Why are my snapshots not removed

Status
Not open for further replies.

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Hi,

I have set up my FreeNas 8.3.0 system with three (recursive) periodic snapshots:
- every 15 minutes, between 8:30 am and 6:30 pm, mo-fr, keep for 2 days
- every 2 hours, between 8 am and 7 pm, mo-fr, keep for 1 month
- every day, between 9pm and 10pm, mo-su, keep for 1 year

I expected that this would give me about 2*4*10 = 80 snapshots for the 15 minute interval, 20-25 snapshots for the 2 hour interval, and max 365 for the 1 day interval.

What I see, though, is that the systeem keeps 850-900 snapshots that correspond to the 15 minute interval, which seems to match the 2-hour interval retention time.

I don't get this. Is the 2 hour interval snapshot depending on the 15 minute snapshots and does that prevent its removal? If so, why? And why does the 1 day interval snapshot not seem to depend on the 15 minute snapshot?

Thanks for any help!
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Can you post what the names of your snapshots are? I'm pretty sure the snapshot script decides when to delete snapshots based on the date and retention sections of the snapshot name. So it's possible that there's a bug in the naming or parsing of the name that is treating "2d" as "1m". There shouldn't be any "dependency" on other snapshots.



Also, it strikes me that a more efficient algorithm would be similar to how Apple's TimeMachine functions; every hour a new backup is created, for the past 24 hours every hourly backup is kept, for the past week the first backup of the day is kept, and the first backup of the week is kept thereafter.

So, you could simplify your backups to creating a new snapshot every 15 minutes that is set to last for a year. Every hour, run a cronjob that keeps every snapshot for the past 2 days, even-hourly snapshots for the past month, and daily snapshots forever. The snapshot script is already set to delete those snapshots after a year anyway, but the new pruning script could enforce this as well. You could add other constraints to delete hourly and minutely snapshots made on Saturday and Sunday. At any given time, you'll have fewer snapshots to look through and the same level of backup security.

I actually have a script that does almost all of this already. Mine just replicates the TimeMachine behavior, but it'd just be one additional case to handle the quarter hour backups, another for days you don't want to keep, and some tweaking to the numbers used by the other intervals.

Here's mine if you're interested: (strike that, this will have to wait until I get a functioning browser to post it from)
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
There are some open tickets related to snapshots taken at the same time, so maybe this is related: Problem replicating multiple Snapshots

I'm still exploring ways to work with this, but one thing that appears to work is starting the snapshots at different times. Here's what I'm currently experimenting with:


From 6 a.m. through 10 p.m. , every 15 minutes delete after 2 hours
From 6:05 a.m. through 10:05 p.m. , every 1 hour delete after 24 hours
From 6:03 a.m. through 10:03 p.m. , every 1 day delete after 1 week
From 6:01 a.m. through 10:01 p.m. , every 1 week delete after 8 weeks

I'm not positive that this is working correctly yet, but I can confirm that the 15 minute snapshots are deleted after 2 hours.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
That bugreport states a more important thing:
You cannot have multiple periodic snapshots for the same dataset in ZFS replication, this is a current limitation.

I'm not sure what that actually means (what is 'a snapshot in ZFS replication'?), but it seems bad. If this is true, I can only have one schedule for the pool/dataset, and I cannot have the 15-minutes-for-2-hours and daily-for-a-year thing. I thought I did my homework before setting this thing up, but I guessed a missed some chapters...

Does anyone have a clear description of what periodic snapshots work (especially in combination with ZFS replication, which I plan to use)?

(I'll post my snapshot names tomorrow, when I can access the machine again.)
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Another oddity with snapshots is how they are scheduled. Setting it to run from 6:05 to 10:05 every hour doesn't mean that they'll be made on the 5's. They're basically started whenever you set up the schedule and progress from there. So if you get a snapshot schedule entered into the GUI at 12:10, your snapshots will occur on the 10's. If something delays a snapshot by 5 minutes so you have one at 13:10 and the next at 14:15, your next snapshot will be at 15:15. For irrational reasons I preferred that mine occurred on the top of the hour, so I waited for a snapshot to be made (let's say it was at 19:35) and then renamed that most recent snapshot (to 19:00). The next snapshot occurred at zero minutes (20:00).

This likely won't be important to most, but it's an oddity that can make debugging a bit confusing.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
But if the time that snapshots are taken may 'randomly' change, how I am supposed to make sure they do not happen at the same time? This is confusing...

Should I just install cron jobs to create the snapshots instead?
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I personally think the best course is having one snapshot task at the smallest granularity that you need and then cleaning out the old snapshots when they aren't required. But, that's because it's what I'm doing.

Another option would be to update the snapshot system to be smarter about when snapshots are made (allow specifying the minute) and when they are skipped (a snapshot already exists at this timestamp) and when they are deleted (don't use the name to determine the expiration, keep track in a separate database).

I don't think the snapshot time is all that likely to randomly change; since manually setting all mine to occur on the hour I haven't had any that have migrated to different times.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
This seems to be turning out to be a real problem. I don't want to do any serious programming on this, because I lack the resources. That means that it will turn into a quick-fix-solution, which, eventually, will fail just at the time that I (or my co-workers) really need it.

So I may consider staying within the GUI-boundaries, and just use bollar's way of scheduling the snapshots, using rsync to copy the data to the backup system, and have the backup system manage its own snapshots. Or simply get one or two snapshots a day, and replicate that to the backup server.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Here's the script that I use. https://gist.github.com/4273770

It shouldn't be too hard to modify it to keep the intervals that you're interested in. If you're interested in that path that is.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
Thanks for sharing this!
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Thanks for the script. I can probably handle the modification of the retention times, or I may just accept the defaults.

What surprised me was how the '-r' switch works. I tried 'rollup.py -t -v Internal', but that didn't show any output. With 'rollup.py -t -v -r Internal' I got output for all snapshots of Internal and everything below that (Internal/*). Is that the intended behavior?

It also seems to assume that the snapshots are created every hour. So it happily marks all my 15 minute snapshots of the past 24 hours as 'h'.


The list of my current snapshots, as promised, is attached.
 

Attachments

  • snapshostlist_nameonly.zip
    13.3 KB · Views: 232

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
'-r' looks at snapshots at the current dataset and all children. So if you don't see any output from "Internal" it means you don't have any snapshots at that level, unless there's a bug. ... OK, from your list it looks like you do have snapshots at that level. And looking at the code I'm reminded that I added the '-r' flag just yesterday and have done so incorrectly. When passed to 'zfs get', '-r' means include child datasets and that's how I was accessing the snapshots. Excluding the '-r' only gets info on the requested dataset and doesn't return anything regarding the snapshots. I need to filter at a different point. For now, '-r' is required. I'll try to get a fix for this later tonight.

It's not that it assumes snapshots are created every hour (though that is indeed how the behavior that I'm using it with), it's that that is the smallest interval that is handled. If you add an interval of 15 minutes and make snapshots every minute, all those would fall into the quarter hour bucket, unless you created a smaller interval bucket. I hope that makes sense, it sounds confusing when I read it back. Basically, you're right though, the smallest interval that is handled right now is an hour, so anything in the past day is put in the hourly bucket. One of my TODOs is to handle arbitrary buckets (patches welcome!).

And your list does indicate that you do have some snapshots made at the same time on the same dataset, but with different retention times. But, I'm not sure that should be any problem as you're not currently using replication. Also, it doesn't explain why you have snapshots going back so long ago that are marked as '2d'. That definitely looks like a bug, I can try replicating that myself tonight as well.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I'm definitely interested in updates of your script! If I find enough courage (and time), I might try to work on the arbitrary retention time. It's too bad I can't access the system from home.

I am using replication, I just wanted to fix one thing at a time. After running your script, I still had some replication errors. My analysis was that the replication code isn't smart enough to choose the correct reference snapshot. I guess it's just because of the sorting, which doesn't work correctly if you have two snapshot-names with the same time. I ran your script again, and because those problematic snapshots were by then expired and consequently purged by your script, the replication seemed to work fine after that.

I need to do some more testing to see if it keeps working, but things are slowly starting to make sense again. I'm not happy that the system allows you to do these things, but I can now work around it.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I rewrote a lot of the script to support arbitrary intervals and defined intervals for hourly, daily, weekly, monthly, and yearly. It's very much untested at this point, so always run it with '-tv' at least once. I should have a chance to test it more tomorrow. It still needs the ability to select intervals and edit existing and define new ones via arguments. And it'll need some tweaks to support a bi-hourly interval. Hopefully tomorrow or the weekend.

I should probably split this project off into a separate thread.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Just a suggestion to make the script more flexible: Wouldn't it be possible to define a short unit of time, say a minute, and then let the user of the script define the intervals based on that? They could then define an arbitrary number of these intervals, with the number of snapshots that should be kept.

In my setup, I'd like to make snapshots:
- every 15 minutes, between 8:30 am and 6:30 pm, mo-fr, keep for 2 days
- every 2 hours, between 8 am and 7 pm, mo-fr, keep for 1 month
- every day, between 9pm and 10pm, mo-su, keep for 1 year

That would translate to about 80 snapshots with a 15 minute interval, 130 snapshots with a 2 hour interval and 365 snapshots with a 1 day interval. I could specify that, for example, as something like 15:80 120:130 1440:365 (or 15M:80 2H:130 1d:365). Your schedule would translate into '1H:24 1d:7 1m:12 1y:10'.

Although for my application, the easiest would be to just specify the intervals that should be kept, and the maximum retention time, so: 15:2880 120:44640 1440:525600 (or 15m:2d 2h:1m 1d:1y). Anyway, I'm happy with anything you produce - these are just suggestions!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Just a suggestion to make the script more flexible: Wouldn't it be possible to define a short unit of time, say a minute, and then let the user of the script define the intervals based on that? They could then define an arbitrary number of these intervals, with the number of snapshots that should be kept.

I'm not sure that his script was designed with the intent of handing it out to the world. Additionally, if something is wrong with how FreeNAS handles screenshots you really should be putting in a ticket at support.freenas.org. The last thing you'd want is for this problem to get more serious in a future version of FreeNAS and you have zero screenshots because they are being deleted as soon as they are created because the bug is now deleting them too soon instead of after they should "expire".
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
The bug is known (see bollar's message in this threat). It was submitted for version 8.0.4, and its current milestone is set to 9.1.0. Often, developers don't seem to like 'me too' comments to existing tickets/bug reports. Of course, I can disguise my 'me too' as a link to this threat :D

I know that fracai's script wasn't built for general use, but depending on his time and interests, he might be willing to revise it to be of more general use. No demands, just suggestions - and given his remark about a separate threat for the script, it seems like fracai is willing to share his work with us.

For me, the snapshot thing was not a requirement when I chose FreeNAS. But, if it works in the way I'd like it to, it is a nice addition. The nice-to-have status gives me some room for experimenting, as long as it only affects my snapshots.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I'm absolutely willing to adapt it further for general use. I actually intended to release it a while ago, but just hadn't gotten around to it. And unfortunately I really enjoy this sort of thing so I'm going to be working on it whether people find it useful or not ;-)

As for the arbitrary interval suggestion, I think I can handle what you want with what I've started, but now I have another idea that may work as well, and be easier to define new intervals. Currently, everything is keyed off of whether the interval is satisfied by the reference (is there a snapshot saved for 2012-50 in weekly (that's year and week number)). Another approach would be to, as you suggest, define an interval and if the last stored snapshot isn't older than the interval, keep looking for a new one. I'm not sure if this would lead to a stable approach, and I know it won't lead to "nice" snapshots (always saving the weekly snapshot as the first one to occur on Sunday, or the monthly being the first one created on the first of the month). That may not be important. Maybe I'll allow both formats. It needs some thought.

I've started another thread to discuss it further.
 
Status
Not open for further replies.
Top