SOLVED autosnap.py fails to clean up old snapshots

Status
Not open for further replies.

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
Hi all,

I had a power failure last night (yes, I know, I need a UPS - will be ordering one shortly). My FreeNAS server (latest stable 9.3 build) recovered OK. Pool and disks look fine (scrubs found no errors, SMART tests report no errors). Everything is working fine.

However, since the power outage, I get the same report every minute from autosnap.py in the logs, which suggests clean up of old snapshots is silently failing:

Code:
May  6 14:55:05 Saturn autosnap.py: [tools.autosnap:59] Popen()ing: /sbin/zfs get -H freenas:state "pond/Titan/Archive@auto-20150421.2300-2w"
May  6 14:55:05 Saturn autosnap.py: [tools.autosnap:59] Popen()ing: /sbin/zfs destroy -r -d "pond/Titan/Archive@auto-20150421.2300-2w"


This happens for each one of my auto-snapshots (2 week retention), and the pair of log entries appears every minute. If I try to do the same at the command line, I get a silent fail from the zfs destroy command (i.e. nothing returned):

Code:
[root@Saturn] /var/log# /sbin/zfs get -H freenas:state "pond/Titan/Archive@auto-20150421.2300-2w"
pond/Titan/Archive@auto-20150421.2300-2w    freenas:state    -    -
[root@Saturn] /var/log# /sbin/zfs destroy -r -d "pond/Titan/Archive@auto-20150421.2300-2w"
[root@Saturn] /var/log# /sbin/zfs get -H freenas:state "pond/Titan/Archive@auto-20150421.2300-2w"
pond/Titan/Archive@auto-20150421.2300-2w    freenas:state    -    -


If I try to manually destroy the old snapshot in the GUI, I get a GUI message saying that the attempt failed as the dataset is busy.

Now, looking back at the logs, there is no question that this only started following the power failure, but I can't find any evidence of any kind of pool / disk corruption. Restarting my FreeNAS box hasn't helped.

Can anyone think of anything I could try? I assume the issue is that something is causing the snapshots to be marked as "busy", whatever that means. Anything I can do to "release" them? Is there possibly some kind of dataset "state" that may have become unsynchronised when the power failed?

Thanks in advance.

Nick
 

nickt

Contributor
Joined
Feb 27, 2015
Messages
131
So I've figured out how to resolve this, but not why it happened in the first place.

Looking more closely, it turned out that it was only the oldest snapshot on each dataset that wasn't being cleaned up. All subsequent snapshots cleaned up just fine when their expiry time came along. So it was just that one snapshot, per dataset, that was stuck.

Looking at the zfs destroy snapshot command (link), it became apparent that there was a hold on the old snapshot, preventing it from being destroyed.
Code:
[root@Saturn] ~# zfs holds "pond/Titan/Archive@auto-20150421.2300-2w"
NAME                                      TAG           TIMESTAMP
pond/Titan/Archive@auto-20150421.2300-2w  freenas:repl  Tue Apr 21 23:00 2015


Looking at newer snapshots, there are no holds listed - for some reason, it was only these "stuck" snapshots that had a hold.

I was able to use the zfs release command to clear the hold.
Code:
[root@Saturn] ~# zfs release -r freenas:repl "pond/Titan/Archive@auto-20150421.2300-2w"


From there, the next pass of autosnap.py seemed to do its thing, and the repeated messages stopped appearing. All good now!

If anyone has any idea why this happened in the first place, I'd be curious to hear some suggestions. One thing I can say is that the power failure did not occur any time near a periodic snapshot, so it can't be to do with the power failing at exactly the moment a snapshot was taken.
 
Last edited:
Status
Not open for further replies.
Top