Replication problems on 9.2.1.6

aaronadams · Jul 23, 2014

At some point, replication stopped between my two FreeNAS devices.

I decided that to begin with, I would try to delete the existing replication task and then create a new one with matching parameters.

So I tried deleting the replication task, which hung the interface, and now my logs are under attack:

Code:

Jul 23 13:02:34 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.1930-1y': no such tag on this dataset
Jul 23 13:02:34 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.1945-1y': no such tag on this dataset
Jul 23 13:02:34 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2000-1y': no such tag on this dataset
Jul 23 13:02:34 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2015-1y': no such tag on this dataset
Jul 23 13:02:35 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2030-1y': no such tag on this dataset
Jul 23 13:02:35 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2045-1y': no such tag on this dataset
Jul 23 13:02:35 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2100-1y': no such tag on this dataset
Jul 23 13:02:35 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2115-1y': no such tag on this dataset
Jul 23 13:02:35 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2130-1y': no such tag on this dataset
Jul 23 13:02:36 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2145-1y': no such tag on this dataset
Jul 23 13:02:36 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2200-1y': no such tag on this dataset
Jul 23 13:02:36 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2215-1y': no such tag on this dataset
Jul 23 13:02:36 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2230-1y': no such tag on this dataset
Jul 23 13:02:37 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2245-1y': no such tag on this dataset
Jul 23 13:02:37 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2300-1y': no such tag on this dataset
Jul 23 13:02:37 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2315-1y': no such tag on this dataset
Jul 23 13:02:37 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2330-1y': no such tag on this dataset
Jul 23 13:02:38 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140722.2345-1y': no such tag on this dataset
Jul 23 13:02:38 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0000-1y': no such tag on this dataset
Jul 23 13:02:38 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0015-1y': no such tag on this dataset
Jul 23 13:02:38 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0030-1y': no such tag on this dataset
Jul 23 13:02:39 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0045-1y': no such tag on this dataset
Jul 23 13:02:39 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0100-1y': no such tag on this dataset
Jul 23 13:02:39 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0115-1y': no such tag on this dataset
Jul 23 13:02:39 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0130-1y': no such tag on this dataset
Jul 23 13:02:39 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0145-1y': no such tag on this dataset
Jul 23 13:02:40 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0200-1y': no such tag on this dataset
Jul 23 13:02:40 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0215-1y': no such tag on this dataset
Jul 23 13:02:40 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0230-1y': no such tag on this dataset
Jul 23 13:02:40 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0245-1y': no such tag on this dataset
Jul 23 13:02:41 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0300-1y': no such tag on this dataset
Jul 23 13:02:41 freenas notifier: cannot release hold from snapshot 'storage/.system/cores@auto-20140723.0315-1y': no such tag on this dataset

It's been going on like that for quite some time (half hour at least). I'll let you know if the replication task actually deletes when it's all said and done.

I'm guessing that's a candidate for improvement.

Then, on the subject of the actual replication failure, here's some sample logs showing my snapshots as well as (I think) my failing replication:

Code:

Jul 22 05:30:10 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0530-1y
Jul 22 05:30:10 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0530-1y
Jul 22 05:45:10 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0545-1y
Jul 22 05:45:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0545-1y
Jul 22 05:47:38 freenas autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "zfs list -Hr -o name -t snapshot -d 1 storage/remote | tail -n 1 | cut -d@ -f2"
Jul 22 05:47:39 freenas autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "/sbin/zfs inherit freenas:state storage/remote@auto-20140717.1930-1y"
Jul 22 05:47:39 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs inherit freenas:state storage@auto-20140717.1915-1y
Jul 22 05:47:39 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs release -r freenas:repl storage@auto-20140717.1915-1y
Jul 22 05:47:39 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST storage@auto-20140717.1930-1y
Jul 22 05:47:39 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl storage@auto-20140717.1930-1y
Jul 22 05:47:39 freenas common.pipesubr: cannot hold snapshot 'storage@auto-20140717.1930-1y': tag already exists on this dataset
Jul 22 06:00:12 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0600-1y
Jul 22 06:00:12 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0600-1y
Jul 22 06:15:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0615-1y
Jul 22 06:15:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0615-1y
Jul 22 06:30:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0630-1y
Jul 22 06:30:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0630-1y
Jul 22 06:32:52 freenas autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "zfs list -Hr -o name -t snapshot -d 1 storage/remote | tail -n 1 | cut -d@ -f2"
Jul 22 06:32:53 freenas autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "/sbin/zfs inherit freenas:state storage/remote@auto-20140717.1945-1y"
Jul 22 06:32:53 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs inherit freenas:state storage@auto-20140717.1930-1y
Jul 22 06:32:53 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs release -r freenas:repl storage@auto-20140717.1930-1y
Jul 22 06:32:54 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST storage@auto-20140717.1945-1y
Jul 22 06:32:54 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl storage@auto-20140717.1945-1y
Jul 22 06:32:54 freenas common.pipesubr: cannot hold snapshot 'storage@auto-20140717.1945-1y': tag already exists on this dataset
Jul 22 06:45:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140722.0645-1y
Jul 22 06:45:11 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140722.0645-1y

If anybody can help me to determine what the heck is going on here, it would be much appreciated! I really don't understand the "cannot hold snapshot" error message, especially why it would be trying to hold a different four-days-old snapshot each time.

cyberjock · Jul 23, 2014

Well, if you are doing a snapshot every 15 minutes and keeping for 1 year (holy smokes) you may have as many as 35000 snapshots. When you have problems with replication and snapshoting with that quantity of snapshots its going to take time to do things in the WebGUI like list your snapshots and such. So I'd say it's less about a candidate for improvement and more a candidate for deciding if you really need 35000 snapshots. ;) 99% of users have less than 100 snapshots and don't have these kinds of problems.

Ericloewe · Jul 23, 2014

Consider a setup with several rules for snapshots, like:

Every 15 minutes kept for a day
Every 2 hours kept for three days
Every day kept for a week
Every week kept for a year

The only disadvantage is that you can only expose one of these to Windows as Shadow Copies, as of now.

aaronadams · Jul 24, 2014

Responses first, then more information.

@cyberjock: I think the quantity of snapshots is a red herring here. That periodic snapshot task has only been enabled for two weeks, and so far I only have 8,000 snapshots total appearing in the dashboard. I would argue that there are going to be plenty of systems out there with far more than 8,000 snapshots (even just a handful of datasets with hourly snapshots for a month would easily eclipse that total). I don't blame you for raising an eyebrow, of course; I have the frequency set high mostly as a stress test. I want to determine where snapshots and replication might fail before I begin to entrust critical business data to FreeNAS.

@Ericloewe: That's a great idea, but unfortunately you can't currently do that when you're using ZFS replication. For now it only supports a single periodic snapshot task. I know people have done things with custom scripts to replicate the Apple Time Machine–like configuration you describe, but I'm going to stick with FreeNAS built-in functionality for the sake of future maintainers of our systems, and for ease of migration to whatever improvements should eventually be delivered to FreeNAS.

Now, it turns out replication hasn't stopped, it's just falling way behind.

When a replication task begins, the PULL system receives the snapshots as quickly as you would expect via gigabit LAN, usually near-instantaneously. But the PUSH system remains at "Status: Sending" for 45-60 minutes afterward.

As soon as it "completes," the next replication task runs right away… but it's now 45-60 minutes later, with three new snapshots having occurred in the meantime. Which means the replication task hasn't got a hope in hell of catching up.

So this looks to be two distinct, but possibly related, problems:

Replication tasks are falling behind.
Deletion of a replication task takes forever attempting to release non-existent holds.

I wouldn't be surprised if the "hold" problems observed during both replication and deletion are related, and are the ultimate cause of both issues. They certainly appear to be related.

I've opened bug reports for each (#5611, #5612).

cyberjock · Jul 24, 2014

The total may be low, but 8000 is still a boatload of snapshots to process. I was actually looking more at the 15 minute snapshots. The problem is that if you fall behind in snapshots (say you shut down the backup server for the weekend) you may be in a position where you never catch up. Indeed you have also validated that it is much worse then that. It seems to be falling more and more behind with each passing snapshot.

So what's the hardware on the PUSH and PULL systems. Something isn't quite right. I've sent 20+TB snapshots before and it took mere seconds to process after the trunk of the data was sent.

aaronadams · Jul 24, 2014

Great points!

Both systems are brand new FreeNAS Mini devices with 16 GB ECC RAM.

My eye is on that "cannot hold snapshot, tag already exists on this dataset" message.

I did find this rambling bug that described encountering the same error message, and also had Windows ACLs involved. I have three datasets with Windows ACLs for CIFS sharing, the remainder using UNIX ACLs, all within a single recursive snapshot/replication task. But that bug report indicated problems with data not appearing on the backup volume, which is not a problem I'm having; my data shows up on the remote volumes just fine, complete with correct ACL data.

I can try reconfiguring everything with Windows ACLs as an attempted workaround, but obviously that shouldn't be necessary.

cyberjock · Jul 24, 2014

Yeah, that bug wasn't related to ACLs despite what it sounds like.

I was actually at iX when one of the lead developers reproduced that bug. It was one of those situations where it was a one-off and fixing it after a RC had been pushed out the door was just not feasible without major setbacks in the release schedule.

So try this for me. When a replication task seems to be in one of those states where it thinks its busy but isn't can you check "top" on both of the systems and post the output? I'm wondering if something is causing the CPU to get extremely busy (perhaps a bug or something) or if it's "something else".

pcbuilder123 · Jul 24, 2014

Hey guys, I really need help. My freenas server has been up since yesterday, and I am using 9.2.1.6. I only have a 320gb hard drive, but am just expirmenting. I think I am doing something wrong in cifs, but my gaming rig and nas are connected via the same modem, and using Ethernet cables (I am a newbie to nas). Anyways, when I go to computer and type in the ip to my nas, it can't find my nas and just opens a tab on chrome. Help?

danb35 · Jul 24, 2014

@pcbuilder123 what does your question have to do with this thread? You'll have a much better chance of getting relevant answers if you post a new thread, in the correct forum, with a descriptive thread title.

aaronadams · Jul 24, 2014

cyberjock said:
Yeah, that bug wasn't related to ACLs despite what it sounds like.

I was actually at iX when one of the lead developers reproduced that bug. It was one of those situations where it was a one-off and fixing it after a RC had been pushed out the door was just not feasible without major setbacks in the release schedule.

So try this for me. When a replication task seems to be in one of those states where it thinks its busy but isn't can you check "top" on both of the systems and post the output? I'm wondering if something is causing the CPU to get extremely busy (perhaps a bug or something) or if it's "something else".

Current ZFS Replication status is "Sending storage/.system/samba4@auto-20140718.1315-1y (83%)" (first I've ever seen it stuck on something more than just the word "Sending").

PUSH server processes:

Code:

[root@freenas] ~# top -an 1000
last pid: 60321;  load averages:  0.18,  0.26,  0.30  up 16+02:22:43    15:32:57
204 processes: 1 running, 50 sleeping, 153 zombie

Mem: 246M Active, 218M Inact, 13G Wired, 571M Buf, 2178M Free
ARC: 11G Total, 8880M MFU, 2568M MRU, 274K Anon, 79M Header, 123M Other
Swap: 8192M Total, 8192M Free


  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
9585 root         12  20    0   156M 19704K uwait   3  47:21  0.00% /usr/local/sbin/collectd
3165 root          6  20    0   455M   242M usem    4  22:10  0.00% /usr/local/bin/python /usr/local/www/freenasUI/manage.py r
28915 uucp          1  20    0 14232K  5124K select  5   3:29  0.00% /usr/local/libexec/nut/usbhid-ups -a ups
56567 root          1  20    0   208M 14768K select  0   1:42  0.00% /usr/local/sbin/nmbd --daemon --configfile=/usr/local/etc/
56571 root          1  20    0   276M 21644K select  5   1:05  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
2689 root          1  20    0 22220K  3908K select  7   1:02  0.00% /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f
42667 root          1  20    0   331M 30336K select  0   0:56  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
28917 uucp          1  20    0 20352K  3876K select  0   0:52  0.00% /usr/local/sbin/upsd
39916 root          1  20    0   331M 29364K select  2   0:48  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
3263 nobody        1  20    0  9908K  2208K select  4   0:43  0.00% /usr/local/sbin/mdnsd
28942 uucp          1  20    0 20336K  3944K nanslp  7   0:37  0.00% /usr/local/sbin/upsmon localhost
83830 root          1  20    0   328M 25000K select  2   0:23  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
3257 www           1  20    0 30140K  6820K kqread  5   0:15  0.00% nginx: worker process (nginx)
54996 root          1  20    0 12036K  1756K select  2   0:14  0.00% /usr/sbin/syslogd -s -C
3837 root          1  38    0 14136K  1812K nanslp  5   0:10  0.00% /usr/sbin/cron -s
53880 root          1  30    0   158M 45796K wait    5   0:07  0.00% python /usr/local/www/freenasUI/tools/autorepl.py (python2
63170 root          1  44   10 18592K  3328K wait    4   0:06  0.00% /bin/sh /usr/local/sbin/pbid
40408 root          1  20    0   331M 24864K select  3   0:06  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
56578 root          1  20    0   262M 18796K select  4   0:04  0.00% /usr/local/sbin/winbindd --daemon --configfile=/usr/local/
56577 root          1  20    0   260M 18512K select  4   0:04  0.00% /usr/local/sbin/winbindd --daemon --configfile=/usr/local/
28966 uucp          1  20    0 20336K  3928K nanslp  6   0:03  0.00% /usr/local/bin/upslog -s ups -l /var/log/ups.log -i 300
56575 root          1  20    0   260M 18636K select  2   0:02  0.00% /usr/local/sbin/winbindd --daemon --configfile=/usr/local/
56579 root          1  20    0   260M 18416K select  0   0:02  0.00% /usr/local/sbin/winbindd --daemon --configfile=/usr/local/
41272 root          1  20    0   331M 24724K select  5   0:02  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
2107 root          1  20    0  6276K   736K select  2   0:01  0.00% /sbin/devd
3983 root          1  52    0   158M 46420K ttyin   6   0:01  0.00% python /etc/netcli (python2.7)
8515 root          4  52    0   168M 46208K select  6   0:01  0.00% /usr/local/bin/python /usr/local/libexec/nas/register_mdns
3022 root          1  20    0 28212K  4052K nanslp  3   0:01  0.00% /usr/local/sbin/smartd -i 1800 -c /usr/local/etc/smartd.co
42708 root          1  20    0   331M 26124K select  4   0:01  0.00% /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/
58956 root          1  20    0 69524K  5856K select  1   0:00  0.00% sshd: root@pts/0 (sshd)
58680 root          1  20    0 38340K  5204K select  4   0:00  0.00% [ssh]
59078 root          1  20    0 17520K  3684K pause   3   0:00  0.00% -csh (csh)
1709 _dhcp         1  20    0 12044K  1888K select  0   0:00  0.00% dhclient: lagg0 (dhclient)
1660 root          1  32    0 12044K  1848K select  0   0:00  0.00% dhclient: lagg0 [priv] (dhclient)
58677 root          1  20    0 14496K  2100K wait    5   0:00  0.00% [sh]
3989 root          1  52    0 12044K  1620K ttyin   6   0:00  0.00% /usr/libexec/getty Pc ttyv6
3985 root          1  52    0 12044K  1620K ttyin   5   0:00  0.00% /usr/libexec/getty Pc ttyv2
3987 root          1  52    0 12044K  1620K ttyin   4   0:00  0.00% /usr/libexec/getty Pc ttyv4
3986 root          1  52    0 12044K  1620K ttyin   7   0:00  0.00% /usr/libexec/getty Pc ttyv3
3990 root          1  52    0 12044K  1620K ttyin   0   0:00  0.00% /usr/libexec/getty Pc ttyv7
3988 root          1  52    0 12044K  1620K ttyin   3   0:00  0.00% /usr/libexec/getty Pc ttyv5
8511 root          1  20    0 46828K  5476K select  0   0:00  0.00% /usr/sbin/sshd
3984 root          1  52    0 12044K  1620K ttyin   1   0:00  0.00% /usr/libexec/getty Pc ttyv1
60321 root          1  20    0 16556K  2516K CPU6    6   0:00  0.00% top -an 1000
3314 root          1  20    0 12044K  1568K sigwai  1   0:00  0.00% daemon: /usr/local/libexec/nas/register_mdns.py[8515] (dae
28941 root          1  52    0 20336K  3836K piperd  0   0:00  0.00% /usr/local/sbin/upsmon localhost
3256 root          1  52    0 26044K  4976K pause   2   0:00  0.00% nginx: master process /usr/local/sbin/nginx
60263 root          1  44   10  3772K  1448K nanslp  3   0:00  0.00% sleep 300
3287 messagebus    1  52    0 14300K  2272K select  5   0:00  0.00% /usr/local/bin/dbus-daemon --system
53879 root          1  20    0 14136K  1840K wait    5   0:00  0.00% cron: running job (cron)
2048 root          1  52    0 14224K  1728K select  2   0:00  0.00% /usr/sbin/moused -p /dev/ums0 -t auto -I /var/run/moused.u

PUSH server logs:

Code:

Jul 24 13:26:49 freenas autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "zfs list -Hr -o name -t snapshot -d 1 storage/remote | tail -n 1 | cut -d@ -f2"
Jul 24 13:26:50 freenas autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "/sbin/zfs inherit freenas:state storage/remote@auto-20140718.1230-1y"
Jul 24 13:26:50 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs inherit freenas:state storage@auto-20140718.1215-1y
Jul 24 13:26:50 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs release -r freenas:repl storage@auto-20140718.1215-1y
Jul 24 13:26:50 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST storage@auto-20140718.1230-1y
Jul 24 13:26:50 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl storage@auto-20140718.1230-1y
Jul 24 13:31:13 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1331-1y
Jul 24 13:31:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1331-1y
Jul 24 13:46:13 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1346-1y
Jul 24 13:46:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1346-1y
Jul 24 14:01:13 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1401-1y
Jul 24 14:01:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1401-1y
Jul 24 14:15:57 freenas autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "zfs list -Hr -o name -t snapshot -d 1 storage/remote | tail -n 1 | cut -d@ -f2"
Jul 24 14:15:57 freenas autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "/sbin/zfs inherit freenas:state storage/remote@auto-20140718.1245-1y"
Jul 24 14:15:57 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs inherit freenas:state storage@auto-20140718.1230-1y
Jul 24 14:15:58 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs release -r freenas:repl storage@auto-20140718.1230-1y
Jul 24 14:15:58 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST storage@auto-20140718.1245-1y
Jul 24 14:15:58 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl storage@auto-20140718.1245-1y
Jul 24 14:16:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1416-1y
Jul 24 14:16:15 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1416-1y
Jul 24 14:31:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1431-1y
Jul 24 14:31:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1431-1y
Jul 24 14:46:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1446-1y
Jul 24 14:46:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1446-1y
Jul 24 15:01:13 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1501-1y
Jul 24 15:01:13 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1501-1y
Jul 24 15:04:27 freenas autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "zfs list -Hr -o name -t snapshot -d 1 storage/remote | tail -n 1 | cut -d@ -f2"
Jul 24 15:04:28 freenas autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.0.211 "/sbin/zfs inherit freenas:state storage/remote@auto-20140718.1300-1y"
Jul 24 15:04:28 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs inherit freenas:state storage@auto-20140718.1245-1y
Jul 24 15:04:28 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs release -r freenas:repl storage@auto-20140718.1245-1y
Jul 24 15:04:28 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST storage@auto-20140718.1300-1y
Jul 24 15:04:28 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl storage@auto-20140718.1300-1y
Jul 24 15:16:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW storage@auto-20140724.1516-1y
Jul 24 15:16:14 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs hold -r freenas:repl storage@auto-20140724.1516-1y

PULL server processes:

Code:

[root@freenas] ~# top -an 1000
last pid: 82616;  load averages:  0.11,  0.14,  0.14  up 14+01:52:49    15:34:37
33 processes:  1 running, 32 sleeping

Mem: 176M Active, 167M Inact, 5269M Wired, 1284K Cache, 458M Buf, 10G Free
ARC: 3234M Total, 730M MFU, 2316M MRU, 18K Anon, 81M Header, 108M Other
Swap: 8192M Total, 8192M Free


  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
82247 root          1  20    0 45784K 12500K tx->tx  3   0:26  0.39% /sbin/zfs receive -F -d storage/remote
11794 root         12  20    0   156M 22964K uwait   1  39:21  0.00% /usr/local/sbin/collectd
2774 root          6  20    0   363M   195M usem    4   2:57  0.00% /usr/local/bin/python /usr/local/www/freenasUI/manage.py r
2339 root          1  20    0 22220K  3908K select  4   0:55  0.00% /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f
2876 nobody        1  20    0  9908K  2192K select  0   0:35  0.00% /usr/local/sbin/mdnsd
2650 root          1  32   10 18592K  3236K wait    6   0:06  0.00% /bin/sh /usr/local/sbin/pbid
2097 root          1  20    0 12036K  1736K select  1   0:04  0.00% /usr/sbin/syslogd -s -C
3413 root          1  20    0 14136K  1812K nanslp  1   0:04  0.00% /usr/sbin/cron -s
68065 www           1  20    0 26044K  6424K kqread  6   0:03  0.00% nginx: worker process (nginx)
3409 root          1  20    0 46828K  5400K select  4   0:02  0.00% /usr/sbin/sshd
2631 root          1  20    0 28212K  4052K nanslp  0   0:01  0.00% /usr/local/sbin/smartd -i 1800 -c /usr/local/etc/smartd.co
3554 root          1  52    0   158M 46444K ttyin   0   0:01  0.00% python /etc/netcli (python2.7)
2937 root          4  52    0   168M 46104K select  1   0:01  0.00% /usr/local/bin/python /usr/local/libexec/nas/register_mdns
1761 root          1  20    0  6276K   736K select  2   0:00  0.00% /sbin/devd
82243 root          1  20    0 69524K  6396K select  4   0:00  0.00% sshd: root@notty (sshd)
82571 root          1  20    0 69524K  5852K select  7   0:00  0.00% sshd: root@pts/0 (sshd)
82574 root          1  20    0 17520K  3492K pause   2   0:00  0.00% -csh (csh)
1280 _dhcp         1  20    0 12044K  1896K select  0   0:00  0.00% dhclient: igb1 (dhclient)
1231 root          1  28    0 12044K  1852K select  0   0:00  0.00% dhclient: igb1 [priv] (dhclient)
82245 root          1  52    0 17520K  2672K pause   7   0:00  0.00% csh -c /sbin/zfs receive -F -d storage/remote && echo Succ
2869 root          1  20    0 26044K  5128K pause   5   0:00  0.00% nginx: master process /usr/local/sbin/nginx
3557 root          1  52    0 12044K  1620K ttyin   7   0:00  0.00% /usr/libexec/getty Pc ttyv3
3559 root          1  52    0 12044K  1620K ttyin   2   0:00  0.00% /usr/libexec/getty Pc ttyv5
3556 root          1  52    0 12044K  1620K ttyin   4   0:00  0.00% /usr/libexec/getty Pc ttyv2
3560 root          1  52    0 12044K  1620K ttyin   5   0:00  0.00% /usr/libexec/getty Pc ttyv6
3561 root          1  52    0 12044K  1620K ttyin   0   0:00  0.00% /usr/libexec/getty Pc ttyv7
3558 root          1  52    0 12044K  1620K ttyin   6   0:00  0.00% /usr/libexec/getty Pc ttyv4
3555 root          1  52    0 12044K  1620K ttyin   3   0:00  0.00% /usr/libexec/getty Pc ttyv1
82616 root          1  20    0 16556K  2312K CPU1    1   0:00  0.00% top -an 1000
82605 root          1  31   10  3772K  1448K nanslp  7   0:00  0.00% sleep 300
2900 messagebus    1  52    0 14300K  2268K select  1   0:00  0.00% /usr/local/bin/dbus-daemon --system
2936 root          1  52    0 12044K  1568K sigwai  7   0:00  0.00% daemon: /usr/local/libexec/nas/register_mdns.py[2937] (dae
1736 root          1  52    0 14224K  1728K select  5   0:00  0.00% /usr/sbin/moused -p /dev/ums0 -t auto -I /var/run/moused.u

Nothing of relevance in the PULL server logs.

aaronadams · Jul 24, 2014

…Noteworthy information:

On the PULL server, I can see the auto-20140718.1315-1y snapshots have been created, but each reports a "used" size of 0. The previous set of snapshots (auto-20140718.1300-1y) all show up properly, with the proper amount of space being used for each (ranging from 11.6K to 366K, totalling about 1M).
Notice how those auto-20140718.1315-1y snapshots don't appear anywhere in the PUSH logs yet.

Also, note that my "cannot hold snapshot: tag already exists on this dataset" error messages have stopped appearing. I don't know if that's a result of my earlier attempt to delete the "broken" replication task, but regardless, the absence of that recurring error message has not stopped the problem from occurring.

cyberjock · Jul 24, 2014

Yeah, snapshots are always of 0 length until the snapshot is downloaded and integrated into the pool.

At this point I think your going to have to wait for a developer to look at the tickets you submitted. :(

aaronadams · Jul 24, 2014

That's my guess too.

Thanks for taking a look, though! Your questions definitely helped me collect what I think is some super helpful data for both tickets.

Josep · Jul 30, 2014

I'm having the same issue and it's definitely not because of the number of periodic snapshots.
I have two FreeNAS 9.2.1.6. One has local periodic snapshots every hour (work hours only), for two weeks. It pushes to the secondary FreeNAS that is only used for backup.
I keep getting this with every snapshot:

...
Jul 30 09:31:21 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl tank@auto-20140730.0930-2w
Jul 30 09:31:21 freenas common.pipesubr: cannot hold snapshot 'tank@auto-20140730.0930-2w': tag already exists on this dataset
...
Jul 30 10:31:23 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs set freenas:state=LATEST tank@auto-20140730.1030-2w
Jul 30 10:31:23 freenas autorepl.py: [common.pipesubr:72] Executing: /sbin/zfs hold -r freenas:repl tank@auto-20140730.1030-2w
...

Any idea of what could be the cause?

aaronadams · Aug 16, 2014

I'm going to share everything I've done in diagnosing and then reconfiguring my setup. I hope users will find some of this helpful, and I hope the FreeNAS developers will find some of it useful in improving documentation and refining some of the jagged edges.

I'm not doing recursive snapshots and replication for my entire ZFS volume anymore; instead I've moved any datasets I want to replicate into a parent dataset (using zfs rename), and I do recursive snapshots and replication for that dataset only. This way I'm not replicating snapshot data for the highly volatile .system datasets, which had already taken me to a point where 5 MB data had 1 GB of completely useless snapshot data.
- This also helps keep your total snapshot count low. This is important, not because ZFS can't handle large snapshot counts, but because the FreeNAS interface chokes on large snapshot counts. My .system dataset has 4 child datasets (cores, rrd, samba4, syslog); recursively, this is 5 additional unneeded snapshots per period, which in my case more than doubled my snapshot count.
To further reduce snapshot count, I have snapshots occurring hourly from 7:00:00 to 22:59:59, Monday through Saturday (with ZFS scrubs on Sundays). Replication is still set to run around the clock, to help it catch up in case it ever "falls behind" again due to stalling replication jobs.
The "can't hold snapshot" message doesn't appear to be important.

Sasayaki · Oct 25, 2014

Hi Aaronadams,

As a newbie user who's experiencing the same problems you are, what's the best way to resolve this? Upgrade to the 9.3 nightly, or move my data so instead of being in, say, /mnt/Storage/, it's in /mnt/Storage/Storage and therefore not snapshotting the .system directory?

aaronadams · Oct 26, 2014

I'm still running 9.2.1.7 with no plans to upgrade; at this point everything is working properly, and given what I went through to get to this point, I am terrifed to touch it ever again!

My approach was exactly what you are describing. Here is my local machine configuration:

Active Volumes:

Periodic Snapshot Tasks:

ZFS Replication:

This configuration has been quite robust, even when the remote box goes offline for a while; once the connection is restored, it gracefully recovers and quickly catches up.

It's important to note that each replication job, even without any new data to replicate, currently takes about 20 minutes (likely due to extremely slow zfs list). FreeNAS does not handle high snapshot counts well. There are massive overhead issues with snapshot count that slow even the most basic tasks to a crawl. In theory ZFS snapshots should be extremely lightweight, but in reality they get heavy in a hurry. So keep your snapshot count in the thousands at most (I'm currently sitting below 3,000 snapshots), and even fewer if you can manage. You'll thank yourself later.

Otherwise I'd just recommend following loosely-established best practices. For me, that means:

Snapshots every two hours at 08–20:00 on M/T/W/T/F/S;
SMART short tests (2 minute duration) at 00:00 on M/T/W/T/F/S;
SMART long test (7 hour duration) at 12:00 on Sundays;
A ZFS scrub (45 minutes to several hours in duration) at 00:00 on Sundays.

One other "gotcha" worth noting: you can set start/end times for replication, but not frequency or day of the week. During periods when your remote machine is unreachable, the replication task will run every minute; be prepared for some lengthy error logs during those times.

Here's hoping for some seriously improved documentation for FreeNAS in the not-too-distant future; once I got everything successfully configured I've been very happy, but getting to that point was incredibly difficult, and I doubt I'd be recommending FreeNAS to anyone without a very high tolerance for frustration and experimentation. This certainly isn't point-and-click easy yet, though there are definitely people working hard to make it so.

Important Announcement for the TrueNAS Community.

Replication problems on 9.2.1.6

aaronadams

Dabbler

cyberjock

Inactive Account

Ericloewe

Server Wrangler

aaronadams

Dabbler

cyberjock

Inactive Account

aaronadams

Dabbler

cyberjock

Inactive Account

pcbuilder123

Cadet

danb35

Hall of Famer

aaronadams

Dabbler

aaronadams

Dabbler

cyberjock

Inactive Account

aaronadams

Dabbler

Josep

Cadet

aaronadams

Dabbler

Sasayaki

Explorer

aaronadams

Dabbler

Similar threads