truenas zfs replication Timeout in head().

Silverstar24

Dabbler
Joined
Jan 22, 2020
Messages
11
Hi

I run into an ZFS replication error. Since a few day I got the following error during ZFS Replication PULL Job's.

truenas zfs replication Timeout in head().

It looks like a connection issue but I change nothing on this configuration and PUSH replications work fine. SSH login on the system works so I can connect to the root user with out username and password (public key).

Now it tried the following things:

- Update both NAS to TrueNAS 13 U3.1
- Delete all snapshot's on both NAS
- Scrub source and destination pool
- Delete one of the Datasets in the destination NAS and start from scratch.

But nothing helps. Any ideas?

The Log:
[2022/11/28 20:04:49] INFO [Thread-13] [zettarepl.paramiko.replication_task__task_26] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2022/11/28 20:04:49] INFO [Thread-13] [zettarepl.paramiko.replication_task__task_26] Authentication (publickey) successful!
[2022/11/28 20:04:50] INFO [replication_task__task_26] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/11/28 20:04:50] INFO [replication_task__task_26] [zettarepl.replication.run] For replication task 'task_26': doing pull from 'ERB_Pool/xxx' to 'NUN_DATA2_Pool/xxx' of snapshot='auto-2022-11-28_20-03-2m' incremental_base=None receive_resume_token=None encryption=False
[2022/11/28 20:05:00] WARNING [replication_task__task_26] [zettarepl.replication.run] For task 'task_26' at attempt 1 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/11/28 20:05:00] INFO [replication_task__task_26] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2022/11/28 20:05:01] INFO [replication_task__task_26] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/11/28 20:05:01] INFO [replication_task__task_26] [zettarepl.replication.run] For replication task 'task_26': doing pull from 'ERB_Pool/xxx' to 'NUN_DATA2_Pool/xxx' of snapshot='auto-2022-11-28_20-03-2m' incremental_base=None receive_resume_token=None encryption=False
[2022/11/28 20:05:11] WARNING [replication_task__task_26] [zettarepl.replication.run] For task 'task_26' at attempt 2 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/11/28 20:05:11] INFO [replication_task__task_26] [zettarepl.replication.run] After recoverable error sleeping for 2 seconds
... 2 more lines ...
[2022/11/28 20:05:24] WARNING [replication_task__task_26] [zettarepl.replication.run] For task 'task_26' at attempt 3 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/11/28 20:05:24] INFO [replication_task__task_26] [zettarepl.replication.run] After recoverable error sleeping for 4 seconds
[2022/11/28 20:05:29] INFO [replication_task__task_26] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/11/28 20:05:29] INFO [replication_task__task_26] [zettarepl.replication.run] For replication task 'task_26': doing pull from 'ERB_Pool/xxx' to 'NUN_DATA2_Pool/xxx' of snapshot='auto-2022-11-28_20-03-2m' incremental_base=None receive_resume_token=None encryption=False
[2022/11/28 20:05:39] WARNING [replication_task__task_26] [zettarepl.replication.run] For task 'task_26' at attempt 4 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/11/28 20:05:39] INFO [replication_task__task_26] [zettarepl.replication.run] After recoverable error sleeping for 8 seconds
[2022/11/28 20:05:47] INFO [replication_task__task_26] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/11/28 20:05:47] INFO [replication_task__task_26] [zettarepl.replication.run] For replication task 'task_26': doing pull from 'ERB_Pool/xxx' to 'NUN_DATA2_Pool/xxx' of snapshot='auto-2022-11-28_20-03-2m' incremental_base=None receive_resume_token=None encryption=False
[2022/11/28 20:05:57] WARNING [replication_task__task_26] [zettarepl.replication.run] For task 'task_26' at attempt 5 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/11/28 20:05:57] ERROR [replication_task__task_26] [zettarepl.replication.run] Failed replication task 'task_26' after 5 retries
 

Silverstar24

Dabbler
Joined
Jan 22, 2020
Messages
11
ok, I found it ... it was an DNS issue. Strange and it cost me the half afternoon. First DNS resolve was working and the replica process shows all of the snapshots but the second request was a cache miss on the the DNS Server. As result the process failed when the data transfer should start.
So I added a static entry in the DNS Server as test and it works now.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm running into this, though I'm inclined not to blame DNS here, since I double-checked that everything is working as it should. Unless some part of the middleware is breaking while attempting to cache DNS results, it has no reason to break.

The error is wonderfully vague, so it's going to be a fun one to track down...

To allow for flexibility in terms of which snapshots get replicated (does anyone actually use that functionality?), zettarepl seems to pull snapshots one at a time (which is slow as molasses, but that's another discussion). Whatever is going wrong seems to be at the start of a snapshot, as errors don't seem to happen when large snapshots are in the process of being transferred.

Why am I doing a pull replication? I need to seed a TrueNAS 13 server from a FreeNAS 11 machine (no, I can't take it out of production; no I can't use FreeNAS 11, the HBA is too new for that; yes it should've been updated long ago, but it was deemed too potentially disruptive), and FreeNAS 11's SSH client is too old.
 
Top