Replication failing: "Partially received snapshot is saved" "Command failed with code 1.."

JaimieV · Oct 18, 2020

I have three replications, pulled by my backup NAS from the primary NAS. Two of them, the big ones, work fine.
The third is of my iocage tree, and has been failing since late last week, probably synced with 11.3u5 being installed. I have been getting emails saying:

* Replication "DataPool/iocage - DataPoolBackup/iocage" failed: failed to read from stream
Command failed with code 1..

All the settings are identical to the other two replications apart from the source and destination datasets.

I deleted and re-created the replication templated off one of the existing ones, changing only source and destination datasets, everything else identical. It identifies 17 snapshots to transfer, then again fails with code 1. So that's fun, I guess.

I also deleted and re-created the dataset, then remade the replication, same results.

But today, some excitement. Overnight I got this email instead:

New alerts:
* Replication "DataPool/iocage - DataPoolBackup/iocage" failed: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
zfs send -t 1-11552caf99-f0-789c636064000310a500c4ec50360710e72765a5269740f80cd8e4d3d28a534b18e00024cf86249f5459925acc802a8facbf243fbd34338581e19de1ca08568657950e48f29c60f9bcc4dc54060697c492c480fcfc1cfdccfce4c4f45487c4d2927c5d23032303430303433d430b03035da35ca8f9dc0c08ff24e7e71614a51617e76723dc0000ed82205d
Command failed with code 1..

Now every time I run the replication I get the same message with the same resume key. Something's stuck.

I can't use that suggested command line as-is, it would need me to construct the rest of the transport details, piping through SSH and netcat manually and I don't even know how to find the ssh keyfile it would need as it doesn't seem to be in the usual places - buried in the DB somewhere perhaps? So, thanks for the message but it's practically useless. Can it be improved?

That aside, there's this partial. How to clean it out? zfs list -t snapshots doesn't show it. Does it go away with a timeout? Reboot? Export-import? Until I delete the dataset? Am I stuck with it forever eating space?

Any idea where to look for clues as to the 'code 1' failure?

/Edit to add: The larger, working replications are 1Tb and 20Tb, the failing one is only 300gig or so.

JaimieV · Oct 18, 2020

I deleted and recreated the replication job again, back to just plain "error 1.."

I deleted the destination dataset and recreated both, first attempt made a good try - the progress window showed it trying to copy a named snapshot - but ended up with:

[2020/10/18 13:18:36] INFO [Thread-176] [zettarepl.paramiko.replication_task__task_6] Connected (version 2.0, client OpenSSH_8.0-hpn14v15)
[2020/10/18 13:18:36] INFO [Thread-176] [zettarepl.paramiko.replication_task__task_6] Authentication (publickey) successful!
[2020/10/18 13:18:37] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'DataPool/iocage/download/11.2-RELEASE' to 'DataPoolBackup/iocage/download/11.2-RELEASE' of snapshot='auto-20201015.1800-2m' incremental_base=None receive_resume_token=None
[2020/10/18 13:18:37] INFO [replication_task__task_6] [zettarepl.paramiko.replication_task__task_6.sftp] [chan 3] Opened sftp connection (server version 3)
[2020/10/18 13:18:37] INFO [replication_task__task_6] [zettarepl.transport.ssh_netcat] Automatically chose connect address '10.0.0.2'
[2020/10/18 13:18:47] WARNING [replication_task__task_6.stdout_copy] [zettarepl.transport.base_ssh.root@10.0.0.2.shell.88.async_exec.477] Copying stdout from >> failed: timeout()
[2020/10/18 13:24:01] ERROR [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error SshNetcatExecException(ExecException(1, 'invalid backup stream\n'), ExecException(1, ''))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks
run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
... 6 more lines ...
File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 517, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 33, in run
raise self.process_exception
File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
self.replication_process.wait()
File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh_netcat.py", line 186, in wait
raise SshNetcatExecException(connect_exec_error, self.listen_exec_error) from None
zettarepl.transport.ssh_netcat.SshNetcatExecException: invalid backup stream
Command failed with code 1

Seems odd that the WARNING on the copying stdout timout is 5m14s before the ERROR hits.

Again, this is using the same single "SSH Connection" definition that the two working replications use. It's the only one I have defined.

I've run a scrub on the source pool last night, that's come back clean. Is there anything else I can test at the source?

Replicating this dataset to the backup was working for literally years before. It is puzzling.

JaimieV · Oct 18, 2020

Out of annoyance I spun up a FreeNAS VM, and it works fine there.

(/Later edit: that went fully to completion, no issues.)

I built the VM setup one-to-one from the settings on the physical server.

Back on the physical server, it does the timeout while "Sending 1 of 1". Or now, while "Starting job...".

JaimieV · Oct 18, 2020

I left it and came back to this. Well, at least it's consistent now...

I'd have preferred it being consistent in the "all working" direction! Previously the top two were green, iocage was ERROR, and I've added the test one this afternoon.

The fails are varied: Invalid backup stream; timeout; connection reset by peer.

netstat -i shows zero errors, but I'm going to replace the network cable.

/Edit: I switched it for a known good 10gigE cable which didn't have any positive effect. Then to a 1gigE cable to the 1gigE switch. Still no improvement.

JaimieV · Oct 19, 2020

Current status: Machine is haunted. I tried the builtin bge0-3 ethernet ports, they can all get 1gigE, DHCP, no errors, but not pass packets even through a dumb 1gigE switch. I've powered it down in disappointment for the moment, hoping from ideas from y'all.

Oh - I also I tried booting back to 11.3u4 and u3.1 and that made no difference to the success rates.

JaimieV · Oct 20, 2020

I set all the firmware (including network cards, BIOS, backplane, iDRAC) the same as on the working primary NAS R410. No change.
I set all the firmware to the very latest. No change. (So I did that to the primary NAS too, because why not. Nothing bad happened)

I reverted to the 1gigE builtin Broadcom ports - and did a reset this time, which got the routing fixed. My own fault above, I didn't expect the old connection to still hold the route to 10.0.0.0/24 when it is disconnected and a new interface had come up via DHCP on that network. I'm sure it's sensible behaviour in some way that I can't think of... yet another OS behavioural diff to keep in my head.

Using the builtin Broadcom works perfectly and consistently. Including with the same 10gigE cable, to the same 10gigE port (obviously at 1gig).

I swapped out the 10gigE card (Intel X540-T) for another one of the same model. No improvement, straight back to just the third replication failing.

Also tested something else - starting a Time Machine backup to a share on it. This wedged at under 400meg written through the X540, works fine through the Broadcom. Yet I can copy an 8gig iso file to it though the X540 and cksum ok, but then a second copy of the file wedges and doesn't copy.

I tried the known-good X540-T out of the primary NAS. Same.

So, appears that this particular Dell R410 has decided that it *really* doesn't like the X540-T cards. Going by revision numbers on the main board it's as close as I'll ever see to identical to my primary R410. I've gone through all the BIOS and iDRAC settings side by side, no diffs. No missing power cables or anything else visible. Both running 11.3u5.

At the moment I've taken the 10gigE out entirely and I'll let it bed in on the Broadcom, currently it all looks perfect.

F*kin' Dells, amirite?

One thing I'd like to try is using the other PCIe slot, or without the riser entirely (it's a 1u machine so has a sideways mount) but it's a bit messy as I'll have to do it without a backplate. I'll probably try that after this has settled and finished a terabyte of Time Machine dump from my desktop.

JaimieV · Oct 22, 2020

Turns out the Dell disables all the PCIe slots (without any choice to the user) in the 1U R410. Fair choice, they're not safely usable with the lid on. I'd never needed to discover this before.

So I tried the other slot in the PCIe riser, and... it appears to all be working. So it's the LP PCIe slot in the riser that's haunted, rather than any wider bits of the system. I'll try it again in the LP slot, to confirm, and if it's definitely the slot I'll have to find a full-height backplate for the X540 card, it's a bit worrying being loose.

Further news as events warrant, I'm nowhere near convinced yet. It'll take a few days of good behaviour to get over this nonsense!

JaimieV · Oct 29, 2020

Tying this off: The card fails in the LP PCIe slot repeatably. Works fine in the full height slot. It's had a week of good behaviour, and I've left a note to myself on the riser not to use the LP side because I'll forget eventually!

Important Announcement for The TrueNAS Community.

Replication failing: "Partially received snapshot is saved" "Command failed with code 1.."

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

JaimieV

Guru

Similar threads

Important Announcement for The TrueNAS Community.