SOLVED Replication to remote host failed - pool or dataset is busy

TN22 · Jan 14, 2023

This job has been working for several weeks, but broke today. Main pool is DATA, backup is called BACKUP. There are existing replication jobs to replicate from DATA to BACKUP once a week. These succeed. Other snapshot jobs on the DATA pool itself also succeed. The failing job is set up to replicate from the local BACKUP pool to a "remote" TN box called MIRROR.

When the job starts it fails immediately. The job logs in /var/log/jobs are below. The folder called out in the error is mounted from the DATA pool (from the .system folder). I don't see anything mounted from BACKUP.

Any thoughts about why this is failing, and how to fix it?

Running TrueNAS Scale 22.02.3 on both boxes.

[2023/01/14 16:47:47] INFO [Thread-435] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_8.4p1)
[2023/01/14 16:47:47] INFO [Thread-435] [zettarepl.paramiko.replication_task__task_3] Authentication (publickey) successful!
[2023/01/14 16:47:49] INFO [replication_task__task_3] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/01/14 16:47:50] INFO [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from 'BACKUP' to 'MIRROR' of snapshot='auto-week-2023-01-13_02-30' incremental_base=None include_intermediate=False receive_resume_token=None encryption=False
[2023/01/14 16:47:50] ERROR [replication_task__task_3] [zettarepl.replication.run] For task 'task_3' unhandled replication error ExecException(1, "Warning: Permanently added the ECDSA host key for IP address '192.168.xx.yy' to the list of known hosts.\ncannot unmount '/var/db/system/syslog-cd93307f360c4818ad53abf4dac4059c': pool or dataset is busy\n")
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
retry_contains_partially_complete_state(
File "/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py", line 16, in retry_contains_partially_complete_state
return func()
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 182, in <lambda>
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 278, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 611, in run_replication_steps
replicate_snapshots(step_template, incremental_base, snapshots, include_intermediate, encryption, observer)
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 652, in replicate_snapshots
run_replication_step(step, observer)
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 732, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 33, in run
raise self.process_exception
File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
self.replication_process.wait()
File "/usr/lib/python3/dist-packages/zettarepl/transport/ssh.py", line 154, in wait
stdout = self.async_exec.wait()
File "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py", line 104, in wait
raise ExecException(exit_event.returncode, self.output)
zettarepl.transport.interface.ExecException: Warning: Permanently added the ECDSA host key for IP address '192.168.xx.yy' to the list of known hosts.
cannot unmount '/var/db/system/syslog-cd93307f360c4818ad53abf4dac4059c': pool or dataset is busy

morganL · Jan 14, 2023

TN22 said:
This job has been working for several weeks, but broke today. Main pool is DATA, backup is called BACKUP. There are existing replication jobs to replicate from DATA to BACKUP once a week. These succeed. Other snapshot jobs on the DATA pool itself also succeed. The failing job is set up to replicate from the local BACKUP pool to a "remote" TN box called MIRROR.

When the job starts it fails immediately. The job logs in /var/log/jobs are below. The folder called out in the error is mounted from the DATA pool (from the .system folder). I don't see anything mounted from BACKUP.

Any thoughts about why this is failing, and how to fix it?

Running TrueNAS Scale 22.02.3 on both boxes.

Best to follow forum rules. Hardware etc required.

I'd suggest documenting what you did from the checklist on this page.

/scale/scaletutorials/dataprotection/replication/

There was another user that reported a similar problem. It was resolved my moving the system dataset back to the boot pool. Is your system dataset on the pool you are backing up to?

NAS-119072

TN22 · Jan 15, 2023

Sorry, I added it to my signature, hopefully that will help so I don't have to remember.

I changed the system dataset to the boot pool, and it threw a ton of errors about things being in use, etc. Restarted the box and it stayed on the boot pool though.

After doing so, the replication was successful.

I'm not sure why the issue cropped up now; nothing really changed on that device. But at least it's working now.

(I just need to keep that old remote machine online for a bit longer as I work to get the new backup NAS fully online.)

morganL · Jan 15, 2023

Thanks for that. This seems to be an openZFS limitation. I'm not sure why it exists, but we should document this workaround.

Important Announcement for the TrueNAS Community.

SOLVED Replication to remote host failed - pool or dataset is busy

TN22

Dabbler

morganL

Captain Morgan

TN22

Dabbler

morganL

Captain Morgan

Similar threads