Replication broken due to orphaned system dataset

NomsPlease

Cadet
Joined
Dec 14, 2023
Messages
3
I have been trying to find out why my replication is refusing to work. I had this setup previously, but it was a test scenario only replicating a single dataset in the pool. I am now setting it as the whole pool, excluding the dataset called Media, which looks trivial.

My system Dataset blocked my ability to select the whole pool, giving this error below.
Code:
Active side: cannot unmount '/var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941': pool or dataset is busy.


After getting these errors, I did the most logical thing and moved my system dataset to my boot SSDs. They have plenty of space, so really, it's better there and out of the way of the other datasets. This has been moved, and the Main pool, BigPool, is now free of the .system directories. (Other Pool dataset names are redacted)
Code:
boot-pool                                                                        25.6G  21.9G    96K  none
boot-pool/.system                                                                1.61G  21.9G  1.20G  legacy
boot-pool/.system/configs-f1f5036a6e4448d09a9ddb3c45165866                       6.88M  21.9G  6.88M  legacy
boot-pool/.system/cores                                                            96K  1024M    96K  legacy
boot-pool/.system/ctdb_shared_vol                                                  96K  21.9G    96K  legacy
boot-pool/.system/glusterd                                                        104K  21.9G   104K  legacy
boot-pool/.system/netdata-f1f5036a6e4448d09a9ddb3c45165866                        358M  21.9G   358M  legacy
boot-pool/.system/rrd-f1f5036a6e4448d09a9ddb3c45165866                           52.6M  21.9G  52.6M  legacy
boot-pool/.system/samba4                                                          284K  21.9G   284K  legacy
boot-pool/.system/services                                                         96K  21.9G    96K  legacy
boot-pool/.system/webui                                                            96K  21.9G    96K  legacy
boot-pool/ROOT                                                                   24.0G  21.9G    96K  none
boot-pool/ROOT/22.12.3.3                                                         6.09G  21.9G  6.08G  legacy
boot-pool/ROOT/22.12.4.2                                                         6.07G  21.9G  6.07G  legacy
boot-pool/ROOT/23.10.0                                                           5.94G  21.9G  5.94G  legacy
boot-pool/ROOT/23.10.0.1                                                         5.90G  21.9G  5.90G  legacy
boot-pool/ROOT/Initial-Install                                                      8K  21.9G  2.65G  /
boot-pool/grub                                                                   8.22M  21.9G  8.22M  legacy

Code:
BigPool                                                                          70.0T  46.1T   232K  /mnt/BigPool
BigPool/3*****                                                              47.1M  46.1T  47.1M  /mnt/BigPool/3*****
BigPool/A*****                                                                65.6G  46.1T   151K  /mnt/BigPool/A*****


When setting up my replication task, I select the entire pool and set the Exclude Child Dataset option to ignore the unwanted datasets. After saving and trying the run, I got the same error. So I figured I would see if this directory exists and unmount it. Well, the directory doesn't exist, the dataset that would mount there is non-existent, and it seems stuck. I added BigPool/.system to the excluded child datasets, which again didn't fix it.
Code:
root@truenas[~]# ls -lash /var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941
ls: cannot access '/var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941': No such file or directory


I tried to remake the task entirely, not reusing the previous task in case it somehow got that dataset stuck in it. This made no difference and resulted in the same error again. I'm unsure how replication keeps seeing this dataset that does not exist and how this error keeps getting thrown for a non-existent directory.

The only way I could get the task to run and bypass the dataset error was to pick every dataset instead of the pool; that worked. I want to do the entire pool though as if I add datasets I would rather they be default included and require me to exclude them manually.

Screenshot 2023-12-14 at 11.41.13 AM.png


Could anyone point me in a direction to resolve this? Any input would be appreciated.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I have been trying to find out why my replication is refusing to work. I had this setup previously, but it was a test scenario only replicating a single dataset in the pool. I am now setting it as the whole pool, excluding the dataset called Media, which looks trivial.

My system Dataset blocked my ability to select the whole pool, giving this error below.
Code:
Active side: cannot unmount '/var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941': pool or dataset is busy.


After getting these errors, I did the most logical thing and moved my system dataset to my boot SSDs. They have plenty of space, so really, it's better there and out of the way of the other datasets. This has been moved, and the Main pool, BigPool, is now free of the .system directories. (Other Pool dataset names are redacted)
Code:
boot-pool                                                                        25.6G  21.9G    96K  none
boot-pool/.system                                                                1.61G  21.9G  1.20G  legacy
boot-pool/.system/configs-f1f5036a6e4448d09a9ddb3c45165866                       6.88M  21.9G  6.88M  legacy
boot-pool/.system/cores                                                            96K  1024M    96K  legacy
boot-pool/.system/ctdb_shared_vol                                                  96K  21.9G    96K  legacy
boot-pool/.system/glusterd                                                        104K  21.9G   104K  legacy
boot-pool/.system/netdata-f1f5036a6e4448d09a9ddb3c45165866                        358M  21.9G   358M  legacy
boot-pool/.system/rrd-f1f5036a6e4448d09a9ddb3c45165866                           52.6M  21.9G  52.6M  legacy
boot-pool/.system/samba4                                                          284K  21.9G   284K  legacy
boot-pool/.system/services                                                         96K  21.9G    96K  legacy
boot-pool/.system/webui                                                            96K  21.9G    96K  legacy
boot-pool/ROOT                                                                   24.0G  21.9G    96K  none
boot-pool/ROOT/22.12.3.3                                                         6.09G  21.9G  6.08G  legacy
boot-pool/ROOT/22.12.4.2                                                         6.07G  21.9G  6.07G  legacy
boot-pool/ROOT/23.10.0                                                           5.94G  21.9G  5.94G  legacy
boot-pool/ROOT/23.10.0.1                                                         5.90G  21.9G  5.90G  legacy
boot-pool/ROOT/Initial-Install                                                      8K  21.9G  2.65G  /
boot-pool/grub                                                                   8.22M  21.9G  8.22M  legacy

Code:
BigPool                                                                          70.0T  46.1T   232K  /mnt/BigPool
BigPool/3*****                                                              47.1M  46.1T  47.1M  /mnt/BigPool/3*****
BigPool/A*****                                                                65.6G  46.1T   151K  /mnt/BigPool/A*****


When setting up my replication task, I select the entire pool and set the Exclude Child Dataset option to ignore the unwanted datasets. After saving and trying the run, I got the same error. So I figured I would see if this directory exists and unmount it. Well, the directory doesn't exist, the dataset that would mount there is non-existent, and it seems stuck. I added BigPool/.system to the excluded child datasets, which again didn't fix it.
Code:
root@truenas[~]# ls -lash /var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941
ls: cannot access '/var/db/system/netdata-ae32c386e13840b2bf9c0083275e7941': No such file or directory


I tried to remake the task entirely, not reusing the previous task in case it somehow got that dataset stuck in it. This made no difference and resulted in the same error again. I'm unsure how replication keeps seeing this dataset that does not exist and how this error keeps getting thrown for a non-existent directory.

The only way I could get the task to run and bypass the dataset error was to pick every dataset instead of the pool; that worked. I want to do the entire pool though as if I add datasets I would rather they be default included and require me to exclude them manually.

View attachment 73424

Could anyone point me in a direction to resolve this? Any input would be appreciated.
Need to start with version of SCALE you are using. Bluefin or Cobia?

We do tend to recommend that replicating individual data sets is better that pools with exclusions. Its cleaner, simpler and better tested.

If necessary a top level dataset can be used to hold your child datasets that need replication ( advice for users planning their set-up... no so easy if your systems is already setup).
 

NomsPlease

Cadet
Joined
Dec 14, 2023
Messages
3
The version would be essential; I am running the current version of Cobia. TrueNAS-SCALE-23.10.0.1

I see how allowing specific datasets would lessen the chance of accidentally replicating something undesired, but I cannot trust myself to update my replication tasks every time I add a data set. Regardless, replication should not get stuck on something non-existent, which is being presented in this case. I can replicate individual datasets, but selecting the pool always brings this .system dataset back into play.
 
Top