Importing zpool stuck after reboot

fullspeed · Oct 6, 2015

Specs of machine are in my sig (FS01 in this case), I can't post debug or commands because I am mid-import from bootup

Ok here is the situation.

Last night I updated to latest Freenas build after testing it on another two servers with success. Everything came up great and I went to bed.

Today I did a large delete of probably 70TB of data, I then noticed Freenas was stuck on trying to delete a snapshot that was weeks stale. I tried to kill the process but it wouldn't let me, nor would it let me release it.
I turned off replication and rebooted with the intention of trying to release/destroy that snapshot when it came back up.

Instead my OS kernel panicked on reboot, second reboot I got to "importing pool" but it's been a solid 45mins and nothing yet.. I went down the server and I can see the disks blinking so I know its doing something.

After some research it sounds like this could be caused by that large delete, is the zpool import doing some sort of remediation right now?

Thanks

Edit:
Here is a screenshot

jgreco · Oct 6, 2015

The pool import has to complete the in-process transaction. Deleting a large snapshot that involves a lot of I/O can in fact cause the responsible command to "hang" for the duration. ZFS has to wander around and update a lot of bookkeeping especially if there are lots of small files. You reboot that big boy, and now you're going to find your filer actually out of service, because the stuff that it WAS doing while it was running in multiuser mode now has to be done as part of the pool import, where it cannot be serving files.

LESSON TO LEARN: LEAVE IT ALONE. There are no shortcuts. There's just making the problem worse. Next time let it complete the snapshot deletion.

Sorry, probably not what you wanted to hear. Your only option now is to let it complete that operation and then it should eventually come up fine.

Darren Myers · Oct 6, 2015

I dont have answers, but it seems @jgreco does; what im here for is i wanna see dis:

fs01 (607TB) - Dell R510 - 24core / 128GB RAM / 10GBe / 3x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs02 (607TB) - Dell R510 - 24core / 128GB RAM / 10GBe / 3x SC847J / LSI9300-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs03 (87TB) - Dell R510 - 24core / 64GB RAM / 10GBe / 1x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs04 (87TB) - Dell R510 - 24core / 64GB RAM / 10GBe / 1x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc

<insert pictures here>

fullspeed · Oct 6, 2015

jgreco said:
The pool import has to complete the in-process transaction. Deleting a large snapshot that involves a lot of I/O can in fact cause the responsible command to "hang" for the duration. ZFS has to wander around and update a lot of bookkeeping especially if there are lots of small files. You reboot that big boy, and now you're going to find your filer actually out of service, because the stuff that it WAS doing while it was running in multiuser mode now has to be done as part of the pool import, where it cannot be serving files.

LESSON TO LEARN: LEAVE IT ALONE. There are no shortcuts. There's just making the problem worse. Next time let it complete the snapshot deletion.

Sorry, probably not what you wanted to hear. Your only option now is to let it complete that operation and then it should eventually come up fine.

I wasn't actually manually deleting a snapshot. My snapshot rules are every four hours for four days and Freenas was attempting (but failing) to delete it with its automated process for about three weeks so I was forced to do something. Freenas replication had a ZFS hold on it which was busted because of an upgrade (it correlates with the snapshot time). I tried to force kill the process so I could release the hold but it would not work, I tried everything. That said I do agree in the "leave it alone" sense but the context is a little different.

Regardless after about 90 minutes the pool imported and i'm back online but my replication is broken, which has happened before because of OS upgrades. The silver lining is the new replication upgrades look very promising, I like the changes and hopefully we're all good from here on out.

Just wondering if I can fix it or I have to re-replicated 380TB of data.

depasseg · Oct 7, 2015

fullspeed said:
Just wondering if I can fix it or I have to re-replicated 380TB of data.

I've also been bitten by the stale snapshots and dorked up replication. After wrestling with only 30TB, I gave up. I deleted all my replication tasks, far side datasets and started fresh.

You can try to manually release the holds using this command:

Code:

zfs list -t snapshot -H -o name,userrefs | grep 1\$ | cut -f1 | xargs sudo zfs release freenas:repl

As suggested in this bug report: https://bugs.freenas.org/issues/11647

fullspeed · Oct 7, 2015

depasseg said:
I've also been bitten by the stale snapshots and dorked up replication. After wrestling with only 30TB, I gave up. I deleted all my replication tasks, far side datasets and started fresh.

You can try to manually release the holds using this command:

Code:
zfs list -t snapshot -H -o name,userrefs | grep 1\$ | cut -f1 | xargs sudo zfs release freenas:repl

As suggested in this bug report: https://bugs.freenas.org/issues/11647

Similar to the command I ran, I did release everything but there is nothing I can do to "fix" replication again, I'm about to delete all my snapshots/replication tasks as you said. Unfortunate.

fullspeed · Oct 7, 2015

I deleted / re-enabled everything replication still not starting, Status is showing as "Failed: WARNING enabled NONE in cipher"

Never seen this message before.

depasseg · Oct 7, 2015

Did you delete and recreate (you said re-enabled, which is why I ask) the replication job?

Try changing the cipher to fast, then save, then change back to disabled and save to see if the error goes away.

fullspeed · Oct 7, 2015

depasseg said:
Did you delete and recreate (you said re-enabled, which is why I ask) the replication job?

Try changing the cipher to fast, then save, then change back to disabled and save to see if the error goes away.

I deleted all snapshots from both sides, then delete replication tasks from the sending side, then re-created the replications tasks.

I tried changing to fast then disabled but the problem persists, is this a bug?

Update:

It took a while but it looks to be sending the initial snapshot now, I can see network activity on the remote host (error remains however)

Sender

Receiver

cyberjock · Oct 9, 2015

This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(

jgreco · Oct 9, 2015

So does the new replication system support multiple retention tiers properly now?

depasseg · Oct 9, 2015

cyberjock said:
This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(

I don't think there were any manual snapshot deletions mentioned. In fact the OP specifically said he didn't.

fullspeed · Oct 9, 2015

cyberjock said:
This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(

Like depasseg said, I didn't delete snapshots manually. That is until I was forced to because an OS upgrade broke replication.

depasseg · Oct 9, 2015

Did you delete using the GUI or the CLI?

cyberjock · Oct 9, 2015

fullspeed said:
Like depasseg said, I didn't delete snapshots manually. That is until I was forced to because an OS upgrade broke replication.

Right, and you broke replication fatally by deleting the snapshot. I'm aware of people having problems with broken replication. I, personally, haven't had any problems with them. Not sure why I've been lucky when so many others haven't.

jgreco · Oct 9, 2015

Cuz you're cool?

Cuz you're awesome?

Cuz you're lucky?

Cuz you're lying?

who knows

fullspeed · Oct 9, 2015

cyberjock said:
Right, and you broke replication fatally by deleting the snapshot. I'm aware of people having problems with broken replication. I, personally, haven't had any problems with them. Not sure why I've been lucky when so many others haven't.

Replication was already fatally broken, it was stuck on trying to replicate a snapshot that was 3-4 weeks old when my policy is set to 4 days, I correlated it to a bug that other users encountered. On a more positive note I'm going to change my approach due to my large amount of data.

For each host that im backing up i'm creating a dataset with its own snapshot and replication task. It won't affect the speed but it will give me more control over individual hosts, most specifically it will allow me to prioritize the most important ones.

Important Announcement for the TrueNAS Community.

Importing zpool stuck after reboot

fullspeed

Contributor

jgreco

Resident Grinch

Darren Myers

Guru

fullspeed

Contributor

depasseg

FreeNAS Replicant

fullspeed

Contributor

fullspeed

Contributor

depasseg

FreeNAS Replicant

fullspeed

Contributor

cyberjock

Inactive Account

jgreco

Resident Grinch

depasseg

FreeNAS Replicant

fullspeed

Contributor

depasseg

FreeNAS Replicant

cyberjock

Inactive Account

jgreco

Resident Grinch

fullspeed

Contributor

Similar threads