Importing zpool stuck after reboot

Status
Not open for further replies.

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
Specs of machine are in my sig (FS01 in this case), I can't post debug or commands because I am mid-import from bootup

Ok here is the situation.

Last night I updated to latest Freenas build after testing it on another two servers with success. Everything came up great and I went to bed.

Today I did a large delete of probably 70TB of data, I then noticed Freenas was stuck on trying to delete a snapshot that was weeks stale. I tried to kill the process but it wouldn't let me, nor would it let me release it.
I turned off replication and rebooted with the intention of trying to release/destroy that snapshot when it came back up.

Instead my OS kernel panicked on reboot, second reboot I got to "importing pool" but it's been a solid 45mins and nothing yet.. I went down the server and I can see the disks blinking so I know its doing something.

After some research it sounds like this could be caused by that large delete, is the zpool import doing some sort of remediation right now?

Thanks

Edit:
Here is a screenshot
upload_2015-10-6_15-33-24.png
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The pool import has to complete the in-process transaction. Deleting a large snapshot that involves a lot of I/O can in fact cause the responsible command to "hang" for the duration. ZFS has to wander around and update a lot of bookkeeping especially if there are lots of small files. You reboot that big boy, and now you're going to find your filer actually out of service, because the stuff that it WAS doing while it was running in multiuser mode now has to be done as part of the pool import, where it cannot be serving files.

LESSON TO LEARN: LEAVE IT ALONE. There are no shortcuts. There's just making the problem worse. Next time let it complete the snapshot deletion.

Sorry, probably not what you wanted to hear. Your only option now is to let it complete that operation and then it should eventually come up fine.
 
Joined
Oct 2, 2014
Messages
925
I dont have answers, but it seems @jgreco does; what im here for is i wanna see dis:

fs01 (607TB) - Dell R510 - 24core / 128GB RAM / 10GBe / 3x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs02 (607TB) - Dell R510 - 24core / 128GB RAM / 10GBe / 3x SC847J / LSI9300-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs03 (87TB) - Dell R510 - 24core / 64GB RAM / 10GBe / 1x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc
fs04 (87TB) - Dell R510 - 24core / 64GB RAM / 10GBe / 1x SC847J / LSI9206-16E / 2x200GB SSD zil / 4x512GB SSD l2arc

<insert pictures here>
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
The pool import has to complete the in-process transaction. Deleting a large snapshot that involves a lot of I/O can in fact cause the responsible command to "hang" for the duration. ZFS has to wander around and update a lot of bookkeeping especially if there are lots of small files. You reboot that big boy, and now you're going to find your filer actually out of service, because the stuff that it WAS doing while it was running in multiuser mode now has to be done as part of the pool import, where it cannot be serving files.

LESSON TO LEARN: LEAVE IT ALONE. There are no shortcuts. There's just making the problem worse. Next time let it complete the snapshot deletion.

Sorry, probably not what you wanted to hear. Your only option now is to let it complete that operation and then it should eventually come up fine.

I wasn't actually manually deleting a snapshot. My snapshot rules are every four hours for four days and Freenas was attempting (but failing) to delete it with its automated process for about three weeks so I was forced to do something. Freenas replication had a ZFS hold on it which was busted because of an upgrade (it correlates with the snapshot time). I tried to force kill the process so I could release the hold but it would not work, I tried everything. That said I do agree in the "leave it alone" sense but the context is a little different.

Regardless after about 90 minutes the pool imported and i'm back online but my replication is broken, which has happened before because of OS upgrades. The silver lining is the new replication upgrades look very promising, I like the changes and hopefully we're all good from here on out.

Just wondering if I can fix it or I have to re-replicated 380TB of data.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Just wondering if I can fix it or I have to re-replicated 380TB of data.
I've also been bitten by the stale snapshots and dorked up replication. After wrestling with only 30TB, I gave up. I deleted all my replication tasks, far side datasets and started fresh.

You can try to manually release the holds using this command:
Code:
zfs list -t snapshot -H -o name,userrefs | grep 1\$ | cut -f1 | xargs sudo zfs release freenas:repl

As suggested in this bug report: https://bugs.freenas.org/issues/11647
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
I've also been bitten by the stale snapshots and dorked up replication. After wrestling with only 30TB, I gave up. I deleted all my replication tasks, far side datasets and started fresh.

You can try to manually release the holds using this command:
Code:
zfs list -t snapshot -H -o name,userrefs | grep 1\$ | cut -f1 | xargs sudo zfs release freenas:repl

As suggested in this bug report: https://bugs.freenas.org/issues/11647

Similar to the command I ran, I did release everything but there is nothing I can do to "fix" replication again, I'm about to delete all my snapshots/replication tasks as you said. Unfortunate.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
I deleted / re-enabled everything replication still not starting, Status is showing as "Failed: WARNING enabled NONE in cipher"

Never seen this message before.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Did you delete and recreate (you said re-enabled, which is why I ask) the replication job?

Try changing the cipher to fast, then save, then change back to disabled and save to see if the error goes away.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
Did you delete and recreate (you said re-enabled, which is why I ask) the replication job?

Try changing the cipher to fast, then save, then change back to disabled and save to see if the error goes away.

I deleted all snapshots from both sides, then delete replication tasks from the sending side, then re-created the replications tasks.

I tried changing to fast then disabled but the problem persists, is this a bug?

Update:

It took a while but it looks to be sending the initial snapshot now, I can see network activity on the remote host (error remains however)

Sender
upload_2015-10-7_10-26-27.png


Receiver
upload_2015-10-7_10-26-42.png
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So does the new replication system support multiple retention tiers properly now?
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(
I don't think there were any manual snapshot deletions mentioned. In fact the OP specifically said he didn't.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
This is an example of why a week or so ago I said that people should avoid deleting snapshots manually. If you do, and you break replication, it's likely broken for good and a re-replication from scratch is required. :(

Like depasseg said, I didn't delete snapshots manually. That is until I was forced to because an OS upgrade broke replication.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Did you delete using the GUI or the CLI?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Like depasseg said, I didn't delete snapshots manually. That is until I was forced to because an OS upgrade broke replication.

Right, and you broke replication fatally by deleting the snapshot. I'm aware of people having problems with broken replication. I, personally, haven't had any problems with them. Not sure why I've been lucky when so many others haven't.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Cuz you're cool?

Cuz you're awesome?

Cuz you're lucky?

Cuz you're lying?

who knows :smile:
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
Right, and you broke replication fatally by deleting the snapshot. I'm aware of people having problems with broken replication. I, personally, haven't had any problems with them. Not sure why I've been lucky when so many others haven't.

Replication was already fatally broken, it was stuck on trying to replicate a snapshot that was 3-4 weeks old when my policy is set to 4 days, I correlated it to a bug that other users encountered. On a more positive note I'm going to change my approach due to my large amount of data.

For each host that im backing up i'm creating a dataset with its own snapshot and replication task. It won't affect the speed but it will give me more control over individual hosts, most specifically it will allow me to prioritize the most important ones.
 
Status
Not open for further replies.
Top