FreeNAS all of a sudden crashed!!

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Hi running into an issue here with my working FreeNAS setup and now stuck in a boot loop. I was trying to do a network backup and after which my FreeNAS became stuck in a boot loop with the following images below. It seems to be related to trying to load the RAID-Z pool. When I boot FreeNAS as initial install it boots fine. I go in and try to import the existing pool, the console window loads the following and then reboot. This is a FreeNAS setup that has been working for quite some time and the drives seem to be fine. I am running it in ESXi. I have tried reduceing the memory allocation, and vCPUs with no positive results. Any thoughts?

Not sure if this is the right sub forum to post under so please let me know
 

Attachments

  • 823_494_1.png
    823_494_1.png
    16.7 KB · Views: 177
  • 403_82_1.png
    403_82_1.png
    1.4 KB · Views: 182

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I am running it in ESXi.
Did you connect the drives via an HBA in PCI passthrough mode, or is VMware in the way in any manner (virtual disks, local RDM, etc?)

Also, please post full hardware specs.
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Its in Passthrough mode using LSI 9211-8i

Hardware Specs:
Asus Rampage IV Formula x79
Intel Xeon CPU 12C/24T
Ripjaws Memory (Non-ECC) Running at 1600Mhz 32GB
4x WD 4TB drives
LSI 9211-8i Raid Controller Passthrough
EXI 6.7 U2
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Tried following some of what was stated in this thread


specifically zpool impot -F -f -n RAID-Z

Got that same issue....
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Check the model of your WD drives against the list of known SMR drives, there's some firmware issues that could contribute to this. Non-ECC RAM could also be a culprit.

The short version is that it looks like your pool is corrupted. If you have a backup, restore from that. If not, read on.

(If this is irreplaceable data then start by block copying your drives using dd or a similar tool. Only work on the copies.)

With your pool exported, try using zdb -e RAIDZ -ul to list potential previous uberblocks and their transaction groups, then zpool import -F RAIDZ -T 12345678 -n -o readonly=on with an actual txg number in place of the filler number. If it says it will work you can try removing the -n to make it do it for real, check the contents to make sure things are there, and then either copy everything off, or export and import again without the readonly flag to actually do the rollback.
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Check the model of your WD drives against the list of known SMR drives, there's some firmware issues that could contribute to this. Non-ECC RAM could also be a culprit.

The short version is that it looks like your pool is corrupted. If you have a backup, restore from that. If not, read on.

(If this is irreplaceable data then start by block copying your drives using dd or a similar tool. Only work on the copies.)

With your pool exported, try using zdb -e RAIDZ -ul to list potential previous uberblocks and their transaction groups, then zpool import -F RAIDZ -T 12345678 -n -o readonly=on with an actual txg number in place of the filler number. If it says it will work you can try removing the -n to make it do it for real, check the contents to make sure things are there, and then either copy everything off, or export and import again without the readonly flag to actually do the rollback.

I'll give this a shot.

One thing I noticed while surfing the web on this controller today is that firmware 20 is recommended. I am currently running 18 and have been trying to get it flashed most of the day with no success. Could this be an issue as well or am I wasting my time?
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Check the model of your WD drives against the list of known SMR drives, there's some firmware issues that could contribute to this. Non-ECC RAM could also be a culprit.

The short version is that it looks like your pool is corrupted. If you have a backup, restore from that. If not, read on.

(If this is irreplaceable data then start by block copying your drives using dd or a similar tool. Only work on the copies.)

With your pool exported, try using zdb -e RAIDZ -ul to list potential previous uberblocks and their transaction groups, then zpool import -F RAIDZ -T 12345678 -n -o readonly=on with an actual txg number in place of the filler number. If it says it will work you can try removing the -n to make it do it for real, check the contents to make sure things are there, and then either copy everything off, or export and import again without the readonly flag to actually do the rollback.

So I managed to get the firmware flash and after reading your message I didnt expect that to do much. I ran the zdb command and it seemed to have a txg=498163 i then followed up with the zpool command that basically hung. I am letting it set over night incase this is a long process but I assume if it has not imported the pool by tomorrow morning I am SOL.

I am planning to build a new rig soon with drives that are not on the SMR list and the new rig will have ECC. This is the first time that I have been bitten by anything like this. I thought ECC was to help prevent issues with sudden loss of power or is this just a portion of what it helps with?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
I thought ECC was to help prevent issues with sudden loss of power or is this just a portion of what it helps with?

Hi,

That is really not how ECC works... If power is loss, the RAM will evaporate, ECC or not. The purpose of ECC is to detect error and corruption in RAM and fix it. These errors can be from many different sources, defective RAM being only one example. Should you store a series of 1 in RAM and for some reason one of them is flipped to a 0, ECC will detect and fix it. Some RAM is even multi-bit ECC and can correct many errors at once.

ZFS assumes that whatever that get written in RAM will be retrieved as is. As such, it does not do its own checksums like it does for that written on drives. Without ECC, that assumption has a much higher risk to end up false. Depending of what get corrupted, it can compromise up to the entire pool.
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
Hi,

That is really not how ECC works... If power is loss, the RAM will evaporate, ECC or not. The purpose of ECC is to detect error and corruption in RAM and fix it. These errors can be from many different sources, defective RAM being only one example. Should you store a series of 1 in RAM and for some reason one of them is flipped to a 0, ECC will detect and fix it. Some RAM is even multi-bit ECC and can correct many errors at once.

ZFS assumes that whatever that get written in RAM will be retrieved as is. As such, it does not do its own checksums like it does for that written on drives. Without ECC, that assumption has a much higher risk to end up false. Depending of what get corrupted, it can compromise up to the entire pool.

Gotcha, never really looked much into ECC RAM outside of knowing it was better than non

I ended up having to rebuild the pool... Sucks but is what it is. Thanks for the help...
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I ended up having to rebuild the pool... Sucks but is what it is. Thanks for the help...
I was going to suggest trying to go back a few txg numbers from what your most recent one was - it would have potentially cost you "some" data, but if you've already rebuilt the pool - well, I hope there wasn't anything terribly critical there.
 

marcus8699

Dabbler
Joined
May 4, 2020
Messages
17
I was going to suggest trying to go back a few txg numbers from what your most recent one was - it would have potentially cost you "some" data, but if you've already rebuilt the pool - well, I hope there wasn't anything terribly critical there.

I kept getting the same txg numbers and it would just hang for a considerable amount of time. I had some of the data backed up, about 60% of it. Nothing real critical but just a pain now to get the movie collection I had stored back. Thanks for your help, I am looking into Crashplan running in a separate VM to backup the NAS for future issues like this. Hopefully the build that I am looking into will be better as well.

Also, I added a 4TB Red drive into a batch of 3x 4TB Green drives. It seems that one Red drive is an SMR drive so potentially this couldve been the problem.
 

Matt_G

Explorer
Joined
Jan 24, 2016
Messages
65
Also, I added a 4TB Red drive into a batch of 3x 4TB Green drives. It seems that one Red drive is an SMR drive so potentially this couldve been the problem.
Since a Vdev can't be extended, may I ask how you added that drive?
When I see statements like that, I immediately have visions of someone adding a 1 drive Vdev to their existing pool that originally consisted of a 3 drive RAIDZ1 Vdev.

Not saying that's what you did but I am very curious...
 
Top