Crashed FreeNAS box :(

auskento · Jul 31, 2015

Today i am a sad panda, and not sure where to go from here.

I have a drive that is completely dead, whole system locks up if i try and mount the pool when it is installed.

I am at a loss as to what to try next, i am not sure what commands or processes i can try to rebuild / recover (i suspect i am screwed)

Are there any experienced diagnosers available to assist with this?
I have spent a couple of days at it trying to see if i can offline the drive fast enough, but no dice.

What logs or info can i supply to be of help?

Bidule0hm · Jul 31, 2015

If you know which drive it is then just unplug it (I assume you have a RAID-Zn or striped mirrors volume on a proper system). If you want more help please follow the forum rules and post the list of your hardware.

auskento · Jul 31, 2015

Sorry - last night was my first exposure to the system, and i was more in a panic than thinking reasonably about it.
Most of the data is not critical anymore, but i'd hate to have to replace it if i can avoid it.

My server is one of the ixSystems servers. It has a SuperMicro MB in it, and has capacity for 36 drives, but has 20 * 2TB drives in it.
The bulk of these are Hitachi Enterprise (as supplied with the NAS) and Seagate / WD NAS or Enterprise drives.
It is Dual Xeon with 48 GB RAM

The failed drive is part of a RAID-Z2 array called 'storage'
The drive that is failing is /dev/da12
The array is not encrypted.

If i try and boot up this system with this drive in, all i get is continuous Unretryable Error problems, and system never gets to menu or UI available.
If i remove this drive, the storage pool never mounts.

I can try and mount it once i pop this drive in, but get errors again straight away - as soon as it starts trying to mount, i can no longer make any CLI commands either direct from the box, or via SSH.
I dont know how i can offline this drive and replace it prior to losing access to entering commands.

SweetAndLow · Jul 31, 2015

Sounds like maybe your boot device has gone bad. Can you get a new one and replace it? What version of freenas
?

auskento · Aug 1, 2015

The system boots OK if the problematic drive is removed - its not a boot disk issue.
Running 9.2.1.8

SweetAndLow · Aug 1, 2015

Ahhhh your right I should have read your post slower. What output do you get when you try to import the pool? Also how about zpool status, I think it will be empty if you can't import correctly but I want to double check.

auskento · Aug 3, 2015

zpool doesnt report a status as there is no pool mounted.
The import of the pool starts, but generates errors on the disk almost immediately.

As soon as that happens, the CLI becomes unresponsive (so i cant offline the drive)

DrKK · Aug 4, 2015

Let me repeat this back to you.

You have a RAID-Z2 vdev called "storage". PRECISELY ONE SINGLE DRIVE in that vdev has failed, you are sure there are no other failures, and you are saying when you yank that drive, the pool won't mount/import at boot time.

Is that correct?

auskento · Aug 5, 2015

Hi, thanks for your reply.

Yes. If i look at the drives list when the 1 listed failed drive (da12) is removed, all remaining 18/19 drives are listed.
After the system is booted, i can connect the failed drive, do camcontrol rescan all, and then do zpool import
It shows that the zpool storage is ONLINE.

If i then do zpool import storage, this drive starts flooding the console screen with Unretryable errors and no other commands work.
I can't offline or detach the failed drive, or try and replace it.

I'll get some captures next time im in front of it over the weekend.

DrKK · Aug 5, 2015

OK dumb question. With 18/19 drives, it onlines the pool, and everything is fine.

Just take a new drive, put it in the 19th position, resilver it, and call it a day.

What am I missing

auskento · Aug 18, 2015

So its been a while since ive been able to progress this, but it seems leaving it alone helped.

In response to DrKK -
I had the same thoughts as you, why couldnt i just put a new drive in. I tried, but none of the zpool commands would work because the pool was not mounted (it was online)

I got called away on aproject, and just left it sitting for 10 days.
When i got back and looked at it, i started seeing some issues with filenames of data stored in the pool - thats strange i thought.

All of a sudden the pool was back online - just needed time to 'mount' i guess.

So after that i popped in a new 2tb drive and selected to replace it, this is where my current woes are.

The drive is continually resilvering. It never completes. (the failed drive is still installed, and im not sure if i should remove it)
Every time i look at the output from zpool status, it shows that the resilver has started at a different time.

I have been able to recover the core of the data off this system that i wanted to.

Should i just remove the troublesome drive now and see what happens?

pool: storage
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Aug 19 13:33:49 2015
13.3T scanned out of 32.7T at 528M/s, 10h44m to go
23.5G resilvered, 40.49% done
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 1.24K 0 0
raidz2-0 ONLINE 0 0 0
gptid/5213272e-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/527a014f-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/52ddec11-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/534088df-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/53a57b8e-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/54104636-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/54755fcc-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/a7552ce7-8a7a-11e4-a46f-00259065cebd ONLINE 0 0 0 block size: 512B configured, 4096B native
gptid/5541205d-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/5196311c-e8b0-11e3-ba3e-00259065cebd ONLINE 0 0 0 block size: 512B configured, 4096B native
gptid/56102462-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
gptid/5675e4b4-39f8-11e1-bb85-00259065cebc ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
gptid/de789459-a395-11e4-8adf-00259065cebd ONLINE 0 0 0
gptid/c614533f-df38-11e3-815f-00259065cebd ONLINE 0 0 0
gptid/55ebee3a-9617-11e3-8fb0-00259065cebd ONLINE 0 0 0
gptid/586e26a9-9617-11e3-8fb0-00259065cebd ONLINE 0 0 0
gptid/76e94bcd-e025-11e3-8b40-00259065cebd ONLINE 0 0 0
gptid/775ae1ae-e025-11e3-8b40-00259065cebd ONLINE 0 0 0
replacing-4 ONLINE 1.24K 0 0
gptid/671798f4-2572-11e5-afa2-00259065cebd ONLINE 1.24K 0 5 (resilvering)
gptid/64c2b9d4-40ce-11e5-8cb6-00259065cebd ONLINE 0 0 2.48K (resilvering)

DrKK · Aug 19, 2015

Resilvering can take a long time. YOu can see where the above status shows you that you have 10 hours and 44 minutes left to go, right?

danb35 · Aug 19, 2015

Yes, but he's said that the "resilver in progress since" date/time keep changing. That isn't expected behavior, AFAIK. I don't think this has anything to do with it, but the pool layout's kind of messed up--a 12-disk RAIDZ2 striped with a 7-disk RAIDZ1? Not good. And that's assuming that nothing's added as a stripe, which we can't tell, since the zpool status wasn't posted in code tags.

rogerh · Aug 19, 2015

There appears to be a a RAIDZ1 vdev called raidz1-1 (which is striped with a RAIDZ2 vdev which appears OK to make a pool called STORAGE). The RAIDZ1 from what the OP said started with 8 disks. If it actually started with 7 disks there may be some hope. But if it started with 8 disks it has now lost two disks and is trying to resilver two disks, one of which is good but the other is known to have failed. The fact that the pool could not be imported when one disk was removed certainly suggests that more than one disk had failed in that vdev. If there are 21 disks now in the machine, it seems almost certain that two have failed and one of the replacements cannot possibly work. The only hope is that it finally succeeds in resilvering the new disk, using whatever it can retrieve from the failed disk. Then there would only be one disk to replace. But that looks as though it may not happen.

auskento · Aug 20, 2015

Thanks for the replies.

Yes the resilver keeps starting over. I will grab the data off that i want and then yank the drive and see what the result is.

Yes the server originally came with 12 drives. The others have been added for additional space.
The array expansion didnt go as id expected (as seen)

I figure im going to recover the data I can and then scratch the array and start it over.

Important Announcement for the TrueNAS Community.

Crashed FreeNAS box :(

auskento

Dabbler

Bidule0hm

Server Electronics Sorcerer

auskento

Dabbler

SweetAndLow

Sweet'NASty

auskento

Dabbler

SweetAndLow

Sweet'NASty

auskento

Dabbler

DrKK

FreeNAS Generalissimo

auskento

Dabbler

DrKK

FreeNAS Generalissimo

auskento

Dabbler

DrKK

FreeNAS Generalissimo

danb35

Hall of Famer

rogerh

Guru

auskento

Dabbler

Similar threads