Pool UNAVAILABLE due to persistent errors

Status
Not open for further replies.

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
I shut down my system before leaving for a trip of several weeks. When I restarted it, it showed that one drive had faulted. I rebooted, thinking that maybe the drive had just been a little slow to become ready, and that drive showed as OK but a different one had faulted. After another reboot, the same (second) drive still showed as faulted, so I started a replacement procedure. Then overnight two more drives faulted (three drives in one of the two striped RaidZ2 vdevs), and now the pool is marked as UNAVAILABLE.

Code:
zpool status -v 


shows a lot of errors in the snapshots that were created yesterday but also errors of the form

Code:
<metadata>:<0xXXX>


and

Code:
Pool1:<0xXX>


The output of

Code:
ls -l /mnt/Pool1


looks normal, and the few plain-text files I've examined seem to be OK, but I have no idea what other damage may exist.

None of this is of vital importance, but I wouldn't want to lose it.

This system IS my backup machine, having replaced a huge collection of DDS4 tapes.

smartctl shows all drives as PASSED.

Any suggestions how to proceed?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Destroy this pool, recreate it, redo the backups ;)

But this server seems to have a big problem which might be PSU related or cable related or HBA related or... I recommend you to check everything before using this server again.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What's the real output of smartctl?

The SMART health assessment is utterly useless.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
It probably goes without saying, but you should figure out a new backup target in the meantime. It may take some work to figure out what exactly went wrong. Respect Murphy.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
What's the real output of smartctl?

The SMART health assessment is utterly useless.

Smart_Results.txt attached: The three drives for which smart results are included are the drives that faulted.
 

Attachments

  • Smart_Results.txt
    20.1 KB · Views: 224

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Two of the drives have crazy high command timeout counts, which is unusual since only one of them actually has the CRC errors typical of cable issues.

I'd investigate PSU and/or host issues.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
Two of the drives have crazy high command timeout counts, which is unusual since only one of them actually has the CRC errors typical of cable issues.

I'd investigate PSU and/or host issues.
The output of

zpool status

suggests that marking the pool as repaired using 'zpool clear' may allow some data to be recovered. Is it reasonable to expect the files that are still accessible to be OK still? I'm assuming that some may no longer be accessible at all; is that correct?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Whatever can be recovered should be accessible with or without running zpool clear, from what I know. Zpool clear just tells ZFS "ok, noted, problem taken care of".
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
Whatever can be recovered should be accessible with or without running zpool clear, from what I know. Zpool clear just tells ZFS "ok, noted, problem taken care of".
Would it be unsafe (i.e., cause even more damage) at this point to reboot after replacing the power supply and then install a couple of temporary drives onto which to copy -- as a temporary backup -- whatever can be recovered? (It should be quicker to copy this way than to copy over the network. Moreover, some of these files are backups of other machines as they were at some point in the past and not as they are now.)

Is there any way to discover what the <metadata>:<0xXXX> and Pool1:<0xXXX> entries (see my first post) correspond to? No-longer-accessible or damaged files?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It depends on the state of hard drives. If the issues are mechanical (i.e. motor, R/W arm and heads, etc.) a reboot is probably a bad idea. If they're problems with the disk surface (bad sectors due to an unfortunate impact with the heads that caused no further damage, random bitflips, etc.), a reboot is probably safe.
If it's an electrical problem, I honestly don't know which option is better.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
I haven't rebooted -- waiting for a new power supply. Copying some files to a USB-connected drive (I know, I know) works OK. but some give a "Bad address" message and some a "Device not configured" message.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
I had to reboot because I could no longer access the machine via the GUI and the console (via IPMI) was full of scrolling messages I couldn't read.

Now that I have rebooted,

zpool status -v

shows the pool as DEGRADED rather than UNAVAILABLE and is resilvering and reporting "No known data errors"! Can I relieve believe this?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The error counters are cleared upon reboot.
 
Status
Not open for further replies.
Top