Power failure, disk DEGRADED, scrub, then REMOVED

Status
Not open for further replies.

iMIl

Cadet
Joined
Apr 26, 2016
Messages
3
Hi all,

This is my first FreeNAS setup, brand new 9.10 with latest upgrades, so far I've been mostly impressed, except this morning when things turned badly.

The machine, a standard (yes I know) dell studio xps 435mt motherboard with a core i7 920 + 8G non-ECC RAM was in panic() state. I powered off the system as no local or remote shell was available, added 4 more GB (because the panic() seemed related to swap space) and booted it up again.

I've ran a complete memtest to get sure the memory was not faulty, everything's good on that side.

Once up, "zpool status" showed that one disk was in DEGRADED state, with a couple of checksum errors. I assumed this had been caused by the power-off, so I ran a "zpool scrub". After about 1 hour, all disks were ONLINE, I then assumed (and I might have been very wrong here) that I could "zpool clear" at that point. After running once again "zpool status", I realized another "scrub" had started, and shortly after that, the status of both disks that were ONLINE at first was REMOVED. The console showed that GEOM had them destroyed. Here I thought the NAS was simply broken.

I then rebooted once again the server, and all my disks were imported without any errors, a "scrub" automatically ran for about 20 minutes and all my disks were back ONLINE.

I assume the process I followed was wrong but I can't really figure out what happened there, could any grownup elaborate on this failure, and what could have been the correct procedure? I tried hard to find such a methodology but I could only find information about disk replacement. Note I'm not complaining about the initial panic(), I didn't use the recommended hardware, I knew the risks.

Thanks a lot
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Your post is pretty hard to follow because there's so little information about your setup.
The console showed that GEOM had them destroyed.
GEOM destroying a device might indicate a disk disconnected due to a power, cable or port problem. It's what you'd see if you pulled a drive out of a hot-swap bay.
 

iMIl

Cadet
Joined
Apr 26, 2016
Messages
3
Your post is pretty hard to follow because there's so little information about your setup.

Right, so here's the current "zpool status -v" output

Code:
  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: vault
 state: ONLINE
  scan: scrub repaired 0 in 8h39m with 0 errors on Tue Apr 26 20:24:51 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        vault                                           ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/8de1803c-0a22-11e6-aed5-0024e8204055  ONLINE       0     0     0
            gptid/8f713372-0a22-11e6-aed5-0024e8204055  ONLINE       0     0     0
            gptid/909d5c8f-0a22-11e6-aed5-0024e8204055  ONLINE       0     0     0

errors: No known data errors


In my previous explanation, the 3rd disk was in DEGRADED state. After scrubbing and "zpool clear", the first two disks were destroyed by GEOM and then showed as REMOVED by "zpool status"

GEOM destroying a device might indicate a disk disconnected due to a power, cable or port problem. It's what you'd see if you pulled a drive out of a hot-swap bay.

Yes that's what the manpage says, but I assure you the disks were not moved nor removed during the process..
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
the disks were not moved nor removed
That's why you need to investigate cable, power and similar potential issues.
the panic() seemed related to swap space
This is consistent with the above (if a drive that had in-use swap disappeared from the system).
 

iMIl

Cadet
Joined
Apr 26, 2016
Messages
3
I've just witnessed the behavior happening right now, on a third disk, there was no power failure and I changed every SATA cable FWIW, here's what the console shows:
Code:
(ada2:ata3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada2:ata3:0:0:0): CAM status: Command timeout
(ada2:ata3:0:0:0): Retrying command
ada2 at ata3 bus 0 scbus2 target 0 lun 0
ada2: <ST2000DL003-9VT166 CC3C> s/n 5YD7D98N detached
GEOM_ELI: Device ada2p1.eli destroyed.
GEOM_ELI: Detached ada2p1.eli on last close.
(ada2:ata3:0:0:0): Periph destroyed

Once rebooted, every device comes back online.

This thread https://forums.freenas.org/index.php?threads/freenas-dropping-harddrives.9425/ looks very much like what I'm describing, so I'll try changing the power supply too.
 
Last edited:
Status
Not open for further replies.
Top