Second drive failed while replacing another.

MattGG · Mar 15, 2013

Hey guys,

I'm using FreeNAS 8.3 with a raidz1 configuration (6x 2tb) and esxi 5. For the last several months a disk has been dropping out of the raid inexplicably but could be brought back with a full power cycle of the esxi host (off and then on, not reboot). Since this is a home server this wasn't that big of a deal. This week I decide to replace the drive (SN ending in 8GG) when half way through the resilvering (56%) another disk fails (SN ending in 0BM)! I'm not able to get 0BM to do anything and I'm worried I've lost the raid.

I was thinking that maybe I could undo the replacement of 8GG and instead replace 0BM; is this possible? My reasoning is that 8GG hadn't fully failed yet as it was only dipping out of the array occasionally and I still have yet to detach it. However, the pool is not recognized if I plug it back in (with or without its replacement disk). Running

Code:

zpool import files

while 8GG is plugged in gives an I/O error.

Is all lost or is it possible to undo the replacement so I can replace the other disk, resilver and then replace the first disk afterwards? Alternatively, is it possible to resilver using 8GG or clone it somehow so that I could have all replicas and then subsequently replace 0BM?

My procedure for replacing 8GG:

zpool scrub files
offlined 8GG in GUI
shut down
physically replace disk
boot up
replace 8GG in GUI
this is where 0BM failed

Booting up I get this from 0BM, repeatedly.

Code:

(da4:mps0:0:3:0): READ(10). CDB: 28 0 e8 e0 87 80 0 1 0 0
(da4:mps0:0:3:0): CAM status: SCSI Status Error
(da4:mps0:0:3:0): SCSI status: Check Condition
(da4:mps0:0:3:0): SCSI sense: MEDIUM ERROR info:e8e08860 asc:11,0 (Unrecovered read error)

zpool status without 8GG plugged, the faulted disk is 0BM, the unavail is 8GG (replaced)

Code:

  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 8h51m with 203133 errors on Fri Mar 15 01:41:28 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0  336K
          raidz1-0                                        DEGRADED     0     0  959K
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/dsk/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    FAULTED      9    79     2  too many errors
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 203133 data errors, use '-v' for a list

ProtoSD · Mar 15, 2013

Matt,

I'm sorry for your loss, there's not really anything you can do. I don't want to lecture you because I know you probably already wish you'd done something differently. This is why we discourage people from using Raid-Z1, and also from doing it from a virtual environment. If it were me, I would have looked into why the disks were dropping before things got to this point, but I'm sure you probably realize that now. Unless you have a lot of extra cash to waste on *attempting* data recovery, it's better to just learn from your mistake and work on rebuilding from scratch. :(

cyberjock · Mar 15, 2013

Damn. I hate reading these threads.

ProtoSD is right, we discourge virtual machines because there's little you can do if/when disks drop(sometimes a single bad sector can cause ESXi to timeout). If you read up on ZFS it says it needs direct disk access, which isn't provided in virtual environments unless you are using PCI passthrough.

jgreco · Mar 15, 2013

You haven't explained how you're talking to these disks yet. Are you using ESXi and virtual disks? Raw device mappings? PCI passthru?

Basically, whatever you're doing, stop for a bit. Can you get a replacement disk for the failing drive that's equivalent (or better)? If so, you may want to try using "ddrescue" to copy the failing disk to a fresh disk, and then see if ZFS will accept that as a member of the pool in place of the failing disk.

MattGG · Mar 15, 2013

Thanks for the quick answers guys! This was my first raid ever and knowing what I know now I won't be doing raidz1 again.

You haven't explained how you're talking to these disks yet. Are you using ESXi and virtual disks? Raw device mappings? PCI passthru?

I'm using PCI Passthrough.

I'll see what I can get done with ddrescue, thanks for the suggestion.

cyberjock · Mar 15, 2013

MattGG said:
Thanks for the quick answers guys! This was my first raid ever and knowing what I know now I won't be doing raidz1 again.

I'm using PCI Passthrough.

I'll see what I can get done with ddrescue, thanks for the suggestion.

WOW. This really sucks. So you were doing it properly and it still went bad :( Boo!

I'd try a dd copy as jgreco mentioned if you can get it to work. I think 2 other people tried and never had any luck because dd pooped all over itself when it couldn't successfully read some bad sectors. Definitely report back and let us know how it turns out.. for better or worse.

ProtoSD · Mar 15, 2013

I'm a big fan of ddrescue, and I would have suggested it if I thought it had a chance. I suppose when you don't have any other options, it can't hurt to try.

MattGG · Mar 15, 2013

It looks good:

Code:

[root@centosd ddrescue-1.16]# hdparm -I /dev/sdb | grep Serial
        Serial Number:      5YD3E0BM
        Transport:          Serial, SATA Rev 3.0
[root@centosd ddrescue-1.16]# hdparm -I /dev/sdc | grep Serial
        Serial Number:      S1E1A006
        Transport:          Serial, SATA Rev 3.0
[root@centosd ddrescue-1.16]# ddrescue -v /dev/sdb /dev/sdc --force


GNU ddrescue 1.16
About to copy 2000 GBytes from /dev/sdb to /dev/sdc
    Starting positions: infile = 0 B,  outfile = 0 B
    Copy block size: 128 sectors       Initial skip size: 128 sectors
Sector size: 512 Bytes

Press Ctrl-C to interrupt
rescued:     2000 GB,  errsize:    4096 B,  current rate:        0 B/s
   ipos:     2000 GB,   errors:       1,    average rate:     102 MB/s
   opos:     2000 GB,     time since last successful read:      24 s
Finished

I placed the ddrescue disk in place of 0BM:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 8h51m with 203133 errors on Fri Mar 15 01:41:28 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0     0
          raidz1-0                                        DEGRADED     0     0     0
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/dsk/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     5
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 203133 data errors, use '-v' for a list

ONLINE!!!!! Scrubbing now and it is repairing a disk:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Mar 16 00:20:49 2013
        21.5G scanned out of 10.6T at 170M/s, 18h1m to go
        4K repaired, 0.20% done
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0   106
          raidz1-0                                        DEGRADED     0     0   636
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/dsk/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0    13  (repairing)
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 203220 data errors, use '-v' for a list

MattGG · Mar 15, 2013

Scrub completed:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 3.75G in 12h15m with 453193 errors on Sat Mar 16 12:36:18 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0  610K
          raidz1-0                                        DEGRADED     0     0 1.58M
            replacing-0                                   DEGRADED     0     0  152K
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/dsk/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0 1.17K
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 453193 data errors, use '-v' for a list

Tried to detach the original replaced drive in the gui; it said it was successful but doesn't appear so. This is after a reboot:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 3.75G in 12h15m with 453193 errors on Sat Mar 16 12:36:18 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0    17
          raidz1-0                                        DEGRADED     0     0    68
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/dsk/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 453193 data errors, use '-v' for a list

I don't believe the original drive replacement finished resilvering, wouldn't it have done so in the last scrub? Since it had failed at 56% it would have had to repair far more than 3.75GB.

ProtoSD · Mar 15, 2013

MattGG said:

Congrats, I'm glad you didn't listen to me ;)

I prefer to use ddrescue with the logfile because it allows you to interrupt and resume, and because if it hits a bad spot it temporarily skips it and copies the rest, then comes back and works on scraping the stuff it had trouble with.

Man, I sure hope you go get a spare disk(s) and do a backup before anything else bites you. Then hopefully someone with esxi experience can help you figure out why your disk(s) were dropping.

MattGG · Mar 18, 2013

The raid is still degraded and the largest folder, Media, remains inaccessible. I think this is because the initial replacement was interrupted. I'm not sure what to do next; I already scrubbed.

paleoN · Mar 18, 2013

MattGG said:
The raid is still degraded and the largest folder, Media, remains inaccessible. I think this is because the initial replacement was interrupted. I'm not sure what to do next; I already scrubbed.

What happened to the original 8GG, I think?

Never mind, I read the first post more carefully. Is it a possibility to ddrescue 8GG to another disk? Depending on the reason for I/O error you might see different behavior.

Also, how many drive bays & SATA ports do you have?

ProtoSD · Mar 18, 2013

MattGG said:
The raid is still degraded and the largest folder, Media, remains inaccessible. I think this is because the initial replacement was interrupted. I'm not sure what to do next; I already scrubbed.

A scrub should have completed whatever the resilver didn't AFAIK.

MattGG · Mar 18, 2013

Is it a possibility to ddrescue 8GG to another disk? Depending on the reason for I/O error you might see different behavior.

Is that possible after I replaced it in the GUI? I tried placing 8GG back in before and no pool was recognized. Granted, that was when I had 0BM failed as well.

Also, how many drive bays & SATA ports do you have?

This server has 20 hotswap bays with an ibm m1015 (lsi 9240-8i) flashed as jbod. The raid is 6x 2tb disks so I have two spots available on the raid card without buying a sas expander.

MattGG · Mar 18, 2013

Just to see what would happen I put back 8GG (the original drive I GUI replaced):

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 3.75G in 12h15m with 453193 errors on Sat Mar 16 12:36:18 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            replacing-0                                   ONLINE       0     0     4
              gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6  ONLINE       0     0     0
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 453193 data errors, use '-v' for a list

What is the best course of action now? Before, with 0BM causing problems, the pool was not recogized when 8GG was plugged in. Scrubbing now.

paleoN · Mar 18, 2013

MattGG said:
What is the best course of action now? Before, with 0BM causing problems, the pool was not recogized when 8GG was plugged in. Scrubbing now.

If you think 8GG was causing issues importing because of a hardware problem then a block copy to another drive. Otherwise what you are doing now. Also, I would throw a restart in if the scrub seems to "finish" early.

If there are still errors after the scrub finishes you can try:

Code:

zpool clear -Fn files

Make sure you use the -n to check first. This should attempt rollback the pool when used without -n discarding newer data. I wouldn't try it until everything you can currently access has been backed up first.

MattGG · Mar 20, 2013

After a very long scrub:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 1.07T in 35h28m with 452537 errors on Wed Mar 20 08:39:38 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             ONLINE       0     0  628K
          raidz1-0                                        ONLINE       0     0 1.64M
            replacing-0                                   ONLINE       0     0 13.9M
              gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6  ONLINE       6 31.5M     0
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0 1.32K
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 452531 data errors, use '-v' for a list

It repaired a TB which is expected since the original replacement was interrupted half way through. It looked the same after a reboot (minus the chksums etc).

After I removed 8GG (the gui replaced drive) the Media folder remains accessible unlike before but the raid is still degraded:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 1.07T in 35h28m with 452537 errors on Wed Mar 20 08:39:38 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0     2
          raidz1-0                                        DEGRADED     0     0     4
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 452531 data errors, use '-v' for a list

I tried to gui detach it but it remains.

paleoN · Mar 21, 2013

MattGG said:
After I removed 8GG (the gui replaced drive) the Media folder remains accessible unlike before but the raid is still degraded:

Looks like the other failing drive had more trouble than it appeared, gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6 ONLINE 0 0 1.32K.

MattGG said:
I tried to gui detach it but it remains.

Try:

Code:

zpool detach files 7423993124130373765

MattGG · Mar 21, 2013

paleoN said:
Looks like the other failing drive had more trouble than it appeared, gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6 ONLINE 0 0 1.32K.

A couple days of being idle after physically removing 8GG shows:

Code:

[Mathew@freenas ~]$ zpool status
  pool: files
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 1.07T in 35h28m with 452537 errors on Wed Mar 20 08:39:38 2013
config:

        NAME                                              STATE     READ WRITE CKSUM
        files                                             DEGRADED     0     0 71.2K
          raidz1-0                                        DEGRADED     0     0  281K
            replacing-0                                   DEGRADED     0     0     0
              7423993124130373765                         UNAVAIL      0     0     0  was /dev/gptid/b79a4f7f-ba3e-11e0-b27e-50e54950d8e6
              gptid/20a24ed9-8a8e-11e2-8608-000c29eb7541  ONLINE       0     0     0
            gptid/b83b4301-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b8b1b0f0-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b92173a6-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/b98c8805-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0
            gptid/ba01227d-ba3e-11e0-b27e-50e54950d8e6    ONLINE       0     0     0

errors: 452531 data errors, use '-v' for a list

paleoN said:
Try:
Code:
zpool detach files 7423993124130373765

Code:

[Mathew@freenas] /mnt/files/home/Mathew# zpool detach files 7423993124130373765
cannot detach 7423993124130373765: no valid replicas

Important Announcement for the TrueNAS Community.

Second drive failed while replacing another.

Dabbler

MVP

Inactive Account

Resident Grinch

Dabbler

Inactive Account

MVP

Dabbler

Dabbler

MVP

Dabbler

Wizard

MVP

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Second drive failed while replacing another."

Similar threads