SOLVED ZFS keeps removing my drive after resilvering

Status
Not open for further replies.

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
Hopefully the resilver will be a quicky. I don't really like that ada0 error much, but hopefully it was a cable glitch or something from all the physical activity. You'll want to keep an eye open.

Yes just detach that offline drive, and leave the 3 remaining good ones.
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
Well, it looks like my array might have died. :(

Code:
[root@freenas] /boot# zpool status -v
  pool: Root
state: UNAVAIL
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov 16 08:02:24 2014
        507G scanned out of 4.39T at 154M/s, 7h22m to go
        169G resilvered, 11.28% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        Root                                            UNAVAIL      1     0     0
          raidz1-0                                      UNAVAIL     15     0     0
            1526374017946971693                         REMOVED      0     0     0  was /dev/gptid/4585e79d-6d59-11e4-b30c-08606e69c5e2  (resilvering)
            15432765115828312757                        REMOVED      0     0     0  was /dev/gptid/b19b1a9f-6cf6-11e2-b19a-08606e69c5e2
            gptid/b284e7f4-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x65>
        <metadata>:<0x7a>
        <metadata>:<0xd5>
        <metadata>:<0xe1>
        Root:<0x0>
        Root:<0xab62>
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
Not sure if this is a URE and how Z1 dies with corrupt metadata. But it looks like it. That error you posted with the detach... seems to have been a death blow. Sorry to see it go down like that. I thought you were out of the woods.

Can you online any of those devices? Or does this just reboot to insufficient redundant copies and no pool?
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
I booted the server back up and all of my drives are online and resilvering.
Code:
[root@freenas] ~# zpool status
  pool: Root
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov 16 08:02:24 2014
        3.54T scanned out of 4.39T at 160M/s, 1h32m to go
        1.12T resilvered, 80.64% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        Root                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/4585e79d-6d59-11e4-b30c-08606e69c5e2  ONLINE       0     0     1  (resilvering)
            gptid/b19b1a9f-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0
            gptid/b284e7f4-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0

errors: No known data errors


I'm assuming that it's going to go offline again soon so I've been copying anything I care about off of the array. I'm going to buy another 3TB drive and rebuild my server using Z2 this time (4 x 3TB).

I think I'll do a clean install of FreeNAS just to make sure there's nothing lingering from before.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
All of that sounds like a great idea. Looks like you are an hour and a half out from healthy. z1 is way too scary.
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
Holy crap, I think it recovered!
Code:
[root@freenas] ~# zpool status

  pool: Root
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 1.37T in 12h42m with 0 errors on Sun Nov 16 20:44:27 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        Root                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/4585e79d-6d59-11e4-b30c-08606e69c5e2  ONLINE       0     0     1
            gptid/b19b1a9f-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0
            gptid/b284e7f4-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0

errors: No known data errors


I'm assuming it's safe to do a "zpool clear" now that everything is resilvered and online?
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
Yep. Looks good.
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
Fingers crossed! :)
Code:
[root@freenas] ~# zpool status

  pool: Root
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support feature
        flags.
  scan: resilvered 1.37T in 12h42m with 0 errors on Sun Nov 16 20:44:27 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        Root                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/4585e79d-6d59-11e4-b30c-08606e69c5e2  ONLINE       0     0     0
            gptid/b19b1a9f-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0
            gptid/b284e7f4-6cf6-11e2-b19a-08606e69c5e2  ONLINE       0     0     0

errors: No known data errors
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
I just replaced a second drive that was starting to fail and it went a lot more smoothly. I followed the documentation strictly and everything was fine. :)

That being said, I did get some data errors which I'm hoping a scrub will resolve.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
My guess is that you have a bad piece of hardware somewhere. Maybe it was just that drive, but it could be something like cables, power, or ports. Watch it carefully.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
RAIDZ1/RAID5 really is living on the edge...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I wonder where that puts RAID0? :)
"We make this sacrifice to you, oh Lord Cthulhu, master of the Bit Bucket in the Sky! Take this valuable data as a token of our gratitude for Your blessings!"
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
I'm baaaaack...

Code:
Nov 23 20:13:51 freenas smartd[2323]: Device: /dev/ada0, 1 Currently unreadable (pending) sectors
Nov 23 20:34:27 freenas kernel: (ada0:ata2:0:1:0): WRITE_DMA48. ACB: 35 00 c0 50 c9 40 16 00 00 00 00 01
Nov 23 20:34:27 freenas kernel: (ada0:ata2:0:1:0): CAM status: Command timeout
Nov 23 20:34:27 freenas kernel: (ada0:ata2:0:1:0): Retrying command
Nov 23 20:34:27 freenas kernel: ada0 at ata2 bus 0 scbus0 target 1 lun 0
Nov 23 20:34:27 freenas kernel: ada0: <WDC WD30EZRX-00D8PB0 80.00A80> s/n WD-WCC4N0974177 detached


I'm going to try picking up a new cable for ada0, or maybe even a 3-pack if they're cheap.
 

Pie

Dabbler
Joined
Jan 19, 2013
Messages
38
After switching ports once I got:
Code:
Nov 28 17:29:42 freenas kernel: (ada0:ata2:0:1:0): READ_DMA48. ACB: 25 00 80 b4 7b 40 cb 00 00 00 08 00
Nov 28 17:29:42 freenas kernel: (ada0:ata2:0:1:0): CAM status: Command timeout
Nov 28 17:29:42 freenas kernel: (ada0:ata2:0:1:0): Retrying command
Nov 28 17:29:42 freenas kernel: ada0 at ata2 bus 0 scbus0 target 1 lun 0
Nov 28 17:29:42 freenas kernel: ada0: <WDC WD30EZRX-00D8PB0 80.00A80> s/n WD-WCC4N0974177 detached

After switching ports again I got this on boot:
Code:
S.M.A.R.T. Status BAD, Backup and Replace

Looks like I'll be doing another RMA... :(
 
Status
Not open for further replies.
Top