How to add drive removed by mistake back into ZFS pool?

Henrique Barbosa · Aug 7, 2015

Dear all,

I need help!! I removed a drive from my ZFS pool by mistake (by issuing a command and not physically) and now I cannot put it back!! Below is a description of my system and of my problem.

DELL PowerEdge 2900
20Gb RAM, running latest version: FreeNAS-9.3-STABLE-201506292130
The RAID controller is a PERC 6/i with 8 disks of 2Tb each
FreeNAS configured with raid-Z2 (double parity)

What happened (or how I destroyed my data in 10 steps!!)

This morning I noticed a yellow status light on the FreeNAS webpage: it said smart-d service was not running. Obviously I checked and all my 8 drives have s.m.a.r.t enabled. This message appeared after the last upgrade, as many reported online on different forums. More strangely, the services page said that smart was running, but when I tried to turn it off and on again, it did not start!

After looking at tons of threads online, some one said "MegaRAID SAT layer is reportedly bug", so FreeNAS will not do anything about it... while others gave advice on how to start smart service by command online, so I tried.

Step 1) check output of camcontrol, and everything seems ok

Code:

[root@x] ~# camcontrol devlist -v
scbus0 on mfi0 bus 0:
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 0 lun 0 (pass0)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 1 lun 0 (pass1)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 2 lun 0 (pass2)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 3 lun 0 (pass3)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 4 lun 0 (pass4)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 5 lun 0 (pass5)
<ATA WDC WD20EZRX-00D 0A80>        at scbus0 target 6 lun 0 (pass6)
<ATA WDC WD20EZRX-00D 0A80>        at scbus0 target 7 lun 0 (pass7)
<DP BACKPLANE 1.05>                at scbus0 target 32 lun 0 (pass8,ses0)
scbus1 on ata2 bus 0:
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ata3 bus 0:
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on camsim0 bus 0:
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on umass-sim0 bus 0:
<SanDisk Cruzer Fit 1.27>          at scbus4 target 0 lun 0 (pass9,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)

Step 2) tried to start smart manually

Code:

[root@x] ~# smartctl -a /dev/pass0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/pass0 [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat' to try anyhow

Step 3) I accepted the suggestion and tried the "-d sat" option, which gave some input/output errors and high rate of seek-error, but nonetheless, the drive was there and working! One thing I did not understood was the smart log in/out error messages at the end.

=> see attached file: nas_bug_smartctl_pass0.txt

Step 4) Then I tried the "-d auto" option, and what happened!?!? the device is not there anymore??? Not even in /dev/ nor reported by camcontrol!!

Code:

[root@x] ~# smartctl -d auto -a /dev/pass0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/pass0: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

[root@x] ~# ls /dev/pas*
pass1% pass2% pass3% pass4% pass5% pass6% pass7% pass8% pass9% 

[root@x] ~# camcontrol devlist -v
scbus0 on mfi0 bus 0:

<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 1 lun 0 (pass1)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 2 lun 0 (pass2)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 3 lun 0 (pass3)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 4 lun 0 (pass4)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 5 lun 0 (pass5)
<ATA WDC WD20EZRX-00D 0A80>        at scbus0 target 6 lun 0 (pass6)
<ATA WDC WD20EZRX-00D 0A80>        at scbus0 target 7 lun 0 (pass7)
<DP BACKPLANE 1.05>                at scbus0 target 32 lun 0 (pass8,ses0)
scbus1 on ata2 bus 0:
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ata3 bus 0:
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on camsim0 bus 0:
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on umass-sim0 bus 0:
<SanDisk Cruzer Fit 1.27>          at scbus4 target 0 lun 0 (pass9,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)

Step 5) Then I looked at the ZPOOL status:

Code:

[root@x] ~# zpool status -v
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug  5 03:47:11 2015
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: volume1
state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    volume1                                         DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        7331790985355822600                         REMOVED      0     0     0  was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a
        gptid/6665a078-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/669ab163-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/66eb58db-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67492df3-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67a444e7-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67d50035-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/685f3984-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0

errors: No known data errors

Step 6) I was stupid to try on other drives as well, thank god I did not try the "-d sat" with all of them, just #1 and #7. As you can see below, for /dev/pass1 there was no in/out error messages but lots of seek-error counts, while for /dev/pass7 there was in/out error while reading the smart logs, but NO seek-error.

=> See attached file: nas_bug_smartctl_pass1to7.txt

Step 7) Three drives tested with "-d sat" and three different behaviours.... and after that, device #7 went missing as well! But not device #1.

Summary: device #0 and #7 don't show on camcontrol, and the zpool status is DEGRADED, however, the drive STATE is just REMOVED:

Code:

[root@amazonia] ~# camcontrol devlist -v
scbus0 on mfi0 bus 0:
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 1 lun 0 (pass1)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 2 lun 0 (pass2)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 3 lun 0 (pass3)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 4 lun 0 (pass4)
<ATA ST2000DM001-1CH1 CC27>        at scbus0 target 5 lun 0 (pass5)
<ATA WDC WD20EZRX-00D 0A80>        at scbus0 target 6 lun 0 (pass6)
<DP BACKPLANE 1.05>                at scbus0 target 32 lun 0 (pass8,ses0)
scbus1 on ata2 bus 0:
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ata3 bus 0:
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on camsim0 bus 0:
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on umass-sim0 bus 0:
<SanDisk Cruzer Fit 1.27>          at scbus4 target 0 lun 0 (pass9,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)

[root@x] ~# zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug  5 03:47:11 2015
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: volume1
state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    volume1                                         DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        7331790985355822600                         REMOVED      0     0     0  was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a
        gptid/6665a078-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/669ab163-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/66eb58db-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67492df3-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67a444e7-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67d50035-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        13413779447659399635                        REMOVED      0     0     0  was /dev/gptid/685f3984-2038-11e5-a8a8-0022198c616a

errors: No known data errors

[root@x] ~# glabel status
                                      Name  Status  Components
gptid/6665a078-2038-11e5-a8a8-0022198c616a     N/A  mfid1p2
gptid/669ab163-2038-11e5-a8a8-0022198c616a     N/A  mfid2p2
gptid/66eb58db-2038-11e5-a8a8-0022198c616a     N/A  mfid3p2
gptid/67492df3-2038-11e5-a8a8-0022198c616a     N/A  mfid4p2
gptid/67a444e7-2038-11e5-a8a8-0022198c616a     N/A  mfid5p2
gptid/67d50035-2038-11e5-a8a8-0022198c616a     N/A  mfid6p2
gptid/721d8146-1f4e-11e5-ab27-0022198c616a     N/A  da0p1

Step 8) I also checked the system log. There were changes to VD 00/0 from OPTIMAL to OFFLINE, then Deleted, the same for VD 07/7.... and no surprises for VD 01/1 although I ran "-d sat" on it... Does that mean these disks were about to fail anyway and I just precipitated it but forcing the smart checkup?

=> See attached file nas_bug_demsg.txt

Step 8) After that I tried to add the drive back into the pool, since it just says REMOVED not FAILED on any big error on "zpool status"... but the command below does not work:

Code:

[root@x] ~# zpool online -e volume1 gptid/685f3984-2038-11e5-a8a8-0022198c616a
warning: device 'gptid/685f3984-2038-11e5-a8a8-0022198c616a' onlined, but remains in faulted state

[root@x] ~# zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug  5 03:47:11 2015
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: volume1
state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    volume1                                         DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        7331790985355822600                         REMOVED      0     0     0  was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a
        gptid/6665a078-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/669ab163-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/66eb58db-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67492df3-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67a444e7-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        gptid/67d50035-2038-11e5-a8a8-0022198c616a  ONLINE       0     0     0
        13413779447659399635                        REMOVED      0     0     0  was /dev/gptid/685f3984-2038-11e5-a8a8-0022198c616a

errors: No known data errors

Step 9) I decided to try to reboot the system... Maybe FreeNAS could detect that I never removed the drives physically and just put it back in the pool... and the system did not started again!! I will go physically to the site tonight to see the error messages during the boot time that may be preventing the system to start.

What should I do ?? please, I need help!!

DrKK · Aug 8, 2015

Hooooo boy.

@cyberjock ....try to go easy on him.

cyberjock · Aug 8, 2015

Uh... it looks like you are doing hardware RAID with ZFS.

Please post a debug file from the WebGUI.. System -> Advanced -> Save Debug.

I won't lie.. you are probably "screwed" as hardware RAID is totally unsupported and totally unsustainable on FreeNAS. So the help you need may be someone saying "you need to find someone that will help you locally as your needs are beyond what the forum offers".

DrKK · Aug 8, 2015

Just for my own edification here, @cyberjock, the fact that his camcontrol devices are attached to "passN" is your indication that something like a hardware RAID is in play?

cyberjock · Aug 8, 2015

No. His info says mfi0 and me mentioned VD (Volume devices on hardware raid)

Sent from my HTC6535LVW using Tapatalk

Important Announcement for the TrueNAS Community.

How to add drive removed by mistake back into ZFS pool?

Henrique Barbosa

Cadet

Attachments

DrKK

FreeNAS Generalissimo

cyberjock

Inactive Account

DrKK

FreeNAS Generalissimo

cyberjock

Inactive Account

Similar threads