Henrique Barbosa
Cadet
- Joined
- Aug 7, 2015
- Messages
- 1
Dear all,
I need help!! I removed a drive from my ZFS pool by mistake (by issuing a command and not physically) and now I cannot put it back!! Below is a description of my system and of my problem.
DELL PowerEdge 2900
20Gb RAM, running latest version: FreeNAS-9.3-STABLE-201506292130
The RAID controller is a PERC 6/i with 8 disks of 2Tb each
FreeNAS configured with raid-Z2 (double parity)
What happened (or how I destroyed my data in 10 steps!!)
This morning I noticed a yellow status light on the FreeNAS webpage: it said smart-d service was not running. Obviously I checked and all my 8 drives have s.m.a.r.t enabled. This message appeared after the last upgrade, as many reported online on different forums. More strangely, the services page said that smart was running, but when I tried to turn it off and on again, it did not start!
After looking at tons of threads online, some one said "MegaRAID SAT layer is reportedly bug", so FreeNAS will not do anything about it... while others gave advice on how to start smart service by command online, so I tried.
Step 1) check output of camcontrol, and everything seems ok
Step 2) tried to start smart manually
Step 3) I accepted the suggestion and tried the "-d sat" option, which gave some input/output errors and high rate of seek-error, but nonetheless, the drive was there and working! One thing I did not understood was the smart log in/out error messages at the end.
=> see attached file: nas_bug_smartctl_pass0.txt
Step 4) Then I tried the "-d auto" option, and what happened!?!? the device is not there anymore??? Not even in /dev/ nor reported by camcontrol!!
Step 5) Then I looked at the ZPOOL status:
Step 6) I was stupid to try on other drives as well, thank god I did not try the "-d sat" with all of them, just #1 and #7. As you can see below, for /dev/pass1 there was no in/out error messages but lots of seek-error counts, while for /dev/pass7 there was in/out error while reading the smart logs, but NO seek-error.
=> See attached file: nas_bug_smartctl_pass1to7.txt
Step 7) Three drives tested with "-d sat" and three different behaviours.... and after that, device #7 went missing as well! But not device #1.
Summary: device #0 and #7 don't show on camcontrol, and the zpool status is DEGRADED, however, the drive STATE is just REMOVED:
Step 8) I also checked the system log. There were changes to VD 00/0 from OPTIMAL to OFFLINE, then Deleted, the same for VD 07/7.... and no surprises for VD 01/1 although I ran "-d sat" on it... Does that mean these disks were about to fail anyway and I just precipitated it but forcing the smart checkup?
=> See attached file nas_bug_demsg.txt
Step 8) After that I tried to add the drive back into the pool, since it just says REMOVED not FAILED on any big error on "zpool status"... but the command below does not work:
Step 9) I decided to try to reboot the system... Maybe FreeNAS could detect that I never removed the drives physically and just put it back in the pool... and the system did not started again!! I will go physically to the site tonight to see the error messages during the boot time that may be preventing the system to start.
What should I do ?? please, I need help!!
I need help!! I removed a drive from my ZFS pool by mistake (by issuing a command and not physically) and now I cannot put it back!! Below is a description of my system and of my problem.
DELL PowerEdge 2900
20Gb RAM, running latest version: FreeNAS-9.3-STABLE-201506292130
The RAID controller is a PERC 6/i with 8 disks of 2Tb each
FreeNAS configured with raid-Z2 (double parity)
What happened (or how I destroyed my data in 10 steps!!)
This morning I noticed a yellow status light on the FreeNAS webpage: it said smart-d service was not running. Obviously I checked and all my 8 drives have s.m.a.r.t enabled. This message appeared after the last upgrade, as many reported online on different forums. More strangely, the services page said that smart was running, but when I tried to turn it off and on again, it did not start!
After looking at tons of threads online, some one said "MegaRAID SAT layer is reportedly bug", so FreeNAS will not do anything about it... while others gave advice on how to start smart service by command online, so I tried.
Step 1) check output of camcontrol, and everything seems ok
Code:
[root@x] ~# camcontrol devlist -v scbus0 on mfi0 bus 0: <ATA ST2000DM001-1ER1 CC25> at scbus0 target 0 lun 0 (pass0) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 1 lun 0 (pass1) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 2 lun 0 (pass2) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 3 lun 0 (pass3) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 4 lun 0 (pass4) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 5 lun 0 (pass5) <ATA WDC WD20EZRX-00D 0A80> at scbus0 target 6 lun 0 (pass6) <ATA WDC WD20EZRX-00D 0A80> at scbus0 target 7 lun 0 (pass7) <DP BACKPLANE 1.05> at scbus0 target 32 lun 0 (pass8,ses0) scbus1 on ata2 bus 0: <> at scbus1 target -1 lun -1 () scbus2 on ata3 bus 0: <> at scbus2 target -1 lun -1 () scbus3 on camsim0 bus 0: <> at scbus3 target -1 lun -1 () scbus4 on umass-sim0 bus 0: <SanDisk Cruzer Fit 1.27> at scbus4 target 0 lun 0 (pass9,da0) scbus-1 on xpt0 bus 0: <> at scbus-1 target -1 lun -1 (xpt0)
Step 2) tried to start smart manually
Code:
[root@x] ~# smartctl -a /dev/pass0 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org Smartctl open device: /dev/pass0 [SAT] failed: SATA device detected, MegaRAID SAT layer is reportedly buggy, use '-d sat' to try anyhow
Step 3) I accepted the suggestion and tried the "-d sat" option, which gave some input/output errors and high rate of seek-error, but nonetheless, the drive was there and working! One thing I did not understood was the smart log in/out error messages at the end.
=> see attached file: nas_bug_smartctl_pass0.txt
Step 4) Then I tried the "-d auto" option, and what happened!?!? the device is not there anymore??? Not even in /dev/ nor reported by camcontrol!!
Code:
[root@x] ~# smartctl -d auto -a /dev/pass0 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org /dev/pass0: Unable to detect device type Please specify device type with the -d option. Use smartctl -h to get a usage summary [root@x] ~# ls /dev/pas* pass1% pass2% pass3% pass4% pass5% pass6% pass7% pass8% pass9% [root@x] ~# camcontrol devlist -v scbus0 on mfi0 bus 0: <ATA ST2000DM001-1ER1 CC25> at scbus0 target 1 lun 0 (pass1) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 2 lun 0 (pass2) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 3 lun 0 (pass3) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 4 lun 0 (pass4) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 5 lun 0 (pass5) <ATA WDC WD20EZRX-00D 0A80> at scbus0 target 6 lun 0 (pass6) <ATA WDC WD20EZRX-00D 0A80> at scbus0 target 7 lun 0 (pass7) <DP BACKPLANE 1.05> at scbus0 target 32 lun 0 (pass8,ses0) scbus1 on ata2 bus 0: <> at scbus1 target -1 lun -1 () scbus2 on ata3 bus 0: <> at scbus2 target -1 lun -1 () scbus3 on camsim0 bus 0: <> at scbus3 target -1 lun -1 () scbus4 on umass-sim0 bus 0: <SanDisk Cruzer Fit 1.27> at scbus4 target 0 lun 0 (pass9,da0) scbus-1 on xpt0 bus 0: <> at scbus-1 target -1 lun -1 (xpt0)
Step 5) Then I looked at the ZPOOL status:
Code:
[root@x] ~# zpool status -v pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug 5 03:47:11 2015 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors pool: volume1 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM volume1 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 7331790985355822600 REMOVED 0 0 0 was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a gptid/6665a078-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/669ab163-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/66eb58db-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67492df3-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67a444e7-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67d50035-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/685f3984-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 errors: No known data errors
Step 6) I was stupid to try on other drives as well, thank god I did not try the "-d sat" with all of them, just #1 and #7. As you can see below, for /dev/pass1 there was no in/out error messages but lots of seek-error counts, while for /dev/pass7 there was in/out error while reading the smart logs, but NO seek-error.
=> See attached file: nas_bug_smartctl_pass1to7.txt
Step 7) Three drives tested with "-d sat" and three different behaviours.... and after that, device #7 went missing as well! But not device #1.
Summary: device #0 and #7 don't show on camcontrol, and the zpool status is DEGRADED, however, the drive STATE is just REMOVED:
Code:
[root@amazonia] ~# camcontrol devlist -v scbus0 on mfi0 bus 0: <ATA ST2000DM001-1ER1 CC25> at scbus0 target 1 lun 0 (pass1) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 2 lun 0 (pass2) <ATA ST2000DM001-1ER1 CC25> at scbus0 target 3 lun 0 (pass3) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 4 lun 0 (pass4) <ATA ST2000DM001-1CH1 CC27> at scbus0 target 5 lun 0 (pass5) <ATA WDC WD20EZRX-00D 0A80> at scbus0 target 6 lun 0 (pass6) <DP BACKPLANE 1.05> at scbus0 target 32 lun 0 (pass8,ses0) scbus1 on ata2 bus 0: <> at scbus1 target -1 lun -1 () scbus2 on ata3 bus 0: <> at scbus2 target -1 lun -1 () scbus3 on camsim0 bus 0: <> at scbus3 target -1 lun -1 () scbus4 on umass-sim0 bus 0: <SanDisk Cruzer Fit 1.27> at scbus4 target 0 lun 0 (pass9,da0) scbus-1 on xpt0 bus 0: <> at scbus-1 target -1 lun -1 (xpt0) [root@x] ~# zpool status pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug 5 03:47:11 2015 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors pool: volume1 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM volume1 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 7331790985355822600 REMOVED 0 0 0 was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a gptid/6665a078-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/669ab163-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/66eb58db-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67492df3-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67a444e7-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67d50035-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 13413779447659399635 REMOVED 0 0 0 was /dev/gptid/685f3984-2038-11e5-a8a8-0022198c616a errors: No known data errors [root@x] ~# glabel status Name Status Components gptid/6665a078-2038-11e5-a8a8-0022198c616a N/A mfid1p2 gptid/669ab163-2038-11e5-a8a8-0022198c616a N/A mfid2p2 gptid/66eb58db-2038-11e5-a8a8-0022198c616a N/A mfid3p2 gptid/67492df3-2038-11e5-a8a8-0022198c616a N/A mfid4p2 gptid/67a444e7-2038-11e5-a8a8-0022198c616a N/A mfid5p2 gptid/67d50035-2038-11e5-a8a8-0022198c616a N/A mfid6p2 gptid/721d8146-1f4e-11e5-ab27-0022198c616a N/A da0p1
Step 8) I also checked the system log. There were changes to VD 00/0 from OPTIMAL to OFFLINE, then Deleted, the same for VD 07/7.... and no surprises for VD 01/1 although I ran "-d sat" on it... Does that mean these disks were about to fail anyway and I just precipitated it but forcing the smart checkup?
=> See attached file nas_bug_demsg.txt
Step 8) After that I tried to add the drive back into the pool, since it just says REMOVED not FAILED on any big error on "zpool status"... but the command below does not work:
Code:
[root@x] ~# zpool online -e volume1 gptid/685f3984-2038-11e5-a8a8-0022198c616a warning: device 'gptid/685f3984-2038-11e5-a8a8-0022198c616a' onlined, but remains in faulted state [root@x] ~# zpool status pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h2m with 0 errors on Wed Aug 5 03:47:11 2015 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors pool: volume1 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM volume1 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 7331790985355822600 REMOVED 0 0 0 was /dev/gptid/662ee6df-2038-11e5-a8a8-0022198c616a gptid/6665a078-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/669ab163-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/66eb58db-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67492df3-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67a444e7-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 gptid/67d50035-2038-11e5-a8a8-0022198c616a ONLINE 0 0 0 13413779447659399635 REMOVED 0 0 0 was /dev/gptid/685f3984-2038-11e5-a8a8-0022198c616a errors: No known data errors
Step 9) I decided to try to reboot the system... Maybe FreeNAS could detect that I never removed the drives physically and just put it back in the pool... and the system did not started again!! I will go physically to the site tonight to see the error messages during the boot time that may be preventing the system to start.
What should I do ?? please, I need help!!