Track down HDD issue, if any...

thorgrim · Mar 12, 2013

Hi guys !

Sorry for the long post :)

I have my Freenas setup runing since a year now. Here are the infos about it :

Build FreeNAS-8.3.0-RELEASE-p1-x64 (r12825)
Platform AMD E-350 Processor
Memory 7774MB
HDD 5 x 2TB ( RAIDZ2 )
SSD 1 x 64Gb ( cache )

The problem is, I have one of my HDD that seems to be dying, but the infos from the GUI and from command line don't match.
Everything start from the GUI displaying an alert message :

Code:

WARNING: The volume share (ZFS) status is UNKNOWN: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

Ok so I go and check the Storage, View Disks and Volume status tabs and here we go :

Storage is said healthy, while there are only 4 disks displayed (instead of 6) and the Volume Status shows one disk and the SSD as null... That's the moment I decide to fire up an ssh connection to the machine to run a zpool status :

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 2h51m with 0 errors on Sun Mar  3 14:51:49 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE      10 1.12K     0
        cache
          ada4p1                                        ONLINE       0 19.3M     0

errors: No known data errors

Everything is shown as online, but there is still this error message displayed by the GUI in the first place. Since everything is online, and I tend to trust a bit more the command line than the GUI, I run a zpool clear to get rid of the error message. But nothing changes, the error message stays as is.

So after all that, I decide to order a new drive to replace the "maybe" dying one. I stop the machine to replace the drive but forgot to note the serial number, so when I open the machine I don't remember which one to remove. I just remove a bit of dust, and restart the machine to get the info, but now everything is back as fully functional, no more errors, everything online. Perfect... But wait, I just restarted the machine ? Nevermind, I take the serial number of the "dying" disk but decide to leave it alone for now, since everything is back to normal.

But as you can guess, a couple hours after I get back to the same point. Error message, but everything online, but not in all the views... So now, I jsut want to be sure that the drive needs to be changed or not. Are there any tools installed in Freenas to do so ? Which one from the GUI or the command line should I trust ? And why a zpool clear does not remove the error message while it says everything is online and healthy ?

Thanks for those who read up to here, and I will take any advice you may have !

originalprime · Mar 12, 2013

thorgrim,

I know that SMART status isn't the be-all-end-all of HDD diagnosis, but I am curious what the SMART status is of the drive that seems to be error prone.

Assuming you are using SATA disks, the command would be (where X equals the ada assignment of the disk that's failing):

Code:

smartctl -a /dev/adaX

Sometimes a system has a brain fart and just needs to be rebooted. I schedule reboots on all of my servers just for this reason. I would certainly not assume the worst until you've done some more digging. Hopefully your disks are OK! And if they're not, then that's why you're using RAIDZ2. :)

paleoN · Mar 12, 2013

thorgrim said:
But as you can guess, a couple hours after I get back to the same point. Error message, but everything online, but not in all the views... So now, I jsut want to be sure that the drive needs to be changed or not.

Naturally, as the drive is likely dying. When you reboot all of the counters are reset. Then you start having additional errors.

Code:

glabel status

Plus the below smartctl command will give you the serials and let you look at the other attributes.

originalprime said:
Assuming you are using SATA disks, the command would be (where X equals the ada assignment of the disk that's failing):

Made for the web:

Code:

smartctl -q noserial -a /dev/adaX

originalprime said:
Sometimes a system has a brain fart and just needs to be rebooted. I schedule reboots on all of my servers just for this reason.

Maybe Windows? A properly functioning server shouldn't need to be rebooted e.g. weekly.

originalprime · Mar 12, 2013

paleoN said:
Maybe Windows? A properly functioning server shouldn't need to be rebooted e.g. weekly.

Yep. Windows takes the cake. And I'm referring to my work servers ; ) We schedule the reboots to alleviate weird and quirky brainfarts with various systems. We've found it far better to be proactive than to have to reboot a server during production (work) hours!

thorgrim · Mar 12, 2013

Thanks for the input !

glabel status output :

Code:

[root@freenas] ~# glabel status
                                      Name  Status  Components
gptid/a179b007-59c3-11e1-9ae0-14dae9686174     N/A  ada0p2
gptid/a20198bf-59c3-11e1-9ae0-14dae9686174     N/A  ada1p2
gptid/a2951acb-59c3-11e1-9ae0-14dae9686174     N/A  ada2p2
gptid/a322af78-59c3-11e1-9ae0-14dae9686174     N/A  ada3p2
gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174     N/A  ada5p2
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                    ufsid/504835b9736eb6c8     N/A  da0s1a
                            ufs/FreeNASs1a     N/A  da0s1a
                            ufs/FreeNASs2a     N/A  da0s2a

The faulty drive is (based on the GUI info) ada5, but the smartcl doesn't run since ada5 does not exist under /dev :

Code:

[root@freenas] ~# smartctl -q noserial -a /dev/ada5
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/ada5p2: Unable to detect device type
Smartctl: please specify device type with the -d option.

Use smartctl -h to get a usage summary

[root@freenas] ~# smartctl -q noserial -a /dev/ada
ada0%   ada0p1% ada0p2% ada1%   ada1p1% ada1p2% ada2%   ada2p1% ada2p2% ada3%   ada3p1% ada3p2% 
[root@freenas] ~# smartctl -q noserial -a /dev/ada

I'm beginning to suspect maybe a hardware issue around the motherboard or the SATA cables. I'll look for an update for the BIOS maybe.

paleoN · Mar 12, 2013

thorgrim said:
The faulty drive is (based on the GUI info) ada5, but the smartcl doesn't run since ada5 does not exist under /dev :

Neither does ada4 which I just noticed also has write errors. To confirm that the system doesn't see ada4 & ada5:

Code:

camcontrol devlist -v

thorgrim said:
I'm beginning to suspect maybe a hardware issue around the motherboard or the SATA cables. I'll look for an update for the BIOS maybe.

+1 Particularly if they are attached via the same controller.

originalprime said:
We've found it far better to be proactive than to have to reboot a server during production (work) hours!

True enough. ;)

thorgrim · Mar 12, 2013

paleoN said:
Neither does ada4 which I just noticed also has write errors.

Where do you see ada4 having write errors ? Just curious :) My guess was that ada4 was the SSD drive which seems to be playing yo-yo too if you refer to the screenshots in my first post. The RAIDZ2 regroups ada0, ada1, ada2, ada3 and ada5 while the cache is ada4. When I first built the machine I ordered 4 disks and maybe two days after firing it up I order one more drive and the SSD, that must explain the disorder in the numbers.

So after a fresh reboot (after all, the machine had been runing alost one full day since last reboot... ;) )

Code:

[root@freenas] ~# glabel status
                                      Name  Status  Components
gptid/a179b007-59c3-11e1-9ae0-14dae9686174     N/A  ada0p2
gptid/a20198bf-59c3-11e1-9ae0-14dae9686174     N/A  ada1p2
gptid/a2951acb-59c3-11e1-9ae0-14dae9686174     N/A  ada2p2
gptid/a322af78-59c3-11e1-9ae0-14dae9686174     N/A  ada3p2
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                    ufsid/504835b9736eb6c8     N/A  da0s1a
                            ufs/FreeNASs1a     N/A  da0s1a
                            ufs/FreeNASs2a     N/A  da0s2a
[root@freenas] ~# camcontrol devlist -v
scbus0 on ahcich0 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus0 target 0 lun 0 (pass0,ada0)
<>                                 at scbus0 target -1 lun -1 ()
scbus1 on ahcich1 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus1 target 0 lun 0 (pass1,ada1)
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ahcich2 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus2 target 0 lun 0 (pass2,ada2)
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on ahcich3 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus3 target 0 lun 0 (pass3,ada3)
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on ata0 bus 0:
<>                                 at scbus4 target -1 lun -1 ()
scbus5 on ata1 bus 0:
<>                                 at scbus5 target -1 lun -1 ()
scbus6 on umass-sim0 bus 0:
<SMI USB DISK 1100>                at scbus6 target 0 lun 0 (pass4,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)
[root@freenas] ~# zpool status
  pool: share
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 2h51m with 0 errors on Sun Mar  3 14:51:49 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            10762394583603093102                        UNAVAIL      0     0     0  was /dev/gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174
        cache
          10921270572903689210                          UNAVAIL      0     0     0  was /dev/ada4p1

errors: No known data errors

glabel does not list neither ada4 or ada5 anymore, they don't appear either on the camcontrol output and the zpool status clearly shows one drive and the SSD down.

I was thinking maybe to remove the SSD, eliminating some confusion to start with. And change the ada5 drive with the new one I just ordered. What would be the best procedure for all that ?
I'll tackle that tomorrow, with the BIOS update.

Thanks for your help !

paleoN · Mar 13, 2013

thorgrim said:
Where do you see ada4 having write errors ? Just curious :)

What does the zpool status output mean?

thorgrim said:
I was thinking maybe to remove the SSD, eliminating some confusion to start with. And change the ada5 drive with the new one I just ordered.

There's no confusion. The cache device is ada4, well technically the first partition on ada4, and ada5 is part of the raidz2 vdev. The device numbering is irrelevant as long as the user knows which is what.

I agreed with your supposition above about a possible motherboard issue unless you think it more likely both the SSD & ada5 failed simultaneously? In which case swapping ada5 won't accomplish much. How and where are the SSD & ada5 connected?

thorgrim · Mar 13, 2013

paleoN said:
What does the zpool status output mean?

Ooops, I just thought the read/write columns were for actual read/write operations ongoing on the device... ;)

paleoN said:
I agreed with your supposition above about a possible motherboard issue unless you think it more likely both the SSD & ada5 failed simultaneously? In which case swapping ada5 won't accomplish much. How and where are the SSD & ada5 connected?

All drives are connected to the motherboard directly.

I did the BIOS update and verified that all drives are well seen by the BIOS :

But now all the BIOS settings are all back to default and my system does not boot anymore. This Asus BIOS is kind of tricky... I have to find back the right values.

paleoN · Mar 13, 2013

thorgrim said:
I did the BIOS update and verified that all drives are well seen by the BIOS

For one thing try switching the drives to AHCI mode. For the boot issue you likely need to change the HD boot order and have the USB stick first.

thorgrim · Mar 13, 2013

paleoN said:
For one thing try switching the drives to AHCI mode. For the boot issue you likely need to change the HD boot order and have the USB stick first.

Ok just got it back to boot. I have an Asus E35M1-I as motherboard and having it to boot from USB is kind of tricky. You have to find the right combo between SATA, USB and "BBS" values... Will write all that down for next time, if any :)

And now all drives are present under /dev/adaX so let's go for some smartctl love :

For the ada5 HDD :

Code:

[root@freenas] ~# smartctl -q noserial -a /dev/ada5
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST2000DL003-9VT166
Firmware Version: CC3C
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Mar 13 18:08:20 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  612) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 342) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   100   006    Pre-fail  Always       -       207763968
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   060   060   030    Pre-fail  Always       -       8591818457
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       8303
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   060   045    Old_age   Always       -       28 (Min/Max 24/28)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       26
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   023   004   000    Old_age   Always       -       207763968
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       170694885253231
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       613751501
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1843876835

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

For the ada4 SSD :

Code:

[root@freenas] ~# smartctl -q noserial -a /dev/ada4
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron RealSSD C300/C400/m4
Device Model:     M4-CT064M4SSD2
Firmware Version: 0009
User Capacity:    64,023,257,088 bytes [64.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Wed Mar 13 18:09:19 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  295) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   4) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       5216
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       22
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Levelling_Count    0x0033   100   100   010    Pre-fail  Always       -       7
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       0
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       241 74 167
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       48
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       0
202 Perc_Rated_Life_Used    0x0018   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xff)       Completed without error       00%      5215         -
# 2  Vendor (0xff)       Completed without error       00%      5214         -
# 3  Vendor (0xff)       Completed without error       00%      5214         -
# 4  Vendor (0xff)       Completed without error       00%      5204         -
# 5  Vendor (0xff)       Completed without error       00%      5203         -
# 6  Vendor (0xff)       Completed without error       00%      5190         -
# 7  Vendor (0xff)       Completed without error       00%      5178         -
# 8  Vendor (0xff)       Completed without error       00%      5166         -
# 9  Vendor (0xff)       Completed without error       00%      5153         -
#10  Vendor (0xff)       Completed without error       00%      5141         -
#11  Vendor (0xff)       Completed without error       00%      5128         -
#12  Vendor (0xff)       Completed without error       00%      5116         -
#13  Vendor (0xff)       Completed without error       00%      5105         -
#14  Vendor (0xff)       Completed without error       00%      5093         -
#15  Vendor (0xff)       Completed without error       00%      5080         -
#16  Vendor (0xff)       Completed without error       00%      5068         -
#17  Vendor (0xff)       Completed without error       00%      4989         -
#18  Vendor (0xff)       Completed without error       00%      4976         -
#19  Vendor (0xff)       Completed without error       00%      4964         -
#20  Vendor (0xff)       Completed without error       00%      4951         -
#21  Vendor (0xff)       Completed without error       00%      4939         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Everything is such in good shape that a zpool clear remove the error message. My gues is, it's gona take less that 10 minutes for it to come back...

thorgrim · Mar 14, 2013

So, after a good night of doing mostly nothing, all drives except the SSD seem to be fine :

Code:

[root@freenas] ~# zpool status 
  pool: share
 state: ONLINE
  scan: scrub repaired 0 in 2h51m with 0 errors on Sun Mar  3 14:51:49 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
        cache
          ada4p1                                        ONLINE       0 93.0K     0

errors: No known data errors

Code:

[root@freenas] ~# glabel status
                                      Name  Status  Components
gptid/a179b007-59c3-11e1-9ae0-14dae9686174     N/A  ada0p2
gptid/a20198bf-59c3-11e1-9ae0-14dae9686174     N/A  ada1p2
gptid/a2951acb-59c3-11e1-9ae0-14dae9686174     N/A  ada2p2
gptid/a322af78-59c3-11e1-9ae0-14dae9686174     N/A  ada3p2
gptid/a395a1ac-59c3-11e1-9ae0-14dae9686174     N/A  ada5p1
gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174     N/A  ada5p2
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                    ufsid/504835b9736eb6c8     N/A  da0s1a
                            ufs/FreeNASs1a     N/A  da0s1a
                            ufs/FreeNASs2a     N/A  da0s2a

Code:

[root@freenas] ~# camcontrol devlist -v
scbus0 on ahcich0 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus0 target 0 lun 0 (pass0,ada0)
<>                                 at scbus0 target -1 lun -1 ()
scbus1 on ahcich1 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus1 target 0 lun 0 (pass1,ada1)
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ahcich2 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus2 target 0 lun 0 (pass2,ada2)
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on ahcich3 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus3 target 0 lun 0 (pass3,ada3)
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on ahcich4 bus 0:
<M4-CT064M4SSD2 0009>              at scbus4 target 0 lun 0 (ada4)
<>                                 at scbus4 target -1 lun -1 ()
scbus5 on ahcich5 bus 0:
<ST2000DL003-9VT166 CC3C>          at scbus5 target 0 lun 0 (pass5,ada5)
<>                                 at scbus5 target -1 lun -1 ()
scbus6 on umass-sim0 bus 0:
<SMI USB DISK 1100>                at scbus6 target 0 lun 0 (pass6,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)

It appears as null in the GUI, is not found by camcontrol nor glabel, but zpool status shows it ONLINE ? I'll leave it like that for the day and tonight I'll see what is left. But it does not smell really good for the SSD. At least ada5 seems to be back to normal life.

cyberjock · Mar 14, 2013

Yeah, ada4 looks like it's about done. But the SMART data for it(assuming its the Crucial M4 you listed SMART data above for) doesn't seem to indicate any problems in the past. It sounds like it may have either just failed, the disk actually is bad but not giving indications in SMART, or the disk itself isn't the problem and you should look elsewhere(cables, etc).

I wouldn't rule out the disk being the problem until you test it. But obviously something is not working with the caching.

Edit: As far as ada5 is concerned, of course it looks fine. The numbers reset on reboot and as you said "So, after a good night of doing mostly nothing" so there wasn't much to do that could go wrong. A parked car is never broken. I'd still question its reliability until you do a complete scrub without errors and a long smart test. Considering you were smart and went with a RAIDZ2 you have a little more headroom than many people, but I wouldn't rely on it. I'd make the scrub and smart test a high priority.

thorgrim · Mar 14, 2013

cyberjock said:
Yeah, ada4 looks like it's about done. But the SMART data for it(assuming its the Crucial M4 you listed SMART data above for) doesn't seem to indicate any problems in the past. It sounds like it may have either just failed, the disk actually is bad but not giving indications in SMART, or the disk itself isn't the problem and you should look elsewhere(cables, etc).

I wouldn't rule out the disk being the problem until you test it. But obviously something is not working with the caching.

Edit: As far as ada5 is concerned, of course it looks fine. The numbers reset on reboot and as you said "So, after a good night of doing mostly nothing" so there wasn't much to do that could go wrong. A parked car is never broken. I'd still question its reliability until you do a complete scrub without errors and a long smart test. Considering you were smart and went with a RAIDZ2 you have a little more headroom than many people, but I wouldn't rely on it. I'd make the scrub and smart test a high priority.

Ok, so I started a scrub job on the pool. Everything was going well for about 20 minutes, no errors and runing fine but all of a sudden, it just stopped but still displaying no error, like everything is perfectly fine.

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
  scan: scrub in progress since Thu Mar 14 05:21:46 2013
        223G scanned out of 1.44T at 215M/s, 1h39m to go
        0 repaired, 15.12% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
        cache
          ada4p1                                        ONLINE       0  191K     0

errors: No known data errors

    .... .... .... ....

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
  scan: scrub repaired 0 in 0h21m with 0 errors on Thu Mar 14 05:43:16 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
        cache
          ada4p1                                        ONLINE       0  226K     0

errors: No known data errors

But I knew it did not go through the entire pool so I restarted another scrub and it directly gave an error :

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Mar 14 05:46:14 2013
        25.4G scanned out of 1.44T at 124M/s, 3h20m to go
        1.24M repaired, 1.72% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0   317  (repairing)
        cache
          ada4p1                                        ONLINE       0  252K     0

errors: No known data errors

For now it seems to be still running but at half the speed it was before. The speed is not an issue as long as it goes to the end this time and at least ensure that data on the 4 other disks is good in case I replace the ada5 one.

Just to be sure, when you say run a long smart test you mean something like smartctl -t long /dev/ada5 ?

I choose RAIDZ2 just for this kind of situation, to have time to take necessary steps before total failure. And the really important stuff is backed-up on two other external drives. But still, it would be nice not having to rebuild everything :)

2pm update
The scrub is still running and the more it goes, the more it slows down. Anything to fear ? Also, seem to have repaired quite a lot of "files" :

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Mar 14 05:46:14 2013
        1.05T scanned out of 1.44T at 72.9M/s, 1h34m to go
        76.4G repaired, 72.71% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0 1.81M  (repairing)
        cache
          ada4p1                                        ONLINE       0 39.0M     0

errors: No known data errors

cyberjock · Mar 14, 2013

thorgrim said:
Ok, so I started a scrub job on the pool. Everything was going well for about 20 minutes, no errors and runing fine but all of a sudden, it just stopped but still displaying no error, like everything is perfectly fine.

That's kind of what I was expecting. As soon as you started verifying the data on the zpool you'd see ada5 was trashed. There's a chance that the scrub will never finish with ada5 installed. This is where having hard drives that have TLER(read: expensive) could have made the difference between a scrub completing and not completing. If the scrub starts saying it'll take weeks, months, or years I'd consider replacing ada5 and not worry about the scrub finishing.

I would look at the SMART data for ada5 and see what it says for Current_Pending_Sector_Count(which is sometimes Current Uncorrectable Pending Sectors or similar). If the value is not zero and increasing at a regular rate the drive may not finish the scrub. Even if it can finish the scrub, I probably wouldn't waste my time trying to let it finish.

thorgrim said:
Just to be sure, when you say run a long smart test you mean something like smartctl -t long /dev/ada5 ?

Exactly.

Edit: I wouldn't be fearful of it fixing stuff, it looks like its trying to fix the broken drive(an exercise in futility). Considering none of the other drives have any errors it looks like all of your time is going to be spent trying to fix a drive that you are going to pull out and RMA/trash anyway. ada5 will never be suitable for storing data ever again. So I'd just shutdown the server and pull the bad drive now. Just follow the manual for the proper directions for offlining and replacing disks. If you have a spare disk already available I'd install the new disk and let it resilver. Once that finishes I always do another scrub just to make sure everything is perfect, then call it a job well done.

thorgrim · Mar 14, 2013

cyberjock said:
That's kind of what I was expecting. As soon as you started verifying the data on the zpool you'd see ada5 was trashed. There's a chance that the scrub will never finish with ada5 installed. This is where having hard drives that have TLER(read: expensive) could have made the difference between a scrub completing and not completing. If the scrub starts saying it'll take weeks, months, or years I'd consider replacing ada5 and not worry about the scrub finishing.

I would look at the SMART data for ada5 and see what it says for Current_Pending_Sector_Count(which is sometimes Current Uncorrectable Pending Sectors or similar). If the value is not zero and increasing at a regular rate the drive may not finish the scrub. Even if it can finish the scrub, I probably wouldn't waste my time trying to let it finish.

Looks like the scrub is about to end, at least that what it says at the moment :

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Mar 14 05:46:14 2013
        1.32T scanned out of 1.44T at 63.4M/s, 0h32m to go
        120G repaired, 91.77% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0 2.84M  (repairing)
        cache
          ada4p1                                        ONLINE       0 71.1M     0

errors: No known data errors

cyberjock said:
Edit: I wouldn't be fearful of it fixing stuff, it looks like its trying to fix the broken drive(an exercise in futility). Considering none of the other drives have any errors it looks like all of your time is going to be spent trying to fix a drive that you are going to pull out and RMA/trash anyway. ada5 will never be suitable for storing data ever again. So I'd just shutdown the server and pull the bad drive now. Just follow the manual for the proper directions for offlining and replacing disks. If you have a spare disk already available I'd install the new disk and let it resilver. Once that finishes I always do another scrub just to make sure everything is perfect, then call it a job well done.

Yeah, I think that when I come home tonight ada5 is going to say goodbye to his buddies and take a tour back to Seagate. I'll let the scrub continue, just for curiosity to know if it will complete or not

Can I run SMART test while scrub is running ? I guess so but don't want to scrap the whole thing more than it is already... And my SSD seem to be in a bad shape too with zpool status reporting lots of WRITE errors. I'll try to find a cable to give it a chance, but my guess is it will be kicked out as well soon.

Another thing I noticed is the load average is getting really high :

Code:

last pid: 13820;  load averages:  6.09,  5.11,  5.59                                                                  up 0+17:56:52  12:00:30
43 processes:  4 running, 39 sleeping
CPU: 39.1% user,  0.0% nice, 60.9% system,  0.0% interrupt,  0.0% idle
Mem: 173M Active, 105M Inact, 2034M Wired, 2140K Cache, 236M Buf, 5189M Free
Swap: 8192M Total, 8192M Free

I've seen it up to 8 and more today while the scrub was running. Is it normal or another sign that something is wrong ?

Edit : I've run the smartctl tool to check the Current_Pending_Sector value and it says 0 :

Code:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

thorgrim · Mar 14, 2013

Scrub finished !

Code:

[root@freenas] ~# zpool status
  pool: share
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 128G in 6h35m with 0 errors on Thu Mar 14 12:21:23 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        share                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/a179b007-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a20198bf-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a2951acb-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a322af78-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0     0
            gptid/a3ae295d-59c3-11e1-9ae0-14dae9686174  ONLINE       0     0 3.03M
        cache
          ada4p1                                        ONLINE       0 79.9M     0

errors: No known data errors

Time to go home and take this ada5 drive out of my NAS :)

cyberjock · Mar 14, 2013

Yeah, I'd pull ada5. Try a different cable for the SSD and see if that helps. If not, maybe RMA the SSD too.

At least you didn't lose any data!

thorgrim · Mar 14, 2013

Resilvering on the way. Time to take the RMA way.

The SSD seems to be back. I changed the power plug, but not the SATA one as I have no spare. I'll have one tomorrow and do the change, hopping this will do the trick. But for now everything is working perfectly fine, ne errors reported, resilvering progressing at full speed (250M/s instead of 60 all the day for the scrub) and all that while streaming a movie. I love Freenas :)

Thanks for the help !

Important Announcement for the TrueNAS Community.

Track down HDD issue, if any...

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Attachments

Wizard

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Similar threads