Critical Alert

morxy49 · Sep 23, 2015

Just got an email from my NAS saying
"The volume pandora_vol0 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected."

The NAS is currently not responding via the Web interface. Should i try to manually restart it?

If i have to guess, it's ada0 being a little bitch again, as it usually is. But i assume you can check a log somewhere? How do i do to understand this error message?

The version is FreeNAS-9.3-STABLE-201412090314

morxy49 · Sep 23, 2015

UPDATE: The NAS rebooted itself without me doing anything. Waiting for it to boot atm.

morxy49 · Sep 23, 2015

Ok, so the NAS is up and running again. No error messages whatsoever, which kind of concerning...
I checke out ada0, and this is what i got:

Code:

[root@PandorasBox] ~# smartctl -a /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00AZ6B0
Serial Number:    WD-WCC070254514
LU WWN Device Id: 5 0014ee 25d5df87f
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 23 21:23:07 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (50700) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 487) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   244   021    Pre-fail  Always       -       7991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       94
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       14706
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       746406
194 Temperature_Celsius     0x0022   101   097   000    Old_age   Always       -       51
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       10%     14705         4183154992
# 2  Short offline       Completed: read failure       90%     14670         4162610798
# 3  Short offline       Completed: read failure       90%     14646         4162610792
# 4  Short offline       Completed: read failure       90%     14622         4162610796
# 5  Short offline       Completed: read failure       90%     14598         4162610798
# 6  Short offline       Completed: read failure       50%     14574         4162610798
# 7  Short offline       Completed: read failure       90%     14550         4162610792
# 8  Short offline       Completed: read failure       90%     14526         4162610798
# 9  Short offline       Completed: read failure       90%     14502         4162610798
#10  Short offline       Completed: read failure       90%     14478         4162610798
#11  Short offline       Completed: read failure       90%     14454         4162610798
#12  Short offline       Completed: read failure       90%     14430         4162610798
#13  Short offline       Completed: read failure       10%     14406         4162610798
#14  Short offline       Completed: read failure       90%     14382         4162610798
#15  Short offline       Completed: read failure       90%     14358         4162610798
#16  Short offline       Completed: read failure       90%     14334         4162610798
#17  Short offline       Completed: read failure       90%     14310         4162610798
#18  Short offline       Completed: read failure       90%     14287         4162610798
#19  Short offline       Completed: read failure       90%     14263         4162610798
#20  Short offline       Completed: read failure       90%     14239         4162610798
#21  Short offline       Completed: read failure       90%     14215         4162610798

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The current pending sector count has been 2 for like a year now, so i don't really think that's a problem. What else could it be?

Oh, and the version is FreeNAS-9.3-STABLE-201412090314. Edited the first post as well.

Jailer · Sep 23, 2015

I'd be more concerned about the load cycle count on that drive. It also has not completed a short test. I'd replace that drive and check out the using wdidle3.exe on it's replacement if you intend on sticking with WD Green drives.

morxy49 · Sep 23, 2015

Jailer said:
I'd be more concerned about the load cycle count on that drive. It also has not completed a short test. I'd replace that drive and check out the using wdidle3.exe on it's replacement if you intend on sticking with WD Green drives.

Yeah, I'm not really a fan of WD Green drives actually. I just happened to have 3 of them laying around when i built my NAS. pandora_vol0 consists of 3 WD Green's and 5 WD Red's in a RAIDz2.
I agree that it may be time to replace that drive, but can i be certain that it is not something else causing this critical alert?
As i said, this disk has been nagging on for like a year now (current pending sector count, temperatures etc.), without anything actually happening, so i'd be surprised if it actually is that disk causing the problems.

danb35 · Sep 23, 2015

Every SMART short test (for the last three weeks, anyway) has failed, and the disk has a load cycle count over 750k (and I believe the lifetime spec is 600k). I'd agree with @Jailer's recommendation to replace the drive.

The ZFS error counts are reset when the machine reboots, so it's expected that they'd show 0 now. If you start a scrub, expect to see errors again.

morxy49 · Sep 23, 2015

danb35 said:
Every SMART short test (for the last three weeks, anyway) has failed, and the disk has a load cycle count over 750k (and I believe the lifetime spec is 600k). I'd agree with @Jailer's recommendation to replace the drive.

The ZFS error counts are reset when the machine reboots, so it's expected that they'd show 0 now. If you start a scrub, expect to see errors again.

Oh, ok. Well maybe i should buy a replacement drive ASAP then. I'm currently doing a backup of everything before i do any more experimenting. Better safe than sorry.

Jailer · Sep 23, 2015

I'd be checking those other green drives out real close as well. If they've been in the system as long as this one then they are likely suffering from excessive load cycle counts too and will probably be reaching the end of their life cycle sooner rather than later.

morxy49 · Sep 23, 2015

Jailer said:
I'd be checking those other green drives out real close as well. If they've been in the system as long as this one then they are likely suffering from excessive load cycle counts too and will probably be reaching the end of their life cycle sooner rather than later.

*sigh* yeah, i know... i've been waiting for this day, and now it came. well, time to open up the big wallet!

morxy49 · Sep 23, 2015

Ok, but back to the real question. Can i check out some log somewhere to see what actually caused the critical alert?

Bidule0hm · Sep 23, 2015

51 °C waow, that's a very high temp, no wonder this drive is erratic. The LCC is also well over the designed max value.

You should keep your drives under 40 °C at all times. And you should replace this drive ASAP as it's pretty much toasted.

You should also setup long SMART test every one or two weeks, short tests are not that useful.

morxy49 · Sep 23, 2015

Bidule0hm said:
51 °C waow, that's a very high temp, no wonder this drive is erratic. The LCC is also well over the designed max value.

You should keep your drives under 40 °C at all times. And you should replace this drive ASAP as it's pretty much toasted.

You should also setup long SMART test every one or two weeks, short tests are not that useful.

Yeah, my two other Green's are at 577k and 696k LCC. This isn't good at all.
Though all my drives run at around 46-50 degrees, even the red ones.

Jailer · Sep 23, 2015

You need to address your cooling with temps that high otherwise you'll be opening up your big wallet much more than you like.

Important Announcement for the TrueNAS Community.

Critical Alert

morxy49

Contributor

morxy49

Contributor

morxy49

Contributor

Jailer

Not strong, but bad

morxy49

Contributor

danb35

Hall of Famer

morxy49

Contributor

Jailer

Not strong, but bad

morxy49

Contributor

morxy49

Contributor

Bidule0hm

Server Electronics Sorcerer

morxy49

Contributor

Jailer

Not strong, but bad

Similar threads