Critical Alert - A device has experienced an unrecoverable error

rdybro · Apr 21, 2016

Hi everyone.

My FreeNAS is giving me this alert:

CRITICAL: The volume volume01 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Can anyone tell me something about about this error? What logs do I have to look in to find information on what might have caused this?

My FreeNAS seems to run fine.

I have Googled this, but I couldn't really find any information on it.

Regards, Rasmus

hugovsky · Apr 21, 2016

You should check your drives with smart tests. Also, please post full system specs.

rdybro · Apr 21, 2016

hugovsky said:
You should check your drives with smart tests. Also, please post full system specs.

Build FreeNAS-9.10-RELEASE (2def9c8)
Platform Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
Memory 16328MB (ECC)
Disks 5x3TB disk (3x WD Red and 2x Seagate Green)

Don't know if there is anything else worth noting in regards to my system specs?
I know the Seagate disks are a bad idea, but they were some disks I already had. They are being replaced with WD Red as they die.

I just setup a short self test to run once every hour. I had not setup any SMART tests. Will this notify me if it finds any error, or do I have to check some log files? And are the short self test enough, or should I run one of the other options?

Regards

danb35 · Apr 21, 2016

rdybro said:
I just setup a short self test to run once every hour.

Short test every hour is way too frequently--once a day is plenty. You should also set up a long test every week or two. If they find errors, and you've configured email for the SMART service in FreeNAS, the system will email you with alerts. For now, post either the output of zpool status or a screen shot of the volume status page.

hugovsky · Apr 21, 2016

Your hardware is fine. Now that we can see it. ;)

Do take great consideration for danb35 advice and post what was asked so we can try to help you.

rdybro · Apr 21, 2016

danb35 said:
Short test every hour is way too frequently--once a day is plenty. You should also set up a long test every week or two. If they find errors, and you've configured email for the SMART service in FreeNAS, the system will email you with alerts. For now, post either the output of zpool status or a screen shot of the volume status page.

Thanks - yeah I thought it would be way too often, it was also just meant to be here for a couple of hours until I got a result/figured this out

I will setup a schedule later with one short self test daily and one long self test every one or two weeks. I have not setup email reporting, but I will do in one of the coming days. Thanks for the suggestions

I've attached a screenshot of the result when running a "zpool status". Is that the right command? Is it the CKSUM value on one of the disks that is the problem?
The disk overview on the web interface is not showing any errors, or anything suspicious on any of the disks.

rdybro · Apr 21, 2016

hugovsky said:
Your hardware is fine. Now that we can see it. ;)

Do take great consideration for danb35 advice and post what was asked so we can try to help you.

Great - I thought so ;) I've used a lot of time in here to find a great hardware-solution :)

danb35 · Apr 21, 2016

rdybro said:
Is it the CKSUM value on one of the disks that is the problem?

Yes, it likely is. Start a long SMART self-test on that drive ('smartctl -t long /dev/whatever'). Once it finishes (likely in several hours), post the output of 'smartctl -x /dev/whatever' in code tags. This will help determine if the problem is with the drive itself or somewhere else.

Sakuru · Apr 21, 2016

Just to make sure you know where to find the code tags button :)

rdybro · Apr 21, 2016

danb35 said:
Yes, it likely is. Start a long SMART self-test on that drive ('smartctl -t long /dev/whatever'). Once it finishes (likely in several hours), post the output of 'smartctl -x /dev/whatever' in code tags. This will help determine if the problem is with the drive itself or somewhere else.

Thanks. I have started the self test now. I will return once there is a result

rdybro · Apr 21, 2016

Sakuru said:
Just to make sure you know where to find the code tags button :)

Thanks, I actually didn't :D

rdybro · Apr 21, 2016

danb35 said:
Yes, it likely is. Start a long SMART self-test on that drive ('smartctl -t long /dev/whatever'). Once it finishes (likely in several hours), post the output of 'smartctl -x /dev/whatever' in code tags. This will help determine if the problem is with the drive itself or somewhere else.

Here you go, this should be the disk that have caused the error. Thanks a lot for your help

Code:

[root@fil03] ~# smartctl -x /dev/ada1
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RC3 amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1ER166
Serial Number:    W501QMAD
LU WWN Device Id: 5 000c50 08aa31f8c
Firmware Version: CC25
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Apr 22 00:59:54 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (   89) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:  (   1) minutes.
Extended self-test routine
recommended polling time:  ( 324) minutes.
Conveyance self-test routine
recommended polling time:  (   2) minutes.
SCT capabilities:        (0x1085)SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   115   099   006    -    99181608
  3 Spin_Up_Time            PO----   098   094   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    26
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   073   060   030    -    25125496
  9 Power_On_Hours          -O--CK   096   096   000    -    3544
10 Spin_Retry_Count        PO--C-   100   100   097    -    0
12 Power_Cycle_Count       -O--CK   100   100   020    -    21
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0 0 0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   071   041   045    Past 29 (1 72 31 27 0)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    8
193 Load_Cycle_Count        -O--CK   100   100   000    -    192
194 Temperature_Celsius     -O---K   029   059   000    -    29 (0 17 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    3544h+36m+47.476s
241 Total_LBAs_Written      ------   100   253   000    -    5452510284
242 Total_LBAs_Read         ------   100   253   000    -    241761913334
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS     129  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5176  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      10  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3541         -
# 2  Short offline       Completed without error       00%      3536         -
# 3  Short offline       Completed without error       00%      3535         -
# 4  Short offline       Completed without error       00%      3534         -
# 5  Short offline       Completed without error       00%      3533         -
# 6  Short offline       Completed without error       00%      3532         -
# 7  Short offline       Completed without error       00%       225         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    29 Celsius
Power Cycle Min/Max Temperature:     27/31 Celsius
Lifetime    Min/Max Temperature:     17/59 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

I have a bonus question, if you guys don't mind. When scheduling my SMART tests, should I schedule different disks at different times, or doesn't that matter?

depasseg · Apr 21, 2016

I do mine all at the same time.

Glorious1 · Apr 21, 2016

Just wondering if you're sure you're testing the same disk that showed checksum errors in zpool status. One of my frustrations is it seems everything shows you a different identifier for the disks. Tying them all together is certainly doable, but a pain.

You've got a huge number of raw read errors on that disk (line 66), and it looks like it's exceeded the max temp threshold (lines 79 and 155). But it did pass the extended test (line 130), and if I read it right the errors were correctable (lines 84-85). I'm not sure where that leaves you. Not an expert, but my first impulse would be to (a) make sure the drives don't overheat in the future, and (b) clear the errors as zpool status output suggested and keep a close eye on things.

danb35 · Apr 21, 2016

Glorious1 said:
You've got a huge number of raw read errors on that disk (line 66),

Not really. The drive's a Seagate, and they report that field differently. I don't understand or remember the detail, but in short, on a Seagate, the higher that number is, the better.

Glorious1 said:
and it looks like it's exceeded the max temp threshold (lines 79 and 155).

Definitely--it's seen 59 C, which is way too high (back in my younger and dumber drives, I had a couple of drives in a server in my attic. In the summer. In South Carolina. The highest they saw was 54 C, which is way too high as it is).

Nothing jumps out as problematic, so I'll echo @Glorious1's question: are you sure you tested the right drive? It's certainly possible you did, but we'd want to be sure.

rdybro · Apr 22, 2016

This is from zpool status:

Code:

  pool: volume01
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 3h59m with 0 errors on Sun Apr 17 03:59:54 2016
config:

NAME                                            STATE     READ WRITE CKSUM
volume01                                        ONLINE       0     0     0
  raidz2-0                                      ONLINE       0     0     0
    gptid/c6560518-99de-11e5-8e9c-d05099c00546  ONLINE       0     0     0
    gptid/a5b85a87-824b-11e5-a90d-d05099c00546  ONLINE       0     0   705
    gptid/a621326c-824b-11e5-a90d-d05099c00546  ONLINE       0     0     0
    gptid/d07c7909-9e98-11e5-92f0-d05099c00546  ONLINE       0     0     0
    gptid/a715c6ee-824b-11e5-a90d-d05099c00546  ONLINE       0     0     0

errors: No known data errors

And this is from glabel status:

Code:

                                      Name  Status  Components
gptid/a715c6ee-824b-11e5-a90d-d05099c00546     N/A  ada0p2
gptid/a5b85a87-824b-11e5-a90d-d05099c00546     N/A  ada1p2
gptid/a621326c-824b-11e5-a90d-d05099c00546     N/A  ada2p2
gptid/c6560518-99de-11e5-8e9c-d05099c00546     N/A  ada3p2
gptid/d07c7909-9e98-11e5-92f0-d05099c00546     N/A  ada4p2
gptid/ad5dd534-7a81-11e5-a03a-d05099c00546     N/A  da0p1

And I ran smartctl -t long /dev/ada1 to start the long self-test, and smartctl -x /dev/ada1 to get the results.

The temperature issue is a long way back, when I build the machine, and did not get the air flowing right inside my case. It should not be a problem anymore, the temperature as I read it is now 29 degrees Celcius.

Ericloewe · Apr 23, 2016

rdybro said:
I have a bonus question, if you guys don't mind. When scheduling my SMART tests, should I schedule different disks at different times, or doesn't that matter?

My instinct tells me you'll just multiply the time the pool spends with degraded performance by the number of drives. Since the pool is mostly limited by its slowest drive (if only a single vdev exists), you're just limiting the pool's performance for longer, about as much as you'd see from doing the SMART tests all at once.

So don't, unless you're looking for learning experiences.

rdybro · Apr 24, 2016

Ericloewe said:
My instinct tells me you'll just multiply the time the pool spends with degraded performance by the number of drives. Since the pool is mostly limited by its slowest drive (if only a single vdev exists), you're just limiting the pool's performance for longer, about as much as you'd see from doing the SMART tests all at once.

So don't, unless you're looking for learning experiences.

Oh yeah, that makes sense - thanks

rdybro · Apr 24, 2016

danb35 said:
Not really. The drive's a Seagate, and they report that field differently. I don't understand or remember the detail, but in short, on a Seagate, the higher that number is, the better.

Definitely--it's seen 59 C, which is way too high (back in my younger and dumber drives, I had a couple of drives in a server in my attic. In the summer. In South Carolina. The highest they saw was 54 C, which is way too high as it is).

Nothing jumps out as problematic, so I'll echo @Glorious1's question: are you sure you tested the right drive? It's certainly possible you did, but we'd want to be sure.

Sorry this is not to be impatient - I just think I didn't press "Reply" when I posted my last post. So I just wanted to make sure that you got a notification on my post

It is this one.

Important Announcement for the TrueNAS Community.

Critical Alert - A device has experienced an unrecoverable error

Dabbler

Guru

Dabbler

Hall of Famer

Guru

Dabbler

Attachments

Dabbler

Hall of Famer

Guru

Attachments

Dabbler

Dabbler

Dabbler

FreeNAS Replicant

Guru

Hall of Famer

Dabbler

Server Wrangler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Critical Alert - A device has experienced an unrecoverable error"

Similar threads