SmallGuy
Guru
- Joined
- Jun 7, 2013
- Messages
- 560
Hello Guys,
Set-Up:
ASRock E3C226D2I; OnBoard NIC INTEL I210AT ; Intel Core i3-4330 CPU @ 3.50GHz ; 2x Kingston KVR16E11/8 ; 6x WD RED WD20EFRX - ZFS RAIDZ 2 ; FreeNAS-9.10-STABLE
Just have trouble with ada1 of my RAIDZ2 6 disks pool.
This is the second time it happend in less than a month.
Let me explain the circonstances of the appaerance of the failure.
I have recently upgrade from 9.2.1.9 to 9.10 by clean installation:
-BIOS upgrade to latest version
-BMC upgrade to latest version
-Buy a USB Adaptor (1->2)
-Buy 2 San disk cruser 16gB
-insert all of those inside the box
-Clean Install following the manual
-The pool was imported automagically
-reload configuration file
-Done
What is intersting here is that I have open the case (and potentialy that the upgrade was smouthly from the software point of view)...
After the installation was completed, just after the post-installation auto-reboot, ada1 was detected as failing by FreeNAS (ATA Error Count), and the disk has been set offline automatically by the system.
So I think of a bad connection due to the fact I have put my BIG fingers inside the case for the USB devices installation, and probably touch the wiring:
-Shutdown the system, check connections and reboot... The disk is attached back to the pool automagicaly.
-Ran a long smart test succesfully: Errors recorded at the first failure are still there, but everything else looks good, and the pool status is reported as Healthy.
So I have decided to continue using the drive as is, and to keep an eye on it.
Tuesday morning, my schedulled scrub was launched, and I receved this wenesday morning a critical alert e-mail:
Thanks to the @Bidule0hm script, I got the following report:
Surprisingly, there are some errors I associate to communication, but no UDMA_CRC error.
No Raw_Read_Error_Rate, nothing regarding the traditional errors generally met.
I suspect that when I will reboot the system this evening, the disk will be import automaticaly as it happend the first time, and the result of the new long SMART test I will launch, will be PASSED.
I will post the full result of the long smart test ASAP (The last extended test has been stop by myself because I have erronously repeted the long test using the command history, and the script report only the last one...)
I have already order a new drive as the warranty period is finished, but want to have some advice/clarification.
Just to have a lightened way for troubleshooting, is somebody able to tell me with confidence what are "ABRT at LBA = 0x003ffc80 = 4193408" Error and "READ DMA" command and if this kind of error is generally related to the drive electronic, to the connectic (cable or bad connection), power supply or to the disk controler?
Set-Up:
ASRock E3C226D2I; OnBoard NIC INTEL I210AT ; Intel Core i3-4330 CPU @ 3.50GHz ; 2x Kingston KVR16E11/8 ; 6x WD RED WD20EFRX - ZFS RAIDZ 2 ; FreeNAS-9.10-STABLE
Just have trouble with ada1 of my RAIDZ2 6 disks pool.
This is the second time it happend in less than a month.
Let me explain the circonstances of the appaerance of the failure.
I have recently upgrade from 9.2.1.9 to 9.10 by clean installation:
-BIOS upgrade to latest version
-BMC upgrade to latest version
-Buy a USB Adaptor (1->2)
-Buy 2 San disk cruser 16gB
-insert all of those inside the box
-Clean Install following the manual
-The pool was imported automagically
-reload configuration file
-Done
What is intersting here is that I have open the case (and potentialy that the upgrade was smouthly from the software point of view)...
After the installation was completed, just after the post-installation auto-reboot, ada1 was detected as failing by FreeNAS (ATA Error Count), and the disk has been set offline automatically by the system.
So I think of a bad connection due to the fact I have put my BIG fingers inside the case for the USB devices installation, and probably touch the wiring:
-Shutdown the system, check connections and reboot... The disk is attached back to the pool automagicaly.
-Ran a long smart test succesfully: Errors recorded at the first failure are still there, but everything else looks good, and the pool status is reported as Healthy.
So I have decided to continue using the drive as is, and to keep an eye on it.
Tuesday morning, my schedulled scrub was launched, and I receved this wenesday morning a critical alert e-mail:
Code:
The volume Volume1 (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. Device: /dev/ada1, ATA error count increased from 10 to 20
Thanks to the @Bidule0hm script, I got the following report:
Code:
########## SMART status report for ada1 drive (Western Digital Red: WD-WMC3012xxxxx) ########## smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p3 amd64] (local build) SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 174 172 021 Pre-fail Always - 4291 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 142 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27188 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 130 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 86 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 55 194 Temperature_Celsius 0x0022 113 106 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 ATA Error Count: 20 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 20 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 00 80 fc 3f 40 Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 80 fc 3f 40 08 24d+13:38:01.312 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.308 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.304 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.300 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.295 READ DMA Error 19 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 00 80 fc 3f 40 Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 80 fc 3f 40 08 24d+13:38:01.308 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.304 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.300 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.295 READ DMA Error 18 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 00 80 fc 3f 40 Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 80 fc 3f 40 08 24d+13:38:01.304 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.300 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.295 READ DMA Error 17 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 00 80 fc 3f 40 Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 80 fc 3f 40 08 24d+13:38:01.300 READ DMA c8 00 00 80 fc 3f 40 08 24d+13:38:01.295 READ DMA Error 16 occurred at disk power-on lifetime: 27187 hours (1132 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 00 80 fc 3f 40 Device Fault; Error: ABRT at LBA = 0x003ffc80 = 4193408 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 80 fc 3f 40 08 24d+13:38:01.295 READ DMA Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Aborted by host 90% 26604 -
Surprisingly, there are some errors I associate to communication, but no UDMA_CRC error.
No Raw_Read_Error_Rate, nothing regarding the traditional errors generally met.
I suspect that when I will reboot the system this evening, the disk will be import automaticaly as it happend the first time, and the result of the new long SMART test I will launch, will be PASSED.
I will post the full result of the long smart test ASAP (The last extended test has been stop by myself because I have erronously repeted the long test using the command history, and the script report only the last one...)
I have already order a new drive as the warranty period is finished, but want to have some advice/clarification.
Just to have a lightened way for troubleshooting, is somebody able to tell me with confidence what are "ABRT at LBA = 0x003ffc80 = 4193408" Error and "READ DMA" command and if this kind of error is generally related to the drive electronic, to the connectic (cable or bad connection), power supply or to the disk controler?
Last edited: