blahhumbug
Dabbler
- Joined
- Apr 25, 2015
- Messages
- 22
I recently built my first FreeNAS box using 8xWD Red 3TB (WD30EFRX) drives. As part of my initial system checks I ran memtest86 for 48 hours and then ran the following tests in parallel across all 8 drives:
smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
Code:
=== START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD30EFRX-68EUZN0 Serial Number: WD-************ LU WWN Device Id: * ****** ********* Firmware Version: 82.00A82 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed May 6 19:35:29 2015 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 113) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (39540) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 397) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 145 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 42 194 Temperature_Celsius 0x0022 115 109 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 10% 138 901099712 # 2 Extended offline Completed without error 00% 32 - # 3 Conveyance offline Completed without error 00% 24 - # 4 Short offline Completed without error 00% 24 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
Last edited: