SOLVED New WD Gold 12TBs in Lenovo SA120: ATA error count increased

eroji · Feb 12, 2019

I have 12 WD Gold 12TBs in a Lenovo SA120 DAS enclosure. As soon as I created the volume and mounted it I started seeing these errors on all the disks. I can't seem to figure out what's causing the errors. Any ideas what could be the cause?

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD121KRYZ-01W0RB0
Serial Number:    8DJYV3JY
LU WWN Device Id: 5 000cca 253e9bfcc
Firmware Version: 01.01H01
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb 12 17:17:43 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1256) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       618
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       27
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 20/32)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   199   199   000    Old_age   Always       -       91

SMART Error Log Version: 1
ATA Error Count: 91 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 91 occurred at disk power-on lifetime: 618 hours (25 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 00 90 20 69 40 00  25d+18:16:49.740  READ FPDMA QUEUED
  60 18 00 58 20 69 40 00  25d+18:16:49.740  READ FPDMA QUEUED
  60 18 00 40 20 69 40 00  25d+18:16:49.740  READ FPDMA QUEUED
  60 18 00 20 20 69 40 00  25d+18:16:49.739  READ FPDMA QUEUED
  60 50 00 d0 1f 69 40 00  25d+18:16:48.884  READ FPDMA QUEUED

Error 90 occurred at disk power-on lifetime: 618 hours (25 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 00 60 8a 1b 40 00  25d+18:15:10.085  READ FPDMA QUEUED
  60 18 00 28 8a 1b 40 00  25d+18:15:10.083  READ FPDMA QUEUED
  60 18 00 b8 89 1b 40 00  25d+18:15:10.082  READ FPDMA QUEUED
  60 18 00 80 89 1b 40 00  25d+18:15:10.081  READ FPDMA QUEUED
  60 18 00 48 89 1b 40 00  25d+18:15:10.080  READ FPDMA QUEUED

Error 89 occurred at disk power-on lifetime: 618 hours (25 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 00 08 84 1b 40 00  25d+18:15:09.293  READ FPDMA QUEUED
  60 18 00 d0 83 1b 40 00  25d+18:15:09.292  READ FPDMA QUEUED
  60 18 00 28 83 1b 40 00  25d+18:15:09.292  READ FPDMA QUEUED
  60 18 00 98 83 1b 40 00  25d+18:15:09.292  READ FPDMA QUEUED
  60 18 00 f0 82 1b 40 00  25d+18:15:09.077  READ FPDMA QUEUED

Error 88 occurred at disk power-on lifetime: 618 hours (25 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 00 a8 7d 1b 40 00  25d+18:15:08.464  READ FPDMA QUEUED
  60 20 00 70 7d 1b 40 00  25d+18:15:08.307  READ FPDMA QUEUED
  60 48 18 60 7d 1b 40 00  25d+18:15:08.306  READ FPDMA QUEUED
  60 00 10 60 7c 1b 40 00  25d+18:15:08.306  READ FPDMA QUEUED
  60 00 08 60 7b 1b 40 00  25d+18:15:08.306  READ FPDMA QUEUED

Error 87 occurred at disk power-on lifetime: 618 hours (25 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 18 58 6d a7 40 00  25d+18:14:41.512  READ FPDMA QUEUED
  60 a0 00 f0 6f a7 40 00  25d+18:14:41.510  READ FPDMA QUEUED
  60 18 30 d0 6f a7 40 00  25d+18:14:41.490  READ FPDMA QUEUED
  60 78 28 58 6f a7 40 00  25d+18:14:41.490  READ FPDMA QUEUED
  60 00 20 58 6e a7 40 00  25d+18:14:41.490  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       604         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Chris Moore · Feb 12, 2019

Need all hardware details. For example look here:

Updated Forum Rules 12/5/18
https://forums.freenas.org/index.php?threads/updated-forum-rules-12-5-18.45124/

tom__w · Feb 13, 2019

BTW: Not sure it is relevant but it is interesting that I have had 2 out of 3 new Gold drives fail in the past month. All started showing SMART errors.

sretalla · Feb 13, 2019

It's worth noting that the Gold drives are 7200 RPM, so will run hot... is that heat properly accounted for in the chassis to avoid topping 35 celcius?

Red and Purple drives spin at 5400 RPM and run cooler.

Johnnie Black · Feb 13, 2019

eroji said:
199 UDMA_CRC_Error_Count 0x000a 199 199 000 Old_age Always - 91

CRC errors are a connection problem, if it's with all disks it's not a slot problem, you can try new SAS cables(s), bypassing the enclosure just to test, it could also be some compatibilty issue between those disks and the enclosure.

tom__w · Feb 13, 2019

sretalla said:
It's worth noting that the Gold drives are 7200 RPM, so will run hot... is that heat properly accounted for in the chassis to avoid topping 35 celcius?

Red and Purple drives spin at 5400 RPM and run cooler.

My server rack is sitting at 55 F. I don't think heat is the issue here.

Chris Moore · Feb 13, 2019

sretalla said:
It's worth noting that the Gold drives are 7200 RPM, so will run hot... is that heat properly accounted for in the chassis to avoid topping 35 celcius?

While I agree that cooler drives tend to last longer, I don't think this is related to temperature because the WD Gold drives are rated to 60°C and I have a system full of the 6TB drives that had to keep running for almost a month while one of the cooling units was out in my server room at work. Temperatures were close to the limit for the whole month and the drives never slowed down or missed a tick. I agree with @Johnnie Black that is is probably a cabling issue or something to do with the controller, which is why I wanted more hardware details.

eroji · Feb 13, 2019

Chris Moore said:
Need all hardware details. For example look here:

Updated Forum Rules 12/5/18
https://forums.freenas.org/index.php?threads/updated-forum-rules-12-5-18.45124/

Here are the specs:
Supermicro X9DRi-F
2x Intel E5-2690
256GB of DDR3-1333
FreeNAS-11.1-U6
2x LSI 9211-8i (For internal disks)
1x LSI 9206-16e (to SA120)
Supermicro SC836 16 bay JBOD backplane
16x WD Purple 8TB (internal)
12x WD Gold 12TB (SA120)
FreeNAS-11.1-U6

Chris Moore · Feb 13, 2019

I got this from the Lenovo site, so it may not be the same as what you have. Do you have just one cable from the server to the external disk shelf?

I would try re-seating all the connectors on the SAS cabling.

eroji · Feb 13, 2019

So the SA120 comes with 1 IOM out of the box. I am only one module with 1 cable from the external HBA. I rebooted the server and it got stuck with "B2" code. Prior to the reboot I went and looked at the errors on the individual disks and it showed that 3 of the 12 disks had errors while the rest did not have any. So I am a bit confused if those disks were the only ones that were actually bad? I called Supermicro and they stated that it has to do with expansion cards. I had someone then reseat all the HBAs and reconnect the cable. The cable did not reseat correctly on first try but the server booted up without seeing the disks in the SA120. I had him redo the cable and the disks came back. I detached the volume and reattached it which took a very long time. It did finally finish successfully but I am seeing these output in the KVM. I'm not sure right now if it's a cable issue or what but I have ordered a replacement and should arrive by Friday. However, the pool seems to be resilvering on its own for some reason.

eroji · Feb 13, 2019

Pool status shows that 3 disks are being resilvered.

Code:

  pool: tank0
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Feb 13 12:59:39 2019
        544G scanned at 1.23G/s, 1.66G issued at 3.85M/s, 90.3T total
        139M resilvered, 0.00% done, no estimated completion time
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank0                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/15f4b384-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/17490f91-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0  (resilvering)
            gptid/18ab0ec7-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/19fe4063-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0  (resilvering)
            gptid/1b55881b-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/1cc7edd0-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0  (resilvering)
            gptid/1e277c79-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/1f936f44-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/20fcbe2e-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/22608b35-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/23d100d0-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0
            gptid/25539295-2a7b-11e9-8bf3-0025906bdbce  ONLINE       0     0     0

errors: No known data errors

Chris Moore · Feb 13, 2019

I would try reseating

eroji said:
Pool status shows that 3 disks are being resilvered.

That is exciting, isn't it?

eroji · Feb 14, 2019

The resilvering percentage keeps on resetting back to 0. I'm guessing the errors in KVM I'm seeing is related. I am leaning towards cable or HBA communication issue of some sort. I went ahead and shut the NAS down and disconnected the cable (replacement cable arriving Friday). In the meantime I have updated all HBA firmwares to the latest and I will upgrade FreeNAS to 11.2.

Chris Moore · Feb 16, 2019

eroji said:
replacement cable arriving Friday

Please keep us updated on developments.

eroji · Feb 16, 2019

So far so good with replacement cable. Though I'm not sure if the issue was caused by firmware or cable in the first place.

Chris Moore · Feb 16, 2019

eroji said:
Though I'm not sure if the issue was caused by firmware or cable in the first place

What firmware did you have before? What did you upgrade to?

eroji · Feb 16, 2019

20.0.4.0 to 20.0.7.0 I think.

Chris Moore · Feb 16, 2019

eroji said:
20.0.4.0

It could be, and I have no evidence of this, but it could be that the older firmware was intolerant of the 12TB drives. If you are willing to risk some testing, you might switch the cables and see if the errors return. Please don't do it if you are not willing to take a chance.

eroji · Feb 20, 2019

I think it's safe to say this is resolved. No issue since cable change and firmware upgrade. I'm leaning towards firmware issue as the root cause.

Important Announcement for the TrueNAS Community.

SOLVED New WD Gold 12TBs in Lenovo SA120: ATA error count increased

Contributor

Hall of Famer

Explorer

Powered by Neutrality

Guru

Explorer

Hall of Famer

Contributor

Hall of Famer

Contributor

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Similar threads