High # of Errors Corrected by ECC with new SAS HDD

mgb · Oct 7, 2015

I'm in the process of burn-in testing a new server and there's 1 SAS HDD (out of 10) that's reporting quite a high number of "Errors Corrected by ECC" compared to the others.

The system:

Chassis: Supermicro CSE-826BE26-R920LPB (Dual Expanders Backplane)
Motherboard: Supermicro X1oDRi-T
HBA: 2x LSI 9207-8i (only 1 is currently connected)
HDDs: 10x HGST 2TB 7K400 SAS2

The procedure:

After running memtest86 and memtest86+ on the server, I followed the [How To] Hard Drive Burn-In Testing (thanks @qwertymodo). You'll notice the ~8TB of data processed which is the result of badblocks default 4 passes.

The suspicious HDD:

Code:

root@sysresccd /root % smartctl -q noserial -a /dev/sdh
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-3.14.50-std460-amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A280
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LB provisioning type: unreported, LBPME=0, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Oct  7 14:32:51 2015 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        85 C

Manufactured in week 14 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  4
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  6
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 2048408550375424

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:       5304      354         0      5658       8029       8001.691           0
write:         0        0         0         0          6       8001.596           0
verify:     1535      176         0      1711      26283          0.064           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      46                 - [-   -    -]
# 2  Background long   Completed                   -       6                 - [-   -    -]
# 3  Background short  Completed                   -       0                 - [-   -    -]

Long (extended) Self Test duration: 22650 seconds [377.5 minutes]

Out of the other 9 HDDs, 2 had 0 errors, 2 had 1 error, 3 had less than 10 errors, 2 had less than 30 errors and then this one shows over 5000. Seems extremely high compared to the others.

This is definitely something I should be worried about, right?

Any suggestions on how to more thoroughly test this drive?

I'm thinking I should RMA the drive.

Thanks in advance!

--mgb

qwertymodo · Oct 9, 2015

What's the output of smartctl -A /dev/sdh?

Important Announcement for the TrueNAS Community.

High # of Errors Corrected by ECC with new SAS HDD

mgb

Dabbler

qwertymodo

Contributor

Similar threads