HGST SAS drive burn-in

TravisT

Patron
Joined
May 29, 2011
Messages
297
I just picked up 5 HGST 8TB HUH drives for a new pool, and was trying to perform the typical SMART tests for good measure, as the drives were manufactured a few years back.

I have only installed one of 5 disks. The SMART short test ran with no problems, but the SMART long test fails immediately with the following error:

Code:
Long (extended) offline self test failed [unsupported field in scsi command]


I've read about the differences with SAS drives vs SATA drives, but I haven't found anything that says the long test shouldn't run. Is this normal? Current SMART resutls:

Code:
smartctl -a /dev/da11


smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)


Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org





=== START OF INFORMATION SECTION ===


Vendor:               HITACHI


Product:              HUH72808CLAR8000


Revision:             M7K0


Compliance:           SPC-4


User Capacity:        8,001,563,222,016 bytes [8.00 TB]


Logical block size:   4096 bytes


LU is fully provisioned


Rotation Rate:        7200 rpm


Form Factor:          3.5 inches


Logical Unit id:      0x5000cca2612961c4


Serial number:        VJGRSJ0X


Device type:          disk


Transport protocol:   SAS (SPL-3)


Local Time is:        Tue Jun 23 21:45:54 2020 EDT


SMART support is:     Available - device has SMART capability.


SMART support is:     Enabled


Temperature Warning:  Disabled or Not Supported





=== START OF READ SMART DATA SECTION ===


SMART Health Status: OK





Current Drive Temperature:     34 C


Drive Trip Temperature:        60 C





Manufactured in week 12 of year 2017


Specified cycle count over device lifetime:  50000


Accumulated start-stop cycles:  2


Specified load-unload count over device lifetime:  600000


Accumulated load-unload cycles:  898


Elements in grown defect list: 0





Vendor (Seagate Cache) information


  Blocks sent to initiator = 2154618477871104





Error counter log:


           Errors Corrected by           Total   Correction     Gigabytes    Total


               ECC          rereads/    errors   algorithm      processed    uncorrected


           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors


read:          0        0         0         0    2096739      31582.459           0


write:         0        0         0         0     615147      79855.226           0


verify:        0        0         0         0       3298          0.547           0





Non-medium error count:        0





SMART Self-test log


Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]


     Description                              number   (hours)


# 1  Background short  Completed                   -       1                 - [-   -    -]


# 2  Background short  Completed                   -       1                 - [-   -    -]


# 3  Background short  Completed                   -       0                 - [-   -    -]


# 4  Background short  Completed                   -   21427                 - [-   -    -]


# 5  Background short  Completed                   -   19987                 - [-   -    -]


# 6  Background short  Completed                   -   18547                 - [-   -    -]


# 7  Background short  Completed                   -   17107                 - [-   -    -]


# 8  Background short  Completed                   -   15667                 - [-   -    -]


# 9  Background short  Completed                   -   14227                 - [-   -    -]


#10  Background short  Completed                   -   12787                 - [-   -    -]


#11  Background short  Completed                   -   11347                 - [-   -    -]


#12  Background short  Completed                   -    9887                 - [-   -    -]


#13  Background short  Completed                   -    8447                 - [-   -    -]


#14  Background short  Completed                   -    7007                 - [-   -    -]


#15  Background short  Completed                   -    5567                 - [-   -    -]


#16  Background short  Completed                   -    4105                 - [-   -    -]





Long (extended) Self-test duration: 6 seconds [0.1 minutes]


 

Fredda

Guru
Joined
Jul 9, 2019
Messages
608
This is not a general SAS issue. Works fine here. See sniplet from SMART output of a SAS drive:
Code:
Vendor (Seagate) cache information

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0      1196      1196          0      34411.529           0
write:         0        0         0         0          0       3809.233           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   41411                 - [-   -    -]
# 2  Background long   Completed                   -   41243                 - [-   -    -]
# 3  Background long   Completed                   -   41075                 - [-   -    -]
# 4  Background long   Completed                   -   40907                 - [-   -    -]
# 5  Background long   Completed                   -   40739                 - [-   -    -]

Did you start the smart command via commandline? Or was it done via the FreeNAS smart tasks?
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
I started it via the CLI using:

smartctl -t long /dev/da11
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
I also installed a second drive this morning and ran the same process below:

smartctl -a /dev/da11
smartctl -t short /dev/da11
smartctl -t conveyance /dev/da11
smartctl -t long /dev/da11

No errors until trying to run the long SMART test. Received the same error on the second disk.

Code:
smartctl -t long /dev/da11
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Long (extended) offline self test failed [unsupported field in scsi command]
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
Interestingly enough (to me at least), is that it appears I have some smart test running on the suspect disk. I tried numerous things (including the -C option) which started throwing an insane amount of errors on my console. The process wouldn't kill, so I ended up just rebooting the server. It calmed the console messages down, but it also locked out my disk it seems. An attempted smart "all" gives me this:

Code:
smartctl -a /dev/da9
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUH72808CLAR8000
Revision:             M7K0
Compliance:           SPC-4
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2612961c4
Serial number:        VJGRSJ0X
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Jun 24 10:21:18 2020 EDT
device is NOT READY (e.g. spun down, busy)
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


Code:
smartctl -t short /dev/da9
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Can't start self-test without aborting current test (38% remaining),
add '-t force' option to override, or run 'smartctl -X' to abort test.


The testing was reporting 98% last night, and seemed "stuck". This morning it was down to 53%, so apparently it is running. Is there any way to find out what test is running?
 

Fredda

Guru
Joined
Jul 9, 2019
Messages
608
Is there any way to find out what test is running?
A running SMART test should be reported in the smartctl -a /dev/xx output:
Code:
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   71930671       46         0  71930717   71930717       1841.158           0
write:         0        0         1         1          1   2133947303.609           0

Non-medium error count:   479239

Self-test execution status:        99% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background short  Completed                   -   18064                 - [-   -    -]
....
#19  Background long   Completed                   -   17657                 - [-   -    -]
#20  Background short  Completed                   -   17633                 - [-   -    -]

Long (extended) Self-test duration: 1367 seconds [22.8 minutes]

I have no idea what goes wrong here. Do you have the possibility to put that drive into another computer or put a different SAS drive type in you server?
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
Unfortunately, the only SAS drives I have on hand are the 5 I ordered. I did install the rest of them this morning and all are giving the exact same outputs, making me thing there are no problems with the disks themselves. Now it's just figuring out what is going on with this one disk. Is it some sort of offline test that could be running?

Ah, one more thing of note - I did start running badblocks yesterday in tmux, however I would assume that a restart would interrupt that process... maybe I'm incorrect in that assumption.
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
Looks like what was running was a foreground long smart test:

Code:
smartctl -a /dev/da9
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUH72808CLAR8000
Revision:             M7K0
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2612961c4
Serial number:        VJGRSJ0X
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Jun 24 19:13:04 2020 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        60 C

Manufactured in week 12 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  2
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  898
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2161775319449600

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0    2096745      31582.459           0
write:         0        0         0         0     615231      79911.139           0
verify:        0        0         0         0       3374          0.547           0

Non-medium error count:        0

Self-test execution status:        86% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Foreground long   Completed                   -      21                 - [-   -    -]
# 3  Background short  Completed                   -       1                 - [-   -    -]
# 4  Background short  Completed                   -       1                 - [-   -    -]
# 5  Background short  Completed                   -       0                 - [-   -    -]
# 6  Background short  Completed                   -   21427                 - [-   -    -]
# 7  Background short  Completed                   -   19987                 - [-   -    -]
# 8  Background short  Completed                   -   18547                 - [-   -    -]
# 9  Background short  Completed                   -   17107                 - [-   -    -]
#10  Background short  Completed                   -   15667                 - [-   -    -]
#11  Background short  Completed                   -   14227                 - [-   -    -]
#12  Background short  Completed                   -   12787                 - [-   -    -]
#13  Background short  Completed                   -   11347                 - [-   -    -]
#14  Background short  Completed                   -    9887                 - [-   -    -]
#15  Background short  Completed                   -    8447                 - [-   -    -]
#16  Background short  Completed                   -    7007                 - [-   -    -]
#17  Background short  Completed                   -    5567                 - [-   -    -]
#18  Background short  Completed                   -    4105                 - [-   -    -]

Long (extended) Self-test duration: 6 seconds [0.1 minutes]


That is one LONG test, at almost 24 hours. Now that it's finished, it all seems back to normal (or at least the same as the other SAS drives).
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
# 5 Background short Completed - 0 - [- - -]
# 6 Background short Completed - 21427 - [- - -]
Nothing fishy to see here, citizen. Please move along.
Were these sold to you as new drives, or as used drives?
 

TravisT

Patron
Joined
May 29, 2011
Messages
297
The only fishy thing here was that the foreground long test took a LONG time. I didn't think it typically took ~24 hours for that test to run.

As far as the drives being new, there was some concern of them not being new when I ordered them, however it did not state new/refurbished/used. I'm assuming because there are previous smart test completions that it can be assumed the drive has been used. Is that correct?
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
It looks that way to me, and that there's at least 21,427 power-on-hours more than the drive is reporting now.
I have no problem with people buying used drives, as long as the seller isn't trying to pass them off as new. Whomever you bought these from I am disinclined to trust. If they zeroed out the POH, what else got overwritten? Do these drives have a history of a lot of errors, but that also got zeroed out?
Were these drives replaced due to age- which with 2.5 years of operating time is very possible- or is there a more serious underlying problem?
It just raises questions we don't have the answers to.

Long SMART tests can be very long. For an 8TB drive, 24 hours doesn't seem to be unreasonable. I have some 3TB drives in the 7 hour range, and in this thread we see 6TB HGST drives giving estimates of 13-16 hours: https://www.ixsystems.com/community...rt-extended-test-duration-any-theories.59328/
 
Last edited:

TravisT

Patron
Joined
May 29, 2011
Messages
297
All very good points. While I was ok with purchasing used drives, there was definitely some misleading info if these were used drives, as the seller did not mention it anywhere. I have reached out to them and am waiting for a response.

As for the long test - is there any good reason I can't run a background long test? The only tests I've been able to successfully run are the background short and the foreground long tests. While the conveyance test doesn't report errors when running it, it also doesn't seem to do anything (doesn't show in the completed tests section).

Edit: As for the POH, I'm not seeing that reported anywhere other than the "Lifetime Hours" column, but I'm assuming that only reports time since the particular test was run. If I understand correctly, you could have the drive in service for a year without running a SMART test and it would only start counting from when the first test is run. I know (now) that SATA drives report much more information than these SAS drives do.
 
Top