Can't view previous SMART test logs &is it in sequence?

cyberjock · Sep 6, 2014

We're looking at doing something like 7GB of physical RAM minimum to ensure we don't get "false positives". Most people aren't going to install 7GB of RAM just to avoid the warning. ;)

Ericloewe · Sep 6, 2014

cyberjock said:
We're looking at doing something like 7GB of physical RAM minimum to ensure we don't get "false positives". Most people aren't going to install 7GB of RAM just to avoid the warning. ;)

We should start a betting pool to see when the first instance of someone using one 4GB, one 2Gb and one 1GB DIMM will be.

jgreco · Sep 6, 2014

/me goes digging to one-up this with a 512MB DIMM too ... 7.5GB

diskdiddler · Sep 6, 2014

I have rebooted the system and interestingly my array is healthy (which I guess makes sense, I haven't removed any disks or lost any data yet of any kind)
but when I view individual disks under "view disks" there's no indicator that any smart error occurred there, this might be something to consider adding in future versions.
If for some (silly) reason I lost the email which told me I had bad sectors, I'd have no idea there's a disk with an issue at this point, just based on what the GUI is offering.

Sep 7 01:12:17 freenas smartd[18910]: Device: /dev/ada4, 360 Currently unreadable (pending) sectors
Sep 7 01:12:18 freenas smartd[18910]: Device: /dev/ada4, 45 Offline uncorrectable sectors

That is appearing in the console message log at the bottom (I saw someone recommend enabling that, I thought it a good idea - mind you it's not default)
So yeah, just a suggestion long term, although I know what some peoples stance is on this regarding emails being configured properly in the first place (and it's true) - if for some reason an email gets lost, I'm not seeing any indication of a disk failure without me having to hunt for it (nothing under storage, the console display isn't on by default)

jgreco · Sep 6, 2014

It is perfectly possible for SMART to assess a drive as passed while simultaneously having unreadable sectors.

jgreco · Sep 6, 2014

jgreco said:
It is perfectly possible for SMART to assess a drive as passed while simultaneously having unreadable sectors.

Code:

# smartctl -a /dev/ada3
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:     ST3000DM001-9YN166
Serial Number:   xxxxxx
LU WWN Device Id: xxxxxxxxxxxxx
Firmware Version: CC4H
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Sep  6 10:21:51 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 322) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   091   006    Pre-fail  Always       -       156785408
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       107
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       216
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       644730710
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16658
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       21
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   099   099   099    Old_age   Always   FAILING_NOW 1
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       21475164165
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   045   045    Old_age   Always   In_the_past 33 (Min/Max 21/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       667
194 Temperature_Celsius     0x0022   033   055   000    Old_age   Always       -       33 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   001   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   001   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       128763119549804
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       71301624261542
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       76475582325877

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       10%     16656         237578075
# 2  Short offline       Interrupted (host reset)      10%     16647         -
# 3  Short offline       Completed: read failure       10%     16642         236872309
# 4  Extended offline    Completed: read failure       90%     16629         237578074
# 5  Short offline       Completed: read failure       10%     16620         236043972
# 6  Short offline       Completed: read failure       10%     16607         237578075
# 7  Short offline       Completed: read failure       10%     16604         236872309
# 8  Short offline       Completed: read failure       10%     16586         237577660
# 9  Short offline       Completed: read failure       10%     16544         237578075
#10  Short offline       Completed: read failure       10%     16527         237578075
#11  Short offline       Completed: read failure       10%     16479         237578075
#12  Short offline       Completed: read failure       10%     16477         237578075
#13  Extended offline    Completed: read failure       90%     16465         237578075
#14  Short offline       Completed: read failure       10%     16451         236043972
#15  Short offline       Completed: read failure       10%     16448         237578075
#16  Short offline       Completed: read failure       10%     16439         237578075
#17  Short offline       Completed: read failure       10%     16437         237578075
#18  Extended offline    Completed: read failure       90%     16406         237578075
#19  Short offline       Completed: read failure       10%     16404         237578075
#20  Short offline       Completed: read failure       10%     16402         237578075
#21  Short offline       Completed: read failure       10%     16395         237578075

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

diskdiddler · Sep 6, 2014

Just to clarify, "unreadable sectors" I take it, are the same as "bad sectors"? If I were to format this in Windows, do a full chkdsk /r I'd end up with bad sectors marked or no?

Ericloewe · Sep 6, 2014

diskdiddler said:
Just to clarify, "unreadable sectors" I take it, are the same as "bad sectors"? If I were to format this in Windows, do a full chkdsk /r I'd end up with bad sectors marked or no?

Unreadable amounts to bad, but bad sectors can also derive from them not being writable (but still providing nominally "valid" data) - the latter is a weird situation, though.

In any case, if you have more than one or two (I go up to four, personally, as the maximum), the drive is not to be trusted as it will probably develop more bad sectors.

DrKK · Sep 6, 2014

Ericloewe said:
In any case, if you have more than one or two (I go up to four, personally, as the maximum), the drive is not to be trusted as it will probably develop more bad sectors.

For the record, I'm less forgiving than Eric.

If I get one bad sector? Drive comes straight out. Experience in my life (which, who knows, may be skewed) is the there is no such thing as "1 bad sector" there is only "0 bad sectors" or "precursor to entire drive failing within a few weeks".

diskdiddler · Sep 6, 2014

DrKK said:
For the record, I'm less forgiving than Eric.

If I get one bad sector? Drive comes straight out. Experience in my life (which, who knows, may be skewed) is the there is no such thing as "1 bad sector" there is only "0 bad sectors" or "precursor to entire drive failing within a few weeks".

That's how I normally look at it too, it's indicative of long term issues even if it doesn't die instantly.

I'd love to know which one of those drives is doing the random head seeking noise, I'd assume it's 4 due to the problems but I can't honestly know that for certain. Driving me nuts.

titan_rw · Sep 8, 2014

If you're happy with your pools redundancy, administratively take drive 4 offline. If the clicking stops, then that was it. On-lining the drive should only need to 'catch up' the drive from when it was offlined. It shouldn't have to do a full resilver.

Theoretically you could run through all your drives this way checking for whether the clicking goes away. If it never goes away, then it's possible that the drive always clicks, regardless of disk access, or it's not a drive clicking at all. I assume we know it is a drive making the noise at this point.

If you do go through all the drives and the clicking never goes away, you'll have to start powering down the drives to tell which one it is. Again, more than one way to do that too.

diskdiddler · Sep 8, 2014

When you take it offline does it actually cut power to the disk?

cyberjock · Sep 9, 2014

Not if you just take it out of the pool. You'll have to physically disconnect the disk from your PSU to cut power to the disk.

diskdiddler · Sep 9, 2014

Yeah that makes sense, I don't see why you'd need a spin down command to be issued. Would be a nice addition but not exactly a high priority.
Could be useful for removing disks from huge arrays of disks though, if the caddy has a light and it disengages it so you know exactly which one it is.

Hey - what's the best method to take this baby out, should I tell FreeNAS I'm pulling it first, or just shut down, pull it and let FreeNAS figure out it's missing a disk?

cyberjock · Sep 9, 2014

No, you shouldn't ideally *ever* spindown a disk automatically. The reason: If the disk is randomly spinning down automatically you're going to have to deal with spin-up delay, which is 3-10 seconds. For some activities that's enough time to cause the read or write to fail. So no, you don't want it automatically spinning down. I'd consider that a more advanced function that should be strictly controlled by the server admin.

Removing disks should be done per the manual, so that's where I'll direct you to answer your last question. ;)

diskdiddler · Sep 9, 2014

Yeah I definitely meant a manual administration function - the page where you can edit / wipe disks, could have a "take offline" button

diskdiddler · Sep 10, 2014

Ok thanks all for the help, I've installed a new fan in the server, it's very very low power and moves very little air - but I suspect it might be the difference between the disks hitting 45c (113f) instead of 50 to 55c . (122f / 131f) Not looking forward to summer (up to 33c (92f) indoors at my place! :/
I suspect this could've contributed to the bad sectors on disk #4
One last question, I've done the replace, the resliver is going to take 24 hours (ouch but so be it) - could I have actually done a smart long check from the console before the reslivering? I spose I could've right? (and probably should have?)

jgreco · Sep 10, 2014

You should do a conveyance test and a long test on drives you've just received, then test them for a thousand hours, then they'll be less likely to fail early on.

diskdiddler · Sep 10, 2014

I've never heard of conveyance?
As a Windows user, I used to do a full format and use perfmon to monitor the speed the disks wrote at - on a huge 18 hour graph, it was great, you could clearly see disks have "slow points" if they were trouble.

1000 hours seems a bit excessive by any measurement.

jgreco · Sep 10, 2014

diskdiddler said:
I've never heard of conveyance?

DId you have a point you were making?

http://en.wikipedia.org/wiki/S.M.A.R.T.#Self-tests

1000 hours seems a bit excessive by any measurement.

I'll take into consideration that you had no idea what a conveyance test was when I decide how much to value that opinion.

The issue of infant mortality is well-understood among storage professionals. You can try to bring yourself up to speed by checking out the Google reliability study, http://research.google.com/archive/disk_failures.pdf and other resources available via Google search.

An interpretation of the Google study's observation is that there's basically a window from 3 months to 3 years where drives are least likely to fail.

However, long experience by many people in the business is that drives are most likely to fail in the first month. 1000 hours is a conservative compromise; few failures are noticed between 1000 (my number) and 2160 hours (3 months).

So here's the thing you need to ponder: if you are building a pool with redundancy, and you've got a significantly greater chance of drive failure because of infant mortality issues, do you want to push those drives into production after a mere 24 hours of testing, and risk a double failure? Or is the smart move to wait at LEAST 1000 hours, maybe even 2160 hours, to get your drives properly validated? Because especially for a new pool, they've ALL got that greater risk of failure.

I might have agreed had you called 1000 hours insufficient.

Important Announcement for the TrueNAS Community.

Can't view previous SMART test logs &is it in sequence?

Inactive Account

Server Wrangler

Resident Grinch

Wizard

Resident Grinch

Resident Grinch

Wizard

Server Wrangler

FreeNAS Generalissimo

Wizard

Guru

Wizard

Inactive Account

Wizard

Inactive Account

Wizard

Wizard

Resident Grinch

Wizard

Resident Grinch

Similar threads