SMART error (Temperature) detected on host

Status
Not open for further replies.

MtK

Patron
Joined
Jun 22, 2013
Messages
471
Hey,
I just got my system up and running on a Fractal Defin Mini with 6 drives.
Ran the first S.M.A.R.T (before setting a pool) just to see everything is OK, but got:
Code:
Device: /dev/ada0, Temperature 40 Celsius reached critical limit of 40 Celsius (Min/Max ??/40)

(on all 6 drives)

fans are working and system temp looks ok.

is that normal?
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
The optimal temperature for HDDs is 30-40°C, anything over 40°C will see increased failure rates. You should check the smart attributes to see how hot they get in normal operation. Keep in mind that they tend to heat up quite a lot when you put strain on them (scrubs, read/write operations).

I'm also running into heat problems in summer, since my case is very compact and additionally I've got a passively cooled CPU in there. I'll definitely need to take care of that before next summer.
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
the system is completely new, and the system just booted.
as I mentioned in the post, I didn't even create a pool yet, so there is no real stress on the drives..
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
What is the output of smartctl -a /dev/ada0 (in [code]-tags please to keep the formatting).
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
What is the output of smartctl -a /dev/ada0 (in [code]-tags please to keep the formatting).


Code:
# smartctl -a /dev/ada0
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD5000AAKS-00UU3A0
Serial Number:    WD-WCAYU8141594
LU WWN Device Id: 5 0014ee 158bebaa2
Firmware Version: 01.03B01
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Nov 30 16:49:50 2013 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command                                                                              from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  36) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                ( 8400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                             ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 100) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3037) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_                                                                             FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -                                                                                    0
  3 Spin_Up_Time            0x0027   141   141   021    Pre-fail  Always       -                                                                                    3941
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -                                                                                    25
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -                                                                                    0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -                                                                                    0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -                                                                                    84
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -                                                                                    0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -                                                                                    0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -                                                                                    22
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -                                                                                    8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -                                                                                    16
194 Temperature_Celsius     0x0022   103   101   000    Old_age   Always       -                                                                                    40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -                                                                                    0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -                                                                                    0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -                                                                                    0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -                                                                                    0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -                                                                                    0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA                                                                             _of_first_error
# 1  Extended offline    Interrupted (host reset)      40%        83         -
# 2  Extended offline    Completed without error       00%        81         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
Thanks, the last column "RAW_VALUE" is missing. Can you double-check?
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
Thanks, the last column "RAW_VALUE" is missing. Can you double-check?

Code:
# smartctl -a /dev/ada0
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD5000AAKS-00UU3A0
Serial Number:    WD-WCAYU8141594
LU WWN Device Id: 5 0014ee 158bebaa2
Firmware Version: 01.03B01
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Nov 30 16:56:55 2013 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  36) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                ( 8400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 100) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3037) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   141   141   021    Pre-fail  Always       -       3941
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       84
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0022   104   101   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      40%        83         -
# 2  Extended offline    Completed without error       00%        81         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
Yeah, it's running at 39°C right now. Since you have got a WD Blue with 7200RPM it will run hotter than a Green model.

I'd keep an eye on the temperatures and see how they go while a scrub is in progress. If it regularly exceeds 45°C I'd look into improving the cooling. Remember to properly set up smart emails and reporting, so that you get a warning if a drive decides to fail.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
So if I may opine.

The case you have, it comes with two 120mm cheapo fans. But it has room for 4 fans, and one or more of those mount point, if memory serves, will handle a 140mm fan, and these can move as much as 50% more air.

Have you considered investing in maximizing your fans? i.e, start with taken any unused fan mount points, and put fans on it? "Cougar Vortex" is an exception fan, available as 140 or 120mm models, for under $20.

You're at a borderline point...39-40C...that suggests to me that you can benefit a lot by incremental improvements like having more/better fans.

Also, 40C is not the end of the world. You can go in there and reset the "critical" temp to 45 or something so you don't have the warning annoying you.
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
I have a plan of adding fans, yes.
but this is a newly installed machine and thought it'd be in a better shape before I will be forced to add more fans...
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Well, you've got six 7200 RPM drives in that case, sir, presumably near the edges of the case, and hence near the fluid dynamic stagnation points of the air. You have plenty of low cost options to solve this cooling problem.

And like I said, you're really close to having an OK thermal situation. It's just a few degrees hot. A couple more fans, of the right size and type, and we'll get this.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
What is the output of smartctl -a /dev/ada0 (in [code]-tags please to keep the formatting).
Running smartctl -l scttemp /dev/ada0 will give you much more detailed temperature info including the current power cycle & lifetime max/min values and a temperature graph. My WD REDs log the temperature every minute and keep the data for 8 hours.
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
according to this, I really (!) need a fan:
Code:
# smartctl -l scttemp /dev/ada5
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  2
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        SMART Off-line Data Collection executing in background (4)
Current Temperature:                    43 Celsius
Power Cycle Min/Max Temperature:     22/46 Celsius
Lifetime    Min/Max Temperature:     42/48 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      5/60 Celsius
Min/Max Temperature Limit:            1/85 Celsius
Temperature History Size (Index):    128 (34)
 
Index    Estimated Time   Temperature Celsius
  35    2013-11-30 16:30    43  ************************
 ...    ..( 24 skipped).    ..  ************************
  60    2013-11-30 16:55    43  ************************
  61    2013-11-30 16:56    42  ***********************
 ...    ..( 30 skipped).    ..  ***********************
  92    2013-11-30 17:27    42  ***********************
  93    2013-11-30 17:28    43  ************************
 ...    ..(  2 skipped).    ..  ************************
  96    2013-11-30 17:31    43  ************************
  97    2013-11-30 17:32    44  *************************
 ...    ..( 17 skipped).    ..  *************************
 115    2013-11-30 17:50    44  *************************
 116    2013-11-30 17:51    43  ************************
 ...    ..( 21 skipped).    ..  ************************
  10    2013-11-30 18:13    43  ************************
  11    2013-11-30 18:14    42  ***********************
  12    2013-11-30 18:15    43  ************************
 ...    ..(  2 skipped).    ..  ************************
  15    2013-11-30 18:18    43  ************************
  16    2013-11-30 18:19    44  *************************
 ...    ..(  2 skipped).    ..  *************************
  19    2013-11-30 18:22    44  *************************
  20    2013-11-30 18:23    43  ************************
 ...    ..( 13 skipped).    ..  ************************
  34    2013-11-30 18:37    43  ************************
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
according to this, I really (!) need a fan:

Yes, sir, I believe that's what I've been trying to tell you for about 3 posts. :)
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Here's mine...just for shits and giggles:

Code:

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Stand-by (1)
Current Temperature:                    20 Celsius
Power Cycle Min/Max Temperature:     15/28 Celsius
Lifetime    Min/Max Temperature:     15/28 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (440)

Index    Estimated Time   Temperature Celsius
 441    2013-11-30 04:24    20  *
 ...    ..(181 skipped).    ..  *
 145    2013-11-30 07:26    20  *
 146    2013-11-30 07:27    21  **
 ...    ..(285 skipped).    ..  **
 432    2013-11-30 12:13    21  **
 433    2013-11-30 12:14    20  *
 ...    ..(  6 skipped).    ..  *
 440    2013-11-30 12:21    20  *
 

MtK

Patron
Joined
Jun 22, 2013
Messages
471
Yes, sir, I believe that's what I've been trying to tell you for about 3 posts. :)

I didn't say I don't need.
but take a look at my last post, it's ada5 which doesn't have a direct fan on it...
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
That's funny that my drives got down to 18C or so at times.

I guess we know how warm I keep my house. :)
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
That's funny that my drives got down to 18C or so at times.
According to the google study (figures 4 & 5), such low temperateures aren't exactly beneficial either: "In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates."
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You're absolutely right, and Cyberjock and I were already discussing this at length weeks ago about my drive temps being "too low"!!!

But meh, what am I going to do? Put a heater in there? And in any case, the failure curve for "low" temperature is when the drive is first put into service. Low temperatures don't seem to matter once the drive has been in service for a certain number of hours.

Whereas, high temperatures are always bad, at any time, in the drive's life.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Everybody else is adding fans, so maybe remove one? :) Or lower the RPM?
 
Status
Not open for further replies.
Top