Should I worry about WD REDs running 40C on a mostly idle machine?

Status
Not open for further replies.

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
I have been reading here that disk temperatures should be under 40C. Mine is a home server setup with me being the single user (mostly), and the system is idle lot of the times, but the disk temperatures always seem to hover around and sometimes above 40C. That's probably partly because I live in a tropical country but the room temperatures are nowhere near that. My guess is that its because the disk is always spinning.

The server is mostly read from over a single 1Gbps link, and occasionally written to. The writing is heavy(?) when it happens (like uploading HD videos from a camcorder, etc), but over the 1Gbps network. I also have a bunch of jails, but nothing that would be pounding the disks all the time. The most frequent access type should be over ssh, via jail. All this to say that I can't think of anything heavy on i/o that's running there.

The server is a FreeNAS mini running 9.2.1.8 with 32GB RAM & 4x4GB WD RED WD40EFRX drives. I also have a backup server with less-capable hardware (HP N36L, 8GB ECC RAM, 4X4GB WD RED WD40EFRX) running 9.2.1.8, which has no network shares, no cron jobs and most services are disabled (meaning it is even more idle, except during replications), but it too has similar disk temperatures.

Auto Snapshot Replication tasks are spaced 1 - 2 hours apart. There are 12 of them.

This is the output of "smartctl -a -q noserial /dev/ada0" on the mini, typical of all disks on both machines:

Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EFRX-68WT0N0
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 17 15:25:56 2014 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (53280) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 532) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   172   021    Pre-fail  Always       -       8325
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       610
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       11
194 Temperature_Celsius     0x0022   111   109   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       595         -
# 2  Short offline       Completed without error       00%       571         -
# 3  Short offline       Completed without error       00%       547         -
# 4  Short offline       Completed without error       00%       523         -
# 5  Short offline       Completed without error       00%       503         -
# 6  Short offline       Completed without error       00%       479         -
# 7  Short offline       Completed without error       00%       239         -
# 8  Short offline       Completed without error       00%       215         -
# 9  Short offline       Completed without error       00%       191         -
#10  Short offline       Completed without error       00%       167         -
#11  Conveyance offline  Completed without error       00%        49         -
#12  Conveyance offline  Completed without error       00%        47         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


The question is, should I be worried about premature disk failures, even though SMART says its all ok for now? If yes, is there anything I can do about it, like APM (which I know nothing about)? As I said, this is a home server setup for storing home media (so not performance critical). I greatly value longevity over small change in performance.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yes, you should worry - add a bit of cooling and you should be fine. That drive says the maximum temperature it's experienced is 41 degrees Celsius.

Ideally, you should have them idle closer to 30 degrees Celsius.
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
I have the same problem. The room temperature around mine is about 30-35 degrees C (for domestic, not tropical, reasons). The only reliable solution is air conditioning (which is of course usual in data centres).
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Any suggestions on possible cooling solutions, short of running 24x7 air conditioning, which is not an option for me?

And this will get worse in summer.

Thanks,
Saurav.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Any suggestions on possible cooling solutions, short of running 24x7 air conditioning, which is not an option for me?

And this will get worse in summer.

Thanks,
Saurav.
I don't know how feasible this is in the FreeNAS Mini, but adding fans and or replacing those already present with stronger fans is the usual way of solving the problem.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There is no "good fix".

APM isn't a good choice. Sure, spinning down your disks (assuming you have a second pool for your system dataset so you even *can* spindown disks) only wears out disks faster, so not a good choice. Even then, when you do scrubs, you're going to see the drives get hotter than whatever your idle temps are anyway, so you really are still going to get hot disks.

The three options I see are to try adding a "better" fan to your Mini, move the Mini to a cooler location, or cool the location its currently at. Those are about your only "good" options.

While you shouldn't be worrying about this to the point of not getting sleep, I'd say it definitely needs to be fixed.
 
Last edited:

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
I am wondering how much a very small air conditioning unit ducted to the air intake of my computers would be economical to buy and run. Ducting would be a lot easier if I put them in a rack, perhaps.

Edit: the one thing that cannot be altered is the thermal resistance from innards (die, or HDD works) to case; for a given maximum power (about which little can be done in the case of a disk, even those which are adjustable reduce their power only a little and at the cost of performance) there is a given temperature gradient between inside where it is measured, and has its ill effects, and the case. All further air flow will do is allow the case to approach asymptotically towards the ambient air temperature.

Either liquid cooling with a refrigerator in circuit or cooling the incoming air is the only solution.
 
Last edited:

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
I am wondering how much a very small air conditioning unit ducted to the air intake of my computers would be economical to buy and run. Ducting would be a lot easier if I put them in a rack, perhaps.

This is a very bad idea. Using chilled air to directly cool your components will result in condensation forming on them and this is obviously not good for a computer.

Also FWIW WD Reds rated spec by WD is 70c maximum operating temperature.

Google ran a study many years ago about SMART attributes and hard disk failures. You can find that here:
https://web.archive.org/web/20080826011702/http://labs.google.com/papers/disk_failures.pdf

They concluded that temperature was as major of a factor as previously thought.

"One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population."

Here is a graphic from said study as well.
afr_temp_age_dist.png


It shows that they found that too low temperatures actually resulted in a higher failure rate and that only really high temperatures for example above 45c started to show a slight increase in failures again.

Personally I would be fine with WD Reds running at 40c as long as they're not getting above 45c regularly like when you run scrubs and other high-loads. I'm fine with this because the data that I have seen and that I trust points to this conclusion. But in the end the choice is yours.

Looks like your drive has hit 43c at most so far.

Additionally Backblaze has run a much more recent study:
https://www.backblaze.com/blog/hard-drive-temperature-does-it-matter/

Their conclusions were this:

"How much does operating temperature affect the failure rates of disk drives? Not much."

"After looking at data on over 34,000 drives, I found that overall there is no correlation between temperature and failure rate."
 
Last edited:

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Chilled air has a low water content. Warm air over cold components leads to condensation.

(Edit: Air conditioning is good for demisting car windscreens.)
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
Well I mentioned that from experience. I have tried cooling computers with air conditioner units and a lot of condensation always forms on all the metal heatsinks.
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Well I mentioned that from experience. I have tried cooling computers with air conditioner units and a lot of condensation always forms on all the metal heatsinks.

That's interesting! I'll have to try and think about why. Perhaps warm room air is being entrained with the cool air flow? If I do the experiment I'll let you know the answer here. Our relative humidity is usually in the 90s all year round (Wales) so it should be a good test.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
First, there's a second chart for temps, that scales up badly after 40C.

Second, quoting the backblaze with such a small tidbit of information like "How much does operating temperature affect the failure rates of disk drives? Not much." is useless. Did they use drives that were all between 30 and 40C? If so I'd expect that answer. Now if they did 30C to 55C then their answer would be very different.

So you can't grab a small quote like that and expect it to mean a darn thing. ;)
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
The temperature reported by smartctl is from some time in the past like during offline data collection or a previously run selftest, right?

I mean, its NOT the temperature of the disk at the time the smartctl command is run? Or, is it?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The temperature reported by smartctl is from some time in the past like during offline data collection or a previously run selftest, right?

I mean, its NOT the temperature of the disk at the time the smartctl command is run? Or, is it?

All disks I've seen report current temperature and the highest recorded temperature - how often the latter is recorded is unknown, but it seems accurate enough to determine if something went wrong in the past.
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
The temperature reported by smartctl is from some time in the past like during offline data collection or a previously run selftest, right?

I mean, its NOT the temperature of the disk at the time the smartctl command is run? Or, is it?

I thought is *was* the current data at the time the command was run. Tests (in terms of pass or fail and ?failing sector) are are reported from the last time they were run. But I could be completely wrong!
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
My three real options
try adding a "better" fan to your Mini, move the Mini to a cooler location, or cool the location its currently at.
are not easy to implement, and would take time. In the meanwhile, I reduced the SMART check interval to 60 minutes (from 30 minutes default). The idle temperatures now hover around 38C.

With this being a home setup (not frequently read from or written to), would it be foolish to decrease it even further (say like 120 minutes)? Scrubs are 35 days apart, and SMART long selftests are once a week.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
With this being a home setup (not frequently read from or written to), would it be foolish to decrease it even further (say like 120 minutes)? Scrubs are 35 days apart, and SMART long selftests are once a week.

Honestly, I think it's a little foolish to change that time frame at all.

If you read my thread in the guides section, I recommend scrubs bi-weekly along with SMART long tests (but not at the same time obviously).
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Honestly, I think it's a little foolish to change that time frame at all.
Because that has nothing to do with disk temperature? Or it makes things dangerous? Or Both?

I read your guide, but didn't remember the recommended scrub interval. I just left it to the default 35 days that seems to be set automatically when a pool is created.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah, the default was set because too many people weren't setting a scrub schedule at all. So a year or two later when they needed to do disk replacements they couldn't because of pool problems and some people had problems dealing with the problems.

There's no reason to change the default and SMART is your first indicator of problems. It's just not recommended to deviate from that setting. There's no performance penalty or anything, so you're changing something for no gain and potential penalties.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Chilled air has a low water content. Warm air over cold components leads to condensation.

(Edit: Air conditioning is good for demisting car windscreens.)

Interesting. I just toured a fresh air modular datacenter. They raise temps to remove humidity. I suppose that's different than condensation, as yes, iced drinks form condensation quick in warmer air in my experience.
 
Status
Not open for further replies.
Top