Airflow Temperature

KMR · Mar 13, 2013

Hey Folks,

I had a video card hooked up to my Freenas box for testing. I recently rebuilt my 3 disk raidz1 array as a 6 disk raidz2 array following some recommendations here. I was also trying to track down another hardware issue. Anyway, I noticed that some warnings were coming up on my console about temperature. I took out the video card and put the fans back in the case and I will be spacing the drives more evenly and adding another fan tomorrow to improve airflow, but I'm concerned that I have damaged my new hard drives. The following is the output from smartctl -a /dev on one of my drives.

Code:

[root@freenas] ~# smartctl -A /dev/ada5
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   100   006    Pre-fail  Always       -       76825240
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       221583
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       80
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   051   042   045    Old_age   Always   In_the_past 49 (0 52 50 47 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   049   058   000    Old_age   Always       -       49 (0 26 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       92968862089296
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1686666804
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       740888271

It looks like the drives have all been over the threshold once. I am performing a scrub now and this is the output of zpool status

Code:

[root@freenas] /var/log# zpool status
  pool: volume
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Mar 13 16:05:31 2013
        2.36T scanned out of 4.24T at 498M/s, 1h5m to go
        1.47M repaired, 55.77% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/6a444eda-899d-11e2-823d-50e549b3a1da  ONLINE       0     0    11  (repairing)
            gptid/6ab74a08-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     9  (repairing)
            gptid/6b391d29-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     7  (repairing)
            gptid/6b97f34d-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     5  (repairing)
            gptid/6be28d09-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     6  (repairing)
            gptid/6c5396b1-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     9  (repairing)

Can anyone make suggestions?

Thanks!

KMR · Mar 13, 2013

Completed scrub output:

Code:

[root@freenas] /var/log# zpool status -v
  pool: volume
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 2.38M in 2h28m with 0 errors on Wed Mar 13 18:33:49 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/6a444eda-899d-11e2-823d-50e549b3a1da  ONLINE       0     0    17
            gptid/6ab74a08-899d-11e2-823d-50e549b3a1da  ONLINE       0     0    14
            gptid/6b391d29-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     9
            gptid/6b97f34d-899d-11e2-823d-50e549b3a1da  ONLINE       0     0    12
            gptid/6be28d09-899d-11e2-823d-50e549b3a1da  ONLINE       0     0     9
            gptid/6c5396b1-899d-11e2-823d-50e549b3a1da  ONLINE       0     0    15

errors: No known data errors

Should I use the zpool clear command as the output suggests?

JaimieV · Mar 14, 2013

A complaint about over-temperature doesn't constitute a device fail, but will have been flagged up as a warning. If ada5 is the one that got hottest (52'C) then you're almost certainly okay. Check against their HDD model datasheets to see the max operating temperature - normally it's 50/55/60 degrees C.

Clear the pool. If the disks behave themselves (no further errors) then no harm done apart from some premature aging. Keep your backups up to date, but with a Z2 you should be pretty safe.

cyberjock · Mar 14, 2013

So a somewhat related question because I've seen this before...

Anyone have any comments/ideas about all the drives with errors for checksums on all disks? My friend's FreeNAS server runs fine, is a RAIDZ2, and we've never identified the exact cause for these apparent errors. His server has been up since late January without a reboot, has a scrub scheduled every 1st and 15th, and the scrub on March 1st found checksum errors on all disks. I find it highly unlikely that all 10 disks(they are new, but all 10.. I doubt it) have problems already. Just like above, my friend's server has the message "One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." which sounds like bad sectors on the drives. But if that's the case why wasn't there any read or write errors? I'd expect that during a scrub(which ideally would be only reads) you'd have read errors, not "checksum" errors. Also when I check out the SMART info for all of the disks I see no indication of any errors reading or writing.

KMR · Mar 14, 2013

Thanks for your replies. I checked the manufacturer website which says the operating range is 0 - 60 degrees Celsius. I will check the smart data on all the drives when I get home but I don't think any were above 60. I am going to pick up the brackets so I can space the drives better and put another fan in the case in an attempt to keep things cool. I just spent a bunch of money on this project and I don't want it going down the tubes over something stupid. The next parts to this project are to replace the motherboard, CPU, and RAM with new parts that support ECC RAM. (I'm thinking an ASUS M5A97 MB, cheapo dual core AMD, and Kingston ECC RAM - any thoughts?). The reason being I am still thinking that my motherboard is the cause of the first issue I had, and I want ECC support. After that a UPS is in order.

Regarding cyberjock's question, I am also curious why there would be checksum errors on all disks.

Thanks again!

JaimieV · Mar 14, 2013

CJ, dude! Start a new thread! ;)

I'd run, not walk, to do a memory test. Also check the physical location - particularly, see if anyone just powered up a ham radio nearby, or installed a microwave on the other side of the partition wall. RF interference spikes can cause transfer errors by messing with the signals in the cables.

KMR · Mar 14, 2013

I was going to ask about those checksum errors anyway :)

I did 4 passes of memtest86 on the machine while looking for the previous problem and there were no errors. There are no Ham radios or microwaves near the boxes.

Thanks for the responses!

cyberjock · Mar 14, 2013

JaimieV said:
CJ, dude! Start a new thread! ;)

Normally I would have. But since this person is having issues I'm wondering if the checksum errors are also an indication of something more nefarious.

In my friend's case we've ignored them for just a few weeks short of a year. We've even done an upgrade from a combination of 1TB and 1.5TB drives to all 3TB drives. Despite using different slots(but the same controller) and it being a new zpool we are having the checksum issues. We've ignored them as the server has had zero issues aside from those checksum errors. But now that I see someone else having these weird unexplained checksum errors I'm wondering if they are a sign of something bigger down the road. As for my friend's setup, its in a basement, has no electronic sources nearby that could cause interference.

Finding the checksum error cause may help the OP fix his issue though, hence I asked here.

cyberjock · Mar 14, 2013

Ok, now I started doing homework and pulling out my notes and questioning everything. Here's an explanation of the error columns: (found here)

The second section of the configuration output displays error statistics. These errors are divided into three categories:

READ – I/O errors occurred while issuing a read request.

WRITE – I/O errors occurred while issuing a write request.

CKSUM – Checksum errors. The device returned corrupted data as the result of a read request.

These errors can be used to determine if the damage is permanent. A small number of I/O errors might indicate a temporary outage, while a large number might indicate a permanent problem with the device. These errors do not necessarily correspond to data corruption as interpreted by applications. If the device is in a redundant configuration, the disk devices might show uncorrectable errors, while no errors appear at the mirror or RAID-Z device level. If this scenario is the case, then ZFS successfully retrieved the good data and attempted to heal the damaged data from existing replicas.

So it sounds like a situation where the drive, during a scrub, was able to read the data, but the data on the drive was not consistent with what the zpool checksums thought should be there. Due to ZFS' self-healing capabilities the issue was corrected.

So I see the CKSUM errors as being 3 possible situations:

1. The drive's internal ECC had to correct the errors, and those errors are being passed though to the OS which are being recorded as the CKSUM errors.
2. The drive's data was inconsistent with the rest of the zpool. For example, if data was corrupted but the sector's ECC data still said the data is good(aka bitrot). This could potentially also be from bad drive cabling.(But again, since this is happening to all of the drives, do you really want to argue that bad cabling is the cause for all the drives having errors?)
3. Some combination of both #1 and #2.

So it looks like a situation where if you don't have large numbers you can safely ignore the problem, but it should be monitored and the hardware in the machine should be verified to be properly operating.

KMR · Mar 14, 2013

Okay, so it seems that it isn't a big deal which is good. But I assume this isn't normal behavior. I will be interested to see if the problem goes away when I replace the PC components and install a UPS.

JaimieV · Mar 14, 2013

Good digging, Cyberjock - I failed to find that when I went hunting. Good to know.

jgreco · Mar 15, 2013

cyberjock said:
So I see the CKSUM errors as being 3 possible situations:

1. The drive's internal ECC had to correct the errors, and those errors are being passed though to the OS which are being recorded as the CKSUM errors.
2. The drive's data was inconsistent with the rest of the zpool. For example, if data was corrupted but the sector's ECC data still said the data is good(aka bitrot). This could potentially also be from bad drive cabling.(But again, since this is happening to all of the drives, do you really want to argue that bad cabling is the cause for all the drives having errors?)
3. Some combination of both #1 and #2

I don't get your conclusion at all.

1) isn't the case; if the drive's internal error handling were correcting things, you ought to see it in SMART (correctable errors or something like that).

2) Bad drive cabling is maybe possible but strikes me as not fitting the posted facts. If the drive's data is actually inconsistent with the rest of the zpool, it's probably because it was written incorrectly. If the drive's data merely appears to be inconsistent (i.e. a second read returns different data), then it's being read incorrectly. This could be bad cabling if it was happening on a single drive.

The likely culprit is corruption at the controller, system busses, CPU, or memory. If your controller is running too hot, for example, occasional bit flips aren't outside the realm of possibility. The ZFS CKSUM check is there to catch events in hardware that are not caught at other levels within the hardware.

cyberjock · Mar 15, 2013

My 3 scenarios for what the CKSUM errors mean were generic things that could cause that number to increase. I wasn't saying that any(or all) of those were true for this situation. I was simply listing possible meanings for a CKSUM > 0 situation.

I disagree with #1. Seagates, for instance, actually include every ECC that requires repair as a SMART parameter(1. Raw_Read_Error_Rate). It increase VERY rapidly when reading data. I've seen millions of corrections a minute. Seagate has said that the parameter records the number of ECC corrections since poweron and is normal to be increasing. I call BS on their whole story because Seagate is the only manufacturer that uses that parameter in the way they describe. If you had even 1000 on any other brand for that SMART parameter that would be a drive you'd be RMAing because it wasn't working.

ECC corrections are apparently very typical(there's an article somewhere that most of all data saved on a magnetic hard drive has at least 1 bit of error and requires correction). There's a cool chart listing the rate of different # of bit errors and their likelihood somewhere on the internet. Overall, more than 97% of all data read requires correction according to the chart. Scares me to think that I trust my data to drives that are so "unreliable". There's a reason why 512-byte sector drives had ECC that can repair up to 50-byte errors and 4096-byte sector drives can repair up to 100-bytes. Errors are just considered to be normal and accounted for. (What I'd really like to know is the error rates for standard flash memory!)

I agree with you on #2 that bad cabling may be the cause but that's unlikely for this situation because every drive showed errors. As I said above though, that my comment was a generic "I think this can cause CKSUM to increase" but was not referring to this exact system.

jgreco said:
The likely culprit is corruption at the controller, system busses, CPU, or memory. If your controller is running too hot, for example, occasional bit flips aren't outside the realm of possibility. The ZFS CKSUM check is there to catch events in hardware that are not caught at other levels within the hardware.

I agree 100%. To me, that falls into the #2 I had explained above. Anything that causes the drive to write incorrect data or the data that is checked via ZFS checksums to find an error. Most RAID controllers use ECC RAM from my experience, most CPUs(and RAID controller CPUs with cache) have ECC internally. To be honest, I think the only 2 places that don't have actual ECC is the PCI/PCIe transactions and the SATA transactions down the SATA cable. Anything else is depending on the hardware itself and if it supports and uses ECC. ZFS is great because even if a piece of unreliable hardware is responsible for data being incorrectly passed down the chain from the CPU to the media(and later from the media back to the CPU) can be corrected, it is imperative that the RAM be 100% trustworthy. I used to not advocate for ECC RAM, but I've reconsidered my view based on some of the posts in the forum and the consequences of bad RAM(lost data.. sometimes lost zpools).

My only 2 problems with that analysis is that the system has zero problems outside of these CKSUM errors and if it were a actual hardware issue I'd expect far more than double digit errors on the drives. My friend's server has over 60 days of uptime, has all trustworthy hardware, is on an UPS, etc. None is new hardware that could suffer from infant mortality. In his case I can't even consider an improper shutdown as a possible cause since there's been at least 3 scrubs performed automatically on the system during the uptime he has currently. It's pretty clearly not the hard drives(unless its a firmware bug or other problem with the entire hard drive family).

Also, in my friends case when we filled his zpool (20+TB) over 3 days and then did a scrub I'd have expected A LOT more then than he sees now(100 to 200GB per week written to the server) with his bi-weekly scrubs, but that wasn't the case. It was the same very low double digits.

I just find it fascinating because it is an unexplained enigma of his system that otherwise has no problems and I can't logically conclude where the problem actually is.

KMR · Mar 15, 2013

So, the original issue has cropped up again. I was doing a file transfer and the FreeNAS box locked up. I am currently running a scrub and will post the results. Are there any logs I can post in the mean time to help diagnose this issue?

I'm suspecting either a crappy USB stick or a defective motherboard component. I have already ordered a new Adata USB stick for FreeNAS as well as a Seasonic PSU, but it seems I will need to push my ECC schedule ahead more. Also, an overnight memtest86 run came back with no errors.

KMR · Mar 15, 2013

It looks like the FreeNAS box just bombed during the scrub. My SSH session stopped working and I couldn't ping the box so I restarted it. The scrub looks like it picked up where it left off.

cyberjock · Mar 15, 2013

KMR said:
It looks like the FreeNAS box just bombed during the scrub. My SSH session stopped working and I couldn't ping the box so I restarted it. The scrub looks like it picked up where it left off.

I don't think it picks up where it left off. If the system is rebooted it starts over from scratch.

KMR · Mar 15, 2013

When I restarted the SSH session and ran the zpool status command it was at a slightly higher percentage than it was before the restart. It is too far back in the session for me to get at or I would show you.

At any rate, is there anything I can check to see about other causes? Like any way to check the status of the sata controllers?

cyberjock · Mar 15, 2013

I don't think so. If the machine is locking up its a little hard to view the logs or troubleshoot otherwise. Is your BIOS up to date? Have you tried a RAM test?

KMR · Mar 15, 2013

Yeah, the BIOS is up to date and I did the memtest86. I'm just going to have to bite the bullet and buy new components. Should I just turn the FreeNAS box off until I have the new stuff ready and stream my media off the backup drives, or do you think it is safe to run things as is with the occasional freeze up until the new stuff arrives?

JaimieV · Mar 17, 2013

If you haven't already, set up syslog to send to a remote machine. That gives you better access to hardware error logs.

Also remember that using the HDD's at full tilt during a scrub increases the heat generation, and all that data flying around may increase the odds of whatever the failure is happening... it's a bit catch 22, really.

Important Announcement for the TrueNAS Community.

Airflow Temperature

Contributor

Contributor

Guru

Inactive Account

Contributor

Guru

Contributor

Inactive Account

Inactive Account

Contributor

Guru

Resident Grinch

Inactive Account

Contributor

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Guru

Similar threads