I had a drive fail catastrophically in one of the servers I manage for work. According to the log, the drive hit 185 celsius before it died and the heat was so intense it cause the two neighboring drives to have errors too. It is a good thing the server was running RAIDz2 with a hot spare. I ended up having to replace three drives at one time, which is rare... That server didn't burst into flames though. It is very unlikely that would happen for anything short of an electrical fault. I had a power supply fail that smelled like it was on fire, but no flames ever appeared. These things don't happen often. I have only seen them because I have been dealing with many servers for many years.
- 365°F (I have to think in F past 100°C) ... that is crazy talk!!! (I believe you).
- I'm surprised it made it that far before dying ... do you know what caused the failure?
- I don't proclaim to know anything about hard drive engineering, but I'm surprised there isn't some sort of internal fail safe that spins them down before they hit such a high temp. Programmatically, I suppose that would require them to issue their own smartmontool command, interpret the result, and conditionally determine whether or not to continue to operate (so that wouldn't make any sense). But I'm surprised there isn't some sort of integrated tprobe to achieve the same.
PS. We had one of the two cooling units for my section of the datacenter go down and the room temp went up to the low 80s (fahrenheit) causing the drives to go up to the high 50s celsius. It was a stressful couple weeks waiting for the parts to fix the cooling unit. Those commercial coolers, I would have thought the parts were more readily available. The way the data-center is partitioned, the other coolers didn't help my section much...
- Ugh that sucks.
- Stupid question #1: When you hear A & B Power or Cooling, that refers to 100% redundancy, right? So your two cooling units would be Cooling A and there wasn't a Cooling B?
- Stupid question #2: I've only been in a bonafide colo once, but working in F&A I know that our cost was fixed for renting the space for the ~12 racks we had there, point being isn't there some sort of SLA that guarantees power / cooling by the colo operator, i.e. the failed cooling unit was their problem (obviously impacting you unfortunately). My inference from your wording is that your company had to maintain the cooling infrastructure.
- Somewhat surprised low 80s°F ambient = high 50s°F HDD temp. I suppose I can't extrapolate my own server to a datacenter (drives could be 15k / multiple servers in constrained space could require a cool aisle / heat level could have produced a compounding effect), but I keep the air on 78°F during the summer and my warmest drive only hits 36°F with an average of 34.4°C with the full speed fan profile (and I imagine your servers would have wound up their fans to cope with the temp). Thankfully, for me, full speed isn't needed to stay sub 40°C @ 78°F ambient and usually Standard with a throttle here and there to Heavy I/O gets the job done (using X9 fan profile speak). Reference below image, top right (produced some time ago, when I was trying to get temps under control).
- Server catches on fire = no problem. PVC cooling loop melts and douses the fire with coolant.
- Cooling unit fails = no problem. You don't need cool air, you have cool liquid.
Off to bed for the night (hopefully I'm not awoken by a fire in my server closet) - Good Night. ;) [Yes, my sense of humor is that bad]
Edit: Bottom right should read "Full Speed Comparison" not "Standard Speed Comparison" (used the same worksheet to compare all fan modes for a change I made and must have forgotten to change that text manually).