After GrumpyBear mentioned that it might be the LSI2308 overheating, I checked the heat sink and it was extremely hot! So I setup a fan to blow air directly onto the LSI heatsink, AND updated FreeNAS to the latest 9.3 RELEASE. Above I reported back that the error had not occurred since, however after that post I turned OFF my fan cooling the LSI directly to then do some scrubs to test if it was actually the software that fixed the issue. After doing this a scrub of 2 TB of data caused the error to crop up again and when it occurred i felt the LSI heatsink and it was extremely hot.
So my conclusion at the moment is that the error occurs when the LSI overheats which occurs during a scrub because its doing a lot of work at that point in time. I am carrying out ongoing testing and will keep the LSI fan in place permanently :)
I will report back with my results but I think that overheating is the cause of the issue!
Ahh - the old change multiple things before testing gotcha ;-)
Hopefully with the fan on the LSI2308 heatsink it will work then if you can do it a few time you have a reproducible fault you can report to SuperMicro. To my knowledge the
MegaCLI tool will not report the die temperature of the HBA (It will report the battery backup unit temperature strangely enough). Avago does not appear to publicly release the specifications for the chip so I have no idea what the operating temperature is supposed to be. The only thing they have is a
Product Brief.
This
Application Note for a card using two of the LSA2308s states that the maximum heatsink temperature of 110 degrees Celsius so these devices are designed to run hot.
I suspect that the design of the motherboard compounds the heat issue. The stock Intel coolers actually suck air in from the top and blow it down through the CPU heatsink. The LSI2308 is located less than 1cm from the edge of the cooler. Here is a quick & dirty photo showing this and the location of the "System" temperature sensor:
As multiple people have reported varying perceptions of how hot the heatsink is and some have reported it as loosely bonded to the die I suspect thet there may be some Quality Control issues.
I also noted in my testing that the system temperature tended to run hot and was not an accurate reading of what the ambient temperature inside the chassis should be based on that any airflow over it would be after the air passes over the Hard Disks
With Cougar fans running in "Optimal" and CPU Utilization: 87.5% (133W)
Code:
System Temps:
CPU Temp 69 degrees C
System Temp 46 degrees C
Peripheral Temp 42 degrees C
PCH Temp 48 degrees C
VRM Temp 51 degrees C
DIMMA1 Temp 38 degrees C
DIMMB1 Temp 35 degrees C
Fan Speeds:
FAN1 1600 R.P.M
FAN2 600 R.P.M
FAN3 600 R.P.M
FAN4 500 R.P.M
FANA 600 R.P.M
Disk Temps:
da0 33 Celsius
da1 38 Celsius
da2 36 Celsius
da3 35 Celsius
da4 37 Celsius
da5 40 Celsius
da6 38 Celsius
da7 37 Celsius
With Noctua Industrial fans under the same conditions: 87.5% Utilization:
Code:
System Temps:
CPU Temp 68 degrees C
System Temp 43 degrees C
Peripheral Temp 39 degrees C
PCH Temp 46 degrees C
VRM Temp 51 degrees C
DIMMA1 Temp 35 degrees C
DIMMB1 Temp 30 degrees C
Fan Speeds:
FAN1 1500 R.P.M
FAN2 700 R.P.M
FAN3 600 R.P.M
FAN4 700 R.P.M
FANA 600 R.P.M
Disk Temps:
da0: 27 Celsius
da1: 32 Celsius
da2: 31 Celsius
da3: 29 Celsius
da4: 31 Celsius
da5: 33 Celsius
da6: 31 Celsius
da7: 29 Celsius
Note that the Fans were controlled by PWM from the FANA header and were reporting their speeds on FAN2 through FAN4 and FANA. The Noctua Fans have a much higher airflow than the Cougar fans but note that while most temperatures were lower the CPU and system sensors were reporting similar temperatures.