X10SL7-F scrub causes LSI 2308 to reset and removes drives from zpool

Status
Not open for further replies.

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
Okay, well your testing shows your LSI2308 runs perfectly. Yes exactly, 450 W and only using a fraction. I am ruling out the PSU for the moment.
Just checked BIOS version and its v 2.0 with build date 04/24/2014.

I have tried putting a spare fan directed at the LSI heatsink as it was very very hot to the touch, however even with that fan I have still encountered the LSI 2308 error... and the LSI heatsink is touchable now.
I am trying an update to the latest FreeNAS 9.3 RELEASE as since the issue occurred I have stopped OS updates in order to try to determine the issue, however I want to see if it helps at all..
 

GrumpyBear

Contributor
Joined
Jan 28, 2015
Messages
141

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
Good, thanks for testing that! Yes I have both of those on my board :smile: I have updated to the latest FreeNAS 9.3 Release and will now do some more testing/scrubbing to check the error still occurs
 

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
I have run 4 scrubs on my zpool with between 600 GB and 1 TB of data, while the pool is not being written to, while copying 100 GB to the pool and also while running a disk benchmark that continuously writes and reads a 5 GB file to the zpool while the scrub is occurring.
I can happily report that I have had no errors since updating to the latest FreeNAS 9.3 RELEASE, however I still don't know why the error was occurring in the first place..... which is unfortunate! Hmm, but if it works now and the error doesn't crop up again then I'm happy.
Thanks for all the help everyone!
I will report back if my LSI2308 crashes again.
 

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
After GrumpyBear mentioned that it might be the LSI2308 overheating, I checked the heat sink and it was extremely hot! So I setup a fan to blow air directly onto the LSI heatsink, AND updated FreeNAS to the latest 9.3 RELEASE. Above I reported back that the error had not occurred since, however after that post I turned OFF my fan cooling the LSI directly to then do some scrubs to test if it was actually the software that fixed the issue. After doing this a scrub of 2 TB of data caused the error to crop up again and when it occurred i felt the LSI heatsink and it was extremely hot.
So my conclusion at the moment is that the error occurs when the LSI overheats which occurs during a scrub because its doing a lot of work at that point in time. I am carrying out ongoing testing and will keep the LSI fan in place permanently :smile:
I will report back with my results but I think that overheating is the cause of the issue!
 

GrumpyBear

Contributor
Joined
Jan 28, 2015
Messages
141
After GrumpyBear mentioned that it might be the LSI2308 overheating, I checked the heat sink and it was extremely hot! So I setup a fan to blow air directly onto the LSI heatsink, AND updated FreeNAS to the latest 9.3 RELEASE. Above I reported back that the error had not occurred since, however after that post I turned OFF my fan cooling the LSI directly to then do some scrubs to test if it was actually the software that fixed the issue. After doing this a scrub of 2 TB of data caused the error to crop up again and when it occurred i felt the LSI heatsink and it was extremely hot.
So my conclusion at the moment is that the error occurs when the LSI overheats which occurs during a scrub because its doing a lot of work at that point in time. I am carrying out ongoing testing and will keep the LSI fan in place permanently :)
I will report back with my results but I think that overheating is the cause of the issue!
Ahh - the old change multiple things before testing gotcha ;-)

Hopefully with the fan on the LSI2308 heatsink it will work then if you can do it a few time you have a reproducible fault you can report to SuperMicro. To my knowledge the MegaCLI tool will not report the die temperature of the HBA (It will report the battery backup unit temperature strangely enough). Avago does not appear to publicly release the specifications for the chip so I have no idea what the operating temperature is supposed to be. The only thing they have is a Product Brief.
This Application Note for a card using two of the LSA2308s states that the maximum heatsink temperature of 110 degrees Celsius so these devices are designed to run hot.

I suspect that the design of the motherboard compounds the heat issue. The stock Intel coolers actually suck air in from the top and blow it down through the CPU heatsink. The LSI2308 is located less than 1cm from the edge of the cooler. Here is a quick & dirty photo showing this and the location of the "System" temperature sensor:
X10SL7_SAS.jpg

As multiple people have reported varying perceptions of how hot the heatsink is and some have reported it as loosely bonded to the die I suspect thet there may be some Quality Control issues.

I also noted in my testing that the system temperature tended to run hot and was not an accurate reading of what the ambient temperature inside the chassis should be based on that any airflow over it would be after the air passes over the Hard Disks

With Cougar fans running in "Optimal" and CPU Utilization: 87.5% (133W)
Code:
System Temps:
   CPU Temp          69 degrees C
   System Temp       46 degrees C
   Peripheral Temp   42 degrees C
   PCH Temp          48 degrees C
   VRM Temp          51 degrees C
   DIMMA1 Temp       38 degrees C
   DIMMB1 Temp       35 degrees C

Fan Speeds:
   FAN1   1600 R.P.M
   FAN2    600 R.P.M
   FAN3    600 R.P.M
   FAN4    500 R.P.M
   FANA    600 R.P.M

Disk Temps:
   da0   33 Celsius
   da1   38 Celsius
   da2   36 Celsius
   da3   35 Celsius
   da4   37 Celsius
   da5   40 Celsius
   da6   38 Celsius
   da7   37 Celsius

With Noctua Industrial fans under the same conditions: 87.5% Utilization:
Code:
System Temps:
   CPU Temp          68 degrees C
   System Temp       43 degrees C
   Peripheral Temp   39 degrees C
   PCH Temp          46 degrees C
   VRM Temp          51 degrees C
   DIMMA1 Temp       35 degrees C
   DIMMB1 Temp       30 degrees C

Fan Speeds:
   FAN1   1500 R.P.M
   FAN2    700 R.P.M
   FAN3    600 R.P.M
   FAN4    700 R.P.M
   FANA    600 R.P.M

Disk Temps:
   da0:   27 Celsius
   da1:   32 Celsius
   da2:   31 Celsius
   da3:   29 Celsius
   da4:   31 Celsius
   da5:   33 Celsius
   da6:   31 Celsius
   da7:   29 Celsius
Note that the Fans were controlled by PWM from the FANA header and were reporting their speeds on FAN2 through FAN4 and FANA. The Noctua Fans have a much higher airflow than the Cougar fans but note that while most temperatures were lower the CPU and system sensors were reporting similar temperatures.
 
Last edited:

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
Yup! :tongue: It is a reproducible fault, which I will tell the SuperMicro support person via email, however I will just provide that for their purposes and keep my board with a fan on the LSI or I could simply use the intel SATA ports until I need to add more HDD's and then use the LSI.
However I plan to use the LSI and "test" it long term to make sure it is stable with the fan setup.

Okay, yeah, well I don't have any more information on it. I have read on these forms somewhere about the temperature limit being 100 degrees C or it may have been 110 degrees C. The only way to really measure the temperature would be to touch a thermometer directly to the chips heat sink.
But it was definitely very very hot, it would burn my finger if I left it on there .... Yes I think the design doesn't help, It needs a bigger heat sink, I think they have probably designed the motherboard assuming there will be some airflow over the LSI heatsink.
I may have received a particularly "bad" one that is more susceptible to the overheating LSI issue.
 

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
I can post the temperature readings from my board, however I'm not sure how to get the temperature readout like you have posted above?
Also currently all fan except my CPU fan is connected to the Fractual case fan controller which is set to 12 v (max).
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
They are server grade MB and it means that they will normally be in the path of a good airflow. However as much as I love SuperMicro MB I think it's a bad design practice to rely on external airflow to cool some part of the MB... When you design something, especially if it's for professional usage, you should use the worst case and this is definitely not the worst case...
 

GrumpyBear

Contributor
Joined
Jan 28, 2015
Messages
141
I can post the temperature readings from my board, however I'm not sure how to get the temperature readout like you have posted above?
Also currently all fan except my CPU fan is connected to the Fractual case fan controller which is set to 12 v (max).
Connect the IPMI interface up and access it via a browser and then you can look at the values.

Or run ipmitool from a ssh session to the NAS IP address.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The problem is the IPMI interface doesn't give us the LSI temp unfortunately.
 

GrumpyBear

Contributor
Joined
Jan 28, 2015
Messages
141
The problem is the IPMI interface doesn't give us the LSI temp unfortunately.
Nothing instrumented on the motherboard or controller will give the you the 2308 temp from what I can tell. But if you look at the picture above the "System" temperature thermistor is about 1cm below and to the right of the 2308.
Code:
ipmitool -c sdr type Temperature
from a SSH session or from a console widgit in the FreeNAS GUI will show the temperature sensors.

Was looking at Fan Controllers at my local computer store and noticed that several advertize 5 or more temperature sensors. Mind you after going for the sleek minimalist scandinavian Fractal installing this two bay glaring monster would not be consistent :eek: Also the Temperature range is 2 - 99 (2-99 what - degrees Kelvin???) so 110C might not only melt the sensor but not show up. No indication of the accuracy either.

Servethehome has used a Forward Looking Infra Red (FLIR) thermometer to grab images in some of their newer articles. That would be the ultimate but they,unfortunately did not use it when they reviewed the X10SL7.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yeah but there is a big difference between the temp of the chip and the temp of the PCB 1 cm away.

Yeah, that's why I make my own, I don't trust this kind of thing at all (let alone the price...).

A thermal camera is in my dreams (but the prices start to fall greatly so not a dream for too long I hope...) so I can't tell, sorry.
 

TXAG26

Patron
Joined
Sep 20, 2013
Messages
310
I have the same board and 4TB SG drives as the OP and use a quiet tower case. From day one, I've had a 40mm Noctua fan zip-tied to the LSI heatsink on the motherboard. Same goes for the Intel/LSI expander heatsink. As mentioned above, Supermicro designed this board for a 1U or 2U server and their standard design has a 3/4/5 high-speed fan bank mere inches from where this heatsink would be in those types of cases. Active cooling is most certainly needed on these chips for most home users that aren't running racked equipment with 5,000rpm screaming fans.

My setup has been rock solid and like ECC memory and all the other things we do for reliability, the OP's original issue is a great example of the types of silent errors that can creep into a system when it's running near its thermal maximums.
 

nas2160

Dabbler
Joined
Feb 1, 2015
Messages
32
Hello everyone,
The error occurred again during a scrub of 3 TB of data even with a dedicated fan for the LSI heatsink..
So I contacted Amazon and organised a return and refund :smile:
I will be receiving my new board next week and will be putting it through a thorough burn in testing phase including the LSI 2308!
I will report back once I have done that.
TXAG26: Good to hear, yes I imagine they did, I will test my new board both with and without a fan and see if I get the same error occurring.
I may look into getting some smaller fans that can be zipped tied to the board like you have done. TO date i have directed a 140 mm fan at the LSI heatsink as that was the only fan available to me.
 

AndruXO

Dabbler
Joined
Jan 24, 2016
Messages
19
Found this old topic with googling my own issue, this is exactly what I needed. Thank you guys.
I confirm that with my X10SL7-F mainboard the LSI controller overheating issue exists and system removes hard drives after some time.
Solved my issue with additional 92mm fan blowing directly to the LSI controller.
 
Status
Not open for further replies.
Top