SOLVED A lot of errors and reboots, unusable

Status
Not open for further replies.

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
Hi everyone,

Sorry to skip the introduction, but I have a frustrating problem that needs fixing. Took me a while to get in here, trying to sign up for the forums through chrome somehow did not work. Needed the suggestion from the IRC channel to use another brower to finally get in. Anyway, I have terrible problems with my new server configuration and it is really annoying. I had already opened an issue on the Bug Tracking System, see here https://bugs.freenas.org/issues/17240. I actually opened 2 because at first I thought I had 2 different problems, so look at this one as well https://bugs.freenas.org/issues/17140. I can give a quick summary of what already happened, but to also see the errors themselves have a look over there and in the files I uploaded.

To summarize, I recently transitioned from my old dell pc as server to new hardware, always exciting. The new hardware was a Supermicro X10SL7-F motherboard, Intel Core i3-4170, Crucial 8GB ECC CT102472BD160B x 4 and while I was at it I also increased my 3-disk RAID-Z1 pool to a 6 disk RAID-Z2 pool. I installed everything, nothing wrong there, booted into my original freenas OS and all seemed fine at first. But then the trouble started happening. My system would just randomly start showing hundreds of errors which always resulted in the system crashing and rebooting. The error is not always the same, sometimes it is kernel panic, then it is CAM errors, then some times it shows problems with IOC, which is the I/O controller I think. Trying to troubleshoot the problem my motherboard at one point would not boot anymore and I had to ask for a replacement. I am not sure if this is related to the original problem, because I have the replacement motherboard now and the issues are exactly the same. I tried many different firmwares for the sas controller, always the IT version (16.0.1,19,20,20.4). I also tried booting the system with some disks excluded to check for faulty disks or cables, but that did not help either. It does however seem to happen when it tries to open a volume. Not a particular volume, I have 2, but any of the volumes. I have tried multiple fresh installs of freenas and the problems seem to happen as soon as I upload my configuration or if I try to import the volume via the storage page. I tried to boot the system without hyperthreading, also did not help. Tried changing some of the settings in the SAS controller, did not help. Performed a memtest, but no problems there.

Have a look at both the issues in the bug tracker, I might not have mentioned everything that has happened. Anyone any ideas?

Thanks in advance,
Jelle Jan




EDIT: So I have 2 pools, one RAID-Z2 with all WD30EFRX disks and one mirror with a 2tb WD green disk and a 2tb Seagate barracuda disk. If I connect only the mirror to the system via either the SATA or SAS ports the system seems to have no problems but as soon as I connect the other pool I get the errors. The systems says IOC fault 0x40002622 and resetting, than a LOT of messages saying 'Warning: io_cmds_active is out of sync - resynching to 0'. I attached a rar containing a recording where you can see the error happening. Skip to about 35 seconds. Are my WD RED 3tb disks somehow incompatible with the lsi 2308 controller?
 

Attachments

  • FREENAS.rar
    1.6 MB · Views: 282
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Did you try a new or different boot usb? Make sure your cables are all plugged in correctly. What Sata cables are you using?

Sent from my Nexus 5X using Tapatalk
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
Yes, bought a new flash usb, tried using a fresh install multiple times. All my cables correctly plugged in, also tried different sata cables. The cables I use now are Supermicro CBL-0044L 57,5cm sata cables. If I try to boot the system with just 3 of the 6 disks connected it starts normally but obviously cannot open the pool. If I connect 4 disks or more then trouble starts.
 

Scareh

Contributor
Joined
Jul 31, 2012
Messages
182
well seeing lots of seemingly unrelated errors so in short:
did you burn in your system when you bought it?
Stress test cpu/ram testing/... look for the topic in the forums.

after every part has checked out, try and do a fresh install on a usb mirror instead of a single usb.
See if that stays stable, reboot and import your disks again.

This way you'll at least excluded hardware problems before anything else.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
I'm going to grasp at straws here but I've not read ANYTHING about attempts to troubleshoot POWER.
Just a WildAssedGuess based on this;
"If I try to boot the system with just 3 of the 6 disks connected it starts normally but obviously cannot open the pool. If I connect 4 disks or more then trouble starts."
 
Last edited:

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
I'd have to perform those test next weekend, I have to leave again tonight. But I am not convinced that the hardware is defective because if just have the mirrored pool connected the system performs normal. When it comes to temps the system is also normal, CPU is around 34 degrees celsius and the HDD's around 27 degrees. So if the hardware is not defective, any other ideas of what could be causing it? Not saying it isn't, just looking for different causes.
 
Last edited:

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
@BigDave I have not performed any real conclusive tests about power, but I did have the system on with all the disks connected to the power and therefore spinning up, but only the disks for the mirrored pool connected with data cables. The system was stable. BTW my PSU is a Seasonic G-series 360 watt.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
@BigDave I have not performed any real conclusive tests about power, but I did have the system on with all the disks connected to the power and therefore spinning up, but only the disks for the mirrored pool connected with data cables. The system was stable. BTW my PSU is a Seasonic G-series 360 watt.
360W for 8 drives on that motherboard (built-in controller) and four sticks of RAM... I'm leaning heavy towards the PSU!
In the past, I've had a single bad disk cause a large and varied mix of issues. You have changed a bunch of variables in a short
period of time, therefore you must go back to "square one" and be systematic in your process of elimination.
I would first attempt to eliminate the PSU as the source, after that, the drives would come next.
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
Alright, hopefully I can find a spare PSU with a higher power rating by the end of next week, until then I will just have to wait.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Also check for firmware updates on the controller. I remember somebody having strange errors with WD drives that went away after a firmware update.
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
Updated bios and tried p16 through p20.4 (the most recent) firmware for the LSI 2308 SAS controller. My server crashed again btw, now with just the mirrored pool (2 disks) installed. So no pool is stable. Installed my old motherboard again and now it is stable. Makes me doubt the PSU as a culprit even more.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
hopefully I can find a spare PSU
I'm thinking power as well. How do you have the drives connected to power (are they all split from the same power connector)? Maybe try a different power supply feed for some of the drives?
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
I used multiple sata splitter cables and a few molex to sata adapter cables. Besides that I have 2 fans in front cooling the hard drives and one fan in the rear. Besides that the only power consumers are the processor, its cooler and the motherboard. I realize power splitter cables can be tricky, but as long as the amps don't exceed, it should be fine right? The +12V line supports 30A and the +5V line 16A. If I calculate the maximum consumption of eight disks according to the power rating on the drive's labels, the maximum is 3.6A on the 12V and 4.8A on the 5V. However the average power consumption during read/write is much lower according to the specs, 4.1W per disk and so 32.8W in total. Forgive me for being stubborn but I just don't see 360W not being enough for the entire system. Or am I missing something? Did you mean the PSU is broke rather than it being insufficient? That would also make no sense to me because my old hardware ran fine. Might be worth mentioning that my old setup was a dell pc with a intel pentium E5300 and 4 x DDR2 2GB modules. Same case, same power supply, different cheap PCI-E sata controller and a Intel Gigabit CT Adapter. The RAM modules weren't even the same brand, yet that system has ran reliable for over 2 years and I can't get my fancy new hardware to play nice. BTW depasseg, I don't think my power supply has multiple feeds. It isn't modular and only has one sata cable and one molex cable, which I am already dividing over the drives, but I think they're just connected to the same feed.

Anyway, next weekend I'm just going to take my 850W PSU out of my gaming desktop and connect it to the server and see what happens. Can't hurt much. Keep giving me your thoughts, this problem is already causing sleeplessness.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
That was interesting, definitely gave me some new insights when it comes to the power consumption of hard disks. I knew there was a peak consumption during spinup, but thought the power ratings on the labels of the hard disks were the maximum consumption. Apparently that was a wrong assumption. I actually hope now that this is the problem and that when I connect a different PSU to the server, the problem will be fixed, yet I am not fully convinced. It does not explain why my server also fails with just two disks being powered, at least from what I understand. According to this guide I should have bought a power supply with a capacity of around 550W or higher for my 'fleet'. If so than my gaming PSU should definitely offer enough breathing room for my server. Thanks BigDave.

Time for sleep.
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
I just installed a power supply with twice the capacity in my server and at least it didn't crash during booting. It booted with only the mirrored pool though. Almost immediately when I tried to import the RAID-Z2 pool I got a lot of read errors again and the system crashes. I'm just going to keep the system online with the mirrored pool for now, see if it is stable, if it doesn't crash within a few hours I can be pretty confident this configuration is stable. If that is the case, there might actually be two different issues. One with the PSU and one with the RAID-Z2 pool, a defective disk or cable maybe, or the pool is just corrupt. Any ideas on this?
 

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
When I woke up this morning the system was halted on a bios screen so it must have crashed again, meaning it wasn't stable with just the two disks either.

EDIT: Hangs everytime now in BIOS at the same code. Really hope I do not have to RMA it again, might not even replace it this time.
EDIT2: Tried shorting the CMOS battery but same error ensues. Also tried disconnecting the USB drives, SATA drives and even memory, but nothing that fixes it. I'm going to be really frustrated if it has to be send to Supermicro again and I have to wait another month to continue troubleshooting this system. Don't know why the BIOS on this board would be corrupted after probably about 30-40 power cycles, since I got this replacement last weekend. I did of course swap out PSU's yesterday, so if there is anything I can think of it is connecting the original PSU. Otherwise I'm just going to leave it with power cord disconnected, perhaps if I leave it like that for 24 hours the problem can resolve itself.
 
Last edited:

HertogArjan

Dabbler
Joined
Oct 16, 2016
Messages
30
Hi everyone, I contacted the Supermicro support and they helped me out a lot. We came to the conclusion that my CPU might be defective and that I should try replacing it. They also helped me reflash me bios through IPMI, because I couldn't get past the bios in the first place. After reflashing and installing an i5 processor that I removed from a desktop computer of mine, the system succesfully booted. Unfortunately my joy did not last long because while I'm not getting the kernel panic problems I used to get I still am being bothered by the SAS controller resetting as soon as I try to import my volume. I remembered the lesson you gave me about correct PSU power rating and especially with the i5 I can expect high power consumption, so I tried my 850W PSU again and see if that solved the problem, but it did not. I used multiple cables from the PSU (it is a modular PSU, the Nexus RX-8500) and did not split the power so as not to have to much current drawn from one socket. I then also went disconnecting certain drives and importing the volume degraded. The first two attempts the SAS controller just reset immediately, with one configuration I could succesfully import my volume in a degraded state. Yet when I started reading data from it at 118MB/s via samba the SAS controller very quickly reset as well. I do not know whether that is a significant difference. I do know that the problem is not caused by a single drive. I do also know now that there have been two problems all along and that a defective CPU is not causing these issues because the CPU used now has never caused problems after probably about 2 years of use. Also the PSU is pretty much cleared for any wrong doing and so is the motherboard. I really do not know what could be causing it. Tonight I will run a memtest on it again just because I do not have a use for it otherwise anyway, but I'm really interested to hear if everyone has any suggestions left I could try.

Thanks in advance,
Jelle Jan
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Can you show is a photo of your system. I wonder if something is shorting out?

Sent from my Nexus 5X using Tapatalk
 
Status
Not open for further replies.
Top