FreeNas Server Randomly Reboots

Status
Not open for further replies.

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
Server: IBM x3850 X5
CPU: 4x Intel Xeon E7-4830 2.13Ghz
Memory: 1048528MB
Raid Card - 3x LSI M1015 (IT Mode)

FreeNAS-9.3-STABLE-201511020249

This past month I'v been dealing with freenas rebooting at random times. I'm having a hard time trying to pinpoint the issue. Currently I have eight modules, on each module their are 8 stick of ram (each 16GB). I did see that the console at one point showed memory errors, so last week I have been removing one module at a time to see if it still reboots. Since last week the server has rebooted three times and I have removed three modules, five more to go. When I do catch freenas rebooting, I notice on the console it displays tons of info super fast for about 30 second or so? I went into the /data/crash directory, untar the file and found this info under the msgbuf.txt. Somtimes FreeNas can stay up for day's other times freenas will boot up and reboot within matter of minutes, to me it sounds like a hardware issue?

panic: ctl_check_for_blockage: Invalid serialization value 1667590243 for 1 => 1
4
cpuid = 10
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xfffffeb2f8410440
kdb_backtrace() at kdb_backtrace+0x37/frame 0xfffffeb2f8410500
panic() at panic+0x1ce/frame 0xfffffeb2f8410600
ctl_check_ooa() at ctl_check_ooa+0xb7/frame 0xfffffeb2f8410660
ctl_work_thread() at ctl_work_thread+0x1f70/frame 0xfffffeb2f8410be0
fork_exit() at fork_exit+0x11f/frame 0xfffffeb2f8410c30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffeb2f8410c30
--- trap 0, rip = 0, rsp = 0xfffffeb2f8410cf0, rbp = 0 ---
KDB: enter: panic
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes, you probably want to go look at my build and burn-in sticky for some hints on what you need to be doing. If the platform is not stable, FreeNAS is going to be a miserable experience.
 

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
Yes, you probably want to go look at my build and burn-in sticky for some hints on what you need to be doing. If the platform is not stable, FreeNAS is going to be a miserable experience.

Sounds good, everything was running fine for couple months but once I moved to a new house that's where I started to have odd issues, I'm going to take apart the entire box, remove cpu's, memory, pci cards, re-apply thermal paste, etc and see if that helps, I will continue to try to find the one bad stick of memory.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Was there any significant bumping or jostling along the way? Speaking as someone who's had to move Big Gear, failures like this after a move often signifies that something came undone. If you're going to be in there, consider using some electronic contact cleaner on the PCIe card edges and sockets, the DIMM card edges and sockets, etc. Try that probably *before* undoing CPU's and redoing thermal paste; most of the modern mounts screw on and are somewhat less likely to be the source of the problem. Running stuff like the memtest for at a minimum of several days successfully is the bare minimum that you should test before even trying to bring up FreeNAS.

https://forums.freenas.org/index.php?threads/building-burn-in-and-testing-your-freenas-system.17750/

Seriously, CPU burn in and memory tests, make those 100% first, then move on to your disk subsystem.
 

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
Right now I'm avoiding running the memory test because I have 64 stick of 16GB each, I feel like I would be dead before it would find the right stick. I figured it would be a little easier just removing the modules one a time to pinpoint the bad one (assuming its only one). Once I find the module with the bad stick, then I can run memory test on that specific module.

As for moving the server, I was very careful with it, made sure their was two of at all times moving it. But still something could of happened during that 15min trip.

Worse case scenario, remove all processors but one, and insert one memory module and see how long the server stay's up. I'll also do a CPU burn in like you said.
 
Last edited:

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
[...] Worse case scenario, remove all processors but one, and insert one memory module and see how long the server stay's up. I'll also do a CPU burn in like you said.
Can the type of memory module you are using be tested in some other (known to be good) motherboard?

You should try to immediately start testing components systematically, and not wait for a crash.

P.S.
I do not believe in so many RAM modules failing at the same time. Even with a move. CPU or motherboard damage is more likely, than so many RAM modules at the same time.
 
Last edited:

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
Can the type of memory module you are using be tested in some other (known to be good) motherboard?

You should try to immediately start testing components systematically, and not wait for a crash.

P.S.
I do not believe in so many RAM modules failing at the same time. Even with a move. CPU or motherboard damage is more likely, than some many RAM modules at the same time.

I can get my hands on another IBM x3850 X5, this way I can test to see if it's the motherboard, and start testing modules in it.
 
Joined
Apr 9, 2015
Messages
1,258
Just another wrench to throw into the scenario but it is possible that the outlet it's plugged in to could have sub standard power. If you have a UPS it will help but eventually the batteries will just go down and you will have major problems.

Dealt with this before working on someone's computer. At my place it was fine, at his it would not run right. Told him to try another outlet and it worked without issues. Told him either the outlet was bad or a connection was bad between that outlet and some other part of the system. It's rare but it does happen. Switching to another room closer to the breaker box would be a good step to see if that is an issue and be a lot faster than doing a memory test.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
When you're powering 2000 watt power supplies, you usually have dedicated circuits run.
 

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
When you're powering 2000 watt power supplies, you usually have dedicated circuits run.

correct, I have four 30amp circuits with each its own ups, had them upgrade my house service to 200amp. In the end I just wanted to make sure a crash like that is hardware, if it is then no problem i can start trouble shooting.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Hard to know for certain where the problem is until you try a bunch of stuff. Just the nature of it.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
I'll throw in my data point - I experience them too. H/W is almost a year old. It seems to happen when I'm doing stuff in the GUI (low load otherwise), like deleting a bunch of snapshots, or setting up a new jail. It's happened a couple times in the past month. I chalked it up to the software updates.
 

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
I'll throw in my data point - I experience them too. H/W is almost a year old. It seems to happen when I'm doing stuff in the GUI (low load otherwise), like deleting a bunch of snapshots, or setting up a new jail. It's happened a couple times in the past month. I chalked it up to the software updates.

Hmmm, here is what I might do, I have an extra IBM 3650 M3 that I used for freenas before upgrading to the IBM 3850 X5. I think I'll go back and use 8GB Memory modules for it, only bring over my LSI raid cards, and see if it happens again.
 
Status
Not open for further replies.
Top