Random reboots on two 2950s

Status
Not open for further replies.

Tim Sievers

Dabbler
Joined
Jul 30, 2015
Messages
14
All,

I have been dealing with random reboots (usually happens between 3 and 14 days) on two Dell 2950 servers. I setup syslog and enabled the debug kernel, but that has only added to the oddity. I cannot find any kernel panics in the logs (but I do get a samba panic on the one server), but there are no logs after some event triggers a reboot, except for the logs of the reboot. The only thing I can find from reviewing the logs is that they seem to end right before the server would run "/usr/libexec/atrun" via crontab. The last time this happened, the syslog reported that the server responded to SNMP after missing the atrun, so syslog and snmp were still operational right before it rebooted. I have disabled the cron job and I have been monitoring the atrun job queue, but I have not been seeing anything. On top of any suggestions you may have, could you also answer the following questions:

1. What does FreeNAS schedule with "at", and are the jobs supposed to stay in the folder?
2. If it is a bad USB, would using another USB in a mirror setup prevent this lockup?

I have already updated from 9.3-Stable-201506292332 to 9.3-Stable-201508250051 and it seems to be running fine for almost a week, but the issue is not predictable.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
First, is the reboot occurring on both servers or only one of them?

What NICs are you using? Realtek, Intel, etc...

Maybe you should post your system hardware configuration, please be detailed. Also, detail the hardware connecting the two servers such as network switches, etc... Did you change out the Ethernet cable?

What kind of burn-in testing have you done to these servers to validate stability?

Using a mirrored USB will not prevent your type of problem, assuming the USB flash drive is the problem, replacing it might.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
How much RAM?
 

Tim Sievers

Dabbler
Joined
Jul 30, 2015
Messages
14
Both do it, but never together. Both have identical hardware.
Broadcom gigabit on the board, Intel gigabit on the card. How could the server respond to SNMP and report it with an external syslog if it was an ethernet or network issue?
32 GB RAM
They have been running for years without any problems with other OS

I suspect something related to the atrun only because there is no cron log of it running (every five minutes) right before it reboots. I know it is a bit coincidental, but without any substantial logs pointing to something more definite, it is worth asking about.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You didn't give very specific system specs. What plugins do you have running?

Can you post your crontab file? I don't know if that will do anything though.
 

Tim Sievers

Dabbler
Joined
Jul 30, 2015
Messages
14
I am pretty much running vanilla Dell 2950 servers, except I have IT firmware on Dell SAS 6. I found out after I set them up that these servers are plagued with issues, but everyone else has some errors to work off of.
I have no plugins running, again vanilla.
I have auto-tune enabled, but the settings it came up with I was going to implement anyway. I have already tried disabling but did not resolve issue.
The crontab is vanilla except I commented the 'atrun' job.
And the FreeNAS boys issued more updates, I need to read into these.

I have no performance issues and everything has been running very well, save for a random reboot here and there.
I do not claim to be an expert on Unix or Linux, so that is why I am inquiring about how FreeNAS uses 'at' and 'atrun'. Are these needed for plugins/jails?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I found out after I set them up that these servers are plagued with issues, but everyone else has some errors to work off of.
Not sure I understand if you are saying your servers have issues or your servers have issues running FreeNAS. Nor do I understand who has errors to work off of.


There could be a software compatibility issue, I'd have you do a few things:

1) Run MemTest 86+ on both servers for 3 solid days and ensure no RAM issues. Even ECC RAM or some other component can fail.
2) Run Prime95 or other CPU stress test for 30 minutes while monitoring the CPU temps. If the CP temps get past the maximum temp for that CPU and all the cores you have installed (I have no idea what you got, you're not telling us and the Dell 2950 can be configured with a few different CPUs). Do not run for longer than an hour, you will likely harm your CPU if you try to run it for extended periods of time like several hours or overnight.
3) I have no idea what your network configuration is but have you attempted to rule out Ethernet cables and switches? Sometimes (yes we have seen it) an Ethernet cable can cause issues, cause the software to skip a beat and fail, normally a crash but reboot isn't far fetched.
4) While you're running the RAM test, read up on if there are any FreeBSD issues with your server hardware.

Have you read this about the command atrun?
Read the entire thing, it tells you where those jobs are stored so if you feel this is your trigger, it may help you.

If you're not going to perform the routine steps above, as far as I can say, I'm done. You have been resistant to answering the questions asked but expect someone to pull out a crystal ball. Someone very well may know the answer but needs some of that additional information to make the connection. If you feel the problem is in FreeNAS then you should submit a bug report however be prepared to answer the questions I've asked and be prepared to state what steps you have done to eliminate your hardware as suspect. Reboots are typically hardware related for a product that is this mature, but I only say typically, I cannot rule out the software. And yes I know it's happening on two identical servers which will draw suspicion to software, however again, it could be hardware, maybe each server has bad RAM or is being overclocked. Maybe it's an Ethernet cable. Please run the tests to rule those items out. See if FreeBSD has issues with your hardware (you will need to know your specific CPU).

And seriously, Good Luck.
 

Tim Sievers

Dabbler
Joined
Jul 30, 2015
Messages
14
Thank you for that constructive criticism. I was referring to the Dell PowerEdge 2950 series servers in general, if you do a forum search for '2950' you will see what I am talking about.
I have only run the integrated testing tools on the server before reusing for FreeNAS, which did not run for more than two or three hours. You may be absolutely correct that I have hardware issues on both servers, they are old.
I have been researching the hardware I have and almost always there is some errors to be associated with it, but my CPU is Xeon E5450 if it matters.
On the ethernet cable issue, would you suggest the one on the management interface or all of them?

I have also already read the articles about 'at' and 'atrun', but they do not explain what FreeNAS schedules using 'at'. I would really love some insight on how it is used in this situation, even if it is not the source of this issue. Not looking to ruffle up any feathers, I honestly don't know the internal workings of FreeBSD or FreeNAS, which is why I posted the questions.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I have also already read the articles about 'at' and 'atrun', but they do not explain what FreeNAS schedules using 'at'.
The bottom of the linked page I provided tells you the files that atrun will look at and thus show your file programs it's trying to run. Not sure if anything will survive the reboot but that's the best I have on that path.

On the ethernet cable issue, would you suggest the one on the management interface or all of them?
I'd run the RAM and CPU tests first as they are most likely to show an error however when it comes the Ethernet cables, I still think it's a long shot but I'd just start at one server and replace the cable. Wait to see if the problem occurs again. Work your way through all the cables. If the servers are in locations where that is not practical, bringing them together and minimize the number of cables/switches possible. Again, after the RAM and CPU tests. I did notice how old those CPUs were but really they had a short run of 3 years (2007 to 2010) so they are not that old, but how hard were they pushed is key and who hot did they get due to fan failures.
 

Tim Sievers

Dabbler
Joined
Jul 30, 2015
Messages
14
Everything tested fine when I ran the tests (only ran on one server due to usage). I have not had a repeat issue since I performed the updates mentioned above. I am going with that it was a software issue.
 
Status
Not open for further replies.
Top