SOLVED Unexpected reboots

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Since updating from 11.2 to 11.2-U1, my FreeNAS server has rebooted on its own, five times, once of those times after reverting to 11.2. Nothing obvious in the logs either.

Background:
I bought a few WD Easystore drives to shuck and resilver into my pool to expand it. After spending most of the last week running badblocks on them (which completed with no issues), I resilvered two of them into the pool yesterday. That ultimately completed, though it took much longer than I'd expected (perhaps trying to resilver two drives at the same time wasn't a good plan). Since I have plenty of spare bays in my server, the two replacement drives were in two of them. So last night, I pulled the two replaced drives out of my system to move the new drives into their "permanent" slots. The disks were showing in the GUI as da20 and da21. Offline da21 through the GUI, sesutil locate da21 on to make sure I'm pulling the right one, pull the disk, put it into the right bay, online through the GUI, sesutil locate all off (I didn't use da21 off because da21 was no longer in the same place, and I thought it might get confused). At this point, all the system fans stopped, and eventually the locate light was turned off. After giving it a few seconds to resilver da21, I then moved da20 using a similar process, though I ran sesutil locate da20 off once I'd confirmed da20's location.

The system fans hadn't restarted yet, but I expected they'd start back up as temps rose--the server's in a pretty cool location right now (and in a detached building from my house). About an hour later, I found that my assumption was wrong when I got a temperature alert email on a couple of my disks. So I logged into the IPMI admin page, set the fans to a higher speed, and checked again after a while. Surprisingly, disk temps didn't seem to be dropping. Since the system was prompting me to update to 11.2-U1 anyway, I went ahead and ran that update. Once it completed, I powered down the server, powered it back up, and my temp problems appeared solved.

When I got up this morning, I had three emails about uncommand system restarts (side note--those emails give the time in UTC, not local time--very confusing). Checking system uptime confirmed that the system had really restarted. Not good. A little later in the morning, it happened again. Figuring it was something to do with 11.2-U1, I reverted to 11.2-RELEASE and rebooted. About an hour ago, the system rebooted again.

Troubleshooting:
IPMI event logs show nothing at all from the relevant timeframe. The system is attached to a 3 kVA UPS, which shows no history of power issues during the relevant period. The system has dual redundant power supplies, each with far more capacity than is necessary to run it in its current configuration. The system log doesn't appear to have anything interesting either. There's simply nothing at all for 3.5 hours preceding the last reboot, but here's the log from that (too big to include, pastebin here).

System configuration:
SuperMicro SuperStorage Server 6047R-E1R36L (Motherboard: X9DRD-7LN4F-JBOD, Chassis: SuperChassis 847E16-R1K28LPB)
2 x Xeon E5-2670, 128 GB RAM, Chelsio T420E-CR
Pool: 6 x 6 TB RAIDZ2, 6 x 4 TB RAIDZ2, (2 x 2 TB + 4 x 3 TB) RAIDZ2
Jails: Plex Media Server, Urbackup, Transmission/SABNZBd+/Sonarr/Radarr, BOINC
APC SUM3000RMXL2U UPS + SUM48RMXLBP2U battery pack

I'm kind of at a loss here, but not happy at all with a suddenly-unstable system. Thoughts?
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I wonder if there's some basic incompatibility between the X9-series SuperMicro motherboards and FreeBSD/FreeNAS 11.2?
That would be very unfortunate. What it doesn't account for is about three weeks of uptime on 11.2 before yesterday. But it's still interesting to see similar issues...
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I would look into the Jails (iocage) as a possible culprit.
Around Freenas 11.2 Beta, I experienced several issues related to crashes (freezes but no reboot on my old Xeon E3, then later my Threadripper 1900x) and seemed to have been caused by Jails, mostly as it seemed to have been caused by swap being exhausted or simply after running some intensive network transfer.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Thanks, I'll try disabling them and see if it changes anything. Though (as above) I haven't changed anything with my jails in the last couple of weeks, and the reporting pages are showing no swap in use over the last 24 hours.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
FWIW, I also have the X9DRD. Still on 11.2-RELEASE. Last week, I had crashes during resilver that were caused by a drive (not the ones being resilvered) giving a timeout. Taking that drive offline allowed me to continue.

#1
 

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
Do you run any vms?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Checking Supermicro's site, I see there's a BIOS update, pushed out earlier this year. The previous BIOS in the system was dated 2013, following the general "if it ain't broke, don't fix it" rule for BIOSs. So I installed the update (mutter mutter Supermicro mutter mutter IPMI mutter mutter license key...) and actually unplugged the power supplies for a bit, cleared out and reconfigured the BIOS settings, etc.--I'll see if that helps anything.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, I'm at 23 hours' uptime now, which is very encouraging. If I were being truly scientific, I would have only changed one thing at a time, but I made three changes at about the same time:
  • Updated BIOS from 3.0 (I think) to 3.3
  • Unplugged power as recommended by SuperMicro's instructions
  • Disabled a bunch of tunables, most of which shouldn't have been needed any more anyway
A scheduled scrub is running now. Once that finishes (assuming it completes without incident), I'll boot back into 11.2-U1 and see if things are still stable.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Actually ended up being quite a while longer than I'd expected for a backup to finish. I'm now at 72 hours' uptime, have just activated the 11.2-U1 boot environment and am rebooting the system. I'll check the tunables; if they're enabled I'll disable them and reboot again. But it is looking like things are fixed.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
OK, it's now almost three days' uptime on 11.2-U1. I'm not sure if it was the hard power-off, the BIOS, the tunables, or some combination, but this seems to be resolved now.
 
D

DeletedUser080302028

Guest
Dev's response to my bug report suggests a bug in the mps driver. Interesting.
What is the mps driver? I'm having this same issue. An 11.2-STABLE system started rebooting a few days ago after I started a scrub.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
The mps driver is used for at least some of the LSI SAS HBAs. According to the core dump files, the reboots on my system were triggered by sas2ircu, which is used by a monitoring script that runs every 15 minutes on my system. I've since done a couple more disk replacements, and run at least one scrub, without issues. My working hypothesis is that the sesutil locate all off put something (possibly one of the backplanes) into an unexpected state, which then resulted in sas2ircu causing a panic and rebooting the system. It wasn't until I pulled power from the system that the oddity was cleared.
 
D

DeletedUser080302028

Guest
Thanks for the info! I've pulled power a few times and still having reboots. Going to tear down the system and and rebuild while swapping SATA cables.
 
Top