SOLVED Daily page fault causing 3AM crash after powerloss

Status
Not open for further replies.

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
[SOLVED] Simply had to reinstall FreeNAS on a new USB stick and import my old config.

Hi all,

I think I'm currently having a hardware problem that crashes my server every night at 3AM due to a previous powerloss.

History
Everything used to work just fine. It had uptimes in 100s of days, only going down whenever I scheduled a FreeNAS update (9.0 > 9.3 > 9.10 & updates for those trains - always STABLE).
I'm running 4 jails: 2 VIMAGE ones (one for Iodine, one for OpenVPN), 2 normal ones (one that just downloads stuff, and one that offers a media server to the LAN).

Machine
  • Memory stick with FreeNAS installation (FreeNAS 9.10 stable w/ latest updates installed): SanDisk 16GB Cruzer Fit Flash Drive
  • Mobo+CPU: C2750D4I (CPU Avoton C2750 Octa-Core Processor)
  • RAM: 4x Kingston Technology ValueRAM 8GB 1600MHz DDR3L PC3-12800 ECC CL11 DIMM 1.35V
  • Disks: 6x WD Red 4TB (config RAIDZ2) + Samsung 840 series Pro 256GB
  • PSU: Silverstone SST-ST45SF-G
Problem
They were doing public works in my street on Thursday 2016-06-09 around noon. They cut off electricity, and the server crashed. I don't think it was doing anything in particular at that moment...
I turned the machine back on by midnight (1AM), and started all the jails. In the morning I noticed nothing happened overnight, and that the jails (apart from the autostarting iodine one) were turned off. By that time I was at work and apart from a slight unease I just started everything back up and didn't think back on it.
On Saturday I noticed the exact same thing happened again. I just let it go on for a few days, turning off features and scheduled tasks. But it kept on happening.
Tuesday 2016-06-14 I discovered a single dump for every day the server crashed in /data/crash : all on about the same timestamp (ranging from 3:03 to 3:05).
Here are the interesting parts of those dumps:
Jun11.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821c8a64
stack pointer = 0x28:0xfffffe0860792b90
frame pointer = 0x28:0xfffffe0860793920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 13036 (pfctl)
Jun12.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821c89f4
stack pointer = 0x28:0xfffffe0860c03b90
frame pointer = 0x28:0xfffffe0860c04920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 69265 (pfctl)
Jun13.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821d89f4
stack pointer = 0x28:0xfffffe0860de3b90
frame pointer = 0x28:0xfffffe0860de4920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 39449 (pfctl)
Jun14.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821d89f4
stack pointer = 0x28:0xfffffe0860a49b90
frame pointer = 0x28:0xfffffe0860a4a920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 98598 (pfctl)
Jun15.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821d89f4
stack pointer = 0x28:0xfffffe0860743b90
frame pointer = 0x28:0xfffffe0860744920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 56035 (pfctl)
Jun16.dump.gz
Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 0e
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff821d89f4
stack pointer = 0x28:0xfffffe0860966b90
frame pointer = 0x28:0xfffffe0860967920
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 71528 (pfctl)
So it turns out some automated script that calls 'pfctl' runs at (probably?) 3AM, part of the usual FreeNAS housekeeping. But it causes a crash every single time, in exactly the same spot.
I've updated FreeNAS in between of the crashes, but nothing changed.

Cause?
I've done my fair share of googling since the crashes started happening. And every link ends up being about faulty hardware.
e.g.
https://forums.freenas.org/index.php?threads/fatal-trap-12-page-fault-while-in-kernel-mode.10268/ : similar error, but not related to pfctl
https://forums.freebsd.org/threads/16575/: FreeBSD, no solution for me
https://forums.freenas.org/index.ph...rap-12-page-fault-while-in-kernel-mode.25542/ : similar error, but not related to pfctl - was a corrupt flashdrive
https://forums.freenas.org/index.php?threads/please-help-me-fatal-trap-12.10321/ : not pfctl - RAM issues
https://forums.freenas.org/index.ph...ult-while-in-kernel-mode-cifs-transfer.12598/ : not pfctl - RAM issues
https://forums.freenas.org/index.php?threads/fatal-trap-12-page-fault-while-in-kernel-mode.24667/ : possibly corrupt flashdrive (never came back to confirm)
https://bugs.pcbsd.org/issues/3912 - 3AM reboot, too little RAM
https://bugs.pcbsd.org/issues/3789 : exactly what I'm experiencing, but was a bug and should have been fixed since then (and it never happened before the powerloss)

I've been running memtest on the machine for over a full day now, but it hasn't turned up anything yet. Memtest even says "nothing found, press escape to exit" but I'll keep it running for a bit longer. I've attached a screenshot from this morning (now over 24h and nothing changed).

Could anything but RAM cause my problem? I'll try burning, and running a fresh copy of freenas on a new stick if nothing turns up. Maybe the flash drive itself got damaged by the powerloss (is that a thing?).

Any help/insight/suggestions/... is welcome. More diagnostic tests apart from memtest are welcome.

Thank you
 

Attachments

  • jviewer.jpg
    jviewer.jpg
    53.3 KB · Views: 290
Last edited:

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
HAve you tried reinstalling freenas on a new boot drive and uploading a backed up config?
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
That's what I'm currently doing. Kept running memtest, but still no errors.

I've downloaded the latest FreeNAS version (FreeNAS-9.10-STABLE-201606072003.iso) and put it on a new memory stick.
Since when do you need two usb sticks to install FreeNAS anyways? If I recall correctly my initial setup just needed 1 usb key where all the files got extracted to. Now I need one key with the installer, and one key that serves as the destination.
It's been installed on the *new* stick (which hopefully works), and it's currently booting from that. When it ends generating a private key (this is taking forever), I'll upload my old config asap.
And then see if it survives the night.

Thank you for the suggestion.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Now I need one key with the installer, and one key that serves as the destination.
A wild guess is that this came with the patch when also the system boot device was converted to run ZFS.


Good luck!
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Hmm, that makes sense I suppose.
System is all up and running now. Booted, uploaded config, rebooted, and done. Running on the new stick now without seemingly any problems.

Haven't enabled any of the VIMAGE jails, but the two regular ones are running. Hopefully it won't crash tonight and then I can start enabling services one by one.

Got a weird warning when importing my zpool (attached), but I don't notice anything wrong. zpool status shows no problems.
 

Attachments

  • importwarning.jpg
    importwarning.jpg
    91.3 KB · Views: 319

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
A wild guess is that this came with the patch when also the system boot device was converted to run ZFS.


Good luck!
Not wild at all.

In any case, it's much simpler to just mount the ISO over IPMI. There's no advantage in bothering to copy the installer to a USB drive.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Yeah, I'd have. But whenever I mount something in the IPMI java app nothing seems to happen on the host. So I always need to result to using memory sticks & physical access.

Anyways, FreeNAS survived the night!
Both normal jails were still running, uptime is correct, and nothing in /data/crash.
Will enable the VIMAGE jails one by one and see if it keeps surviving the night.

But I'm cautiously optimistic.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
My guess is that the powerloss caused some sort of issue on the usbstick, affecting sectors relevant to the files at hand. The corrupted files were loaded, after restart into the system memory, not getting caught in some check.
From that assumption, this could have been alleviated by having mirror usb sticks, with an appropriate scrub schedule. Or, at least, the problems would've surfaced during scrub.
With a reinstall on fresh usb stick, the corruption problem is solved.

But, these are just guesses. Me trying to make sense of the situation.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Thank you, that sounds like it could be right. I'm just hoping for no reboots tonight with one VIMAGE jail running.

I'd never thought about mirroring the USB stick. But after reading this thread, I think I might go instead for a small SSD.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
The recent 'trend' on the forums is to move on from mirrored USB sticks towards a single SSD. Part of the reason is reliability of SSD's are <that much> better than the USB's sticks, probably ever increasingly so since ZFS is used on the boot drive. Not to mention the plummeting prices on SSD devices. At least where I live, 2x 16GB SanDisk Cruzer usb sticks are priced on par with a SSD.
Another reason for the trend to move away from usb mirrored sticks is the messy procedure to find out which stick failed. Due to manufacturing labels on the drives, they are not as easily identifiable as HDD devices. Which renders the 'replace the failed drive' a guessing game.
Add to this, the (normally) lack of drama when uploading an old configuration file to a fresh install...

I run a SSD too, but, I must admit I'm such a fan of the scrub functionality I'd love to run mirrors just for the sake of having that scrub functionality.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
I'll just need to keep an eye out for a cheap one then. Would feel bad using an SSD with >128GB just for FreeNAS.

I've just finished checking the old memory stick with rufus. But no errors or bad sectors were reported.
So huh.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
But no errors or bad sectors were reported.
A bitflip would not necessarily alarm as a hardware failure, in such analysis.
The stick may still be physically fine, yet the data at that point of the file could've suffered corruption.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Would feel bad using an SSD with >128GB just for FreeNAS.
Don't think about the wasted capacity, think about the cost. If a 128GB SSD is $30 and a 60GB SSD is $32, I can tell you which one I'll pick.

Also, I've been pushing the SSD boot device for a long time because they are more reliable.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Don't think about the wasted capacity, think about the cost. If a 128GB SSD is $30 and a 60GB SSD is $32, I can tell you which one I'll pick.

Also, I've been pushing the SSD boot device for a long time because they are more reliable.

I've still got a box or two with an actual hardware RAID controller and a pair of SSD's for boot, dating from back in the FreeNAS 8 days. Sweet, trouble-free booting.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Soon I'll have that sweet RAID setup too, just on my ESXi system thanks to @jgreco . Just waiting on a 256GB SSD to arrive to pair with my current SSD, then I'll be all set to take the system all apart and build it all back up. I also want to see how much dust the case may have collected, if any.

EDIT: What I didn't see were any postings about the need for a properly configured UPS system which would have allowed the original problem to be averted.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Interesting. I've just checked the documentation for my motherboard, and apparently there is a free PCIE port on it ("PCIe 2.0 x8").
So I'll probably just stick an SSD (adapter) in there for my boot drive. That way I can keep my last SATA port free.

About the UPS: it was configured once, but I never got to see it work, and every so often I got spam once the daemon lost connection (not sure how) with the UPS.
After a few times I just disabled the UPS support (but kept it hooked up - was still useful for small interruptions).
The first thing I did today was actually start the UPS service, but I'm still getting some errors. Will need to look into that once I got some more spare time.

Jun 19 10:28:43 fnas notifier: Network UPS Tools - Riello USB driver 0.03 (2.7.3)
Jun 19 10:28:43 fnas notifier: Warning: This is an experimental driver.
Jun 19 10:28:43 fnas notifier: Some features may not function correctly.
Jun 19 10:28:43 fnas notifier:
Jun 19 10:28:46 fnas notifier: Network UPS Tools - UPS driver controller 2.7.3
Jun 19 10:28:46 fnas notifier: Starting nut.
Jun 19 10:28:46 fnas notifier: fopen /var/db/nut/upsd.pid: No such file or directory
Jun 19 10:28:46 fnas notifier: listening on 127.0.0.1 port 3493
Jun 19 10:28:46 fnas notifier: Connected to UPS [NPW800]: riello_usb-NPW800
Jun 19 10:28:46 fnas notifier: /usr/local/etc/nut/cmdvartab not found - disabling descriptions
Jun 19 10:28:46 fnas notifier: Starting nut_upsmon.
Jun 19 10:28:46 fnas notifier: fopen /var/db/nut/upsmon.pid: No such file or directory
Jun 19 10:28:46 fnas notifier: Unable to use old-style MONITOR line without a username
Jun 19 10:28:46 fnas notifier: Convert it and add a username to upsd.users - see the documentation
Jun 19 10:28:46 fnas notifier: Fatal error: unusable configuration
Jun 19 10:28:46 fnas notifier: Network UPS Tools upsmon 2.7.3
Jun 19 10:28:46 fnas root: /usr/local/etc/rc.d/nut_upsmon: WARNING: failed to start nut_upsmon
Jun 19 10:28:46 fnas notifier: /usr/local/etc/rc.d/nut_upsmon: WARNING: failed to start nut_upsmon
Jun 19 10:28:46 fnas notifier: Starting nut_upslog.

EDIT: Went back to square one for this one: http://doc.freenas.org/9.10/freenas_services.html#ups
Seems fixed now. I think the problem was lacking a password for upsmon (was blank). I simply put in 'fixmepass' and done!
Jun 19 21:57:03 fnas notifier: Network UPS Tools - Riello USB driver 0.03 (2.7.3)
Jun 19 21:57:03 fnas notifier: Warning: This is an experimental driver.
Jun 19 21:57:03 fnas notifier: Some features may not function correctly.
Jun 19 21:57:03 fnas notifier:
Jun 19 21:57:05 fnas notifier: Network UPS Tools - UPS driver controller 2.7.3
Jun 19 21:57:05 fnas notifier: Starting nut.
Jun 19 21:57:05 fnas notifier: fopen /var/db/nut/upsd.pid: No such file or directory
Jun 19 21:57:05 fnas notifier: listening on 127.0.0.1 port 3493
Jun 19 21:57:05 fnas notifier: Connected to UPS [ups]: riello_usb-ups
Jun 19 21:57:05 fnas notifier: /usr/local/etc/nut/cmdvartab not found - disabling descriptions
Jun 19 21:57:05 fnas notifier: Starting nut_upsmon.
Jun 19 21:57:05 fnas notifier: fopen /var/db/nut/upsmon.pid: No such file or directory
Jun 19 21:57:05 fnas notifier: UPS: ups (master) (power value 1)
Jun 19 21:57:05 fnas notifier: Using power down flag file /etc/nokillpower
Jun 19 21:57:05 fnas notifier: Starting nut_upslog.

[root@fnas] /# ps aux | grep uucp | grep -v grep
uucp 11624 0.0 0.0 23088 2636 - Ss 9:57PM 0:00.17 /usr/local/libexec/nut/riello_usb -a ups
uucp 11626 0.0 0.0 45392 2380 - Ss 9:57PM 0:00.01 /usr/local/sbin/upsd
uucp 11675 0.0 0.0 20812 2368 - S 9:57PM 0:00.01 /usr/local/sbin/upsmon localhost
uucp 11712 0.0 0.0 20792 2344 - Ss 9:57PM 0:00.00 /usr/local/bin/upslog -s ups -l /var/log/ups.log -i 300

Would it matter what password you fill in there? Isn't it a local connection anyways? If it is important: why is the password prompt just plaintext?

EDIT#2: Just tried to test the UPS. Removed the power to it, it beeped a lot. I had a shell open, and the following message was displayed:
Broadcast Message from root@fnas
(no tty) at 22:13 CEST...

Communications with UPS ups lost
upsc is now reporting a driver error
[root@fnas] /# upsc ups@localhost
Error: Driver not connected
And now I keep receiving "NOCOMM - ups" mails every 5 minutes... so I suddenly remember why I turned it off in the first place. Because I need to stop the service in order to stop the spam.
Jun 19 22:13:13 fnas riello_usb[11624]: Permissions problem: Device not configured
Jun 19 22:13:13 fnas upsd[11626]: Can't connect to UPS [ups] (riello_usb-ups): No such file or directory
Jun 19 22:13:14 fnas ugen0.3: <Cypress Semiconductor> at usbus0 (disconnected)
Jun 19 22:13:14 fnas uhid0: at uhub1, port 2, addr 3 (disconnected)
Jun 19 22:13:14 fnas ugen0.3: <Cypress Semiconductor> at usbus0
Jun 19 22:13:14 fnas uhid0: <Cypress Semiconductor USB to Serial, class 0/0, rev 1.00/0.00, addr 3> on usbus0
Jun 19 22:13:17 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:13:17 fnas upsmon[11675]: Communications with UPS ups lost
Jun 19 22:13:22 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
... repeated every 5s
Jun 19 22:18:09 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:18:13 fnas upsd[11626]: Can't connect to UPS [ups] (riello_usb-ups): No such file or directory
Jun 19 22:18:14 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:18:14 fnas upsmon[11675]: UPS ups is unavailable
Jun 19 22:18:19 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
... repeated every 5s
Jun 19 22:23:12 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:23:15 fnas upsd[11626]: Can't connect to UPS [ups] (riello_usb-ups): No such file or directory
Jun 19 22:23:17 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:23:17 fnas upsmon[11675]: UPS ups is unavailable
Jun 19 22:23:23 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
... repeated every 5s
Jun 19 22:28:15 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:28:15 fnas upsd[11626]: Can't connect to UPS [ups] (riello_usb-ups): No such file or directory
Jun 19 22:28:20 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
Jun 19 22:28:20 fnas upsmon[11675]: UPS ups is unavailable
Jun 19 22:28:25 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected
... repeated every 5s
Jun 19 22:30:11 fnas upsmon[11675]: Poll UPS [ups] failed - Driver not connected

Apparently the driver crashes when the power gets disconnected, because now I get
[root@fnas] /# ps aux | grep uucp | grep -v grep
uucp 11626 0.0 0.0 45392 2380 - Ss 9:57PM 0:00.10 /usr/local/sbin/upsd
uucp 11675 0.0 0.0 20812 2368 - S 9:57PM 0:00.07 /usr/local/sbin/upsmon localhost
uucp 11712 0.0 0.0 20792 2368 - Is 9:57PM 0:00.01 /usr/local/bin/upslog -s ups -l /var/log/ups.log -i 300
And "/usr/local/libexec/nut/riello_usb -a ups" is missing.
Starting it again, fixes the errors:
[root@fnas] /# /usr/local/libexec/nut/riello_usb -a ups
Network UPS Tools - Riello USB driver 0.03 (2.7.3)
Warning: This is an experimental driver.
Some features may not function correctly.


Broadcast Message from root@fnas
(no tty) at 22:34 CEST...

Communications with UPS ups established
 
Last edited:

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Just a quick update. 2 days later, still 0 reboots. And everything is up and running again for a whole day now.
Seems like installing FreeNAS on a new USB stick and uploading the old config file worked.

My next 2 personal ToDos are: fix UPS setup & buy another SSD and use that one as a bootdrive.

Thank you all
 
Status
Not open for further replies.
Top