Spontaneous Reboots

ISJ · Mar 27, 2022

Hello,

I have had two separate reboots ~6h apart. The error message sin email are as follows:

New alerts:
* SRV-STRG-002.core.testdev had an unscheduled system reboot.
The operating system successfully came back online at Sun Mar 27 02:37:48 2022.

How do I diagnose this?
Disks:
2x 250GB WD Blues (for OS, in Mirror)
5x 10TB WD Red Plus (NAS)
1x 1TB Samsung SSD (Write Cache)
1x 250GB Samsung SSD (Read Cache)

CPU:
Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz

Mobo:
ASUS WS C246M PRO uATX Server Motherboard LGA 1151 Intel C246
https://www.newegg.ca/asus-ws-c246m...rocesso/p/N82E16813119149?Item=9SIA7BB86S6417

Memory:
Kingston 8GB DDR4 ECC PC4-2666 Server Ram Memory Module CL19 (1x 8GB)
https://www.memoryexpress.com/Products/MX00115597

NIC: (I'm debating returning the NIC, based on nothing but a hunch.)
10Gtek for X540-T2 10GbE Converged Network Adapter(CNA), Dual Copper RJ45 Port, PCI Express 2.1 X8, Compare to Intel X540-T2

https://www.amazon.ca/gp/product/B06XH4HV96/

PSU:
800W Corsair PSU

/var/crash Directory

root@SRV-STRG-002[~]# ls -hal /var/crash/
total 6
drwxr-x--- 2 root wheel 64B Mar 26 08:32 .
drwxr-xr-x 28 root wheel 1.8K Mar 27 16:33 ..
-rw-r--r-- 1 root wheel 5B Mar 26 08:32 minfree

root@SRV-STRG-002[~]# cat /var/crash/minfree
2048

~~SOLUTION THAT WORKED FOR ME~~

Samuel Tai · Mar 27, 2022

In all likelihood, you're seeing power issues. Do you have a UPS?

Also, look through /var/log/console.log to see if there's anything that matches the times of your reboots.

ISJ · Mar 27, 2022

Samuel Tai said:
In all likelihood, you're seeing power issues. Do you have a UPS?

Also, look through /var/log/console.log to see if there's anything that matches the times of your reboots.

First off, thanks for the effort.

Power Issues:
- Its a brand new and quality PSU.
- Well cooled, nothing over 50c even in use.
- 1000W UPS at 65% load. Power also has been steady on that circuit for all other systems connected, the is the only system to fail.

/var/log/console.log

Not much to show really, nothing to precede the event too.

WARNING
SRV-STRG-002.core.testdev had an unscheduled system reboot. The operating system successfully came back online at Sun Mar 27 02:37:48 2022.
2022-03-27 02:37:48 (America/Edmonton)

[Start of Log]
Mar 27 00:00:00 SRV-STRG-002 newsyslog[3386]: logfile turned over due to size>100K
Mar 27 02:38:13 SRV-STRG-002 Starting devd.
Mar 27 02:38:13 SRV-STRG-002 Autoloading module: ig4.ko
Mar 27 02:38:13 SRV-STRG-002 Autoloading[1164]: Last message 'module: ig4.ko' repeated 1 times, suppressed by syslog-ng on SRV-STRG-002.core.testdev
Mar 27 02:38:13 SRV-STRG-002 Starting zfsd.
Mar 27 02:38:13 SRV-STRG-002 <118>middlewared: starting

WARNING
SRV-STRG-002.core.testdev had an unscheduled system reboot. The operating system successfully came back online at Sun Mar 27 16:33:24 2022.
2022-03-27 16:33:24 (America/Edmonton)

Mar 27 02:38:20 SRV-STRG-002 Syncing multipaths...
Mar 27 02:38:21 SRV-STRG-002 Configuring vt: blanktime.
Mar 27 02:38:21 SRV-STRG-002 Starting cron.
Mar 27 02:38:21 SRV-STRG-002 net.inet.carp.allow: 0 -> 1
Mar 27 02:38:21 SRV-STRG-002
Mar 27 02:38:21 SRV-STRG-002 Sun Mar 27 02:38:21 MDT 2022
Mar 27 16:33:53 SRV-STRG-002 Starting devd.
Mar 27 16:33:53 SRV-STRG-002 Autoloading module: ig4.ko
Mar 27 16:33:53 SRV-STRG-002 Autoloading[1164]: Last message 'module: ig4.ko' repeated 1 times, suppressed by syslog-ng on SRV-STRG-002.core.testdev
Mar 27 16:33:53 SRV-STRG-002 Starting zfsd.

Samuel Tai · Mar 27, 2022

Your power supply won't protect you from sags, brownouts, or power outages. This still smells like input power drops.

ISJ · Mar 27, 2022

Im running a memtest now, this is ECC but why not.

ISJ · Mar 27, 2022

Samuel Tai said:
Your power supply won't protect you from sags, brownouts, or power outages. This still smells like input power drops.

Totally agree and I would love if it was that simple but this is a Dell 1000W (model) UPS backed up by 7.2h of battery runtime, in addition the line is monitored for those event by a power quality monitor with no new events on that monitor or the use event log (it turns red on new events).

In addition my area has no heavy industry, large inductive motors or lightning strikes that could cause power sags or spikes or other oddities, the power only goes out in winter storms and none of that has happened.

The last occurrence happened mid day (16:00) with no effect on lights or other electronics.

I used to be an Electrician so trust me I would rather troubleshoot that but its pretty locked down.

ISJ · Mar 27, 2022

Check the bios, no updates available, all rails are within normal voltages (+/- 2%), 12v was a little high at 12.2v but that's fine at 1.6%.

Samuel Tai · Mar 27, 2022

Are there any core files in /data/crash?

ISJ · Mar 28, 2022

Two and a half passes of memtext86 and im pretty sure it snot a memory issue, no overclocking of course too.

/data/crash

There's some good info here I think, thanks!

Files
===========================================
bounds : "2"
info.0
info.1
info.last@ -> Info.1

textdump.tar.0.gz (directory)

config.txt

ddb.txt

msgbuf.txt

panic.txt

version.txt

textdump.tar.1+.gz (directory)

config.txt

ddb.txt

msgbuf.txt

panic.txt

version.txt

textdump.tar.last.gz@ -> textdump.tar.1.gz

ISJ · Mar 28, 2022

Possibly related:
https://github.com/openzfs/zfs/issues/11405

and also:

https://jira.ixsystems.com/browse/NAS-111889?attachmentOrder=desc (This WAS during the loading of 20TB of data at a high rate.)

Possible fix im looking at here:

https://jira.ixsystems.com/browse/NAS-107636

Looks like this was supposedly fixed in Core 12.0-U1.1 but it's reproducible in Core 12.0-U8.

jgreco · Mar 28, 2022

ISJ said:
1x 1TB Samsung SSD (Write Cache)
1x 250GB Samsung SSD (Read Cache)

Remove these. We don't recommend L2ARC until you have 64GB RAM at a minimum, and below 32GB, it is almost always detrimental unless you can actually explain to me why it isn't (rare but possible). You can definitely cause panics with wildly imbalanced and ill-advised L2ARC configurations.

ZFS does not have a "write cache" on SSD; your write cache is your system memory. If you have added a SLOG device thinking it will make things faster, it will not, at least, not in the way most people envision. Please see the "Some insights into SLOG/ZIL" article.

ISJ · Mar 28, 2022

jgreco said:
Remove these. We don't recommend L2ARC until you have 64GB RAM at a minimum, and below 32GB, it is almost always detrimental unless you can actually explain to me why it isn't (rare but possible). You can definitely cause panics with wildly imbalanced and ill-advised L2ARC configurations.

ZFS does not have a "write cache" on SSD; your write cache is your system memory. If you have added a SLOG device thinking it will make things faster, it will not, at least, not in the way most people envision. Please see the "Some insights into SLOG/ZIL" article.

Done, should I retest now?

EDIT: Not to ping anyone im going to just edit here, im going to take a day or two to accomplish this as I want to add more raw storage and need parts, thanks so much for the help.

jgreco · Mar 28, 2022

Seems fine to try.

ISJ · Mar 29, 2022

I removed both the L2ARC and SLOG and added 3x 10TB drives for a total of 8X in raidz. Which I suspect has removed the bug.

No crash yet, its been 10h @ 250MB/s so it is under load too.

jgreco · Mar 29, 2022

So if you've got 80TB of disk on there, just be aware that you're really probably supposed to have 64GB of RAM, though 32GB is probably workable in a pinch. This will become more of a problem as your pool fills.

ISJ · Mar 29, 2022

jgreco said:
So if you've got 80TB of disk on there, just be aware that you're really probably supposed to have 64GB of RAM, though 32GB is probably workable in a pinch. This will become more of a problem as your pool fills.

Thanks, i'll look at upgrading in the near future sir. Can you elaborate why the space is required?

In addition, unfortunately the issue was occurred again. The crash logs show the same error:

Dump header from device: /dev/ada7p1
Architecture: amd64
Architecture Version: 4
Dump Length: 350208
Blocksize: 512
Compression: none
Dumptime: Tue Mar 29 13:46:15 2022
Hostname: SRV-STRG-002.core.testdev
Magic: FreeBSD Text Dump
Version String: FreeBSD 12.2-RELEASE-p12 ec84e0c52a1(HEAD) TRUENAS
Panic String: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, &zio->io_bookmark, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, psize, zio->io_abd, eabd, &no_crypt))
Dump Parity: 1866606199
Bounds: 0
Dump Status: good

This is something about crypto? The pool is encrypted. Also since the last test I added a 40mm fan to the southbridge chipset, just cause "why not?", I'm not sure but I dont think this is overheating issues, its well cooled in general.

ISJ · Mar 29, 2022

Additional messages:

<118>Configuring vt: blanktime.
<118>Starting cron.
<118>net.inet.carp.allow: 0 -> 1
<118>
<118>Mon Mar 28 21:27:44 MDT 2022
<6>ix0: link state changed to UP
<6>pid 1061 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
<6>arp: 10.40.3.1 moved from 80:61:5f:0e:8e:2a to 2c:4d:54:ec:94:cf on igb0
<6>arp: 10.40.3.1 moved from 2c:4d:54:ec:94:cf to 80:61:5f:0e:8e:2a on igb0
<6>arp: 10.40.3.1 moved from 2c:4d:54:ec:94:cf to 80:61:5f:0e:8e:2a on igb0
<6>arp: 10.40.3.1 moved from 2c:4d:54:ec:94:cf to 80:61:5f:0e:8e:2a on igb0
<6>arp: 10.40.3.1 moved from 2c:4d:54:ec:94:cf to 80:61:5f:0e:8e:2a on igb0
panic: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, &zio->io_bookmark, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, psize, zio->io_abd, eabd, &no_crypt)) failed (0 == 5)

cpuid = 3
time = 1648583175
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00fae8b890
vpanic() at vpanic+0x17b/frame 0xfffffe00fae8b8e0
spl_panic() at spl_panic+0x3a/frame 0xfffffe00fae8b940
zio_encrypt() at zio_encrypt+0x609/frame 0xfffffe00fae8b9d0
zio_execute() at zio_execute+0x6a/frame 0xfffffe00fae8ba20
taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe00fae8ba80
taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe00fae8bab0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00fae8baf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00fae8baf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

jgreco · Mar 29, 2022

I got nothin' for you on the crypto.

ZFS requires the ability to cache metadata efficiently. Unlike the normal filesystems you may be used to, where you might have a handful of gigabytes or even a terabyte, ZFS has to slog through huge amount of filesystem metadata, and the conventional wisdom is that you should shoot for 1GB of RAM per TB of storage. In practice, you can get by with a somewhat lower ratio once you're at a few dozen GB of RAM, and I probably would not be shocked if someone managed a petabyte on a 256GB or 384GB RAM system. Once you get out there, you have to observe what's going on and size according to the observed behaviour.

ISJ · Mar 29, 2022

jgreco said:
I got nothin' for you on the crypto.

ZFS requires the ability to cache metadata efficiently. Unlike the normal filesystems you may be used to, where you might have a handful of gigabytes or even a terabyte, ZFS has to slog through huge amount of filesystem metadata, and the conventional wisdom is that you should shoot for 1GB of RAM per TB of storage. In practice, you can get by with a somewhat lower ratio once you're at a few dozen GB of RAM, and I probably would not be shocked if someone managed a petabyte on a 256GB or 384GB RAM system. Once you get out there, you have to observe what's going on and size according to the observed behaviour.

What metrics would you monitor?

ISJ · Mar 29, 2022

Ive made a JIRA ticket here but im not sure if that's appropriate so...

Important Announcement for the TrueNAS Community.

Spontaneous Reboots

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Dabbler

Dabbler

Similar threads