Random reboots still after 1 month of hardware testing

CoreyVidal · Jul 18, 2020

Random reboots. Please help. Oh God, please send help. Or alcohol. Or a gun. I'm just feeling crushed by putting so much work into this.

Setting the scene: I'm a lifelong Windows-only user (25+ years) who was running a bunch of services like Plex off of my daily-usage desktop computer (like an IDIOT). A few months ago I got hit by hackers who encrypted all of my data. 17 TB worth (I have 2 large external DAS's-I'm a videomaker and video files are massive). I managed to recover my files but it taught me a valuable lesson: it was time to reapproach the way I was organizing my digital life. I want/need a dedicated server, dedicated NAS, and off-site storage.

I built a small server off of a new Intel NIC (here's the hardware). I finally made the plunge into Linux (Ubuntu 20.04), working in the command line for the first time in my life. It's overwhelming, but I'm getting it done. Lots of Googling and forums. I learned all about Docker. Blew my mind. It's set up and humming along nicely with about 20 containers going. A month has gone by.

Then I started building my NAS. I decided to go with FreeNAS over Unraid to get better performance because of the needs of my large video files. I'm gonna turn my 2 DAS's (6 drives × 6 TB each) into one big NAS (12 × 6 TB)!

I transferred everything off of 1 of the 2 DAS's so I could disassemble it and put its 6 drives in my NAS.

Here's the hardware of my FreeNAS system. The motherboard, processors, SAS card, and RAM are used server parts from eBay. The case, PSU, and fans are all brand new.

Updated BIOS and firmwares. Followed instructions on flashing firmware for my SAS card into IT mode. Put in the 6 drives, and finally installed FreeNAS onto an additional internal dedicated 256 GB SSD.

That night, it randomly rebooted 3 times.

Thus began a month-long journey into investigating all my hardware. Here's everything I did/tested/ran:

CPU
(USB boot Hiren's BootCD (off of USB) to Windows 10)

OCCTP:
Run Linpack: the latest version (for 3+ hours)
Run OCCT: small dataset, no AVX (for 3+ hours)
Run OCCT: medium data set, no AVX (for 3+ hours)
Run OCCT: large data set, no AVX (for 3+ hours)

Prime95:
Run: Blend (for 12 hours)
Run: Small (for 6 hours)

————————————————————
Memory

memtest86+:
Ran it on all 12 of my sticks

A few hours in:
RANDOM REBOOT!

Took out all 12 sticks. Tested 4 at a time. Across this, it fixed 5 non-critical ECC errors.

NO RANDOM REBOOTS!

Also, my server is quieter now? It used to make a very quiet chugging sound that I didn't think anything of, but that has stopped. I wonder if having unplugged and replugged in all the sticks might have fixed something?

memtest86+ (take 2):
Ran it on all 12 of my sticks - again!

NO RANDOM REBOOT!
————————————————————
Storage

I ran this custom script by a FreeNAS user based on suggestions from this forum. It ran SMART tests and 4 passes of badblocks and more SMART tests on my drives. I first tried running it on all 6 drives at once...

RANDOM REBOOT!

Ran that script on 1 drive (took 72ish hours). No problems. So ran that script on the other 5 drives (took 120ish hours). No problems!

I think maybe we're good to move on?!

It's been weeks of testing, but we're back in FreeNAS. Let it just sit running for a day. No random reboots! I test copying files to it. Everything works great. I set up SMB. All good. Sharing to/from Windows. Wonderful. Stable for days. When I'm feeling confident, I plug my DAS into my NAS and use FreeNAS's Import Disk feature (Storage ➞ Import Disk) to copy everything.

RANDOM REBOOT.

"God dammit!" (← use your imagination)

I do a bunch of testing. Time passes. Life is pain and existence is a curse. I Google and try a bunch of different stuff. I check crash logs, but I don't really know what I'm looking for (I literally only learned command lines 2 months ago). I Google. I forum. I cry.

I extended the pool to add an additional SSD as a dedicated log drive. And somehow, magically, that instantly solved it. (EDITOR'S NOTE: I feel like this may be an important indicator of what might be wrong, maybe?!)

I try Import Disk again. It's ROCK SOLID! It takes a long time. The copy finishes. I now have a new copy of all my files on my FreeNAS, and have that same 17 TB of files still on my old DAS. I double-check and compare files on both to make sure everything's perfect. It is! I *NEED* to know that my NAS is rock solid. I don't trust it as the sole holder of this 17 TB yet. I'm gonna back up data to the cloud first, and then I'm going to take apart the remaining DAS and add it to my NAS.

In my Ubuntu server, I set up a permanent mount of my FreeNAS. All good. Gonna test some speed!

sync; dd if=/dev/zero of=tempfile bs=1M count=1024; sync
Result: 112 MB/s

dd if=tempfile of=/dev/null bs=1M count=1024
Result: 117 MB/s

sudo /sbin/sysctl -w vm.drop_caches=3
dd if=tempfile of=/dev/null bs=1M count=1024
Result: 117 MB/s

After displaying the last result, a few seconds go by...

RANDOM REBOOT.

The pain. The disappointment. It boots back up, I don't touch it. A few minutes later, another random reboot. Again, I don't touch it. 10 minutes later, another random reboot. Don't touch it. An hour later, another random reboot. 4 reboots in a row.

I gave up and started drinking. This was last night. I left it overnight. Now it hasn't rebooted in 12 hours.

Feeling bold, I just ran those above 4 speed test commands on it again, exactly the same way. This time? No random reboot.

An hour has passed as I've been writing this. But I'm too cynical to think this nightmare is over. I just don't really know what to do from here.

Samuel Tai · Jul 18, 2020

These sound like kernel crashes. Try setting a sysctl kern.corefile=/var/coredumps/%U/%N.core, which will drop a debug file in /var/coredumps at the next crash.

It also appears this motherboard uses a proprietary SAS controller (the PIKE2008), for which FreeBSD support may not be the best. Have you tried your system without this card?

Lastly, have you done the usual things, like making sure the BIOS and firmware are at the latest levels, and any overclocking in the BIOS is disabled?

Samuel Tai · Jul 18, 2020

Also, try running with half the RAM, to see if you're stable. This sounds like possibly a bad stick.

CoreyVidal · Jul 18, 2020

Thanks @Samuel Tai, I appreciate your time.

I've added set kern.corefile to /var/coredumps/%U/%N.core as sysctl in the web-GUI at System ➞ Tunables.

I did a ton of research about the PIKE card, including here on these forums. The card is basically a 9220-8i; the chip/controller is an LSI SAS-2008. The recommendations showed that as long as I flashed it into IT mode, it would be fine. But hey, maybe not. However, if I unplug that card, then my NAS has no drives for its main pool.

All BIOS and firmware are the latest versions. No overclocking.

I did do extensive memtes86+, in sets of 4, and then all 12 together. But you're right, I should try it in FreeNAS and see if it's stable.

I have that kern.corefile set, so I'm currently waiting for the next unexpected reboot.

pab49162 · Jul 18, 2020

I saw your posting and feel for you as random reboots are a pain to troubleshoot. Here are a couple thoughts:

I agree with @Samuel Tai that a bad memory stick could be the issue. Reading your test results though, it is interesting that memtest86+ never found a problem but you did get one random reboot.
Given the memtest86+ test results, I would lean more towards either a heat-related problem or a bad power supply. If you get a random reboot again and find the no trace of the reboot in the coredump log, that would an indication of one these two things caused a hard reboot of the system.
If you have access to another power supply and it isn’t too hard to swap, I would try that first. Over the years, I have seen lots of weird problems (including many random reboots) caused by bad power supplies. Even though your PS was new, it might still have an issue. If you can rule out a bad power supply, that would be a good first step.
In your posting, you didn’t mention anything about processor temperature. From your hardware list, looks like you have lots of fans so I wouldn’t think that was a problem unless one or more of the fans isn’t running. Do you have any data on the case or processor temperatures?
I found your comment “Also, my server is quieter now? It used to make a very quiet chugging sound …” to be interesting. I have never seen unplugging/replugging memory sticks change the sound level of a computer. This comment would make me check all of the fans especially the one in the power supply. (I noticed your power supply has a zero RPM mode - is that enabled?)
Assuming you aren’t running headless, are the keyboard and mouse new? I recently had a client’s PC randomly reboot and traced it to a bad mouse. (Never seen that before -- looked like the Windows 10 kernel crashed because of weird USB traffic.) I really, really doubt this is the cause but it is worth asking.

The bottom line is that I suspect that the root cause of your issue is a bad hardware component. The challenge will be figuring out which one. Hopefully, it is only one component and not two. If it is two or more, then troubleshooting just became much more challenging.

Hope this helps.

Paul

CoreyVidal · Jul 19, 2020

Hey @Samuel Tai, so my computer just did an unscheduled reboot. I've looked for /var/coredumps, but the file or directory doesn't exist. What am I doing wrong?

Samuel Tai · Jul 19, 2020

You're not doing anything wrong. This is an indication the reboot isn't due to a software crash. We're definitely looking at a hardware issue.

CoreyVidal · Jul 19, 2020

Oh OK. I did find a file called log.20200719170945 (which was the time of the reboot) and it gave me this:

Code:

[2020/07/19 21:09:23] (INFO) middlewared.__init__():742 - Starting FreeNAS-11.3-U3.2 middleware
[2020/07/19 21:09:23] (DEBUG) raven.base.Client.set_dsn():272 - Configuring Raven for host: https://sentry.ixsystems.com
[2020/07/19 17:09:27] (DEBUG) middlewared.setup():1250 - Timezone set to America/Toronto
[2020/07/19 17:09:28] (DEBUG) middlewared.setup():2197 - Certificate setup for System complete
[2020/07/19 17:09:31] (WARNING) MailService.send_raw():385 - Failed to send email: [Errno 8] hostname nor servname provided, or not known
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/mail.py", line 363, in send_raw
    server = self._get_smtp_server(config, message['timeout'], local_hostname=local_hostname)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/mail.py", line 411, in _get_smtp_server
    local_hostname=local_hostname)
  File "/usr/local/lib/python3.7/smtplib.py", line 251, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python3.7/smtplib.py", line 336, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python3.7/smtplib.py", line 307, in _get_socket
    self.source_address)
  File "/usr/local/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] hostname nor servname provided, or not known
[2020/07/19 17:09:33] (DEBUG) middlewared.__plugins_setup():841 - All plugins loaded
[2020/07/19 17:09:33] (DEBUG) middlewared.__initialize():1375 - Accepting connections
[2020/07/19 17:09:35] (DEBUG) ServiceService._simplecmd():287 - Calling: restart(collectd)
[2020/07/19 17:09:35] (WARNING) ServiceService._system():309 - Command '/usr/sbin/service collectd-daemon onestop ' failed with code 1: b'collectd_daemon not running? (check /var/run/collectd-daemon.pid).\n'
[2020/07/19 17:09:35] (ERROR) middlewared.setup():826 - System dataset is not mounted
[2020/07/19 17:09:35] (ERROR) middlewared.render_body():25 - Collectd configuration file could not be generated
[2020/07/19 17:09:37] (ERROR) middlewared.set_sysctl():407 - Failed to set sysctl '<module 'sysctl' from '/usr/local/lib/python3.7/site-packages/sysctl/__init__.py'>' to '': sysctl: unknown oid 'kern.cam.ctl.ha_peer'

[2020/07/19 17:09:38] (DEBUG) smartd.ensure_smart_enabled():28 - SMART is not supported on ['/dev/da7', '-d', 'sat']
[2020/07/19 17:09:38] (DEBUG) smartd.ensure_smart_enabled():28 - SMART is not supported on ['/dev/da6', '-d', 'sat']
[2020/07/19 17:09:38] (INFO) EtcService.generate_all():284 - Skipping nginx group generation
[2020/07/19 17:09:38] (INFO) EtcService.generate_all():284 - Skipping collectd group generation
[2020/07/19 17:09:38] (INFO) EtcService.generate_all():284 - Skipping system_dataset group generation
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/usr/local/lib/python3.7/logging/handlers.py", line 185, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/usr/local/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 229, in __str__
    raise NameError("Undefined")
NameError: Undefined
Call stack:
  File "/usr/local/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/utils/io_thread_pool_executor.py", line 52, in _target
    work_item.run()
  File "/usr/local/lib/python3.7/site-packages/middlewared/utils/io_thread_pool_executor.py", line 25, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/etc.py", line 33, in do
    return tmpl.render(middleware=self.service.middleware)
  File "/usr/local/lib/python3.7/site-packages/mako/template.py", line 475, in render
    return runtime._render(self, self.callable_, args, data)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 882, in _render
    **_kwargs_for_callable(callable_, data)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 919, in _render_context
    _exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 946, in _exec_template
    callable_(context, *args, **kwargs)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 425, in render_body
    parsed_conf = parse_config(db)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 416, in parse_config
    add_bind_interfaces(pc, db)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 114, in add_bind_interfaces
    interface)
Message: 'IP address [%s] is no longer in use and should be removed from SMB configuration.'
Arguments: (<mako.runtime.Undefined object at 0x82c143dd0>,)
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/usr/local/lib/python3.7/logging/handlers.py", line 185, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/usr/local/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 229, in __str__
    raise NameError("Undefined")
NameError: Undefined
Call stack:
  File "/usr/local/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/utils/io_thread_pool_executor.py", line 52, in _target
    work_item.run()
  File "/usr/local/lib/python3.7/site-packages/middlewared/utils/io_thread_pool_executor.py", line 25, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/etc.py", line 33, in do
    return tmpl.render(middleware=self.service.middleware)
  File "/usr/local/lib/python3.7/site-packages/mako/template.py", line 475, in render
    return runtime._render(self, self.callable_, args, data)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 882, in _render
    **_kwargs_for_callable(callable_, data)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 919, in _render_context
    _exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.7/site-packages/mako/runtime.py", line 946, in _exec_template
    callable_(context, *args, **kwargs)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 425, in render_body
    parsed_conf = parse_config(db)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 416, in parse_config
    add_bind_interfaces(pc, db)
  File "/tmp/mako/usr/local/lib/python3.7/site-packages/middlewared/etc_files/local/smb4.conf.py", line 114, in add_bind_interfaces
    interface)
Message: 'IP address [%s] is no longer in use and should be removed from SMB configuration.'
Arguments: (<mako.runtime.Undefined object at 0x82c143dd0>,)
[2020/07/19 17:09:38] (WARNING) middlewared.parse_db_config():98 - Path [/mnt/Atlantis/Media] to share [Media] does not exist
[2020/07/19 17:09:38] (WARNING) middlewared.parse_db_config():98 - Path [/mnt/Atlantis/MySpace] to share [MySpace] does not exist
[2020/07/19 17:09:38] (INFO) EtcService.generate_all():284 - Skipping smb_configure group generation
[2020/07/19 17:09:39] (INFO) EtcService.generate_all():284 - Skipping syslogd group generation

Don't know if anything there might be helpful?

Samuel Tai · Jul 19, 2020

Mostly this is uninteresting, but there are a few things that jump out as possibilities:

Code:

[2020/07/19 17:09:35] (ERROR) middlewared.setup():826 - System dataset is not mounted
[2020/07/19 17:09:38] (WARNING) middlewared.parse_db_config():98 - Path [/mnt/Atlantis/Media] to share [Media] does not exist
[2020/07/19 17:09:38] (WARNING) middlewared.parse_db_config():98 - Path [/mnt/Atlantis/MySpace] to share [MySpace] does not exist

This is pointing towards your HBA as bad.

Samuel Tai · Jul 19, 2020

Alternatively, you've got a power supply issue, where it's not supplying enough juice to spin up all the drives in time for the HBA to access, or some kind of grounding issue which is siphoning off voltage needed to power the drives.

pab49162 · Jul 19, 2020

I believe it is more likely there is an issue with your power supply than the HBA. Power supplies probably fail more often than any other component and the failure rate is much lower for HBAs.

A bad power supply can definitely cause random hard reboots. Since you didn't find a core dump, that sure sounds like what happened during your latest reboot.

Suggest you find another power supply, swap it in and see if you still get the random reboots.

ThreeDee · Jul 22, 2020

A bad cap inside your PSU possibly .. I was having similar issues with my Windows desktop that was ultimately resolved with replacing the power supply and taking the old PSU apart revealed a bulging cap

CoreyVidal · Jul 23, 2020

Alright, I have an update.

Between the 2 stand-out ideas (either a problem with the HBA card or the PSU), problems with the HBA card seemed more likely, as it's a card I bought used. Plus I flashed it to IT mode without really knowing what I was doing. I thought my flash was successful, but maybe I was mistaken. Meanwhile, the PSU is a Corsair HX-850, which is a pretty high-end PSU, and I bought it brand new. It's still possible there's something wrong with it, but that seems less likely.
Well, the server was stable for a few days while leaving it alone, so yesterday I decided to take a shot at running it through some heavier usage. And, yes, I started getting reboots. Now here's something new:

It hangs before it reboots--it doesn't just turn off instantly. So I have about 30 seconds' warning before it actually reboots. And I happened to notice output on the VGA screen that I didn't recognize. But when the server rebooted I couldn't find that information anywhere in any files/folders/logs. "Luckily" (ha) the computer rebooted again, and I was able to get a picture of it with my phone. Again, I looked in a bunch of places on the server, but couldn't find that output anywhere. So referencing the picture I took, I wrote this out by hand:

Code:

(da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 d5 a9 49 70 00 00 00
command 0xfffffe00015???30 (<------ 3 question marks there I couldn't make out)
mps0: Sending reset from mpssas_send_abort for target ID 2
        (da5:mps0:0:5:0): WRITE(10). CDB: 2a 00 d5 a9 49 68 00 00 08 00 length 4096 SMID 953 Aborting
mps0: Sending reset from mpssas_send_abort for target ID 5
        (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 ca f8 ff 00 00 00 08 00 length 4096 SMID 449 Aborting
mps0: Sending reset from mpssas_send_abort for target ID 1
        (da3:mps0:0:3:0): WRITE(10). CDB: 2a 00 d5 a9 49 68 00 00 08 00 length 4096 SMID 863 Aborting
mps0: Sending reset from mpssas_send_abort for target ID 3
        (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 ca f8 ff 00 00 00 08 00 length 4096 SMID 934 Aborting
mps0: Sending reset from mpssas_send_abort for target ID 0
        (da4:mps0:0:4:0): WRITE(10). CDB: 2a 00 d5 a9 49 68 00 00 08 00 length 4096 SMID 1001 Aborting
0
mps0: Sending reset from mpssas_send_abort for target ID 4
        (xpt0:mps0:0:2:0): SMID 1 task mgmt 0xfffffe0001555150 timed out
mps0: Reinitializing controller,
mps0: Unfreezing devq for target ID 2
mps0: Unfreezing devq for target ID 5
mps0: Unfreezing devq for target ID 1
mps0: Unfreezing devq for target ID 3
mps0: Unfreezing devq for target ID 0
mps0: Unfreezing devq for target ID 4
panic: mps_iocfacts_allocate failed to get IOC Facts with error 16
 
cpuid = 9
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe1836631720
vpanic() at vpanic+0x17e/frame 0xfffffe1836631780
panic() at panic+0x43/frame 0xfffffe18366317e0
mps_iocfacts_allocate() at mps_iocfacts_allocate+0x13bb/frame 0xfffffe18366318e0
mps_reinit() at mps_reinit+0x112/frame 0xfffffe1836631910
softclock_call_cc() at softclock_call_cc+0x14f/frame 0xfffffe18366319c0
softclock() at softclock+0x79/frame 0xfffffe18366319e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe1836631a20
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe183663a70
fork_exit() at fork_exit+0x83/frame 0xfffffe1836631ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1836631ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KBD: enter: panic
[ thread pid 12 tid 100134 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db:0:kdb.enter.default> write cn_mute 1
cn_mute                0        =            0x1
db:0:kdb.enter.default> textdump dump

I checked carefully for typos, so hopefully it's perfect. There was a little bit at the top that I couldn't make out, so I put ? for a few characters.

Googling is leading me to believe this is a problem with the HBA? But I haven't been able to find anything that tells me what to check from here.
Also, does anyone know where that text gets stored (if it gets stored) so I don't have to write it out by hand?

Samuel Tai · Jul 23, 2020

The console messages should be in either /var/log/dmesg.today or /var/log/console.log and its compressed rolled over earlier versions.

Yes, it looks like the HBA isn't communicating with various disks in your pool just before the reboot.

CoreyVidal · Jul 23, 2020

Yeah, I've looked there. The pre-reboot text shown on my VGA screen does not show up in console.log (and I don't have a dmesg.today).

As a test, I've downloaded to my desktop everything from: /var/, /usr/, and /data/, and used Visual Studio Code to scan every file for anything from that output. Can't find it.

Yorick · Jul 23, 2020

Could be the HBA, could also be a faulty cable. I'd swap the cable first, see whether that resolves it; and if it doesn't, then the HBA is likely to blame.

CoreyVidal · Jul 23, 2020

Hey @Yorick , when you say to swap the cable, which cable(s) are you referring to?

Samuel Tai · Jul 23, 2020

The SAS cables to those drives or backplane.

CoreyVidal · Jul 23, 2020

OK, I adjusted all the cables. Don't think it made any difference.

Had another reboot and this time managed to film a video of it, so I got everything. It's similar to before, but some differences. Also got more at the end:

Code:

(da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 02 08 c2 f4 f0 00 00 00 40 00 00 length 32768 SMID 869 Aborting command 0xfffffe000159c490
mps0: Sending reset from mpssas_send_abort for target ID 1

(da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 02 08 c2 f6 b8 00 00 01 00 00 00 length 131072 SMID 749 Aborting command 0xddddde0001592710
mps0: Sending reset from mpssas_send_abort for target ID 2

(da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 02 08 c2 f7 b0 00 00 00 c0 00 00 length 98304 SMID 1063 Aborting command 0xfffffe00015ac330
mps0: Sending reset from mpssas_send_abort for target ID 3

(da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 02 08 c2 f9 30 00 00 00 40 00 00 length 32768 SMID 1050 Aborting command 0xfffffe00015ab220
mps0: Sending reset from mpssas_send_abort for target ID 5

(da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 02 07 7a 76 78 00 00 00 40 00 00 length 32768 SMID 827 Aborting command 0xfffffe0001598d70
mps0: Sending reset from mpssas_send_abort for target ID 4

(da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 02 07 7a 76 78 00 00 00 40 00 00 length 32768 SMID 77 Aborting command 0xfffffe000155b510
mps0: Sending reset from mpssas_send_abort for target ID 0

(hung for a few seconds)

Code:

        (xpt0:mps0:0:1:0): SMID 1 task mgmt 0xfffffe0001555150 timed out
mps0: Reinitializing controller,
panic: mps_reinit hard reset failed with error 60
 
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe1836631820
vpanic() at vpanic+0x17e/frame 0xfffffe1836631880
panic() at panic+0x43/frame 0xfffffe18366318e0
mps_reinit() at mps_reinit+0x341/frame 0xfffffe1836631910
softclock_call_cc() at softclock_call_cc+0x14f/frame 0xfffffe18366319c0
softclock() at softclock+0x79/frame 0xfffffe18366319e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe1836631a20
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe1836631a70
fork_exit() at fork_exit+0x83/frame 0xfffffe1836631ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1836631ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KBD: enter: panic
[ thread pid 12 tid 100134 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db:0:kdb.enter.default> write cn_mute 1
cn_mute                0        =            0x1

(hung for a few seconds)

Code:

db:0:kdb.enter.default> textdump dump
mps0: mpssas_action_scsiio: Freezing devq for target ID 5
(da5:mps0:0:5:0): WRITE(10). CDB: 2a 00 00 40 00 7f 00 00 01 00
(da5:mps0:0:5:0): CAM status: CAM subsystem is busy
(da5:mps0:0:5:0): Error 5, Retries exhausted
Aborting dump due to I/O error.
textdump_writeblock: offset 2147483136, error 5
Textdump: Error 5 writing dump
db:0:kdb.enter.default> reset
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 2

Rebooted.

Any suggestions on where to turn next?

Samuel Tai · Jul 23, 2020

Code:

mps0: Reinitializing controller,
panic: mps_reinit hard reset failed with error 60

The only thing left is to replace the HBA. The CPU tried to reset the card via PCI-E, and the card didn't respond.

Important Announcement for the TrueNAS Community.

Random reboots still after 1 month of hardware testing

Dabbler

Never underestimate your own stupidity

Never underestimate your own stupidity

Dabbler

Dabbler

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Never underestimate your own stupidity

Dabbler

Guru

Dabbler

Never underestimate your own stupidity

Dabbler

Wizard

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Similar threads