TrueNAS Core crashing - NVMe Issues

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
Forgive me as I'm new to TrueNAS. I recently built a system with 14 regular SAS HDDs and a few m.2 NVMe's using a PCIe to m.2 adapter and bifurication, everything at first seems to work great, I can see all NVMe's (4 in one x16 slot and 3 in another) in TrueNAS. Version is TrueNAS-12.0-U8.

All of the NVMe's are WD_BLACK SN770, 4 2TB and 3 200gb. 1 of the 200gb is for booting the system.

When the NVMe's are under load it will eventually crash the entire system. I see the following in logs and on main console. It seems to happen to any of the NVMe's, I've seen nvme3,nvme4 and nvme5 so far. I tried setting "hw.nvme.use_nvd=0" in loader.conf but that doesn't seem to make any difference, however it gives a slightly different result in the logs. When the issue happens the system locks up completely, and you need to force reset to continue.

If someone can maybe point me in the right direction I would very much appreciate it. It's been years since I played with FreeBSD.

I did see this bug that seems very similar, but I assume the fix is already in the version I'm using? here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713 Can someone help confirm this?

Code:
Mar 29 21:42:25 truenas nvme5: Resetting controller due to a timeout and possible hot unplug.
Mar 29 21:42:25 truenas nvme5: resetting controller
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:120 nsid:1 lba:1497544880 len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:120 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:123 nsid:1 lba:198272936 len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:123 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:13 cid:121 nsid:1 lba:431014528 len:24
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:13 cid:121 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:15 cid:127 nsid:1 lba:864636432 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:126 nsid:1 lba:2445612184 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:126 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:120 nsid:1 lba:430503600 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:120 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:18 cid:123 nsid:1 lba:1499051024 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:123 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: WRITE sqid:18 cid:124 nsid:1 lba:1990077368 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:122 nsid:1 lba:1237765696 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:122 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:125 nsid:1 lba:180758264 len:16
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:20 cid:121 nsid:1 lba:2445612192 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:20 cid:121 cdw0:0
Mar 29 21:42:26 truenas nvd5: detached


This is what I see when "hw.nvme.use_nvd=0" is set:
Code:
nvme3: Resetting controller due to a timeout and possible hot unplug.
nvme3: resetting controller
nvme3: failing outstanding i/o
nvme3: READ sqid:7 cid:127 nsid:1 lba:419546528 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:7 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1901c5a0 0 7 0 0 0
failing outstanding i/o
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nvme3: READ sqid:11 cid:127 nsid:1 lba:782841288 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2ea935c8 0 7 0 0 0
failing outstanding i/o
nvme3: READ sqid:11 cid:123 nsid:1 lba:704576056 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:12 cid:127 nsid:1 lba:1016402352 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:12 cid:125 nsid:1 lba:1824854760 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
WRITE sqid:13 cid:124 nsid:1 lba:1008638008 len:64
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:13 cid:125 nsid:1 lba:1008638152 len:56
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:15 cid:127 nsid:1 lba:783188688 len:8
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=29fefa38 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c9511b0 0 7 0 0 0
ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:15 cid:123 nsid:1 lba:1008553080 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:15 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:124 nsid:1 lba:147012776 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6cc512e8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e9838 0 3f 0 0 0
ABORTED - BY REQUEST (00/07) sqid:16 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:127 nsid:1 lba:2881895592 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:16 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:17 cid:127 nsid:1 lba:2574392744 len:16
nvme3: ABORTED - BY REQUEST (00/07) sqid:17 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:18 cid:126 nsid:1 lba:155895056 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e98c8 0 37 0 0 0
ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:19 cid:125 nsid:1 lba:151377120 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2eae82d0 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1d4c78 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=8c33ca8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=abc63ca8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=99721da8 0 f 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=94ac510 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=905d4e0 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nda3 at nvme3 bus 0 scbus13 target 0 lun 1
nda3: <WD_BLACK SN770 2TB 731030WD 21513C800057>
 s/n 21513C800057 detached
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
We have had this problem in various TrueNAS as well as plain FreeBSD installations. What finally fixed it was a firmware update of the SSDs we used. Intel P4510 series.
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
I did actually see some of your posts about this, thx for the fast reply. Hmm... very interesting. I guess I will have to load up the SSD's in Windows and use the Western Digital utility to see if there is a firmware update.

Going to try that now, thanks!!!
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Doesn't WD provide a bootable FW update image like Intel does? I booted my hosts from a special CD image via IPMI, FW update running automatically, reboot into FreeBSD, done.
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
I don't see it on the WD site...I'm not surprised since these m.2's are for gaming technically. That said, I just quickly loaded them up in a Windows box, one by one since I didn't have a free x16 slot, just used m.2 on the motherboard. WD says the firmware is up to date (see screenshot). So no luck with that. Any other ideas?

Screen Shot 2022-03-31 at 5.16.10 PM.png
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Contact Warner Losh? Open an issue on the FreeBSD bug tracker or add to the one you linked? Have you tried disabling TRIM?
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
Should I just try it on TrueNAS Scale? Since I assume the problem is probably with FreeBSD.

Am I correct in thinking that I can just install Scale on a new boot disk and import my pools?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I have been less than impressed with the WD Blacks we've bought in the past, with random cockups and dropouts. I just RMA'd a SN750 with a WD-installed heatsink. Thing just totally vanished. Normally this is where I say nice things about the Samsungs, but I had an 980 Pro corrupt a bunch of crap on me in a super-unstressy environment a few months back too. Mmm.
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
I actually could return them, I just checked an I have a few days where I can still return on Amazon. Can anybody recommend a good 2-4TB m.2 SSD?

Should I worry about the one I'm using to boot TrueNAS with? I doubt the I/O on boot disk would be enough to cause any issues.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The NVMe SSD market seems to be going through a bit of a SandForce phase in terms of reliability, but I don’t understand why. There are a bunch of controller vendors, firmware has had a good eight years to mature, and even the vertically-integrated vendors seem to be struggling more than usual.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Can anybody recommend a good 2-4TB m.2 SSD?
I run a dozen Samsung 970 EVO Plus - the "Plus" is important. No issues so far. Our data centre SSDs from Intel are all U.2 so not applicable in your case.

if you read through the discussion in the bugtracker, I guess the problem might also lie with your mainboard. I don't understand all the details but interrupt routing and management seems to be a bit tricky with this "newfangled" stuff.
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
So I am convinced that either there is some crazy issue with the SN770's and FreeBSD OR I somehow got a bad batch of 4. I tried to clone the SSD's in both Windows & clonezilla boot iso, both crashed the OS. I even tried to load up TrueNAS Scale, I had the almost the same NVMe problem. To rule out my motherboard as the problem I even loaded up TrueNAS on an entirely different system, and had the same issues.

So.... I'm going to return the SN770's to Amazon. I drove an hour out to Micro Center yesterday and bought 8 2TB m.2 SSD's, this one: https://www.microcenter.com/product/642167/inland-2tb-perform-nvme-ssd-w-o

These are "store brand" NVMe's, using what looks like a newer Kioxia chip. The performance for price seems impressive with 5,000 MB/s read and 4,300 MB/s write. So of course now I have a new problem, I'm not getting anywhere near that performance. It doesn't seem to matter how I setup the vdev. See dd test below:

Code:
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# dd if=/dev/zero of=/mnt/ssd1/ssd-test1/test3 bs=1M count=50000
e50000+0 records in
50000+0 records out
52428800000 bytes transferred in 17.694433 secs (2963011029 bytes/sec)
root@truenas[~]#
root@truenas[~]# 


That dd test is about 370 MB/s, which nowhere near what these disks should be capable of. When I do a real world test using SMB file transfer I can read anywhere from 200MB/s and sometimes it jumps to almost 800 MB/s. With the previous SN770's... when they were working... I could easily read and write well over 1 GB/s. It would always max out my 10Gbps NIC.

I did enable the nda driver using hw.nvme.use_nvd=0 and that seems to help a little bit, but not much.

So anybody have any ideas about this new issue? Maybe I need to start a new thread since it's a different issue, I'm having bad luck with NVMe. Very frustrating.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
So anybody have any ideas about this new issue?
Sorry. I'd never buy anything but a device from a reputable vendor. For Storage SSDs that's Samsung and Intel in our data centre, and a few Transcend and Crucial for boot devices.
That does not imply everything by these companies is perfect. See @jgreco's post above. But so far I stayed away from 980s, anyway, after reading that they using some trickery to replace on-device cache with host memory. 960 and 970 it is for the time being.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I guess there are production problems due to lack of silicon so companies are having to make compromises. Its just a shame that we the consumer have to guess whats going on some of the time
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
These new NVMe's I got also just BSOD'd my Windows test PC... I'm going crazy over here, it's statistically impossible to have this many issues with NVMe. Using them on macOS with a USB 3.0 enclosure works fine though (although since it's USB perhaps it's not writing/reading fast enough to cause a problem, I dk). About to test with a totally different Windows PC.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Yeah - time to start suspecting other hardware methinks
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
Yeah - time to start suspecting other hardware methinks
I mean yeah I would agree, but I tested on totally different platforms. One is a 3u supermicro and other is a basic/typical "gaming pc build" that I use for a dedicated Windows Plex server. About to test with my newer Windows PC gaming machine that I never use, so that will be a third machine. The only thing I'm doing is moving the NVMe between the machines. I reformatted and copied a 20g file to test.
 

ibrennan

Dabbler
Joined
Mar 31, 2022
Messages
12
Tested a bunch of them with a third machine, which has newer hardware. They performed flawlessly, no BSOD, no crash. I coped 100GB in just over one minute a a test.

So I'm back to this being a FreeBSD NVMe issue.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Tested a bunch of them with a third machine, which has newer hardware. They performed flawlessly, no BSOD, no crash. I coped 100GB in just over one minute a a test.

So I'm back to this being a FreeBSD NVMe issue.
Can you run TrueNAS/FreeBSD on the newer hardware? That would eliminate hardware as the issue.
 
Top