Scale R730XD ASUS PCIE NVME Cards NVME's Randomly Drop

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
Hey All,

I have an R730XD with PCIE ASUS M.2 it randomly drops NVME's after a couple days. The poool shows degraded and I have to restart the server, once I restart the server all 4 NVME's come back on and if you run a health check from the CLI it shows all drives as good. The drives are 3 months old and it just happens randomly.

PCIE Card https://a.co/d/cFE2GsR
Drives are Samsun 990 PRO's

Anyone ever ran into this before?
 
Last edited:

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
Digging a little more with lspci command it recognizes all 4 of the NVME controllers.

I am having the same DEVICE ID number showing up as failed when interogating the pool status in Truenas scale but the drive SN is moving. I am only able to get 3 drives to come online at one time now.

When I run the lscpi -nv command.... I find something ineteresting. The devices functioning are using kernel driver "NVME" the "failed" drive is using "vfio-pci"

When I run the lscpi -nv command.... I find something ineteresting. The devices functioning are using kernel driver "NVME" the "failed" drive is using "vfio-pci"
05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 0, IOMMU group 32
Memory at 92600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
06:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 77, NUMA node 0, IOMMU group 33
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: vfio-pci
Kernel modules: nvme

-lspci
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

lspci -k|grep mpt3sas
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
 
Last edited:

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
hopefully this helps someone and not sure how this even happened. One of my passthrough devices got switched on a VM to my NVME. It randomly happened middle of the afternoon today.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
If VMs are in the game it would be good how that is.
 

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
If VMs are in the game it would be good how that is.
Hey Chris,

Maybe I should add some more details.

I have a VM running a coral TPU as a pass through. That machine lost it's coral TPU which I know was there because I had been using it yesterday and verified it's usage. Somehow middle of day that machine lost it's Coral TPU passthrough and it moved to the NVME. The NVME is part of a pool and caused the pool status to be degraded because of a lost drive.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Are you running this VM inside TrueNAS or is TrueNAS also running as a VM? A kind-of picture might be helpful
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
The R730xd isn't compatible out of the box with M.2 NVME drives... or any NVME drives really. I manage a couple dozen of these model servers as HCI storage devices. There's no documented NVME support, though the R730xd will boot from certain models of Intel and Micron NVME drives via PCIE to M.2 or PCIE to U.2 adapters.

We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives. It's possible there's a vague hardware compatibility issue with that generation of Dell server and Samsung NVME drives, or, more likely, the v3 and v4 Xeon E5 CPUs that will run in the 730xd. In our testing, even Intel and Micron NVME drives had a few stability issues and we ended up scrapping the idea altogether, leaving the servers to boot from their internal SATA-DOM or SAS SSDs in the front drive bays.

You may try different models of drives to see if that helps. That generation of servers only has PCIE gen 3 so there's not really much of a performance benefit from the gen 4 or gen 5 drives such as the 980 or 990.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
IIRC the Samsung 990 Pro has a write cache. Once that is saturated the write performance goes down considerably, if memory serves me right.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The R730xd isn't compatible out of the box with [...] or any NVME drives really.
There's no documented NVME support
Well, that's just not true:
Up to twenty 2.5-inch hard drives and up to four 2.5-inch (U.2) NVMe drives (in slots 20 to 23).
(This applies to the 24-bay model, but the motherboard is the same throughout the range.)

Now, it's important to keep in mind that Dell's enablement kit is pretty substantial. I haven't examined one, but I suspect they're using a PCIe switch for the retimer functionality (or maybe just plain retimers, but with a massive heatsink to make it look more expensive). This tells me that tolerances on the PCIe links are pretty tight, which to some extend leads into:
We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives
This is very unusual, but a relevant data point.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The R730XD SFF will work with NVMe devices as stated - but booting from them seems to be a challenge. That doesn't seem to be the issue at play here though.

@loca5790 I'm curious as to how the NVMe SSD got toggled to passthrough to a VM (vfio driver claim). I assume you're using an M.2/PCIe Coral TPU - how is that attached? Is that in the ASUS HyperX card, or on its own mount?
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
Well, that's just not true:

(This applies to the 24-bay model, but the motherboard is the same throughout the range.)

Now, it's important to keep in mind that Dell's enablement kit is pretty substantial. I haven't examined one, but I suspect they're using a PCIe switch for the retimer functionality (or maybe just plain retimers, but with a massive heatsink to make it look more expensive). This tells me that tolerances on the PCIe links are pretty tight, which to some extend leads into:

This is very unusual, but a relevant data point.
And here I thought I knew everything! Guess I should have checked the source documentation first. I wasn't aware the R730xd had any NVME config available. I'd wager it's a rare backplane to find. Thanks for pointing that out.

I know for sure that the R730xd will not boot from a Samsung 970 Evo Plus or 980 Pro, Crucial T500, Western Digital Black(I don't remember the model), and a hand full of random generic branded NVME SSDs extracted from Lenovo laptops. You can install an OS to them but they do not show as bootable options in the BIOS and will not boot once the OS has been installed. We tested TrueNAS Core and ESXi.

Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.

The R730xd will also boot from a BOSS-S1 card containing dual Micron 2300 NVME drives, though I wasn't able to find anywhere that said the BOSS-S1 card was supported on the R730xd.

I remember years ago there being a few competing standards for booting from NVME drives and my assumption here is that Dell only implemented one or a few of those in their 13th gen servers. Given that Intel and Micron drives are available directly from Dell with Dell firmware, it makes sense that Dell would support whatever protocol those companies were using at the time.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'd wager it's a rare backplane to find.
Nope, all 24-bay units support NVMe, though I expect most are not wired for it. Very easy to retrofit, of course, it's just an adapter card and SFF-8643 cables. Same goes for R630 10-bay units.
I remember years ago there being a few competing standards for booting from NVME drives
I never heard anything about that, though I can easily imagine the Gen13 firmware being limited in terms of the NVMe driver. That's something you could work around with enough effort...
I know for sure that the R730xd will not boot from a Samsung 970 Evo Plus or 980 Pro, Crucial T500, Western Digital Black(I don't remember the model), and a hand full of random generic branded NVME SSDs extracted from Lenovo laptops. You can install an OS to them but they do not show as bootable options in the BIOS and will not boot once the OS has been installed. We tested TrueNAS Core and ESXi.

Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.
Then again, that sounds a lot like they're whitelisting models they sell without locking the firmware to Dell-branded drives...
The R730xd will also boot from a BOSS-S1 card containing dual Micron 2300 NVME drives, though I wasn't able to find anywhere that said the BOSS-S1 card was supported on the R730xd.
Those things are basically RAID controllers, right? So they would use a different driver than plain NVMe disks...

There's enough boot nonsense with legacy options. On Gen15, I simply cannot boot ZFS Boot Menu from SATA disks attached to an HBA330 mini. The Linux kernel just crashes and burns, I suspect because nobody actually ever tested the mpt3sas code with the stub loader or something and a race condition with the UEFI driver is tripped. Everything works fine if I boot ZBM from, say, NVMe, and load the real OS from SATA. Haven't tried SAS disks, but I might for fun one day.
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
Nope, all 24-bay units support NVMe, though I expect most are not wired for it. Very easy to retrofit, of course, it's just an adapter card and SFF-8643 cables. Same goes for R630 10-bay units.
Now that's interesting, I didn't know that. We run 19 of that model in a split backplane config with dual HBA330s. This was a specific config we selected from the factory. Through just playing around with other R730xd servers I discovered they all have the exact same backplane, or at least all of our R730xd servers have the same backplane and can be converted to a split backplane, even the ones that weren't configured that way from the factory. It would make sense that they could all run that 4 x NVME config.
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.
I went back and looked at my notes and discovered that this was incorrect. The card I was thinking of here is a BOSS-S2 card with Micron NVME drives which we did not test on the R730xd. The BOSS-S1 card that we tested has Micron 5100 M.2 SATA drives.

The reason for all these tests was because the vast majority of our R730xd servers are in use as VMware vSAN nodes with fully populated backplanes. They were originally configured to boot from a 64GB SATA-DOM drive plugged into the motherboard. However, over their lifespan, ESXi install requirements have grown and the original boot drives ordered with the servers are no longer sufficient and we were not able to find a SATA-DOM drive larger than 64GB. Our testing was meant to see if we could boot the servers from either an M.2\PCIE or U.2\PCIE adapter card so that way we could meet the boot drive requirements of the new ESXi versions without sacrificing node storage capacity for vSAN.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
High-end USB flash drives might also be an option to consider, either the high-end native USB SSDs or USB/NVMe bridges with NVMe SSDs.
 

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
The R730XD SFF will work with NVMe devices as stated - but booting from them seems to be a challenge. That doesn't seem to be the issue at play here though.

@loca5790 I'm curious as to how the NVMe SSD got toggled to passthrough to a VM (vfio driver claim). I assume you're using an M.2/PCIe Coral TPU - how is that attached? Is that in the ASUS HyperX card, or on its own mount?
That is correct. It is on it's own mount on an Ablecon PEX-MP117 based off what was recommended on the forms. The ASUS HyperX only has 4 990 Pro's. They are identified individually in 5-8 via LSPCI. I am slo very curious as to how that routing changed and kicked my separate PCIe card out and moved the vfio to an NVME. This was also recorded in CheckMK when it identified a Disk IO change in TrueNas... it appears somehow the shift happened in TrueNAS because when I tried shifting PCIE card slots in the server the R730XD reporded slot card changes via SNMP.

This is further compounded by the fact that the locations reported through SNMP to CheckMK is loosing disk IO locations and being reported as item not found in monitoring data.
The R730xd isn't compatible out of the box with M.2 NVME drives... or any NVME drives really. I manage a couple dozen of these model servers as HCI storage devices. There's no documented NVME support, though the R730xd will boot from certain models of Intel and Micron NVME drives via PCIE to M.2 or PCIE to U.2 adapters.

We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives. It's possible there's a vague hardware compatibility issue with that generation of Dell server and Samsung NVME drives, or, more likely, the v3 and v4 Xeon E5 CPUs that will run in the 730xd. In our testing, even Intel and Micron NVME drives had a few stability issues and we ended up scrapping the idea altogether, leaving the servers to boot from their internal SATA-DOM or SAS SSDs in the front drive bays.

You may try different models of drives to see if that helps. That generation of servers only has PCIE gen 3 so there's not really much of a performance benefit from the gen 4 or gen 5 drives such as the 980 or 990.
There is a documented enablement kit that uses a PCIE to a 24 bay backplane. I have a 12 bay and am using a PCIE card that many are reporting success with. The booting from them is what seems to be hit or miss. I am not using them to boot and the reason I am using them is for stabality and longevity. When doing testing on them they were transfering around 6 gb/s which is more than what I need.

Are you running this VM inside TrueNAS or is TrueNAS also running as a VM? A kind-of picture might be helpful
I am running TrueNas as the base OS and everything else is a VM inside of it. The NVMe's are in a pool that is dedicated strictly to VM use. I am booting of mirroed SSDs in the flexbay's.



All I have not seen any issues since I corrected the drive pass through but again it takes a while to show up. Not sure why or how but I'm hoping it's simply a me thing where I made a mistake.
 

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
It's back. Going to need some help here as I believe the issue is truenas.

lspci -nv (shows the 4 drives are in the same locations but the kernel driver in use for 6 is missing)

05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 32
Memory at 92600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

06:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: fast devsel, IRQ 47, NUMA node 0, IOMMU group 33
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel modules: nvme

07:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 0, IOMMU group 34
Memory at 92400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

08:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 0, IOMMU group 35
Memory at 92300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

Inside the VM this time the pass through is still active for the coral
00:08.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a

It also appears the pass through is still active for the GPU.
00:09.0 0300: 10de:1cb2 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 10de:11bd

Checkmk logged 3 of the 4 drives dropping out 22 hours ago.
1704717933456.png


At this same time home assistant stops logging CPU shortterm load via the truenas integration but SNMP is still active.
1704718149764.png


There are no errors in SNMP from checkmk going to the Dell Server.

zpool status -v
pool: VM_NVME
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.

To note the removed drive is the same drive that was being removed previoulsy via the identification numbers. it's possible the drive could be acting up but with the loss to checkmk and home assistant via truenas SNMP.... not sure I'm ready to make that conclusion.

I will note.... that when it dropped out CPU usage did spike as was captured on checkMK. I'm wondering if it's possible CPU load was too high and it dropped a drive? ignore the time stamp as the system is on the wrong time scale.
1704719368816.png


Wellllll..... welllllll.... wellll..... Dec 30th drop out
1704719518500.png


1704719561551.png


Starting to wonder if it's kicking things out when CPU load is too high
 
Last edited:

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
The drive is now not showing up in lspci after a reboot. It appears maybe it has truly failed.

05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 0, IOMMU group 32
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

07:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 75, NUMA node 0, IOMMU group 33
Memory at 92400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

08:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 104, NUMA node 0, IOMMU group 34
Memory at 92300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
 

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
@loca5790 Do you happen to have any debugs or earlier information grabs that might show the firmware revisions of the drives? I recall the early days of the Samsung 990 had some "teething issues."
I don't, I could pull one out and attach it to a windows computer to use Samsung Magician since it's not support on anything other than windows.

I have confirmed with Samsung via serial numbers that they are built with the latest chipsets though. "Those are some of the most stable drives we have, it's suprising you'd have any issue."
 
Top