Scale R730XD ASUS PCIE NVME Cards NVME's Randomly Drop

loca5790 · Dec 30, 2023

Hey All,

I have an R730XD with PCIE ASUS M.2 it randomly drops NVME's after a couple days. The poool shows degraded and I have to restart the server, once I restart the server all 4 NVME's come back on and if you run a health check from the CLI it shows all drives as good. The drives are 3 months old and it just happens randomly.

PCIE Card https://a.co/d/cFE2GsR
Drives are Samsun 990 PRO's

Anyone ever ran into this before?

loca5790 · Dec 30, 2023

Digging a little more with lspci command it recognizes all 4 of the NVME controllers.

I am having the same DEVICE ID number showing up as failed when interogating the pool status in Truenas scale but the drive SN is moving. I am only able to get 3 drives to come online at one time now.

When I run the lscpi -nv command.... I find something ineteresting. The devices functioning are using kernel driver "NVME" the "failed" drive is using "vfio-pci"

When I run the lscpi -nv command.... I find something ineteresting. The devices functioning are using kernel driver "NVME" the "failed" drive is using "vfio-pci"
05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 0, IOMMU group 32
Memory at 92600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
06:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 77, NUMA node 0, IOMMU group 33
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: vfio-pci
Kernel modules: nvme

-lspci
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

lspci -k|grep mpt3sas
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas

loca5790 · Dec 30, 2023

hopefully this helps someone and not sure how this even happened. One of my passthrough devices got switched on a VM to my NVME. It randomly happened middle of the afternoon today.

ChrisRJ · Dec 31, 2023

If VMs are in the game it would be good how that is.

loca5790 · Dec 31, 2023

ChrisRJ said:
If VMs are in the game it would be good how that is.

Hey Chris,

Maybe I should add some more details.

I have a VM running a coral TPU as a pass through. That machine lost it's coral TPU which I know was there because I had been using it yesterday and verified it's usage. Somehow middle of day that machine lost it's Coral TPU passthrough and it moved to the NVME. The NVME is part of a pool and caused the pool status to be degraded because of a lost drive.

ChrisRJ · Dec 31, 2023

Are you running this VM inside TrueNAS or is TrueNAS also running as a VM? A kind-of picture might be helpful

firesyde424 · Jan 3, 2024

The R730xd isn't compatible out of the box with M.2 NVME drives... or any NVME drives really. I manage a couple dozen of these model servers as HCI storage devices. There's no documented NVME support, though the R730xd will boot from certain models of Intel and Micron NVME drives via PCIE to M.2 or PCIE to U.2 adapters.

We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives. It's possible there's a vague hardware compatibility issue with that generation of Dell server and Samsung NVME drives, or, more likely, the v3 and v4 Xeon E5 CPUs that will run in the 730xd. In our testing, even Intel and Micron NVME drives had a few stability issues and we ended up scrapping the idea altogether, leaving the servers to boot from their internal SATA-DOM or SAS SSDs in the front drive bays.

You may try different models of drives to see if that helps. That generation of servers only has PCIE gen 3 so there's not really much of a performance benefit from the gen 4 or gen 5 drives such as the 980 or 990.

ChrisRJ · Jan 3, 2024

IIRC the Samsung 990 Pro has a write cache. Once that is saturated the write performance goes down considerably, if memory serves me right.

Ericloewe · Jan 3, 2024

firesyde424 said:
The R730xd isn't compatible out of the box with [...] or any NVME drives really.

firesyde424 said:
There's no documented NVME support

Well, that's just not true:

Up to twenty 2.5-inch hard drives and up to four 2.5-inch (U.2) NVMe drives (in slots 20 to 23).

(This applies to the 24-bay model, but the motherboard is the same throughout the range.)

Now, it's important to keep in mind that Dell's enablement kit is pretty substantial. I haven't examined one, but I suspect they're using a PCIe switch for the retimer functionality (or maybe just plain retimers, but with a massive heatsink to make it look more expensive). This tells me that tolerances on the PCIe links are pretty tight, which to some extend leads into:

firesyde424 said:
We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives

This is very unusual, but a relevant data point.

HoneyBadger · Jan 3, 2024

The R730XD SFF will work with NVMe devices as stated - but booting from them seems to be a challenge. That doesn't seem to be the issue at play here though.

@loca5790 I'm curious as to how the NVMe SSD got toggled to passthrough to a VM (vfio driver claim). I assume you're using an M.2/PCIe Coral TPU - how is that attached? Is that in the ASUS HyperX card, or on its own mount?

firesyde424 · Jan 3, 2024

Ericloewe said:
Well, that's just not true:

(This applies to the 24-bay model, but the motherboard is the same throughout the range.)

Now, it's important to keep in mind that Dell's enablement kit is pretty substantial. I haven't examined one, but I suspect they're using a PCIe switch for the retimer functionality (or maybe just plain retimers, but with a massive heatsink to make it look more expensive). This tells me that tolerances on the PCIe links are pretty tight, which to some extend leads into:

This is very unusual, but a relevant data point.

And here I thought I knew everything! Guess I should have checked the source documentation first. I wasn't aware the R730xd had any NVME config available. I'd wager it's a rare backplane to find. Thanks for pointing that out.

I know for sure that the R730xd will not boot from a Samsung 970 Evo Plus or 980 Pro, Crucial T500, Western Digital Black(I don't remember the model), and a hand full of random generic branded NVME SSDs extracted from Lenovo laptops. You can install an OS to them but they do not show as bootable options in the BIOS and will not boot once the OS has been installed. We tested TrueNAS Core and ESXi.

Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.

The R730xd will also boot from a BOSS-S1 card containing dual Micron 2300 NVME drives, though I wasn't able to find anywhere that said the BOSS-S1 card was supported on the R730xd.

I remember years ago there being a few competing standards for booting from NVME drives and my assumption here is that Dell only implemented one or a few of those in their 13th gen servers. Given that Intel and Micron drives are available directly from Dell with Dell firmware, it makes sense that Dell would support whatever protocol those companies were using at the time.

Ericloewe · Jan 3, 2024

firesyde424 said:
I'd wager it's a rare backplane to find.

Nope, all 24-bay units support NVMe, though I expect most are not wired for it. Very easy to retrofit, of course, it's just an adapter card and SFF-8643 cables. Same goes for R630 10-bay units.

firesyde424 said:
I remember years ago there being a few competing standards for booting from NVME drives

I never heard anything about that, though I can easily imagine the Gen13 firmware being limited in terms of the NVMe driver. That's something you could work around with enough effort...

firesyde424 said:
I know for sure that the R730xd will not boot from a Samsung 970 Evo Plus or 980 Pro, Crucial T500, Western Digital Black(I don't remember the model), and a hand full of random generic branded NVME SSDs extracted from Lenovo laptops. You can install an OS to them but they do not show as bootable options in the BIOS and will not boot once the OS has been installed. We tested TrueNAS Core and ESXi.

Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.

Then again, that sounds a lot like they're whitelisting models they sell without locking the firmware to Dell-branded drives...

firesyde424 said:
The R730xd will also boot from a BOSS-S1 card containing dual Micron 2300 NVME drives, though I wasn't able to find anywhere that said the BOSS-S1 card was supported on the R730xd.

Those things are basically RAID controllers, right? So they would use a different driver than plain NVMe disks...

There's enough boot nonsense with legacy options. On Gen15, I simply cannot boot ZFS Boot Menu from SATA disks attached to an HBA330 mini. The Linux kernel just crashes and burns, I suspect because nobody actually ever tested the mpt3sas code with the stub loader or something and a race condition with the UEFI driver is tripped. Everything works fine if I boot ZBM from, say, NVMe, and load the real OS from SATA. Haven't tried SAS disks, but I might for fun one day.

firesyde424 · Jan 4, 2024

Ericloewe said:
Nope, all 24-bay units support NVMe, though I expect most are not wired for it. Very easy to retrofit, of course, it's just an adapter card and SFF-8643 cables. Same goes for R630 10-bay units.

Now that's interesting, I didn't know that. We run 19 of that model in a split backplane config with dual HBA330s. This was a specific config we selected from the factory. Through just playing around with other R730xd servers I discovered they all have the exact same backplane, or at least all of our R730xd servers have the same backplane and can be converted to a split backplane, even the ones that weren't configured that way from the factory. It would make sense that they could all run that 4 x NVME config.

firesyde424 · Jan 5, 2024

firesyde424 said:
Via a U.2\U.3 to PCIE adapter card, I know for sure the R730xd will boot reliably from Intel P4510, Intel P4610, Intel P5500, Micron 7400 Pro, Micron 7400 Max, Micron 7450 pro, Micron 7450 Max, and Micron 9400 Pro NVME SSDs.

I went back and looked at my notes and discovered that this was incorrect. The card I was thinking of here is a BOSS-S2 card with Micron NVME drives which we did not test on the R730xd. The BOSS-S1 card that we tested has Micron 5100 M.2 SATA drives.

The reason for all these tests was because the vast majority of our R730xd servers are in use as VMware vSAN nodes with fully populated backplanes. They were originally configured to boot from a 64GB SATA-DOM drive plugged into the motherboard. However, over their lifespan, ESXi install requirements have grown and the original boot drives ordered with the servers are no longer sufficient and we were not able to find a SATA-DOM drive larger than 64GB. Our testing was meant to see if we could boot the servers from either an M.2\PCIE or U.2\PCIE adapter card so that way we could meet the boot drive requirements of the new ESXi versions without sacrificing node storage capacity for vSAN.

Ericloewe · Jan 5, 2024

High-end USB flash drives might also be an option to consider, either the high-end native USB SSDs or USB/NVMe bridges with NVMe SSDs.

loca5790 · Jan 5, 2024

HoneyBadger said:
The R730XD SFF will work with NVMe devices as stated - but booting from them seems to be a challenge. That doesn't seem to be the issue at play here though.

@loca5790 I'm curious as to how the NVMe SSD got toggled to passthrough to a VM (vfio driver claim). I assume you're using an M.2/PCIe Coral TPU - how is that attached? Is that in the ASUS HyperX card, or on its own mount?

That is correct. It is on it's own mount on an Ablecon PEX-MP117 based off what was recommended on the forms. The ASUS HyperX only has 4 990 Pro's. They are identified individually in 5-8 via LSPCI. I am slo very curious as to how that routing changed and kicked my separate PCIe card out and moved the vfio to an NVME. This was also recorded in CheckMK when it identified a Disk IO change in TrueNas... it appears somehow the shift happened in TrueNAS because when I tried shifting PCIE card slots in the server the R730XD reporded slot card changes via SNMP.

This is further compounded by the fact that the locations reported through SNMP to CheckMK is loosing disk IO locations and being reported as item not found in monitoring data.

firesyde424 said:
The R730xd isn't compatible out of the box with M.2 NVME drives... or any NVME drives really. I manage a couple dozen of these model servers as HCI storage devices. There's no documented NVME support, though the R730xd will boot from certain models of Intel and Micron NVME drives via PCIE to M.2 or PCIE to U.2 adapters.

We had a considerable number of issues when testing some of our R730xd servers with Samsung 980 pro drives. It's possible there's a vague hardware compatibility issue with that generation of Dell server and Samsung NVME drives, or, more likely, the v3 and v4 Xeon E5 CPUs that will run in the 730xd. In our testing, even Intel and Micron NVME drives had a few stability issues and we ended up scrapping the idea altogether, leaving the servers to boot from their internal SATA-DOM or SAS SSDs in the front drive bays.

You may try different models of drives to see if that helps. That generation of servers only has PCIE gen 3 so there's not really much of a performance benefit from the gen 4 or gen 5 drives such as the 980 or 990.

There is a documented enablement kit that uses a PCIE to a 24 bay backplane. I have a 12 bay and am using a PCIE card that many are reporting success with. The booting from them is what seems to be hit or miss. I am not using them to boot and the reason I am using them is for stabality and longevity. When doing testing on them they were transfering around 6 gb/s which is more than what I need.

ChrisRJ said:
Are you running this VM inside TrueNAS or is TrueNAS also running as a VM? A kind-of picture might be helpful

I am running TrueNas as the base OS and everything else is a VM inside of it. The NVMe's are in a pool that is dedicated strictly to VM use. I am booting of mirroed SSDs in the flexbay's.

All I have not seen any issues since I corrected the drive pass through but again it takes a while to show up. Not sure why or how but I'm hoping it's simply a me thing where I made a mistake.

loca5790 · Jan 8, 2024

It's back. Going to need some help here as I believe the issue is truenas.

lspci -nv (shows the 4 drives are in the same locations but the kernel driver in use for 6 is missing)

05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 32
Memory at 92600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

06:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: fast devsel, IRQ 47, NUMA node 0, IOMMU group 33
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel modules: nvme

07:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 0, IOMMU group 34
Memory at 92400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

08:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 0, IOMMU group 35
Memory at 92300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

Inside the VM this time the pass through is still active for the coral
00:08.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a

It also appears the pass through is still active for the GPU.
00:09.0 0300: 10de:1cb2 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 10de:11bd

Checkmk logged 3 of the 4 drives dropping out 22 hours ago.

At this same time home assistant stops logging CPU shortterm load via the truenas integration but SNMP is still active.

There are no errors in SNMP from checkmk going to the Dell Server.

zpool status -v
pool: VM_NVME
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.

To note the removed drive is the same drive that was being removed previoulsy via the identification numbers. it's possible the drive could be acting up but with the loss to checkmk and home assistant via truenas SNMP.... not sure I'm ready to make that conclusion.

I will note.... that when it dropped out CPU usage did spike as was captured on checkMK. I'm wondering if it's possible CPU load was too high and it dropped a drive? ignore the time stamp as the system is on the wrong time scale.

Wellllll..... welllllll.... wellll..... Dec 30th drop out

Starting to wonder if it's kicking things out when CPU load is too high

loca5790 · Jan 8, 2024

The drive is now not showing up in lspci after a reboot. It appears maybe it has truly failed.

05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 0, IOMMU group 32
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

07:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 75, NUMA node 0, IOMMU group 33
Memory at 92400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

08:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 104, NUMA node 0, IOMMU group 34
Memory at 92300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme

HoneyBadger · Jan 8, 2024

@loca5790 Do you happen to have any debugs or earlier information grabs that might show the firmware revisions of the drives? I recall the early days of the Samsung 990 had some "teething issues."

loca5790 · Jan 8, 2024

HoneyBadger said:
@loca5790 Do you happen to have any debugs or earlier information grabs that might show the firmware revisions of the drives? I recall the early days of the Samsung 990 had some "teething issues."

I don't, I could pull one out and attach it to a windows computer to use Samsung Magician since it's not support on anything other than windows.

I have confirmed with Samsung via serial numbers that they are built with the latest chipsets though. "Those are some of the most stable drives we have, it's suprising you'd have any issue."

Important Announcement for the TrueNAS Community.

Scale R730XD ASUS PCIE NVME Cards NVME's Randomly Drop

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Contributor

Wizard

Server Wrangler

actually does care

Contributor

Server Wrangler

Contributor

Contributor

Server Wrangler

Dabbler

Dabbler

Dabbler

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scale R730XD ASUS PCIE NVME Cards NVME's Randomly Drop"

Similar threads