NVMe Drives Disconnect after 12-24 hours

Joined
Sep 15, 2021
Messages
4
I have four 1TB Adata Sx8200 Pro NVMe drives that I use as an L2ARC Cache. They are in a PLX switch PCIe card. They keep disconnecting after 12-24 hours, usually one at a time. Sometimes only one disconnects, sometimes two, sometimes all of them. After restarting they are immediately recognized and working correctly and then the cycle repeats. I have tried two different 4x M.2 PCIe cards with switches, including a Supermicro one. I have also tried two different PCIe slots. I am using a Supermicro X9 board with 2x Intel processors. Does anyone have experience with this? I am using the latest version of TrueNAS. Help appreciated! Specs below:

OS Version:​

TrueNAS-12.0-U8

Model:​

Icebreaker 4824

Memory:​

256 GiB

2x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
 

Attachments

  • debug-BRAVO-20220216114239.tgz
    2.2 MB · Views: 111

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Does your PCI-e switch card have adequate cooling? This sounds like it overheats and stops working.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I don't know about anyone else, but I just haven't had the greatest of luck with NVMe M.2's. Some of the WD's do a similar annoying thing in my experience.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Not much of a sample size, but I haven't seen any issues with Samsung 970 EVO Plus units (4x 1 TB and 1x 500 GB).

I have seen some weird reports of people running custom BIOS settings (not exposed by the setup utility, but present in the firmware) to do x4/x4/x4/x4 bifurcation and then having odd problems not unlike those mentioned here. Drives dropping, PCIe downgrades, etc. I'm running such a setup on my main workstation (Samsung 870 Evo Plus 1 TB x2 + 500 GB x1 + WD SN700 500 GB) with no issues, so perhaps we are just looking at sucky SSD controllers?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'm going to go with @Samuel Tai's overheating comment, but expand on it. NVMe devices under continuous load can get really really hot. I've had U.2 units in 2U servers trip into predictive failure due to fan noise control tweeks, and nearly burned my hand when I pulled them. Some of the Gen 4 units are rated at 2 amps 12v + 2 amps of 5v... 34 watts is double what the usual 2.5" spinning rust drive draws. The M.2 units on PCIe cards often have less surface area, and less airflow...
 
Joined
Sep 15, 2021
Messages
4
Good thinking guys. The card does have active cooling (a small fan) and a super beefy heatsink. The ambient air temp in the server room doesn't go above 80 degrees Fahrenheit. I have a thermometer with logging outside the case. Would SMART log this data and report drive temp? I will try aiming an additional fan directly at the PCIe card, but the case already has high airflow fans. I'll also note that this card worked without disconnecting for 8 months in a windows workstation, so I'm 70% confident it isn't a cooling issue. Since the card has a PLX switch, it doesn't need bifurcation, as far as I know. Does anyone have more insight into the interplay between bifurcation and other PCIe settings and M.2 cards with active switches on them? The topic hasn't been covered in detail online, and it seems important, for adding performance to older NAS systems.
 
Joined
Sep 15, 2021
Messages
4
Eric? Can you point to any of those instances where the bifurcation settings caused issues? I originally tried to get that to work with passive 2x M.2 PCIe cards, but gave up and went the active route. The cards were fairly expensive, I think like $300 each, so they should work. They do work flawlessly in other systems.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
SMART should tell you the temperature of each drive. What's the orientation of these drives? Are they parallel to the ground (in which case they'd need more active airflow, as the card would trap hot air underneath), or perpendicular (normal front-to-back airflow should be sufficient).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can you point to any of those instances where the bifurcation settings caused issues?
No, just comments on other forums. I mostly expect them to be sucky controllers and/or sucky adapter cards. Or both.
 
Joined
Sep 15, 2021
Messages
4
Here's something interesting. nvd0 was the drive that disconnected.
 

Attachments

  • Pasted Graphic.png
    Pasted Graphic.png
    613.4 KB · Views: 196

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Definitely looks like overheating. 45 C, when all the others are at 36 C.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
Either it is uneven load with one drive working and three idling, or there is a problem with the thermal interface from the card to the heatsink, which causes one specific card to have significantly less thermal inertia. So that one card heats up faster under a balanced load. However, the temperature of 45C is pretty low for this model (see https://www.techpowerup.com/review/adata-sx8200-pro-1-tb/7.html, specifically the chart "Thermal test, sequential write").
 
Top