Killed an NVMe drive running a long SMART test

essinghigh

Dabbler
Joined
Feb 3, 2023
Messages
19
Hello folks,
Not looking for any support on this one - just thought it might be an interesting share.
I just killed a Sabrent Q4 2TiB NVMe by running a SMART test -- I'm not joking. It also happened to be my primary Windows machine's OS drive, so this will be fun to rebuild on a spare.

Was doing some performance testing and wanted to see how much the drive had been written to - so as any sane Windows user does I used scoop to install smartmontools. I checked the drive, smartctl -a C:, 16TiB, not bad!
Then came a fatal mistake, I thought "what the hell - let's run a SMART test". And I did (smartctl -t long C:).
At first it seemed fine, though after about 5 seconds Windows bluescreened, a critical process had died. Weird, but whatever. I let it collect a dump so I can take a look later, and uh oh! I've booted into UEFI. That's not good. I go to reboot and notice the NVMe is not in the boot list, that's more alarmingly not good. No problem though - probably just some weird issue I've run into, I'll just boot into a live mint env and fix it.

Nope.

Mint couldn't see the drive, odd, okay. I go into UEFI - it doesn't even see the drive connected at the PCIe level. Weird -- maybe I did something to the slot?
I switch slot, no dice. Okay, time to move to another machine.
Another machine, no dice. Shit. It's dead.

I'm ordering a USB enclosure for the thing on the off chance I can at least get some life out of it, but for now seems I've completely killed the thing. Very ironic, considering the method of it's execution. And I can't really chalk it down to random coincidence.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
That would be very odd that a 'read' test would damage the nvme drive. And this drive was one of their Rocket lines?

A few questions:
1. What motherboard do you have?
2. How is the nvme drive connected to the motherboard (add-on card or built-in the motherboard)?
3. How many hours do you feel you have on the nvme drive?
4. Have you checked the Sabrent website for a new firmware update for the nvme?
5. Is the nvme drive still under warranty?
6. What version of smartmontools were you using? I assume 7.4 but one should never assume anything.
7. Did the nvme drive have a heatsink on it? I have to assume it was a PCIe 4.0 which do get hot. I guess it could have melted inside.

While you may already have done this, I would:
1. Unplug the computer for 5 minutes, then plug it back in.
2. Perform a factory/default BIOS reset.
3. If the nvme is on a add-on card, does bifurcation need to be on in the BIOS?
4. Reconfigure the BIOS so the drive is present. Well I'm hoping.

When all else fails you could resort to the "Calibrated Rubber Hammer" or "Calibrated Clawfoot Hammer" for some stress relief.

I know you were not asking for help but I'm thinking you have a system issue, likely not the nvme drive unless it catastrophically failed. A SMART test should not cause this.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
To be honest, as terrible as that scenario is, I wouldn't even be that surprised about such a firmware bug... Especially because it seems that the PCIe 4.0 and 5.0 SSDs are even bugger than the older ones (because of their fancy new controllers, I guess).
 
Joined
Oct 22, 2019
Messages
3,641
I have to agree. Something seems "off" about an NVMe suddenly dying after supposedly starting an internal "read" test. Five seconds in it dies. That smells like a firmware issue.

Now no computers can even detect it? Not even in the BIOS?

I'd contact Sabrent to file for an RMA. (Or avoid their NVMe SSDs and favor a different brand, such as Samsung.)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
firmware bug
I had a SSD that would stop writing at a specific power on hour value. Cycling power would reset it for about 1 hour then it would stop again. It allowed enough time to apply a firmware bug fix. Thankfully others reported this problem months before I had the issue so the firmware was recently available.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Now no computers can even detect it? Not even in the BIOS?
So we have been told but if there is an add-on card and if bifurcation is required, and the BIOS got reset... you can see where I'm going. Too many possibilities so we need more information.
 

essinghigh

Dabbler
Joined
Feb 3, 2023
Messages
19
That would be very odd that a 'read' test would damage the nvme drive. And this drive was one of their Rocket lines?

A few questions:
1. What motherboard do you have?
2. How is the nvme drive connected to the motherboard (add-on card or built-in the motherboard)?
3. How many hours do you feel you have on the nvme drive?
4. Have you checked the Sabrent website for a new firmware update for the nvme?
5. Is the nvme drive still under warranty?
6. What version of smartmontools were you using? I assume 7.4 but one should never assume anything.
7. Did the nvme drive have a heatsink on it? I have to assume it was a PCIe 4.0 which do get hot. I guess it could have melted inside.

While you may already have done this, I would:
1. Unplug the computer for 5 minutes, then plug it back in.
2. Perform a factory/default BIOS reset.
3. If the nvme is on a add-on card, does bifurcation need to be on in the BIOS?
4. Reconfigure the BIOS so the drive is present. Well I'm hoping.

When all else fails you could resort to the "Calibrated Rubber Hammer" or "Calibrated Clawfoot Hammer" for some stress relief.

I know you were not asking for help but I'm thinking you have a system issue, likely not the nvme drive unless it catastrophically failed. A SMART test should not cause this.

1. I've tested this in the following --
a. Gigabyte Gaming X AX (original board it was installed in - all slots)
b. ASUS Maximus X Formula (via the slot I didn't have to take the motherboard apart to get at, and a PCIe addin card (bifurcation enabled))
c. A Lenovo Thinkpad T480S (see end of post)

2. Originally built in, though I've tested both to no avail.

3. If I am remembering the SMART output correctly, it's had about 2000 power on hours with about 500 power cycles.

4. Sabrent very rarely release firmware updates, but I checked just to be sure, there hasn't been anything in a long while. Unfortunately I can't compare it to the firmware on the drive :)

5. I'd be able to get a refund for it via Amazon most likely - they provide 2 year "troubleshooting" support on these things which include replacement parts. My main concern is that the data is technically still present, and more importantly, unencrypted, so am hesitant to ship. I'll probably just sink the cost of a replacement myself.

6. 7.4-1

7. Yes, temperatures were idle at about 35-40c, I get concerned about overheating on these PCIe 4.0 devices as you mentioned so I tend to keep an eye on them, especially where motherboard manufacturers like to put the slots right underneath the GPU.

In terms of troubleshooting, I've done this and it just does not seem to recognize it at all. The gigabyte board will let me know what is connected to which slot and where (i.e. the top slot will show that there is an x16 PCIe device connected). But the board could not see anything connected to the M.2 slot whatsoever.

To be honest, as terrible as that scenario is, I wouldn't even be that surprised about such a firmware bug... Especially because it seems that the PCIe 4.0 and 5.0 SSDs are even bugger than the older ones (because of their fancy new controllers, I guess).

Yes, I can well imagine that being the case with these. All in all the scenario isn't terrible - I have TrueNAS for a reason! Will just be a slight pain having to sort a replacement drive.

I have to agree. Something seems "off" about an NVMe suddenly dying after supposedly starting an internal "read" test. Five seconds in it dies. That smells like a firmware issue.

Now no computers can even detect it? Not even in the BIOS?

I'd contact Sabrent to file for an RMA. (Or avoid their NVMe SSDs and favor a different brand, such as Samsung.)
So we have been told but if there is an add-on card and if bifurcation is required, and the BIOS got reset... you can see where I'm going. Too many possibilities so we need more information.

My guess would be firmware as well. The five seconds mark is just when the machine bluescreened, it may have died instantly and only reported a critical process dying after about 5 seconds. I'm not actually sure how Windows operates in this regard as I would have expected a complete freeze rather than a bluescreen.

Correct that no computers can detect it, via addin card or otherwise. I will say I did get an exceptionally brief glimmer of hope with the Lenovo T480S - it had detected it in the boot menu, once. I tried to boot into it and got nothing, straight back to the boot menu, though this time it didn't appear. Booted into a mint live environment on and there was no sign of it there either.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
If you boot up mint live, at the command line, if you enter nvme list what does it return? I don't know if mint live has 'nvme' command installed. If you boot up SCALE, it is installed.

I can see that you have tried to solve the issue. If by chance the drive is listed by the command above, then you should be able to issue the format command or sanitize command. But do not do this until you have read some examples on the internet. I can only see what they say the format of the command must be, I do not have that personal experience myself.

I still would contact Amazon about replacing the drive with the concern about data privacy. See if they even need it back.

I would also contact Sabrent to see if they can help. Never know, they may just ship you a replacement. Worth a try.

If you make any headway on this problem, please share.
 

essinghigh

Dabbler
Joined
Feb 3, 2023
Messages
19
An update arrives!

As I mentioned in the original post, I purchased a USB adapter for the drive - just some cheap "multibao" enclosure, though surprisingly well made!
Despite having tried every which way to get the thing up and running with absolutely no luck, I decided to give it a go - as this was something that Crucial recommend on their "Why did my SSD 'disappear' from my system" FAQ article.

It's a genuine (albeit early) Christmas miracle. I connected the USB enclosure to my machine, waited a good half an hour, nothing. Disconnected it, reconnected it, and it showed up! I'm not sure what could have happened here, Crucial's article mentions something regarding mapping tables, though I am not familiar enough with how NVMe works at this level.
I was a little concerned to do any testing on the drive, and prioritized getting the data from it onto my NAS as quickly as I could - which took a fair while.

After that though, I gave it another go in the M2 slot on my system, and it worked with seemingly no issues.
I'm not exactly jumping to dump data back onto it - I'll need to keep a close eye on it for a while, but it seems that for now at least I've managed to get the data copied off and it LOOKS to be working.

I did check the firmware now that the drive is accessible and it is running the latest version, so I'm going to raise this with Sabrent to see if they can do any testing on their side.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
That is odd but glad it is working for you now. Yes, I agree, keep an eye on it for a while before trusting it. Run another SMART test to see if it happens again.
 
Top