Strategy to comprehensively test SSDs

IronDuke

Dabbler
Joined
Jan 23, 2023
Messages
18
I'd be grateful for any suggestions as to how to comprehensively test 16x 2TB SATA 3 SSDs. Situation is thus:

I have 2x Gen9 DL360s, with 8x SFF slots each (see signature). I ordered 16x (very) cheap and (potentially) nasty 2TB Chinese knock-off SSDs. Had to wait a month for them to arrive, but they're here. I consciously bought them, knowing it was a bit of a punt, because they might be terrible. But with eBay's "Get what you ordered" guarantee, also bought on a credit card with purchase warranty, and the seller accepting returns, I feel as though I've mitigated my risk, even if they aren't good enough to use in a NAS and I need to look at something else. But if they ARE good enough, then it's a win all round.

My use cases aren't particularly stressful for a NAS to begin with:

1. Backups of desktop & laptop machines. Only about 250GB on each of 4 machines.
2. Possibly use Apple Time Machine on those machines too.
3. Media streaming (again only about 1.5TB of movies)
4. Storage of several thousand RAW images from a camera. Very infrequent read access to these.
5. Storage of MP4 video from camera (maybe 15 minutes per file, a few hundred files). Very infrequent read access to these too.
6. Running 3-6 VMs or Kubernetes apps, one of which is an NVR with up to 5x HD cameras. These only record when they detect motion, so only a fraction of the time, nothing like 24x7. Usually get less than 10 videos per day per camera, each 1-2mins long.

I can split these workloads between the two boxes too.

Both boxes have plenty of ECC RAM, and it's pretty cheap to increase that, should it be required. Currently running about $1.10 per GB, and there are many empty DIMM slots available. The VMs might cause me to do that. Plenty of CPU, 2x E2640v3 8C each. Networking is 10Gb optical from the DL360s to the switch and 2.5Gb copper to the client machines, so I don't right now need NAS performance much faster than the 2.5Gb client links. I might go to 10Gb optical on a couple of the client machines later in the year, so that's a consideration, but I would think an all-SSD NAS could saturate even a 10Gb link, or not far off it.

So it's really around the storage. I am thinking all 8 drives in a RAIDZ2, giving about 10TB per box. Should be plenty, capacity-wise. Both boxes have H220 LBAs, so no HW RAID in the way.

I'm still a bit wary of the SSDs though. So far, I've got a known-good 1TB SSD with Linux on it (Ubuntu Server) in slot 1 as boot disk, and "disk under test" is one of the 2TB new SSDs in slot 2. I was quite pleased that these SSDs did respond to smartctl, indicating they have at least some SMART capabilities. But I am also conscious that nefarious influences exist in the world, and it wouldn't be impossible to just have the drives lie in their SMART responses.

In terms of testing, I need confidence that the capacities are all as they claim (ie all 2TB usable), which means I need to write every byte, and read it back to verify that it's OK, preferably several times. I'd also like to stress-test them to a degree, meaning subject them to continuous read/write for several hours, testing thermal susceptibility and integrity of the flash. If they pass both of those requirements, then I'm good to go.

I've run smartctl -t long on the DUT, and it takes a few minutes (I was expecting an hour, but I suppose it's an SSD) and reports no errors. But that could be a lie. So I'm looking for a suite of tests to verify each byte of the storage (ideally NOT using smartctl), and something to verify the durability during sustained read/write cycles. I suppose I could just run whatever the capacity verification ends up being in a loop for a few hours. With 8 drive bays and one boot disk, I can run these tests in parallel for 7 disks.

Long post, but hopefully someone's already needed to do this kind of thing, and so no need to re-invent the wheel.

Thanks in advance for any pointers!
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
In terms of testing, I need confidence that the capacities are all as they claim (ie all 2TB usable), which means I need to write every byte, and read it back to verify that it's OK, preferably several times. I'd also like to stress-test them to a degree, meaning subject them to continuous read/write for several hours, testing thermal susceptibility and integrity of the flash.
That would be a bad idea because the process would wear the NAND cells and shorten the drives' life. Contrary to magnetic platters, NAND cannot sustain an indefinite number of rewrites.
Better trust the SMART tests. And have a backup!
 

IronDuke

Dabbler
Joined
Jan 23, 2023
Messages
18
Actually I've just found a very useful resource:


That explains the function of all the SMART parameters, and I think I now understand why these disks were so inexpensive.

The RAW_VALUE for attribute 163 is 3000. This corresponds to the number of bad blocks when the device was produced. A "normal" level seems to be a couple hundred, so my assumption is these drives are at the low end of the yield during production. That actually makes me feel a lot better about them, assuming the device will do what it's supposed to and just not use those blocks, substituting spare blocks. If I've got my math right, 3000 duff blocks is something like a 1.4millionth of the capacity of the drive, so not remotely worried about that, as long as other blocks don't degrade during use beyond what is expected.

I have to leave for a business trip now, but on Friday when I'm back, I'm eager to see what this count is on the other drives.
 
Top