SSD for special vdev

ajp_anton

Dabbler
Joined
Mar 6, 2017
Messages
11
What hardware/drives do people use for their special vdevs like metadata or dedup (for those who use it)?

I'm running TrueNAS inside Proxmox and I currently have all special vdevs on virtual drives. This way I can have big SSDs handled by Proxmox, and then allocate a large number of tiny virtual drives for TrueNAS to use. But for optimum performance and safety, TrueNAS should have direct access to the physical drives?

For example my 3TB (x2 mirrored) pool currently has an 8GB metadata+small files vdev and another 8GB dedup vdev. I would have to replace these with four (mirroring) physical SSDs. It feels like such a waste to buy humongous SSDs for such tiny vdevs.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This way I can have big SSDs handled by Proxmox, and then allocate a large number of tiny virtual drives for TrueNAS to use. But for optimum performance and safety, TrueNAS should have direct access to the physical drives?

Please don't do this. Provide PCIe passthru access to the physical drives as outlined in the virtualization guide. This provides lower latency and protects you against various failure modes inherent in using hypervisor-provided virtual disks.
 

ajp_anton

Dabbler
Joined
Mar 6, 2017
Messages
11
Please don't do this. Provide PCIe passthru access to the physical drives as outlined in the virtualization guide. This provides lower latency and protects you against various failure modes inherent in using hypervisor-provided virtual disks.
Well, my post already mentions the same reasons to pursue the same end goal as that guide. But what about my question? How to avoid spending on orders of magnitude oversized SSDs?

Not to mention there's not enough PCIe lanes to connect them all. I guess I could use SATA SSDs behind an HBA, but what high endurance SSDs with reasonably modern IOPS are using SATA?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, my post already mentions the same reasons to pursue the same end goal as that guide.

That wasn't my takeaway.

How to avoid spending on orders of magnitude oversized SSDs?

Available choices: Change your expectations of your design, go with a bigger platform, go back to bare metal, or engage in ill-advised hacking.

Using a hypervisor is generally going to place constraints on your platform design; you say for example "then allocate a large number of tiny virtual drives". But that's a really bad idea, do NOT use virtual drives. If you do not use virtual drives and you also insist on using "humongous SSDs" you have painted yourself into a corner where you must either use SATA SSD's (where you can use as many as you like via SAS expanders) which is still problematic because you need to pass the controller through to TrueNAS via PCIe passthru, or you have to look at tricks such as using partitioning or multiple NVMe namespaces. This allows you to use, for example, a pair of large Optane SSD's cut up into your special, dedup, and SLOG vdevs, but this also precludes the system from helping you out with automatic drive replacement and pretty much guarantees you will have a downtime if a device fails and needs to be replaced.

The root problem here is that TrueNAS is fundamentally designed to work with real drives and with only one ZFS partition per disk. It is an appliance design for iX hardware, and it is not optimized to do these other stupid tricks to make things work out better for hobbyists or tinkerers.

Not to mention there's not enough PCIe lanes to connect them all.

Well, that's what a PLX switch is for.

I guess I could use SATA SSDs behind an HBA, but what high endurance SSDs with reasonably modern IOPS are using SATA?

Solidigm D3-S4620 not good enough for you? The problem is still going to be that you need to pass the PCIe controller to the TrueNAS. The use of a virtualization platform makes the I/O requirements very complicated.
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
@jgreco covers it all really.
- Physical - do not virtualize TrueNAS <-- just to back them up cause while they know better than anyone, one more person saying it never hurts right :D
- Why do you need Dedupe and why do you feel you need such high IOPS for it when your backend is using spinning rust drives...(dedupe is highly CPU power I believe, as well...)
- What is your network connection for this TrueNAS 1Gb / 10Gb ?

If you are going to build out a TrueNAS system, the idea around here is, do it right and do it right the first time.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
@jgreco covers it all really.

Thank you! I really do try. My customers and I are heavily virtualized so it's a topic of interest and I **GET** the frustration, but I just don't see a good way to virtualize this stuff without doing damage somewhere. What's really needed is something akin to SR-IOV/VF style support to allow portions of a NVMe device or HBA to be passed into a VM. Unfortunately this is kinda rough to do, due to the need to set up parameters such as what region of LBA's would be for each VF NVMe disk, but I can at least picture a path there using NVMe namespaces where each VF was tied to a namespace. This increases controller complexity but that's more or less #whocares at this point. I also expect that this could evolve over the next decade just as network card VF's have gone from quirky little curiosity to mandatory feature (at least for some folks!) over the last decade.
 

ajp_anton

Dabbler
Joined
Mar 6, 2017
Messages
11
@jgreco
I guess what I don't really get is the purpose of these vdevs, or how they are supposed to be used. I mean I get the speed boost, but if money was no concern, I would just go all-in on SSDs. I thought they would work as a money-saving middle ground between all-SSDs and all-HDDs, but in practice it appears to be difficult to achieve unless you go against all the recommendations here.

Since SSDs aren't small anymore, having a dedicated mirror of huge SSDs for every single one of those tiny vdevs kind of defeats the purpose of the speed/cost tradeoff. This problem would exist even if I didn't virtualize, in fact virtualizing is actually giving me a way of getting around it even though it does introduce other problems. That said, I still need it virtualized, but it's actually not relevant to my main question.

The D3-S4620 would be a way to solve the PCIe lanes problem, but it doesn't solve the cost problem. The smallest 480GB drive is still vastly unnecessarily expensive when only 8GB of it is going to be used.

@MrGuvernment
Well I'm not looking for millions of IOPS, so maybe that was a bit badly worded by me. I was just looking at the appropriately small (and therefore cheaper) ones, and those are all really old (and I assume slow and unreliable).

This one pool has dedupe, because it halves the amount of space required. My other pools don't benefit, but they are still using metadata vdevs. I haven't seen any slowdowns because of dedup, but CPU power shouldn't be an issue on this machine.

My TrueNAS has 10Gb internet access, and LAN access is a mix of 1Gb, 2.5Gb, 10Gb, and Proxmox's internal network.
 

ajp_anton

Dabbler
Joined
Mar 6, 2017
Messages
11
tricks such as using partitioning or multiple NVMe namespaces. This allows you to use, for example, a pair of large Optane SSD's cut up into your special, dedup, and SLOG vdevs, but this also precludes the system from helping you out with automatic drive replacement and pretty much guarantees you will have a downtime if a device fails and needs to be replaced.
This sounds interesting to at least have a look at as a "least bad" option. Didn't know you could actually have all those vdevs on the same physical drive. How does this work in terms of data loss protection? Say one of the (mirrored) drives failed and I replaced it - I assume it won't just start rebuilding itself. Would I have to partition the new drive again the same way, assign each partition as a "replacement drive", and only then will each vdev rebuild? That includes some manual work, but are there any other downsides?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Didn't know you could actually have all those vdevs on the same physical drive.

You can. It's contradictory to the appliance design though.

Say one of the (mirrored) drives failed and I replaced it - I assume it won't just start rebuilding itself.

Correct. You could not even set it up with a spare drive, or you'd risk the system picking it as a replacement for one of the partitions that failed.

Would I have to partition the new drive again the same way, assign each partition as a "replacement drive",

The GUI won't let you assign partitions as a "replacement drive." This all becomes manual CLI work using ZFS and partitioning commands. Automatic sparing replacement is not particularly intelligent and is mostly intended for data drives where failures typically happen. The system would never figure out complicated partitioning etc. You would be better off with doing something like three-way mirroring for dedupe or special vdevs.

This is actually the way I've planned to go. Prices on flash have dropped that I'm seriously considering trying again to build the iSCSI filer that I tried to do back in the ~2015 era. I'm hoping the bleeding at Samsung will result in further price drops on the 870 Evo 4TB, and I have a trio of Optane 905P 480GB's I plan to use.

but are there any other downsides?

Yes, you get some other damage, most notably the loss of per-drive statistics (both in the GUI and stuff like gstat).
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
To be quite frank here, I have, for reasons of laziness, absolutely virtualized FreeNAS "wrong" over the years.
Now, I am absolutely not saying that what you have done is "correct", it will work. Until it doesn't. That's really the issue at hand here. ZFS does a really good job at what it does because it is both the logical volume manager as well as the filesystem. If you go out of your way to use something else as the logical colume manager and abstract the disks away from ZFS, you better be damn sure that backing LUN is stable....
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Thank you! I really do try. My customers and I are heavily virtualized so it's a topic of interest and I **GET** the frustration, but I just don't see a good way to virtualize this stuff without doing damage somewhere. What's really needed is something akin to SR-IOV/VF style support to allow portions of a NVMe device or HBA to be passed into a VM. Unfortunately this is kinda rough to do, due to the need to set up parameters such as what region of LBA's would be for each VF NVMe disk, but I can at least picture a path there using NVMe namespaces where each VF was tied to a namespace. This increases controller complexity but that's more or less #whocares at this point. I also expect that this could evolve over the next decade just as network card VF's have gone from quirky little curiosity to mandatory feature (at least for some folks!) over the last decade.

Put it this way, every time I read through TrueNAS threads and come across one of your posting, it makes me want to go back and check my entire TrueNAS set up ( as simple as it is) to be sure I didnt miss something, or I could of done something better , or, how can I make mine bigger and better, even if I dont need the space :D
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
You can always do these things. ;-) ;-)

Is MrsGuvernment objecting to the unnecessary budgetary strain? Heh!

:D, luckily for me, no :D as she benefits from all my toys and storage and systems, so she cant really complain, especially with how often her mother loses her phone and the backups I keep of their stuff
 
Top