How safe is a PCIe to NVMe adapter card for fusion pool SSDs?

veritas · Mar 12, 2022

In planning my first TrueNAS setup I've spent countless hours reading on this forum, so first of all, my heartfelt thanks to everyone who has contributed to this invaluable resource.

I have come to a point where I can't find an answer as there doesn't seem to be a ton of information available on fusion pools yet.

Basically, I intend to use a special vdev/fusion pool to speed up both reads and writes to my pool and provide quick access to the many small (<1KB) files that make up my knowledge management systems, most of which won't be used frequently or recently, so wouldn't be helped by ARC/L2ARC.

I have a 24-bay chassis but would like to use all the bays for HDDs to maximize storage. My H12SSL-C motherboard has an onboard HBA which is used for the backplane and only two other SATA ports which I plan to use for my boot drives, the two M.2 slots will be taken with two mirrored M.2 SSDs for my VM storage (separate pool, managed by Proxmox), so any additional drives need to be over PCIe. I found this card which looks perfect as it can take four M.2 SSDs (the H12SSL-C supports bifurcation): https://www.delock.de/produkt/89017/technische_details.html?setLanguage=en This would allow me to do a three-way mirror with Seagate IronWolf 525's (ouch my wallet), plus a spare slot, should I for instance find that an L2ARC would be useful.

My concern with this plan is that I'm introducing a new single point of failure: the PCIe to NVMe adapter card. In the unlikely event that this was to fail, could I simply replace the card, boot TrueNAS back up and everything is back? Or am I jeopardizing my metadata and thus my entire pool by running it on drives connected in this way?

I am building a slightly OTT backup strategy, so I'm not worried about losing data as much as I am concerned about downtime as this server will be running everything for my small business.

jgreco · Mar 12, 2022

So, just to frame your question in a different light, please explain to us how the HBA is not a single point of failure.

The 4xM.2 carrier card you link to looks like the commonly available Shenzhen special, and yes, these work fine as long as your board properly supports bifurcation. There is literally nothing to these cards, which is why they're only $25-$40.

I am a little bit concerned by your mention of Proxmox, though. If you have some delusion of trying to virtualize TrueNAS under Proxmox and then feed all your stuff in to TrueNAS via PCIe passthru, please note that Proxmox's PCIe passthru is considered experimental by the Proxmox folks and is not necessarily going to be able to reliably handle a single device, much less a slew of them, over the long term. This is definitely potentially hazardous to your pool and your uptime.

veritas · Mar 12, 2022

True, the HBA is also a single point of failure, but I'm thinking that it's all Supermicro hardware which is definitely compatible and has been successfully used by lots of people. I'm more nervous about entrusting the Shenzhen special with my metadata, and I wonder if I'd be unduly jeopardizing my pool. It's paradoxically reassuring to hear that they are dumb devices and are therefore more likely to be easily replaceable than not.

As for the virtualization, I am studying your Guide to not completely losing your data and also trying to understand all the caveats at Yes, You Can Virtualize FreeNAS. One of the reasons I'm thinking of adding a special vdev is this sentence in the latter:

Using a single disk leaves you vulnerable to pool metadata corruption which could cause the loss of the pool. To avoid this, you need a minimum of three vdevs, either striped or in a RAIDZ configuration. Since ZFS pool metadata is mirrored between three vdevs if they are available, using a minimum of three vdevs to build your pool is safer than a single vdev. Ideally vdevs that have their own redundancy are preferred.

At least initially I won't have three vdevs in one pool for cost reasons, so I was hoping I could mitigate this issue by moving the metadata to a three-way mirror special vdev, whilst reaping the other benefits of the special vdev at the same time. But to be honest I think I might be misunderstanding the quote above, because if I was to be so unlucky to lose a RAIDZ-2 or 3 vdev in my pool, I'm thinking I'd need to rebuild it anyways, so perhaps having my metadata sat inside a normal vdev is not an issue for me.

Just in case Proxmox with TrueNAS Core does die on me, data will be replicated daily to a dedicated backup server with TrueNAS Scale. So while it would be an absolute pain to have to rebuild, it wouldn't be the end of the world and it's a risk I'm willing to take to be able to use Proxmox and TrueNAS Core on the same server.

jgreco · Mar 12, 2022

. Since ZFS pool metadata is mirrored between three vdevs if they are available, using a minimum of three vdevs to build your pool is safer than a single vdev. Ideally vdevs that have their own redundancy are preferred.

This always struck me as a bunch of CYA blather.

What it's really saying is that you shouldn't have single disk vdevs because it is possible for metadata corruption to take out your pool. You can mitigate that by having three single disk vdevs. But this is an idiot's configuration. It's maybe fine if you're doing development work and you don't care about your data, but in that case, you DON'T CARE ABOUT YOUR DATA and so why do you give a crap about the metadata? It's really weird.

It is sort of important to understand that that blog post was written in annoyance and response to my resources here on the forums, and some of it seems like it was an attempt to show that there was stuff I hadn't touched on. Well, I don't touch on every aspect of every possibility because I like to focus on what is safe and good to do. I think that just adds confusion to talk about these edge cases; here you are asking about one of them, point proven.

So here's the deal.

Probably 95-98% of the ZFS pools out there are single-vdev pools, hobbyists or SOHO users with a single RAIDZ2. This is in no way harmful, dangerous, risky, etc., to your metadata. ZFS maintains several copies of critical metadata, and while they will all end up on the same vdev in a single-vdev pool, they are all well-protected against corruption.

If you add more data vdevs, then ZFS will spread it out. That's good but not critical, unless you have no redundancy, which is the blog post's point. But if you don't have redundancy, you have bigger issues anyways.

Now, you need to be aware that the blog post was written in, I dunno, was it 2015, before ZFS allocation classes was a thing. What's being said there is not relevant to your issue.

What YOU need to consider is this:

Moving your metadata out of the HDD pool and onto a special allocation vdev is risky. Your pool will die if the special allocation vdev becomes unrecoverable. Just like any pool, EVERY component vdev MUST be available for ZFS to work, whether standard pool data vdevs or special allocation vdevs.

So what YOU need is to make certain that your special allocation vdev is redundant. And if it's important to you, that probably means three-way mirrors. Smart (paranoid) money would be to get a quad M.2 card and fill it with four SSD's, three mirrored and a spare.

So then you can pick your redundancy level for your 24 bays. I'm paranoid, so I suggest 11-disk RAIDZ3 with a spare, and if you do that, then you have 24 drives making two data vdevs of incredibly reliable storage, and then a special allocation vdev.

The blog post blather about number of vdevs doesn't end up being relevant here. Most/all of your metadata ends up on the special allocation vdev. Your pool is toast if that vdev is lost. Protect it.

veritas · Mar 12, 2022

Very clear, thank you!

I was definitely confused by this triple-vdev thing (even more so because it's repeated in the hardware guide) and thought it was somehow particularly related to virtualization. Seen as you've confirmed my suspicions that three vdevs aren't really a requirement, I'll run with the metadata on my regular vdev and avoid the risk of the special vdev for now. I'll build the rest of the system, see how it performs and then determine if my workload would benefit from it. I understand that if I add it later I won't see the same benefits as it won't migrate pre-existing metadata immediately, but I'm okay with that if it means I can hold off from dropping another $500 for the card plus 3-4 SSDs right now.

I also lean on the paranoid side, so I do plan to go with RAIDZ-3, and 11 or 12 drives, potentially making use of the new vdev expansion feature when it hopefully arrives.

NugentS · Mar 29, 2022

A couple of points:
1. If you want to store metatdata on faster disks - then an L2ARC configured for metadata only might be an option. Non Pool Critical
2. If you add a special vdev later on (may with a Metadata Small Block Size > 0) then you can always, by dataset ZFS Send | ZFS Recv to a new dataset, destroy the old dataset and rename the new temp dataset. This will populate the metadata and small files onto the special. I have done this many times whilst tuning the small block size and never had to reconfigure NFS or SMB Shares.

Important Announcement for the TrueNAS Community.

How safe is a PCIe to NVMe adapter card for fusion pool SSDs?

veritas

Dabbler

jgreco

Resident Grinch

veritas

Dabbler

jgreco

Resident Grinch

veritas

Dabbler

NugentS

MVP

Similar threads