Planning a 2-disk-failure-safe, high speed, VM capable, NAS storage

xrm · Dec 7, 2023

Hi *,

sorry for posting yet another "how to build my system" post; as a lot of people here I am rather unsure how to come up with the best solution for my requirements given that I'm rather new to the TrueNAS-World.

I plan to build a new storage for a larger K-12-like school and have a couple of requirements:
Reliability. I have functional backups up and running, but I experienced in the past how resilvering after a single bad disk brought down three more disks of a RAID5 within a month due to the increased load. As such, I really would like to be able to work with a failure of two disks.
Speed. I have about 20 VMs and some of them are IO intensive. Ideally, I would like to see a peak data throughput in the middle to upper three digit MB/s. 600 to 800 MB/s would be a good start, something in the range of 1.5 GB/s would be ideal and give me some headroom. (More would be better of course, especially since we might have HD video cutting in a few months, but I could handle that with local disks if that's not possible.)
VM capable. According to several sources, this should be done with multiple mirror-vdevs. That alone would be acceptable, but together with my "reliability" wish, this would mean that I have to mirror 3 disks. That's ... a bit much of overhead.
Expandable. Ideally, I don't want individual storages for my VMs disks since I do not yet know how their space requirements will change over the next years and I don't like tinkering with my storage setup all the time.
Price. Money is a concern - but of course not as much as in a (typical) home lab. I can likely throw an "upper end" four digit sum at this if it the result is worth it and I'm doing this pretty much as a honorary job so my work time comes for free.

I currently have a test setup consisting of a HP ProLiant DL360 Gen9 with 96 GB RAM, a HPE 562FLR-SFP+ Dual Port 10G SFP+ network card and a Broadcom 9300-8e that connects to a HPE D3710 with 25x 1.8 TB 12G SAS 10k HDDs. Two of these disks are used as spares, two for DEDUP. Everything else is configured in a single RAIDZ3. I also threw in two mirrored SSDs (800 GB) as LOG devices and disabled sync (not sure if this is a good idea - my understanding was, that SLOG allows disabled sync while mitigating the risk of data loss ... ?).

This peaks slightly below 300 MB/s when writing data locally continuously:

Code:

$ dd if=/dev/random of=./testfile bs=10G count=5 oflag=direct
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+5 records in
0+5 records out
10737397760 bytes (11 GB, 10 GiB) copied, 37.5479 s, 286 MB/s

Writing VM data through a TrueNas scale NFS share goes down to somewhere between 50 and 100 MB/s. Granted, I do not yet have a dedicated ethernet infrastructure (and therefore not yet Jumbos enabled), but my target and source systems are connected through a single 10G switch which is not at its limits yet. I do have LACP enabled and two paths but of course that does not help during simple throughput tests.

So now I am a bit unsure how to improve these speeds.

My current "best-effort" plan would be to throw in another HPE D3710 and fill it over time with something like Samsung PM1643A 1.9 TB SSDs (I'd like to keep resilvering times ... reasonable). Configure the disks as mirrors in pairs of two, no SLOG, no DEDUP and sync set to default. Increase the RAM to 192 GB or so and get a dedicated ethernet channel with Jumbos. Add vdevs as disks come in, one big pool (or possibly multiple pools with some vdevs on hot standby so I can plug them in when the remaining space goes below 50 %).

Pro: hopefully fast; reasonably priced; redundancy (as long I don't get two adjacent drives to fail); expandable
Con: failure of two adjacent drives wrecks everything; still losing 50 % of my memory

In the "old days" I probably would have cobbled a bunch of SAS datacenter HDDs together with a hardware RAID6 controller, dedicated two spares, installed Debian and would be have been happy with my speeds. But with more demanding loads nowadays, I really would like to get some second opinion(s) if there's still room for improvements.

As such, any input to this would be required.

Thank you very much,
Sebastian

xrm · Dec 7, 2023

Err. Can't edit my own post yet, but I meant "any input to this would be welcome". But required fits too, I guess ...

(Also disregard what I wrote about SLOG and sync disabled is fine. I tried to find the source for that and while searching for it found an article that explains rather clearly that this is not the truth. All NFS speeds x 0.5 then, I guess.)

sretalla · Dec 8, 2023

xrm said:
dd if=/dev/random

Thats a crap test... you will get CPU bottleneck producing random data before you push your pool to max.

Use a dataset in the pool with no compression set, then use /dev/zero or learn about using fio to do proper tests.

sretalla · Dec 8, 2023

If you're going to allow sync writes (i.e. not setting it to disabled) and your workload is block storage over NFS, you're certainly right to be thinking about mirrors (perhaps consider adding spares to the pool to comfort yourself a bit about the single disk failure point).

Without a SLOG and with sync writes happening, you will still see relatively poor performance on bursting writes, but nothing will change about the sustained throughput (as SLOG will quickly exhaust itself and you're backed off to pool speed anyway).

Using sync=disabled will get you an idea of the maximum possible performance (which SLOG will only allow you to approximate for some of the time).

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

xrm · Dec 8, 2023

Okay. I wasn't aware that compression=off does disable _all_ underlying compressions during actual data transfer. Doing as you recommended yields a speed of about 470 MB/s.

Spares would help me only when there's enough time to resilver when used in a mirror setup; but if both disks are from the same batch, I'd still risk to lose both disks at the same time, wouldn't I?

What do you mean with "SLOG will quickly exhaust itself"? I would have expected to be able to write at least so many files at higher speed as my SLOG can handle?

Thanks!

sretalla · Dec 8, 2023

No matter how "big" the disk(s) you use for SLOG, it can only really hold at most 3 transaction groups (one of those being the current one, so incomplete).

Some folks around here have done the math on it and that equates to around 30GB (about as much as you can reasonably expect to capture in 10 seconds).

SLOG isn't a write cache.

Risk of loss of the second half of a mirror during resilver is not as high as for RAIDZ due to the lower load of a straight transfer, but indeed not zero... 3-way mirrors are the only way to really address that risk, but using spares is a "reasonable" tradeoff (taking cost into account... If losing data wouldn't be a costly exercise, then doesn't need consideration.

Davvo · Dec 8, 2023

I agree with sretalla, 3-way mirrors look the best way for this... which could become 2-way mirrors with hotspares depending on how lucky you feel, not suggested looking at the scale of this project.
This would solve:

Your reliability requirement;
Your performance and VM-use requirements;
Your expansion requirement;

Of course, the price will likely be steep since you are getting only 1/3 of storage efficency (which in reality means 1/6 since the sweet spot of block storage perfomance is 50%; make sure to read the resource posted by sretalla). You haven't given us any real number, but I guestimate you will require more than a dozen TB of storage given the size of the project.

Regarding dedup... you really want to avoid using it until fast dedup becomes available since it would really exacerbate the requirements.
If you must use it, however, there are a few things you could use... likes a dedicated DDT VDEV (3-way mirror as well in order to match the resiliency of the data VDEVs) made of NVMe drives (due to the low latency required)... going L2ARC could be an option as well, but it poses different challenges; in any way, you would need a lot of RAM.

Additionally, you might want to virtualize TN itself in order to have everything in a single machine... please don't do so, but if you must read the following resource first.

Resource - "Absolutely must virtualize TrueNAS!" ... a guide to not completely losing your data.

[---- 2024/01/16: Still relevant. Virtualization really doesn't change much. Updates made as appropriate. ----] [---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some...

www.truenas.com

Finally, for the boot pool consider this reading.

Highly Available Boot Pool Strategy

I've pounded a few versions of this out over the years, but I hate explaining over and over. TrueNAS allows a ZFS boot mirror pool to be created to increase the reliability of the NAS. This sounds great in theory, but there's a flaw. Due to...

www.truenas.com

Patrick M. Hausen · Dec 8, 2023

If IOPS for VMs and high resiliency to disk failure are both indispensable, how about a VM storage pool built from mirrored pairs of SSDs and a large but slower secondary RAIDZ2 pool built from HDDs? Local snapshots and replication from the first to the second.

xrm · Dec 8, 2023

Davvo, sretalla: Okay, so my current speed tests are even less meaningful than I assumed. That's ... good, I guess. As I wrote, I plan to go without SLOG and DEDUP and the way I understand your replies, that should be fine ... ?

Regarding 3-way-mirrors and storage-efficiency: okay, I kind of expected that. I hoped that there'd be some unknown-to-me, better way but I guess I'll have to think hard about if I reaaaaally need the safety of a third disk.

Patrick: well, my current concern is mostly the primary storage. I had planned to use the existing D3710 enclosure with the HDDs as a (slow) backup pool for it, so I still have some kind of backup - I'm just not eager to ever recover my complete IT from the backup.

(But then again, if I would use multiple pools with vdevs added to them as they grow, I would - at least - only have to recover a few VMs instead of the all VMs ... ).

Also, is there anything from the hardware side that I should change? I don't expect this to go live as soon as the parts arrive and can do quite some testing, but buying something that is an obvious bad choice would be ... annoying.

Thank you all for your valuable input!

HoneyBadger · Dec 8, 2023

Some thoughts here, bold for questions, italics for emphasis where necessary.

xrm said:
As I wrote, I plan to go without SLOG and DEDUP and the way I understand your replies, that should be fine ... ?

Definitely plan to go without DEDUP, unless you have full confidence that your workload benefits strongly from it.

However, since you're speaking of VMs, accessed from an outside hypervisor over NFS, you're very likely going to want an SLOG device even if you do opt for the PM1643A SSDs. Are those desired network speeds reads, writes, or a mix?

xrm said:
My current "best-effort" plan would be to throw in another HPE D3710 and fill it over time with something like Samsung PM1643A 1.9 TB SSDs (I'd like to keep resilvering times ... reasonable). Configure the disks as mirrors in pairs of two, no SLOG, no DEDUP and sync set to default. Increase the RAM to 192 GB or so and get a dedicated ethernet channel with Jumbos. Add vdevs as disks come in, one big pool (or possibly multiple pools with some vdevs on hot standby so I can plug them in when the remaining space goes below 50 %).

Note that in a "grow as you go" there's no "automatic rebalance" in ZFS - if you create a VM when you have only two drives (one vdev) and then later grow to eight drives (four vdevs) that particular VM will still reside only on the first two, and potentially be limited in its read speeds. As the VM data is updated with new writes, it will distribute among all devices - but reads won't change where the data lives.

xrm said:
Two of these [1.8 TB 12G SAS 10k HDDs] are used as spares, two for DEDUP. Everything else is configured in a single RAIDZ3.

HDDs don't make good special vdevs of any type (META or DEDUP) - but you won't be able to remove them from this pool because of the use of RAIDZ3. This also hurts your pool redundancy as while you can lose up to three data disks (RAIDZ3) you can only lose one of the two disks assigned for DEDUP purposes before the pool goes offline.

xrm said:
I also threw in two mirrored SSDs (800 GB) as LOG devices and disabled sync (not sure if this is a good idea - my understanding was, that SLOG allows disabled sync while mitigating the risk of data loss ... ?).

If sync writes are disabled, your SLOG SSDs are doing absolutely nothing. The purpose of a SLOG device (ideally, a fast one) is to allow you to enable/enforce sync writes, while providing a fast place to log those writes to so that you can be both "safe" and "fast" at the same time. Not every SSD makes a good SLOG device as well, but if I had to guess, the combination of 800G and potentially being SAS gives them a better chance than being a power-of-two size and SATA.

Question: Are you doing anything other than VM workloads on this, such as general file storage?

Davvo · Dec 8, 2023

Adding to HoneyBadger's lines about SLOG:

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

xrm · Dec 8, 2023

Davvo:
thanks for the article. While I actually read it before posting this thread, I totally failed to fully grasp its contents as I only slowly start to understand what informations I actually have to look for and which ones are not relevant to me. I read it again and things start to click in place now, thanks.

Also, since I kept forgetting answering your previous guesses: I'm currently planning for about 20 TB of data, so ~ 40 TB of disk storage, with the option to grow as required. There are a few uncertain projects in the pipeline, though, which might required growing sooner than currently planned.

HoneyBadger:
The speeds I wrote about are mostly about writes, assuming that reads are going to be limited by my bandwidth anyway. But I aim for a storage that can keep most of my requirements in check for the next few years.

Good point with the rebalancing, I didn't consider that! So I probably would have to do a manual rebalance, i.e. copy some of the files to another location and then back again?

Regarding the DEDUP: I didn't put much thought into it when setting up my test system, sadly. Was mostly playing around and since I only use the test box for backups at the moment, I thought it'd be useful, not fully aware that it's mission critical and can not be removed afterwards. Well, lesson learned, luckily on the test system.

Interesting point with the SLOG. I thought I could ignore the SLOG with SSDs. But yes, Davvo's linked article even mentions it clearly:

the pool is often busy and may be limited by seek speeds, so a ZIL stored in the pool is typically pretty slow.

So, since sretalla noted that there's hardly any need for big memory, and you wrote that speed is important - should I go with two/three PLP NVMe then or would that be (useless) overkill? Also, the article mentions that I could use a battery backed RAID card as an "unorthodox" solution. I have a couple of them flying around, so that'd be an easy option - but is it actually something that you would go for? To me this sounds a bit like "yet another thing that could bring your setup down", and I feel like going for two proper disks would be a better solution ... ?

Regarding the VM/Filestorage question: I might have some direct file storage on this as well - but I haven't yet decided whether I want to encapsulate the data within a VM or rather write directly to the NFS share. I could do both and tend to do the latter for a couple of reasons (mostly speed, let's be honest), but moving these files to another storage at some point would be way slower than a big VM file, so ... Either way, VMs are my main focus for the moment since I assume(d) them to be the hardest part. Please correct me, if my assumptions are garbage, though.

Again, thank you all for your valuable input!

Important Announcement for the TrueNAS Community.

Planning a 2-disk-failure-safe, high speed, VM capable, NAS storage

xrm

Cadet

xrm

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

Resource - The path to success for block storage

xrm

Cadet

sretalla

Powered by Neutrality

Davvo

MVP

Resource - "Absolutely must virtualize TrueNAS!" ... a guide to not completely losing your data.

Highly Available Boot Pool Strategy

Patrick M. Hausen

Hall of Famer

xrm

Cadet

HoneyBadger

actually does care

Davvo

MVP

Some insights into SLOG/ZIL with ZFS on FreeNAS

xrm

Cadet

Similar threads