Pool layout for ESXi and surveillance storage

Which Layout?

  • 1

    Votes: 0 0.0%
  • 3

    Votes: 0 0.0%
  • Other

    Votes: 0 0.0%

  • Total voters
    3
  • Poll closed .

MrBucket101

Dabbler
Joined
Jul 9, 2018
Messages
18
Here is my hardware.
  • 10GB Networking
  • 2x E5-2620 v1
  • 128GB RAM
  • LSI 9211-8i IT Mode
  • 18x 4TB HGST NAS
  • 2x 1TB 850 Pro
  • 2x Intel DC P3700 400GB
The goal for this system is to be an ESXi datastore (NFS or iSCSI), as well as to store my UniFi video footage recordings (12 1080p cameras). 8TB should be plenty for my VM's

My main concerns is performance, capacity would be nice, but not a priority. I want to have a pleasant snappy experience viewing old footage. I have another (slower) system where I will be storing files long term with better redundancy.

I've come up with 3 possible layouts, and was hoping to get some feedback.

Layout 1 - 2 pools
Pool 1 (VMs only) - 4x4TB drives in RAID10. 1TB L2ARC, DC P3700 ZIL/SLOG - data quota of 80%
Pool 2 (surveillance) - undecided. Either all mirrors, or maybe buy another drive for 5x 3drive raidz1

The primary benefit of this layout, is that with a large ARC and a 1TB L2ARC, plus an 80% data quota, I can effectively cache ~18% of the entire pool. This should give me great performance for the VM's. Would this outweight the benefit of the additional mirrors?

Layout 2 - 1 pool
use all 18 drives, 9 mirrors, 1TB L2ARC, P3700 ZIL/SLOG
I'm just not sure if the additional vdev mirrors for this setup, will outweigh the benefits of the increased caching ability on layout1.

Layout 3 - 1 pool
use all 18 drives, 6xraidz1, 1TB L2ARC, P3700 ZIL/SLOG
this has the highest capacity, I was wondering if this would be a nice in between for great performance, and capacity.
 
Joined
Oct 18, 2018
Messages
969
I think generally folks opt to go with mirrored vdevs for iSCSI or VM storage for performance, especially if you're using spinning disks to back the pool.

4x4TB drives in RAID10
Can you clarify what you mean by this? ZFS supports single-drive vdevs, mirrored drive vdevs, and RAIDZ1|2|3 vdevs offering 1, 2, and 3 drive failure tolerance per vdev respectively. It seems you likely mean you'll use 4 disks, two in each mirrored vdev?

1TB L2ARC
I'm not an expert on L2ARC. There are some who suggest that the size of the L2ARC should not exceed 5x the amount of ram. The reason is that you need store the L2ARC's index in ram and so the larger the L2ARC the more primary ram it eats up. Here is a thread discussing the topic. I'm sure like a lot of things FreeNAS the size of the L2ARC is up for some debate.

DC P3700 ZIL/SLOG
Worth noting that some folks opt for mirrored SLOG devices. There is some debate here. You'll find many sources elsewhere stating that a mirrored SLOG is possibly not necessary but searching around these forums you'll find that mirrored devices seem to be a bit more popular. Worth looking into more if you haven't already done so.

Anyway, perhaps you've already considered/thought about the above in which case ignore me; if not hopefully this helps narrow down hardware choices a bit.
 

MrBucket101

Dabbler
Joined
Jul 9, 2018
Messages
18
Can you clarify what you mean by this? ZFS supports single-drive vdevs, mirrored drive vdevs, and RAIDZ1|2|3 vdevs offering 1, 2, and 3 drive failure tolerance per vdev respectively. It seems you likely mean you'll use 4 disks, two in each mirrored vdev?
Sorry. my old hwraid speak came out :P

I meant 4 drives, 2 mirrored vdevs striped together

I'm not an expert on L2ARC. There are some who suggest that the size of the L2ARC should not exceed 5x the amount of ram. The reason is that you need store the L2ARC's index in ram and so the larger the L2ARC the more primary ram it eats up. Here is a thread discussing the topic. I'm sure like a lot of things FreeNAS the size of the L2ARC is up for some debate.
Hmmmm. I did not know this. Do you think this is still a factor with 128GB of RAM? I already owned the 1TB 850 PRO's, so I was planning to repurpose them.

Worth noting that some folks opt for mirrored SLOG devices. There is some debate here. You'll find many sources elsewhere stating that a mirrored SLOG is possibly not necessary but searching around these forums you'll find that mirrored devices seem to be a bit more popular. Worth looking into more if you haven't already done so.
I believe what I read a while back, said that if your SLOG had power loss protection, then the mirror wasn't necessary? I have the second disk so a mirror is possible.
 
Joined
Oct 18, 2018
Messages
969
Do you think this is still a factor with 128GB of RAM? I already owned the 1TB 850 PRO's, so I was planning to repurpose them.
No matter how much ram you have if your L2ARC is too big you'll see performance penalties eventually. Imagine if you tried to put the largest SSD you could possibly buy as an L2ARC. I'm not 100% sure what you'll experience with 1TB against 128GB of ram. Maybe someone with more experience there will chime in. You may also have luck searching around the forums for similar builds.

I believe what I read a while back, said that if your SLOG had power loss protection, then the mirror wasn't necessary? I have the second disk so a mirror is possible.
hmm, interesting thought. I wouldn't have considered mirrored SLOG devices as unnecessary if you've got PLP. PLP protects you from unexpected power loss causing data to not get written to the SLOG and therefore being unavailable to the system when it reboots which prevents it from committing those transactions missing from the SLOG to your pool. Even with an UPS you'll want to go with a device with PLP if you use a SLOG. Mirrored devices on the other hand further protect you from a situation where a single device fails. How risky and important that is to you depends a lot on your data and your situation. If your incoming data is irreplaceable or performance degradation is not tolerable if the device dies while you wait to replace it you may opt for a mirrored setup.

Hopefully this helps some. I'm not trying to suggest you have to do your build differently or go with mirrored devices etc; just offering further areas to explore while you put together your build list to find the performance and data protection that fits your needs, budget, and risk tolerance.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Given your desire for performance this suggests a mirror option; despite the disparity between the two workloads I'm tempted to suggest Option #2 with a few minor tweaks.

Option #2 gives you the greatest number of vdevs for your pool, and since most back-end pool operations end up choking on IOPS rather than raw bandwidth, this layout generates the most.

With the new L2ARC headers and compressed ARC, the impact of assigning huge SSDs to this vdev type is less brutal on the RAM; however, 1TB of 8K records will consume roughly 4.5GB of RAM to index (at an estimated 40bytes/record). I'd wager that the benefit of the cache will outweigh the loss of that RAM, but bear in mind that it will take a very long time for that much cache to warm up, and it isn't persistent; a reboot of the server will cause it to flush entirely. You also don't need to mirror L2ARC devices, so you could theoretically have 2TB there.

Set the surveillance dataset to secondarycache=metadata in order to avoid ZFS trying to push chunks of video there; although the record size of the video files will probably prevent that. You can also use recordsize=1M on the same dataset.

For the SLOG devices, the power-loss-prevention makes them viable devices; mirroring them provides an extra degree of protection from the rare edge case of losing an SLOG device at the same time as an unexpected power loss (eg: the very moment you need it most) - if you have the two identical and suitable devices already, it'd be a shame not to use them. The VM dataset is also the only one that really needs the sync=always setting. If you're using NFS for VM storage, set your recordsize lower than the default 128K; most users settle on 16K or 32K as a balance between good random I/O performance without choking the sequential I/O too hard for bulk operations. Leave compression on to default LZ4 (and don't even think of deduplication.)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Further to add; I would start with just the single 1TB L2ARC device and see how quickly/fully it can populate. You'll want to tune the parameters below to allow your device to fill faster than the default (8MB/s) - but ensure you leave enough spare bandwidth so that reads aren't affected.

vfs.zfs.l2arc_write_max - this is the maximum write speed into L2ARC under normal operation, in bytes. Adjust this upwards while monitoring read latency on the device; lower it if read latency is being impacted.
vfs.zfs.l2arc_write_boost - after system boot, but before ARC is filled, L2ARC is allowed to fill at a faster rate since there will be minimal, if any, reads from the devices.
 

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
8TB should be plenty for my VM's

Most of my VMs have either 32 or 64 Gigabytes vDisks. Why are you allocating 8TB for your Datastore?
 

MrBucket101

Dabbler
Joined
Jul 9, 2018
Messages
18
Given your desire for performance this suggests a mirror option; despite the disparity between the two workloads I'm tempted to suggest Option #2 with a few minor tweaks.


Option #2 gives you the greatest number of vdevs for your pool, and since most back-end pool operations end up choking on IOPS rather than raw bandwidth, this layout generates the most.


With the new L2ARC headers and compressed ARC, the impact of assigning huge SSDs to this vdev type is less brutal on the RAM; however, 1TB of 8K records will consume roughly 4.5GB of RAM to index (at an estimated 40bytes/record). I'd wager that the benefit of the cache will outweigh the loss of that RAM, but bear in mind that it will take a very long time for that much cache to warm up, and it isn't persistent; a reboot of the server will cause it to flush entirely. You also don't need to mirror L2ARC devices, so you could theoretically have 2TB there.

Set the surveillance dataset to secondarycache=metadata in order to avoid ZFS trying to push chunks of video there; although the record size of the video files will probably prevent that. You can also use recordsize=1M on the same dataset.

For the SLOG devices, the power-loss-prevention makes them viable devices; mirroring them provides an extra degree of protection from the rare edge case of losing an SLOG device at the same time as an unexpected power loss (eg: the very moment you need it most) - if you have the two identical and suitable devices already, it'd be a shame not to use them. The VM dataset is also the only one that really needs the sync=always setting. If you're using NFS for VM storage, set your recordsize lower than the default 128K; most users settle on 16K or 32K as a balance between good random I/O performance without choking the sequential I/O too hard for bulk operations. Leave compression on to default LZ4 (and don't even think of deduplication.)

Further to add; I would start with just the single 1TB L2ARC device and see how quickly/fully it can populate. You'll want to tune the parameters below to allow your device to fill faster than the default (8MB/s) - but ensure you leave enough spare bandwidth so that reads aren't affected.

vfs.zfs.l2arc_write_max - this is the maximum write speed into L2ARC under normal operation, in bytes. Adjust this upwards while monitoring read latency on the device; lower it if read latency is being impacted.

vfs.zfs.l2arc_write_boost - after system boot, but before ARC is filled, L2ARC is allowed to fill at a faster rate since there will be minimal, if any, reads from the devices.

Thank you for this wealth of information. It's going to be a lot for me to unpack, but I believe I understand what you are getting at. What size test file would you recomend I use for my performance tuning? I see a lot of people recommend bonnie++, would you also recommend it, or should I just stick with dd?

If I'm only going to have 1 pool, then mirroring the SLOG seems like a no-brainer, plus I don't have any other use for the extra ssd. I'll even make sure both SSD's are on the same CPU as the HBA to help even just that little bit extra.

Question though, what are your thoughts on the ashift option. I believe my HGST NAS drives are advanced formatm 512e. So would I benefit from setting ashift=12?

Most of my VMs have either 32 or 64 Gigabytes vDisks. Why are you allocating 8TB for your Datastore?
I'm using around 2TB of storage with my VMS atm, I figured I'd quadruple it to not have to worry again. Though I'll probably be going with Layout 2 per HoneyBadger's recomendation
 

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
I'm using around 2TB of storage with my VMS atm, I figured I'd quadruple it to not have to worry again.

Understand ... I have 2TB with 2x 1TB SSDs and will get a 2TB on black friday to increase my datastore, although with 15 VMs running, I still have 500GB available.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thank you for this wealth of information. It's going to be a lot for me to unpack, but I believe I understand what you are getting at. What size test file would you recomend I use for my performance tuning? I see a lot of people recommend bonnie++, would you also recommend it, or should I just stick with dd?

Bonnie++ or FIO will give you good raw results; you can also test in-client from a VM for your VMFS results. For your test size, it needs to be sufficiently large as to avoid the impact of ARC artificially inflating the read numbers - twice the total system RAM is often suggested.

If I'm only going to have 1 pool, then mirroring the SLOG seems like a no-brainer, plus I don't have any other use for the extra ssd. I'll even make sure both SSD's are on the same CPU as the HBA to help even just that little bit extra.

Dual-CPU introduces potential for NUMA related issues. Definitely keep the PCIe-to-socket mapping in mind.

Question though, what are your thoughts on the ashift option. I believe my HGST NAS drives are advanced formatm 512e. So would I benefit from setting ashift=12?

FreeNAS by default won't create a pool with less than ashift=12 these days; and yes, 512e drives should absolutely be used with this setting to align with the physical sector size of 4KB.

I'm using around 2TB of storage with my VMS atm, I figured I'd quadruple it to not have to worry again. Though I'll probably be going with Layout 2 per HoneyBadger's recomendation

For a business, this is where I'd start them down the road of going all-flash and planning to expand as needed. Performance consistency will be much better, and you could start with the 2x1TB drives you have now - adding two more to make 4x1TB would give you 2TB in mirrors, and hopefully LZ4 compression would squash that down enough to let you actually put that amount logically on the drives without running afoul of the "80%" rough thumbrule. (SSDs obviously are far more tolerant of fragmentation than spinning disks, so 80% usage isn't completely unreasonable.) But that would represent another chunk of budget allocated to two more SSDs right now.
 
Top