Recommended topology for a "fusion" pool?

Patrick M. Hausen · Mar 14, 2020

Hi all,

I now have this test machine with 4x SATA HDDs and 2x SATA SSDs. My current production setup (same number of drives) uses a RAIDZ2 for "storage" and a mirror for my VMs.

I want to toy with this test system setting special_small_blocks to 16k for my VM ZVOLs and see how that works out. Would you recommend keeping the RAIDZ2 for the HDDs and adding the mirrored SSDs as the special allocation vdev, or should I setup everything as mirrored pairs? And either way - why? ;)

Thanks,
Patrick

ornias · Mar 14, 2020

This is a hard one Patrick and you know very well there is no "one right answer", you not-so-little troll :p

okey joking aside:
These is one thing I do not like about mirrored pairs (even though I use them myself): they have no way to recover a corrupt during a rebuild. Simply put: if a disk fails, the data of the other disk would be your new data, corrupt or not.

If you use your special vdev for both metadata and small_blocks, I think it should already give you a significant improvement in terms of iops from the disks. Simply because less iops are put through to the disk (because they would be offloaded).

Going mirrored-pairs is limited to: Would you need even MORE iops than currently available?
Aka the, age old, "whats your usecase" question ;)

Patrick M. Hausen · Mar 14, 2020

Well, more IOPs ist not what I need - that's why I keep the VMs on a mirrored pool already. And while this new feature does look really intriguing, of course I notice that one would lose a degree of redundancy using it on the same hardware platform I currently have in production.
Since the mirrored SSD/VM pool is a lot smaller than the RAIDZ2 HDD pool it's a no-brainer to replicate the VMs to the HDDs "just in case". That would be gone in a "fusion" setup.
Even worse a loss of both SSDs (we all know the brown stuff that is supposed to happen once in a while) and the entire pool is gone for good.

OK, but let's assume I still do want a hybrid storage like the big commercial vendors promote - there will be no additional downside to mixing a RAIDZ2 vdev with a mirrored "special allocation" one, right? That was my main reason for asking. The IOPs aspects are all obvious - thanks ;)
Possibly in a commercial "enterprise" environment one would go with a RAIDZ2 or RAIDZ3 and at least a three-way mirror ...

Kind regards,
Patrick

ornias · Mar 14, 2020

Patrick M. Hausen said:
OK, but let's assume I still do want a hybrid storage like the big commercial vendors promote - there will be no additional downside to mixing a RAIDZ2 vdev with a mirrored "special allocation" one, right? That was my main reason for asking. The IOPs aspects are all obvious - thanks ;)
Possibly in a commercial "enterprise" environment one would go with a RAIDZ2 or RAIDZ3 and at least a three-way mirror ...

Ahh now I get it... Redundancy differences between using mirrored pairs for special-vdev and raidz2 with double parity on the normal vdev.

Well, ofcourse the SSD's are single-failure tollerant and the main storage array is double failure tollerant.
But, the smaller size and higher speed of SSD's also means that recovery is a LOT faster. Which means the risk of total data loss of that vdev is quite small. Even small than either choices for the main vdev.

But besides the obvious differences between the vdev types I highlighed above, there isn't really anything that makes the special vdev special in that regard. Special vdevs should've the same LEVEL of redundancy as the normal vdev they are corrected to. But "level of redundancy" can either refer to "disks that can fail" or "statistical data loss chance"

Patrick M. Hausen · Mar 14, 2020

Here we go ...

Code:

  pool: fusion
 state: ONLINE
  scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    fusion                                          ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/0153f537-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0
        gptid/0170f15d-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0
        gptid/017f5a20-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0
        gptid/0164970c-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0
    special   
      mirror-1                                      ONLINE       0     0     0
        gptid/30135f24-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0
        gptid/3157fcd7-6612-11ea-99c2-0cc47afa3c72  ONLINE       0     0     0

errors: No known data errors

morganL · Mar 18, 2020

Yes, Fusion Pools should be designed based on "statistical data loss Chance". No single vdev should be much worse than the others.

A mirror of small SSDs is similar to a Z2 stripe. However, there is always a question of sparing or how long it takes to replace the failed drive. It's also good to know that the SSDs are reliable and power-safe.

Patrick M. Hausen · Mar 22, 2020

Messing around with the test system I am delighted to find that you can create a fusion pool from the UI:

Bildschirmfoto 2020-03-22 um 16.13.06.png

Plus TrueNAS does the right thing [tm] and creates swap partitions on the disks of the data vdev only ...

Well done, folks, well done!
Patrick

jenksdrummer · Mar 27, 2020

ornias said:
These is one thing I do not like about mirrored pairs (even though I use them myself): they have no way to recover a corrupt during a rebuild. Simply put: if a disk fails, the data of the other disk would be your new data, corrupt or not.

No offense intended; but I believe this is inaccurate with ZFS.

With traditional RAID1, you are correct, the controller basically makes an assumption as to which is the good block and which isn't.

As I understand it with ZFS Mirrors, you similarity with traditional RAID1 with regards to the blocks, but, you also get checksums for those blocks to compare to. Between the two blocks and the checksum, you have 2 pairs of comparison data; so when it comes across a block that doesn't match (or a checksum) it can make a valid assessment and correct it based on 3 items match and 1 doesn't.

But, like you, it messes with my 20+ years in the industry (though semi new to ZFS); so I run in Z2.

Patrick M. Hausen · Mar 28, 2020

@jenksdrummer in a typical situation one would pull one defective disk and replace it with a new one, so ZFS does not give you more safety than a traditional mirror implementation. If you have an empty drive bay or a hot spare - good for you ;)

jenksdrummer · Mar 28, 2020

Patrick M. Hausen said:
@jenksdrummer in a typical situation one would pull one defective disk and replace it with a new one, so ZFS does not give you more safety than a traditional mirror implementation. If you have an empty drive bay or a hot spare - good for you ;)

I get that, I'm pretty agro about replacing disks that show any blocks as questionable...

But there's two factors to consider...

A) A physical issue with the disk. In this case, SMART shows an error and increments the value showing a portion of the disk is unusable. This happens only when data is read and can't; or data is attempting to be written and can't. The mirrored disk would not have this issue and the data is not corrupt; freshly inserted disk gets this copy and all is well; and this is the same for traditional mirror as well as ZFS. To that end, one could still run a scrub against the disk showing the SMART error and correct it as likely the block of data and it's checksum vs the mirrored block and checksum would still provide a strong level of assurance that data is consistent - THEN replace the drive. Caveat there if it's got a significant number of bad clusters; it may be prudent to replace anyway, but if it's just 1...scrub on it and then swap. :)

B) Bitrot - generally speaking this is when data on the disk gets corrupted by some means or another; but the block is still readable and for traditional arrays this would go unnoticed until a human discovers it. If the traditional mirrored pair is resynced, there's a 50/50 chance it will get it wrong and sync the incorrect data; and any time the data is read there's the same odds that what it provided will be the corrupt version. With ZFS, this gets noticed when the data is read due to checksum verification, as well as when a scrub is ran; and due to both blocks having checksums, for a mirrored pair, there are 3other pieces of data to process against to determine which block of data is the correct one; IE, only if there is a 2nd corruption that it potentially becomes equal to a traditional mirror. I say potentially because I believe unless somehow 2 disks get literally different blocks of data that should be identical, and, the checksums on each disk align with the respective block...

Under B, replacing a disk is not necessary. I've had bitrot corrected during a scrub. Once. Disks were fine, tested them every way from Sunday.

jenksdrummer · Apr 4, 2020

So, count me in on this topology for now; have some cycles free with my larger array and can do a bit of testing.

2 VDEV x 6 HDD Z2 via 3008-IR in JBOD mode. (onboard, warranty destroyed if I flash it to IT mode, so, leaving it...)
1 Special VDEV (meta/small) x 2 SSD Mirror via onboard SATA ports.
1 Cache NVMe M.2; though I might try this for dedup vdev if I have time.
Boot disks are 2x DOM SATA/SSD

If reporting worked a bit better I might be able to see with some accuracy what the data hit (read/write) at the disks to see if the Special VDEV is being hit.

Yorick · Apr 7, 2020

This will be obvious to everyone in this thread and, just in case someone who is not very ZFS-savvy comes across this thread and thinks "cool, it's like a tiered system / caching thing, I'll try it and get rid of it if I don't like it!":

"One of the ZFS mailing lists had a complaint from a user that added a special Allocation VDEV, played around with it, and then found they couldn't delete it easily like a SLOG or l2arc device. N.B. For now, it is permanent once added , pool must be copied and destroyed and rewritten if you change your mind."

That rabbit hole is here: https://github.com/openzfs/zfs/issues/9038 . The guy thought that a vdev receiving data was like an L2ARC, and that just shows how non-intuitive ZFS can be to Joe Average Sysadmin.

As in everything ZFS, plan ahead. I expect that if it's a pool of all mirror vdevs, no raidz anywhere, then one could remove a mirror special vdev - but why one would want to do this, is questionable. Do performance tests on a pool you don't mind blowing away if it turns out the special vdev didn't do much at all for the workload. And pay attention to the discussion about failure rates in this thread.

velocity08 · Apr 18, 2020

Patrick M. Hausen said:
Messing around with the test system I am delighted to find that you can create a fusion pool from the UI:

View attachment 36772

Plus TrueNAS does the right thing [tm] and creates swap partitions on the disks of the data vdev only ...

Well done, folks, well done!
Patrick

@Patrick M. Hausen is this the TrueNas beta or on 11.3u2?

sretalla · Apr 19, 2020

velocity08 said:
TrueNas beta

Yes

NickF · May 18, 2020

I have related questions.

I currently run two separate pools with obviously two separate iSCSi targets for my VMs in my homelab. One is two Samsung first generation Enterprise nvme drives mirrored and the second is two Intel Mixed read/wrote SATA SSDs.

Initially I only had the two Samsung drives but I needed more space and didn't want to relegate this storage to spinning disks for things like Plex metadata and a Syncthing target for some files.

I'm not saying it would really matter for my use case, but let me outline a scenario. So I could theoretically migrate all the VMs in VSphere to the SATA SSD pool and then blow up the NVME pool, then add the NVME drives as a special VDEV on the SATA and build a "fusion drive".

Would I then get mostly the performance under whatever "normal" scenario of the NVME pool but the combined space of both of the original pools? From a storage utilization and management perspective this would be kinda cool.

How is ZFS handling what goes where? Does it try to put stuff which was evicted from the ARC in the faster tier? Sorta like a persistent "level 3" arc/data VDEV combo? ZFS usually puts stuff in VDevs based on already storage allocation no? I also remember Alan Jude was talking about moving to a situation where whatever VDEV is "fastest" will get the write commit, if not the above, is that how this type of pool would function?

Sorry if I missed a beat somewhere. I haven't been following OpenZFS a that closely until recently

Patrick M. Hausen · May 19, 2020

@NickF By default ZFS puts only metadata on the special allocation vdev. You can then set a maximum block size for which the blocks will also go to the special vdev. I don't know yet, what happens when this vdev reaches its capacity.

So in case you are running VMs and storage/sharing services on the same NAS, you could set that size to 16k. That way all VM disks (ZVOLs) would go to the special vdev and the larger (128k) storage blocks to the regular spinning disks.

HTH,
Patrick

Yorick · May 19, 2020

Patrick M. Hausen said:
That way all VM disks (ZVOLs) would go to the special vdev and the larger (128k) storage blocks to the regular spinning disks.

That’d surprise me. Special allocation classes can contain small file blocks, which is set on a per-dataset level. If VM storage is via NFS, the files aren’t small; if it’s via iSCSI, they’re not files. From the man page:

“ Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. ”

The only way I see to keep the performance of the NVMe and use them for VM storage is to have them in their own pool.

Or place them into the same pool as the SATA SSDs as a mirror vdev, and get the additional size but the pool performance of the SATA SSDs.

Be careful when experimenting. A special allocation vdev is a data vdev and follows the same rules. I think you can remove it again if your pool is all mirrors, with the expected RAM penalty for removal, but test that first.

You definitely can’t remove it again if there’s a raidz in the pool. Not your use case, just putting that here for others who may come across the thread.

Patrick M. Hausen · May 19, 2020

I implied bhyve ZVOL disk images that have a volblocksize of 16k by default and 4k for the EXT4 or NTFS "disks" I create here.

Patrick

Yorick · May 19, 2020

That’s my point - I don’t think zvols have that option. Only datasets with special small file block size.

Now one could set a dataset with a recordsize the same as special small blocks, and I don’t know what would happen then.

Test, test and test again, before making changes to a live pool. Intuitively, special allocation vdevs were meant for small files, not block storage. It might be possible to get around that, by making sure every file block is small, and I’d be very cautious to make sure that’s as designed and not a behavior that ZFS might consider a “bug” going forward.

Patrick M. Hausen · May 19, 2020

When testing I set a small block size for the parent dataset of my zvols and I could definitely see the SSDs filling.

Important Announcement for The TrueNAS Community.

Recommended topology for a "fusion" pool?

Hall of Famer

Wizard

Hall of Famer

Wizard

Hall of Famer

Captain Morgan

Hall of Famer

Patron

Hall of Famer

Patron

Patron

Wizard

Dabbler

Powered by Neutrality

Guru

Hall of Famer

Wizard

Hall of Famer

Wizard

Hall of Famer

Similar threads