Multiple zpools vs multiple vdevs

Bulldog · Jan 23, 2016

I really don't see your point here complaining that the pool is lost if any vdev fails. How is this any different than losing more than 2 disks in a single vdev pool (RAIDZ2)? or more than 1 in (RAIDZ1)? You lose the entire pool there as well. If you lost more than 2 per each vdev at the same time you've got bigger problems anyway.

cyberjock · Jan 25, 2016

MMacD said:
Sorry to disagree, but the basic intended use case --the one that's always true regardless of what other use case they thought they were developing for-- was use by humans. Humans who will sometimes be tired, or hungover, or distracted, and who are guaranteed by physical law to make mistakes.

Okay, but let's put that "use case" theory to task here.

If there is a problem, there is *no* scenario in existence with ZFS that forces you to administer ZFS when tired, hungover, distracted, or otherwise not "fully functional" or risk lose all of your data immediately. So the "immediate need" to handle a problem immediately is simply not there. You can wait. As some people will attest, you can sometimes wait weeks to do things like replace a disk. This is not to be confused with a "production need" or "boss aggro" to get it done immediately, which is a whole different set of problems and has nothing to do with ZFS. Or as we called it in nuclear power "no need to have non-casualty procedures memorized". You have time to consult manuals, do mock-ups, and even get a second opinion (or even a 3rd if you want).

When I worked in nuclear power, the "human performance factors" (working long hours, working at night, all that pesky stuff that affects a human's ability to perform at tip-top shape) plays a factor, and those kinds of things are *always* discussed at a job brief before any work is done. Jobs are sometimes rescheduled as a result. It happens. That's life.

And you know what? Things still go wrong because even when you are on top of your game, things *can* and things *will* go wrong. You might go out to do some very simple maintenance and while working you might shake a rack of components and you may cause an emergency shutdown of a reactor. Yes, that has happened. The important thing is to have procedures and plans in place to recover from when things can (and do) go badly. So do you have good backups? Are you checking them reguarly? Are you in a position to restore those backups and come back up if your production server suddenly caught fire and you had no hardware to reuse? I'm guessing you probably don't. And to be honest, most people don't.

But I think it's a major falsehood to argue that you need ZFS to be usable for a guy that's hungover on a Saturday morning at 5AM with no coffee. If that kind of guy is even being asked to work on your server, your employer has probably made a lot of other serious mistakes that are of much more important consequence that simply expecting a hungover IT guy to do a good job with regards to a file server. They should have someone else they can turn to. Someone in the department that is experienced and trained should be "on call" and should be expected to not drink. I've worked those kinds of jobs my whole adult life. Nothing special there at all.

MMacD said:
It's also the reason why the operators at a nuke facility (I used to know which one but can't now remember) put beer-tap handles on the crucial switches: the nitwit who designed the control panel had made it look neat and clean, all the switches alike, which of course is exactly what's wanted in an emergency: the inability to immediately tell which switches to throw to scram the reactor.

As someone who has stood in a control room (or a simulator) at several nuclear power plants, I can tell you that your story is a bit skewed and not reflective of reality.

Also, all of the plants I've seen had trip switches that were push button. They looked *nothing* like anything else in the room. They were covered with a plexglass cover to prevent inadvertent tripping. There were also 2 switches so that if you *still* managed to screw up and pressed one switch, you wouldn't get a reactor trip. You had to press both simultaneously. That itself was another deviation because virtually nothing else required you to use both hands except for moving the control rods (totally different controls, which make confusion there not possible).

MMacD said:
Sorry for the lack of clarity! I meant porting ZFS's scrubbing code.

That does nothing. Literally, you're asking iXsystems to invent the SMART long test. It checks the disks for bad sectors. What you really mean is porting zfs scrubbing code along with the checksumming features and everything else that makes ZFS itself unique. Totally implausible to even consider that since it took Sun multiple years to develop ZFS with them having total control of the OS, software and hardware.

What you are asking for is literally not what you want, and what you really want is "to use ZFS" because that's all iXsystems would be doing. Reinventing the wheel. ;)

fn369 · Jun 23, 2016

Reverting back to the original title of the thread "Multiple zpools vs multiple VDevs"...

My understanding is that, in order not to suffer performance issues, the VDevs in the zpool should not only contain the same number of drives, but also the same size drives.

For general storage / Plex media etc., I believe 5,400RPM drives in RAIDZ2 configuration are recommended - while for ESXi storage one might use SSDs in striped / mirrored config.

Can one combine both types of storage on a single server? I was wondering about using 2.5" to 3.5" mounting kits for this very purpose.

If so, wouldn't it make sense to create 2 pools, so as to optimise performance?

Ericloewe · Jun 23, 2016

fn369 said:
My understanding is that, in order not to suffer performance issues, the VDevs in the zpool should not only contain the same number of drives, but also the same size drives.

Wrong. ZFS will intelligently distribute writes among vdevs.

fn369 · Jun 23, 2016

Ericloewe said:
Wrong. ZFS will intelligently distribute writes among vdevs.

OK, thank you, that's good to know. I could have sworn I'd read something to the contrary by @jgreco or @cyberjock over the past couple of days, but I'm happy to be wrong.

Do you have any thoughts about combining SSD VDevs with 5,400rpm VDevs in the same zpool, or does your previous response still apply?

jgreco · Jun 23, 2016

fn369 said:
OK, thank you, that's good to know. I could have sworn I'd read something to the contrary by @jgreco or @cyberjock over the past couple of days, but I'm happy to be wrong.

ZFS is only so intelligent. It isn't actually magic, so if you create vdevs that are horribly different in their characteristics, that'll be bad.

You may be thinking of this, where I said:

jgreco said:
I am a strong proponent of heterogeneous pools (search term: heterogeneous in the forum search). The drives should be the same size and ideally the same speed; mixing things will cause it to perform as poorly as the slowest/smallest/whatever.

We do not suggest using differing numbers of drives in future vdevs. Six is a great number. Mixing vdevs with different numbers of drives creates many performance challenges.

It is also "frowned upon" to use different drive SIZES in additional vdevs, but primarily because it may lead to unequal pool IOPS loading over time - a larger vdev will see a larger percent of the workload. All things considered, this is not terrible, and if it is convenient to create a 6-drive 6TB-disk vdev today, and in three months you want to add a 6-drive 8TB-disk or 10TB-disk vdev because you need the space, this is not going to be the end of the world. ZFS will intelligently cope with the situation as well as can reasonably be expected. It would probably be ugly in the long term if you ended up with a vdev of 1TB disks, and another of 2TB disks, and another of 6TB disks, and another of 12TB disks. You'd get very weird IOPS distribution and it might be "frustrating" that a 24 disk system was acting a lot slower (because so much of the load is ending up on the 12TB-disk vdev).

The more different your vdevs are from each other, the more unequal the IOPS loading will become. ZFS "copes" with vdevs of differing sizes implicitly by looking at factors such as the percentage full, but at the end of the day if you have a vdev of six 1TB drives and another vdev of six 12TB drives in a pool, the 12TB vdev is going to be seeing most of the IOPS, resulting in severely unequal loading. No amount of magic pixie dust can fix this.

Do you have any thoughts about combining SSD VDevs with 5,400rpm VDevs in the same zpool, or does your previous response still apply?

Yeah, don't do it. Make a separate pool for SSD.

Ericloewe · Jun 23, 2016

fn369 said:
Do you have any thoughts about combining SSD VDevs with 5,400rpm VDevs in the same zpool, or does your previous response still apply?

No, I was thinking more along the lines of mirrored 2TB vdevs when the existing vdevs are 500GB, for instance.

Mixing SSD and spinning rust isn't a good idea.

fn369 · Jun 23, 2016

jgreco said:
ZFS is only so intelligent. It isn't actually magic, so if you create vdevs that are horribly different in their characteristics, that'll be bad.

You may be thinking of this, where I said:

The more different your vdevs are from each other, the more unequal the IOPS loading will become. ZFS "copes" with vdevs of differing sizes implicitly by looking at factors such as the percentage full, but at the end of the day if you have a vdev of six 1TB drives and another vdev of six 12TB drives in a pool, the 12TB vdev is going to be seeing most of the IOPS, resulting in severely unequal loading. No amount of magic pixie dust can fix this.

That's exactly what I was thinking of, thank you! I was starting to wonder whether I'd imagined it! It all makes a lot of sense, and I will be sure to keep it in mind.

jgreco said:
Yeah, don't do it. Make a separate pool for SSD.

Thank you very much for providing a final answer to the discussion that ran through this thread. Extremely helpful.

fn369 · Jun 23, 2016

Ericloewe said:
No, I was thinking more along the lines of mirrored 2TB vdevs when the existing vdevs are 500GB, for instance.

Mixing SSD and spinning rust isn't a good idea.

Understood, thank you.

jgreco · Jun 23, 2016

fn369 said:
That's exactly what I was thinking of, thank you! I was starting to wonder whether I'd imagined it! It all makes a lot of sense, and I will be sure to keep it in mind.

No problem. It's both right to say "ZFS can cope with it" and also right to say "but the IOPS imbalance could make you want to tear your hair out." It really depends what you're doing with the pool as to whether or not it is a good idea.

Stux · Jul 10, 2016

Ericloewe said:
That's a very common point. I believe it had to do with a violation of atomicity of transactions, which effectively meant the pool has to be taken offline (with all that implies) so that a BPR can take place. Offline BPR is less interesting, but might be a more realistic objective.

The other issue is that it might very well be possibel to atomicly rewrite the BPR while its online, in fact it is, but the trick is the BPR code touches all layers of ZFS, thus making it a nightmare to support the code, it becomes a layering violation (interestingly, ZFS itself is a layering violation). So, once you add BPR then all future changes have a very high chance of breaking BPR, with catastrophic dataloss effects.

True lock-free multithreaded code is hard. Doing it in C is even harder.

ZFS doesn't support online expansion because its technically difficult to do it.

I'd settle for offline expansion/reduction/reshaping/rebalancing/defragmentation.

Stux · Jul 10, 2016

MMacD said:
There are certainly individuals who believe that if it's free no one should complain. But fortunately there are other people --most of them engineers-- who believe that engineering should be done well regardless of the sticker price because it's their self-respect that's at stake.

Yes, ZFS is not perfect, but as I understand it there are 4 options for a modern FS with integrity.

1) ZFS
2) BtrFS
3) ReFS
4) HAMMER

ZFS has issues with being unable to reshape RAIDZx vdevs, and you can't shrink a vdev.

BtrFS has trouble rebuilding a RAID5/6 partition when it needs to.

ReFS is Microsoft specific, and not there yet.

HAMMER is DragonFly specific.

And unforunatley, Apple have dropped the ball on APFS for the moment, and until they decide to add Checksum/Integrity features, its a ho-humm,

Out of all of those options, only ZFS provides maturity, integrity and cross-platform availability. And until Btrfs RAID5/6 support is production ready, I think ZFS is the engineers best option, afterall, engineering is the art of dealing with compromises while still attaining your goal.

jgreco · Jul 11, 2016

HAMMER is also a niche filesystem. Last I heard it was mostly being developed by Matt Dillon, and slowly at that. Whether this would ever become something more than a research-quality filesystem is questionable to me.

BtrFS is the most likely eventual competitor to ZFS.

Stux · Jul 11, 2016

jgreco said:
HAMMER is also a niche filesystem. Last I heard it was mostly being developed by Matt Dillon, and slowly at that. Whether this would ever become something more than a research-quality filesystem is questionable to me.

Right, I didn't want to judge it by saying it was a one-man FS since I haven't investigated it much further than wikipedia

BtrFS is the most likely eventual competitor to ZFS.

Agreed. Pity they're both Oracle's.

jgreco · Jul 11, 2016

Stux said:
Right, I didn't want to judge it by saying it was a one-man FS since I haven't investigated it much further than wikipedia

http://apollo.backplane.com/DFlyMisc/hammer2.txt

Look at the dates and progress. I think what he really needs is a team of filesystem engineers and a bunch of funding.

Agreed. Pity they're both Oracle's.

Well, OpenZFS is pretty much no-longer-Oracle's.

UFS has been around for far too long and some of the design assumptions are showing their age. ZFS was about ten years too early, but has matured nicely and people tend to freak out less when you tell them "8GB RAM minimum" for FreeNAS.

We really could use a few other competent options. Not that I don't like ZFS, but new ideas are always good.

Allan_Guy · Nov 2, 2016

I'm not sure this was addressed.

If you have a bunch of slow drives and also a bunch of SSDs, It would seem to make sense to keep the SSDs and the slow drives in separate zpools. For instance, I have a ton of plex users accessing my slow storage serving up the movies and TV files. However, If I want to do DVR and post processing on SSDs to speed it up, would it make sense to keep them separated? Including a separate raid controller (AKA SAS2 for slow drives and SAS3 for SSDs)?

Secondly, If I create a zpool of say 11 drives z2 to copy stuff from another server to migrate off the old server and I fill them to 99% (rendering them slower than slow). Then when I remove the drives from the old server and add them to the same zpool AKA 2@ 11xz2 in a second vdev. Will the zpool balance the data to 2 vdevs at 50% capacity? Or will one vdev continue to preform at 99%?

I do understand that I could just move the drives and import the volume on new server. But, I'm going from 10 x z2 to 11 x z2... and will end up with 3 vdevs @11 x z2 with 3 hot spares on a 36 hot swap bay server. Then the SSDs will be mounted inside.

Thanks

toadman · Nov 2, 2016

Allan_Guy said:
I'm not sure this was addressed.

If you have a bunch of slow drives and also a bunch of SSDs, It would seem to make sense to keep the SSDs and the slow drives in separate zpools. For instance, I have a ton of plex users accessing my slow storage serving up the movies and TV files. However, If I want to do DVR and post processing on SSDs to speed it up, would it make sense to keep them separated? Including a separate raid controller (AKA SAS2 for slow drives and SAS3 for SSDs)?

Maybe. Two questions here... (1) SSDs in separate pool or not, and (2) if yes on (1), do you need a separate HBA.

On (1), "it depends" on the i/o requirements (iops, bandwitdh) of the processing software. If the SW can drive greater i/o than your HDDs can supply, yep use SSDs and it will speed up. (I don't know enough about your workload to give a more informed answer. On the limited video processing I do I do not see a speed up when using SSDs vs. HDDs.)

On (2), you don't want to use a RAID controller with ZFS, but an HBA (i.e. freenas direct access to the disks). Assuming that's what you meant, then again "it depends" on the aggregate workload demand for i/o. Measure that and compare vs what the single HBA can handle.

If you have 12G SSDs, then yes, put them on a SAS 3 HBA.

Secondly, If I create a zpool of say 11 drives z2 to copy stuff from another server to migrate off the old server and I fill them to 99% (rendering them slower than slow). Then when I remove the drives from the old server and add them to the same zpool AKA 2@ 11xz2 in a second vdev. Will the zpool balance the data to 2 vdevs at 50% capacity? Or will one vdev continue to preform at 99%?

No, existing data in a pool is not re-balanced when adding a new vdev. However, any new writes will be distributed over the two vdevs in relative proportion to the free space in each. In your case since the original vdev is 99% full and the new vdev is 0% full, basically all new writes will go to the new vdev. The only way to "re-balance" is to read the data and rewrite it. i.e. copy it and delete the original.

Allan_Guy · Nov 2, 2016

I happen to have a second HBA SAS3 (RAID controller in HBA mode)

When playing back DVR in a live recording going on, If workload of plex is high and / or copy traffic is demanding, the DVR response time seems to lag a bit when forwarding thru commercials.

Thanks on the vdev info. I'll re-copy 1/2 the data before moving on. Out of curiosity, Once adding the second and or 3rd vdev, will the alert disappear even though the 1st vdev is performing horribly? Since most people add more storage after the performance impact or late, should this be addressed in the new release?

Thanks

toadman · Nov 2, 2016

I believe any "% full" notice from FreeNAS should disappear as the new capacity is added with new vdevs. (Assuming the new % full with the new capacity is below the alert threshold.)

Note, when doing the copy, I don't know if ZFS actually writes new data blocks if the copy is done within the same dataset, or if it just updates (or creates new) block pointers. I think it does create new data blocks if copied between datasets. But I would double check that or let someone more knowledgeable on this answer. (There are a few re-balance threads on the forum that might answer this point.)

Stux · Nov 2, 2016

mv from one dataset to another should rebalance. Providing there are no snapshots of the first dataset.

Important Announcement for the TrueNAS Community.

Multiple zpools vs multiple vdevs

Dabbler

Inactive Account

Explorer

Attachments

Server Wrangler

Explorer

Resident Grinch

Server Wrangler

Explorer

Explorer

Resident Grinch

MVP

MVP

Resident Grinch

MVP

Resident Grinch

Dabbler

Guru

Dabbler

Guru

MVP

Similar threads