Multiple zpools vs multiple vdevs

Ericloewe · Oct 29, 2015

If the vdevs are more or less evenly full, ZFS will use both for writes, spreading the workload to more disks. Reads of that data will naturally use both vdevs, obviously.

Also obviously, if you have separate pools, they'll be completely independent.

MMacD · Jan 20, 2016

Though I'm new to zfs, I've a fair bit of industry experience on the dev end, and my reaction on reading Cyberjock's primer was a case of the cold grues. Lose a vdev and your whole pool goes away? Can't afford enough memory and your whole pool goes away? Don't have a complete technical understanding of how to set things up and your whole pool goes away? That's pretty fragile!!! From here it looks like the filesystem equivalent of Postgresql: a group of academic-engineers' idea of a "neat hack" that was released into the wild with no attention to basic human-factors issues.

The late Sir Terry Pratchett OBE wrote that if someone installed, at the back of the most remote cave in the most difficult-to-reach part of the world, a switch that would destroy the planet, and put up a big sign "WARNING! END-OF-WORLD SWITCH! DO NOT TOUCH!!!!", the paint wouldn't even have time to dry.

So I was looking around for answers to the same question DG started this thread with: why not multiple pools with one vdev apiece? That was the easiest question I could think of. Harder ones would be "what on Earth possessed the designers to make vdevs non-expandable?" and "what made them think balancing everything on the edge of a precipice is good systems engineering?"

I'm building a 3-way mirror with 1 drive for each mirror. I was planning on adding a second drive to each vdev once I free them up from their current role as backup drives. But now I see that I can't do that -- vdevs emulate real devices in that they can't be made bigger (perhaps that was the designers' motive: complete hardware emulation?

)

But now that I know how fragile the architecture is, I am going to allocate 1 vdev per pool. I don't create spanned volumes now, so distributing my files won't be an inconvenience and distributing the vdevs will limit any disaster. So long as hardware is made in China with run-of-the-fab chips (probably recycled chips, too) and poor or no QA, disaster is to be expected. And with one vdev per pool, I can "expand" a vdev by creating a bigger one and then shifting the files. Less convenient than expanding in place, but it seems to be my only option.

pirateghost · Jan 20, 2016

What? You can make a vdev bigger by replacing drives with larger ones.

It really isn't as fragile as you are trying to make it sound. Take off the tin foil hat and just plan your system accordingly

cyberjock · Jan 20, 2016

MMacD said:
Though I'm new to zfs, I've a fair bit of industry experience on the dev end, and my reaction on reading Cyberjock's primer was a case of the cold grues. Lose a vdev and your whole pool goes away? Can't afford enough memory and your whole pool goes away? Don't have a complete technical understanding of how to set things up and your whole pool goes away? That's pretty fragile!!! From here it looks like the filesystem equivalent of Postgresql: a group of academic-engineers' idea of a "neat hack" that was released into the wild with no attention to basic human-factors issues.

<snip>

Harder ones would be "what on Earth possessed the designers to make vdevs non-expandable?" and "what made them think balancing everything on the edge of a precipice is good systems engineering?"

Easy answers to all of your questions. ZFS was made by Sun. They had no expectation, or intention, of seeing ZFS be used in anything but 5-figure (or larger) price tags for large scale file servers, while being managed by a support contract, on hardware that they would sell you and have total control over.

Of course, you read that, and the little home server you built is not 5-figures in cost, has a support contract, and isn't controlled by a vendor that will tell you what you "can" install in your server for a fairly hefty price tag. Since you're taking it upon yourself to build the server relative to a vendor, and since FreeNAS is a free OS that you can use at no cost, you (the user) is expected to avoid the pitfalls that their own engineers avoided by simply being responsible for the entire product line beginning with enginnering the hardware and software, and ending with a support contract post-installation.

You had no chance of calling Sun and saying 'I want a system with only X-GB of RAM". They'd say "very well.. have a nice day". You cannot call them and tell them your price and expect them to not laugh at you. They would tell you how much RAM you'd have to buy, and they would tell you how much you're going to pay. It was very non-negotiable. Even on TrueNAS, the smallest system is 64GB of RAM (if i'm not mistaken). For nearly 100% of us, that's more than our boards can support let alone install.

The system is not as fragile as you are making it out to be. You just have to know what the limitations are, and work with them instead of against them. Hardware RAID used to not support OCE (online capacity expansion). Back in those days you couldn't add more disks to a RAID array, period. If you look at unraid, there are limitations there. Check out LVM in linux and it has limitations too. Even Windows' software RAID has limits. I included the limits so you'd know what your boundaries are based on what you could imagine as possible valid configurations. It's far easier to discuss the limits of a given file system than to discuss all of the possible scenarios that can exist.

In the meantime, I will happily keep running with multiple vdevs in my zpool.

MMacD · Jan 20, 2016

cyberjock said:
The system is not as fragile as you are making it out to be. You just have to know what the limitations are, and work with them instead of against them.

Heh. That's the old-school sysadmin's version of the old-school software developer's "it's not hard to use, you just have to learn how".

Yet software, including zfs, is much more mutable than humans are. Human limitations are inflexible: we're limited-capacity serial processors. We never have access to all our memories, we can never attend to everything in our environments. It's guaranteed by physical law that we'll make mistakes over and over again.

So even software destined to be used by organisations that can throw money at it should be designed and written to minimise the need to do that. That's something we humans can do: design and build to minimise the number of errors made and the consequences of the errors that slip through. To do less is just bad engineering.

It's just as hostile, though the consequences are less bloody, to expect the users of software to lose their data because the engineers couldn't be bothered to do a proper job as it would be to expect car buyers to accept that their brakes might go out on the highway if they happen to mis-steer and clip the Autobahn-style wakey-wakey corduroy at the edge of the roadway.

MMacD · Jan 20, 2016

cyberjock said:
I included the limits so you'd know what your boundaries are based on what you could imagine as possible valid configurations.

Yes, I know that. And I GREATLY appreciate it, and all the rest of the information you included. Believe me, I'm not criticising you at all! I am criticising the ones at Sun who decided that good engineering was too expensive, or tiresome, or hard. And, a little bit, the FreeNAS folk for abandoning UFS rather than gluing the useful (e.g. scrubbing) features to it.

danb35 · Jan 20, 2016

MMacD said:
I am criticising the ones at Sun who decided that good engineering was too expensive, or tiresome, or hard.

"Good engineering" always considers the intended use case. You criticize ZFS as not being suitable for the "Joe Sixpack" user, who doesn't read instructions, doesn't understand computers, but saw a 5-year-old YouTube video that says you can repurpose your castoff old computer into a NAS using FreeNAS. You may well be right that it's not very suitable for this user. But it's only poor engineering by Sun if that was their intended user--and it wasn't.

ZFS is a powerful, flexible filesystem. The only thing out there that comes anywhere close is BtrFS, and to the best of my knowledge, it just isn't there yet. But for its power and flexibility, it does have limitations. If they can ever get block pointer rewrite implemented, that would be fantastic--it would (or at least could) allow for adding devices to (and removing devices from) vdevs, changing RAID levels, removing vdevs from pools, adding or removing deduplication on existing data. But it isn't implemented, and it's questionable whether it will ever be implemented.

MMacD said:
FreeNAS folk for abandoning UFS rather than gluing the useful (e.g. scrubbing) features to it.

What scrubbing feature of UFS would that be? Or are you suggesting that iXSystems should have coded one from scratch?

toadman · Jan 20, 2016

I always find it fascinating when people are unhappy with something available to them at no cost. As if they have somehow been inconvenienced.

cyberjock · Jan 20, 2016

MMacD said:
Yes, I know that. And I GREATLY appreciate it, and all the rest of the information you included. Believe me, I'm not criticising you at all! I am criticising the ones at Sun who decided that good engineering was too expensive, or tiresome, or hard. And, a little bit, the FreeNAS folk for abandoning UFS rather than gluing the useful (e.g. scrubbing) features to it.

I didn't take your post as criticizing me. I took the stance that you don't want to kill the messenger, but the message sucks.

Quite literally, OpenZFS is doing some great things to make it scale down smaller (to the scale that home users are a bit more tasty to wanting to use it). There were problems that have plagued ZFS for a long time, and only an expert ZFS admin would know not to do them. A great example is the async destroy of datasets.

Say you had a 20TB dataset that you wanted to destroy. Until circa 2013, you literally had to delete all the contents, *then* destroy the dataset. ZFS destroy commands were syncronous before that. That meant that if your dataset had 20TB, you had to delete all 20TB in 1 single transaction. That sucked because you could literally take the zpool out of commission while zfs went looking for all the bits that needed to be cleared. In some cases it took mere hours, in other cases multiple days. Every ZFS admin at Sun knew about this potential problem, and made customers *very* aware of the issue and to "not do that". I've personally seen customers in production destroy a zvol that was 20TB, and then the system stopped all activity while zfs tried to do its cleanup, ultimately locking out all workloads. (Remember that the workloads cannot write to the zpool if the transaction is trying to be closed and is a destroy.) So the customer, thinking he did something very wrong, rebooted the machine. The problem: Rebooting just means that when ZFS goes to mount the zpool it *must* complete that transaction before the mount process can begin. You might give it a few hours, then you hit the reset button again. Unfortunately you have a new problem. Every time you reboot ZFS has to restart, from scratch, the destroy command. So you did yourself no favors by interrupting it. The only solution was to wait it out, no matter how long, or simply give up on ever getting the data off of that zpool. I have personally seen people take production systems and literally make them useless for 5 days because they did exactly this.

Now that Sun is gone and Oracle has closed their ZFS branch, the OpenZFS project created the feature flag "feature@async_destroy". Now ZFS destroys the dataset immediately, then clears the disk space using available I/O (scheduled to not conflict with workload I/O). So if you had a 20TB dataset you destroyed, you wouldn't see the disk space be immediately available. Every transaction (5 seconds by default on FreeNAS) you'd see a little more free space than the previous 5 seconds. Workloads continue doing what they need to do and everyone is happy. I've seen 12TB datasets get destroyed and freed in about 6 hours.

You'll probably not hear about this problem (and it was rarely discussed in these forums) but it absolutely existed, and for a few people, they learned the hard way that ZFS expected you to be a pro at ZFS. That was Sun's expectations, but it gets harder to manage when every "user" must also be "an expert" to have a good experience.

There is lots more coming from OpenZFS in the future that will resolve more long-term issues. Big ones I'm aware of for the future is l2arc compression and l2arc not being discarded on a reboot. Of course everyone wants BPR (Block Pointer Rewrite) as that allows you to defrag a zpool (awesomeness!). But that's something that is looking more and more like "not easily to implement" and "someone is gonna have to rob large bags of money to get the required developer resources to make it all work" with each passing year. Someone (I forget who) even went so far as to say that if they had the required funding now, implementing it could very well be impossible because of technical factors.

The harsh reality is that Sun's ZFS was limited in scope relative to what it is today. We still have some of those issues, and those issues will, for the foreseeable future (and possible for the life of ZFS) exist. Just as Fat32 is limited to 4GB files (and virtually anyone that uses Windows is aware of that limitation among others) ZFS has its own quirks.

Sun's goal was to make a file system, from the ground up, that was supposed to be incorruptible, scaleable to mind-boggling sizes, yet still perform well. With those kinds of things to consider you'll have to make engineering decisions that will make someone, somewhere, unhappy. For Sun, it was relatively easy to do certain things like "expect your ZFS admin to be an expert" because Sun wanted your support contract (and the money). It was easy for them to also say "you won't want to make a small zpool of multiple disks with no redundancy and expect a zpool to continue to function" because that was not their target customer.

ZFS scales up very well (it was engineered to do so for the long-term) but doesn't scale down very well for home users. So either you 'upsell' your server to something that serves ZFS pretty well, or you are simply SOL. :(

I've never had a file server with 32GB of RAM before my current FreeNAS server. I'm planning to go bigger during the summer. The question is whether I'll start off at 128GB or 96GB.

Ericloewe · Jan 20, 2016

cyberjock said:
Someone (I forget who) even went so far as to say that if they had the required funding now, implementing it could very well be impossible because of technical factors.

That's a very common point. I believe it had to do with a violation of atomicity of transactions, which effectively meant the pool has to be taken offline (with all that implies) so that a BPR can take place. Offline BPR is less interesting, but might be a more realistic objective.

cyberjock · Jan 20, 2016

Personally, I'd be okay with an offline BPR. Of course, that's when you have to ask yourself "aren't I better off just making a new zpool and doing replication to restore from backup".

Yes, many people have no backups. But trying to make BPR because of those few souls that don't have ZFS replicated backups is probably not going to be enough to open the money gates enough to make it happen.

So I feel like:

- Offline BPR is not likely for financial reasons (if not technical).
- Online BPR is not likely for the atomicity of transactions (if not financial reasons).

End result = no BPR. :(

MMacD · Jan 22, 2016

danb35 said:
"Good engineering" always considers the intended use case. You criticize ZFS as not being suitable for the "Joe Sixpack" user, who doesn't read instructions, doesn't understand computers, but saw a 5-year-old YouTube video that says you can repurpose your castoff old computer into a NAS using FreeNAS. You may well be right that it's not very suitable for this user. But it's only poor engineering by Sun if that was their intended user--and it wasn't.

Sorry to disagree, but the basic intended use case --the one that's always true regardless of what other use case they thought they were developing for-- was use by humans. Humans who will sometimes be tired, or hungover, or distracted, and who are guaranteed by physical law to make mistakes.

That's broadly the reason why any car we buy to drive on the right-hand side of the road has the gas pedal on the right and the brake on the left, with the clutch pedal, if there is a clutch, on the far left, and why no automotive steering wheel behaves like a tiller: too many deaths would result from some cork-brained engineer deciding it would be more efficient or save $5 in unit costs to reverse the pedals or steering action.

It's also the reason why the operators at a nuke facility (I used to know which one but can't now remember) put beer-tap handles on the crucial switches: the nitwit who designed the control panel had made it look neat and clean, all the switches alike, which of course is exactly what's wanted in an emergency: the inability to immediately tell which switches to throw to scram the reactor.

What scrubbing feature of UFS would that be? Or are you suggesting that iXSystems should have coded one from scratch?

Sorry for the lack of clarity! I meant porting ZFS's scrubbing code.

Robert Trevellyan · Jan 22, 2016

So what's your point? That ZFS could be better than it is? Hard to disagree with that. That the people who designed and built it made compromises? Of course they did.

Think of ZFS as alien technology that landed on Earth out of the blue. People figured out how it works, and came up with a set of guidelines that allow humans to get the best from it while minimizing risk. You can use it while taking suitable precautions and get something useful in return, or you can use something else that gives you warmer, fuzzier feelings. I get the distinct impression that ZFS doesn't give you warm fuzzies.

Being open source, the version of ZFS found in FreeBSD can be improved by anyone sufficiently skilled and motivated. Feel free to dismiss this as the typical response of any fan of open-source software.

I have no doubt that ZFS will improve over time, and that it will never be perfect or foolproof.

MMacD · Jan 22, 2016

cyberjock said:
You'll probably not hear about this problem (and it was rarely discussed in these forums) but it absolutely existed, and for a few people, they learned the hard way that ZFS expected you to be a pro at ZFS. That was Sun's expectations, but it gets harder to manage when every "user" must also be "an expert" to have a good experience.

That's a beautiful example of a Grade 1 bug that should have stopped the release in its tracks. That it didn't is really not to Sun's credit. Thanks for describing it--I'd never heard of it.

MMacD · Jan 22, 2016

toadman said:
I always find it fascinating when people are unhappy with something available to them at no cost. As if they have somehow been inconvenienced.

There are certainly individuals who believe that if it's free no one should complain. But fortunately there are other people --most of them engineers-- who believe that engineering should be done well regardless of the sticker price because it's their self-respect that's at stake.

MMacD · Jan 22, 2016

Robert Trevellyan said:
Being open source, the version of ZFS found in FreeBSD can be improved by anyone sufficiently skilled and motivated. Feel free to dismiss this as the typical response of any fan of open-source software.

I'm a fan of socially-produced software, so no worries there. I'd work on ZFS myself, were I not already fully overcommitted to other projects.

toadman · Jan 22, 2016

MMacD said:
There are certainly individuals who believe that if it's free no one should complain. But fortunately there are other people --most of them engineers-- who believe that engineering should be done well regardless of the sticker price because it's their self-respect that's at stake.

Sure, anyone can complain. But it seems like a waste of bits unless there are a set of design trade-offs on the table to debate doesn't it? Engineering obviously involves trade-offs and prioritization. Anything self evident (that doesn't present an effective tradeoff) should (will) be done by any competent engineer.

The point about designing for human factors is a good one, and one that often gets deprioritized. (Which may be the point?)

For example, "Lose a vdev and your whole pool goes away?" Ok, what are the alternatives and what trade-offs do those alternative present?

Robert Trevellyan · Jan 22, 2016

The title of this thread is "Multiple zpools vs multiple vdevs".

I see nothing wrong with debating the pros and cons of ZFS, or the engineering decisions made by its authors. However, here in the FreeNAS forums, everything is after the fact. Please take it to off-topic, or a different forum.

danb35 · Jan 23, 2016

Robert Trevellyan said:
I see nothing wrong with debating the pros and cons of ZFS, or the engineering decisions made by its authors. However, here in the FreeNAS forums, everything is after the fact. Please take it to off-topic, or a different forum.

But the pros and cons of ZFS are inherently pros and cons of FreeNAS. FreeNAS may work to mitigate the cons, and it might do things to reduce some of the pros, but ZFS is the starting point for FreeNAS. Discussion of its merits (and demerits) is therefore, IMO, quite on-topic.

danb35 · Jan 23, 2016

MMacD said:
Sorry to disagree, but the basic intended use case --the one that's always true regardless of what other use case they thought they were developing for-- was use by humans. Humans who will sometimes be tired, or hungover, or distracted, and who are guaranteed by physical law to make mistakes.

Granted. But what does that mean in this case? Surely you can't mean that every product must always prevent its users from doing something stupid/destructive in every circumstance. Even if it were possible to do so, that would greatly limit the usefulness of the product. So "safety" (in any number of ways) is balanced with a number of other factors, including usability and budget.

Since you'd used the example of a car, let's look at that a little further. Cars are very effective at maiming and killing people--over 30,000 last year in the United States. Accidental deaths and homicides involving cars outweigh accidental deaths and homicides involving guns by about 3:1. But we accept that death toll, because cars are extraordinarily useful things. We don't say that a car is badly designed because a moment's inattention could prove fatal--that's inherent to the nature of a two-ton hunk of metal moving at a mile a minute. We can develop safety features that reduce the risks, but it isn't possible, with current technology to eliminate them.

Because cars are dangerous, we don't let children drive them. You might let your small child drive one of these:

But not a real car. It isn't nearly as useful, but it's also a whole lot less likely to kill its occupant or anyone else.

So, to some of your concerns with ZFS:

MMacD said:
Lose a vdev and your whole pool goes away?

Yes. Data is dynamically striped across all vdevs in a pool, so if a vdev dies, your pool dies. It's as if you removed a disk from a RAID0 array. What alternative would you propose for integrating additional vdevs into a pool?

MMacD said:
FreeNAS folk for abandoning UFS rather than gluing the useful (e.g. scrubbing) features to it.

You seriously think it would be a better solution for the FreeNAS devs to have written a new filesystem? Because that's what it would have taken to do this. There's no way this would have resulted in a filesystem that's safer than ZFS.

Important Announcement for the TrueNAS Community.

Multiple zpools vs multiple vdevs

Server Wrangler

Explorer

Unintelligible Geek

Inactive Account

Explorer

Explorer

Hall of Famer

Guru

Inactive Account

Server Wrangler

Inactive Account

Explorer

Pony Wrangler

Explorer

Explorer

Explorer

Guru

Pony Wrangler

Hall of Famer

Hall of Famer

Similar threads