Huge zpool

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
Is there any consensus on maximum size for a pool? I know the limit is technically 2^128 but I'm talking about practical and reliable storage.
I assume for highest capacity (and reliable) one would choose raidz3.
How many disks per vdev is too much?
How many vdevs is too much? Striping vdevs is just like striping disks, multiplying the probability of a failure, so maybe it would be good to add an extra level of parity here at some point. Like a stripe of raidz vdevs, where each vdev is a raidz of disks. Is this possible in ZFS?

Suppose I want 10PB in a single volume, with decent bandwidth. Using 16TB disks I could make 42 vdevs, each with 18 disks in raidz3 for 240TB usable. Is this crazy?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The Zettabyte filesystem (ZFS) is called that for a reason.

You can probably stripe to the point you can manage to power and cool in a system... you just need to think about how you might back that amount of information up if it's all in one host.

a stripe of raidz vdevs, where each vdev is a raidz of disks. Is this possible in ZFS?
Yes that's exactly how it works.

How many disks per vdev is too much?
The consensus around here is that 12 drives in a VDEV is already at some kind of limit to logic (but there's no technical limit to going higher)... recommended is 6-10 per RAIDZ VDEV.

I assume for highest capacity (and reliable) one would choose raidz3.
If your system is well managed and monitored, there's no reason that RAIDZ2 isn't OK... with 16TB drives, you may be right that RAIDZ3 is a good idea for safety though... resilver could take a long time for 16TB.

Suppose I want 10TB in a single volume
I'm not sure if you have your units correct here... you're talking about 756 16 TB disks... that's way more than 10 TB... did you mean 10 PB? (756 x 16TB = 12 PB of raw storage... about 10 PB available in that case)
 
Last edited:
Joined
Jul 3, 2015
Messages
926
I run lots of systems using the Supermicro 90 bay JBODs with 8/10/12TB HGST SAS drives. I make one zpool out of each JBOD using 15 disk Z3 x 6 vdevs which gives me between 400TB-600TB usable depending on what size drives I use. Resilver times on my systems are approx 24 hours tops and they are all about 50% full atm so about 250TB. Therefore even on a full system (80% rule) I would expect no more than 48 hours to resilver. You need to carefully consider when to run your scrubs and I schedule these over the weekend to avoid disruption to users. Scrubs take a similar amount of time as resilvers.
 

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
@sretalla sorry, I meant to say 10PB. I've edited the post. I think the limit of disks per vdev is all the probability of losing both of your other parity disks during the resilver, which of course increases when you have a larger disk (or more data on the disk to be exact).

Yes that's exactly how it works.
I should have made my question more clear. I'm asking about adding an additional "level" of parity. I don't think this is a common practice, and I'm not sure if it's possible. What I mean is, a tank made of raidz "super-vdevs". Each super-vdev is made of raidz vdevs. Each vdev is made of disks. I don't have enough disks available right now to show this, but here is a simple example which fails:
zpool create tank raidz mirror sd{a..b} mirror sd{c..d} mirror sd{e..f} raidz mirror sd{g..h} mirror sd{i..j} mirror sd{k..l}

Nobody in their right mind would stripe 42 disks in a RAID 0. So by the same logic, there is a risk to striping 42 vdevs together. The risk is less, since each vdev is far more reliable than a single disk. But at the same time, increased size makes failure more likely.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Nobody in their right mind would stripe 42 disks in a RAID 0
Correct, since a single drive failure would kill your pool.

In the case of thinking of a RAIDZ3 VDEV as a single drive that's a bit less likely to fail, I think you're not giving enough consideration to the fault tolerance you introduce with RAIDZ3.

Yes, you can have really bad luck and have 4 drives in the same RAIDZ3 fail before you can replace them, but that's why you have monitoring and are prepared to jump into action when a disk fails and ultimately have backups to allow for disaster recovery. Just building more RAID on top of RAID isn't a good answer from the point of economy nor performance (remember that a RAIDZ VDEV performs like a single drive... if you now put those into a second layer of RAIDZ, then you will have 756 disks performing with the same speed as a single disk!!!).

In order to reduce the risk, have narrower RAIDZ3 VDEVs (6 or 7) and many more VDEVs in total... more storage lost to safety, but far less likely to have failures in the same VDEV if the number of VDEVs available is higher.

Depending on your data structure, consider having multiple pools to reduce the damage of pool loss also.
 
Last edited:

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
Good point about the IOPS.
Multiple pools is a consideration. That's why I asked the question about how many vdevs and how many PB is generally considered "safe" in a single pool.
 

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
I have been reading about L2ARC and trying to decide how much is needed. Of course it will be best to test with real hardware and workloads, but I am just trying to plan theoretically to start with. Cyberjock's guidebook says you should usually max out system RAM before adding L2ARC, but I have a feeling this does not apply to large servers. A modern server like Supermicro 6049 can hold up to 6TB of RAM, but it would be much cheaper to buy a few NVMe drives. How does one determine sizing for ARC and L2ARC?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The rule used to be 1GB of RAM for each 10GB of L2ARC... this is for active content, so people got really confised by that.
 
Top