RAIDZ expansion, it's happening ... someday!

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Now that the standard is 4TB drives with scaling prices to 8TB, and you can get 20TB monsters , I don't care anymore and I am actually scared of a feature like this just like I am not really feeling that comfortable going beyond 8TB drives. Maybe 12 TB with RAID-Z3 (Z2 still feasible with SAS 10^16 URE rate)? Just maybe. I would bet the time to complete a rebuild with just 4TB drives is starting to be insane?
We are using 6, 10 and 12 TB drives at work, in different servers, and having no problems. With good hardware and proper vdev sizing, rebuild times are less than 24 hours.
 

djjaeger82

Dabbler
Joined
Sep 12, 2019
Messages
16
We are using 6, 10 and 12 TB drives at work, in different servers, and having no problems. With good hardware and proper vdev sizing, rebuild times are less than 24 hours.

I'm assuming you're talking about regular rebuilds, or has there been progress on the expansion feature as of late? (crosses fingers... )
 

zizzithefox

Dabbler
Joined
Dec 18, 2017
Messages
41
We are using 6, 10 and 12 TB drives at work, in different servers, and having no problems. With good hardware and proper vdev sizing, rebuild times are less than 24 hours.

I think you misunderstood the entire topic: not talking about resilvering, but about the experimental features concerning vdev expansion which I hope you are not using. Still, if you are doing raid-z2 with more than 6x12TB vdev and consumer drives (like 10^14 ure rated drives), you are a dangerous man.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I'm assuming you're talking about regular rebuilds, or has there been progress on the expansion feature as of late? (crosses fingers... )
Regular rebuilds. We are well behind the latest software build ... Stability is the goal. I have over 250TB of data in just one of the systems I manage for work. I don't want to risk that (and my job) to get a new feature.

I think you misunderstood the entire topic
Sorry, I did miss part of the conversation.
Still, if you are doing raid-z2 with more than 6x12TB vdev and consumer drives (like 10^14 ure rated drives), you are a dangerous man.
The drives at work are the expensive datacenter models from Seagate or Western Digital. I configure the RAIDz2 vdevs with either 6 or 8 drives depending on the hardware.
I use different hardware at work than my home system (in my signature) because the organization I work for has access to funds that I don't, so I cheap out on drives at home. Still, I went over a year without having to change a drive at home and my home NAS rebuilds a drive in about six hours, so I don't feel like I am being very dangerous. It is all about selecting the right drive, which can be tricky...
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
Matt has time "later in the week" to take a look.
I think you understand now why I always am very vocal: "next month" or "somewhere next week", from the ZFS maintainers is actually closer to "We hope to get to it this year"

Anyhow ontopic:
I think this feature is delayed due to draid being almost done.
In theory draid is going to be the successor of raidz in a lot of cases, so I think they are not going to actually implement it for raidz and just focus on draid for now.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
DRAID was stuck in "nobody actually wants to push this"-land for a few years, but someone picked it up again, now. I seem to recall it was in a state of "it's mostly ready, but stuck in this old branch and won't apply cleanly to the master branch" when it fizzled out a few years ago.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,972
@Tigersharke thanks for posting. It is a very interesting topic and I was following a lot of what was being said but it would have been better to be there and actually talk to the developers to be a better understanding on how it all works. Broad terms are great for intro but I thrive on the details. DRAID will be very nice making a huge impact on resilvering speeds if/when it comes to town.
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
DRAID was stuck in "nobody actually wants to push this"-land for a few years, but someone picked it up again, now. I seem to recall it was in a state of "it's mostly ready, but stuck in this old branch and won't apply cleanly to the master branch" when it fizzled out a few years ago.
It almost made it into OpenZFS2.0, Behlendorf (one of the maintainers) took it over some months back and made it his personal pet-project :)

The reason it didn't make it was mostly due to review taking a lot of time (due to the scope of it) and getting it back into running order after every round of review (which awkwardly enough, is often the most time consuming part of review and not processing the review feedback itself)

has been rebased on master multiple time by now in the past 4 months and is quite up-to-date. It does build (most of the time) and most of the time passes the tests (once a round of review has been implemented it's made to pass the tests) ^^

*edit*
Click:
 

zizzithefox

Dabbler
Joined
Dec 18, 2017
Messages
41
The drives at work are the expensive datacenter models from Seagate or Western Digital.

Good for you. Recently I am a little sensitive, since I was wandering youtube following the fact that I've just dodged the WD Red SMR affair by sheer chance (many friends have been hit): full of people suggesting redeploying old desktops as freenas servers (man those systems without ECC memories are scary). Then I conducted a job interview for a friend regarding an IT position, and when I casually brought up URE rates about RAID systems and layed down the simple numbers... People simply don't know what I am talking about.
The best answer was: well, at Backblaze they deploy barracuda drives so...
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
The best answer was: well, at Backblaze they deploy barracuda drives so...
Desktop drives are fine, yes the URE rates are higher... But you need to remember URE-time spread is not liniar.
Having an URE's is what raid is there to counter. The opinion of backblaze is the fact they can buy many more drives (and thus more redundancy) using desktop drives because they are cheaper.

Lets say:
You can buy 1 Enterprise drive for 100 bucks
or
You can buy 2 Consumer drives for 100 bucks with 25% worse of the URE rates (so 75% quality in comparison to Enterprise when it comes to URE rates) and put those in a mirror.

What would effectively have the lowest URE rates? The two consumer drives!
For simplicity sake i'll ignore the law of scale in this example, but with the laws of scale also come some benefids when it comes to chance.

I don't fear not running ECC, the chance of my non-ecc clients screwing-up data is always there anyway and the chances of dataloss in a way ECC would've covered it is many MANY times lower than that of my whole pool failing.

TLDR:
Backblaze uses consumer drives because of the laws of scale and the fact failure is not liniary spread over drive-lifetime.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
There is that and there was also necessity. Backblaze jumped into its consumer drive business because they had to shuck truckloads of external drives during the Thai flood aftermath when data enter-grade B2B drive costs reached astronomical levels and/or were simply not available at all.

Between their very efficient parity system, low data center air temperatures, and a small army of techs whose sole job it is to wander the halls and replace dead drives, Backblaze is in a much better position than most companies with data center assets to deal with less-than-optimal-life drives. (No appointments needed, no dispatch, etc)

It’s the same reason I have multiple burned-in, cold spares in case one of my NAS drives goes belly-up. Yeah, my used helium drives likely do not have the same remaining life expectancy as new ones, but I was willing to reap the benefit of paying 1/2 for used data-center grade HGST drives vs new ones. Plus, it allowed me the benefit of buying models from the same families as the ones that Backblaze published stats for.

Given the utter amount of BS coming out of WD, what faith should end consumers place in hard drive life statements, URE, etc from OEMs? After all, there is nothing that prevents a OEM of developing one drive family with one set of likely failure rates but then selling alleged failure rates along with bundled warranties at different price points. There is no harm in under promising and over-delivering actual performance vs. what the consumer paid for.

I have yet to see real documentation from HDD OEMs re: differences re: the spindles, bearings, motors, or platters used in “Data Center” vs. consumer drives. Yes, the PCB and the firmware may be different but the actual selection process may be as limited as binning bare drives with more vs. fewer initial defects, marrying the appropriate PCB to each drive, and shoveling them into their various sales channels.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
Back to the original topic.

One thing people overlook on ZFS RAID-Zx, is that it is by vDev. Many hardware RAID solutions come up with a fixed LUN -> RAID-5/6 set relationship. Meaning you generally can't have 2 x RAID-6 sets, exposed as 1 single LUN. So, it was easier for these systems to take the concept of a 5 disk RAID-6 and make it a 6 disk RAID-6. ZFS was not originally designed to do that. Instead, you add a new vDev or change out each disk with a larger one.

(To be clear, some non-enterprise hardware RAID controllers allow having multiple LUNs of different sizes exposed from a single RAID-5/6 set. I think Adaptec has a chip / card that can do that.)
 
Joined
Jan 1, 2021
Messages
1
Hi,

Expansion of raidz2/3 is an essential feature. Any idea when this will come to Linux? Here is what is happening to me right now:

I created a 5 disk raidz3 array with USB3.0 disks of 4TB each. This means 2 disks can fail and I would still have 3 copies of my data. ZFS shows me that I only have 4TB of usable space. After a few days of copying all my data from my old defunct backup machines I managed to fill up the 4TB.

So I went and bought another 5 disks of 4TB each.

But, now I discover raidz3 cannot be expanded - huh!?. So now I have 10x4TB of physical disk space and only 4TB of contiguous usable space.

Raidz3 has achieved a 10:1 reduction in usable disk space!! (Imagine such futuristic technology)

So now I am planning to copy the 4TB onto one disk, and create a 9-disk raidz3 array which should hopefully give me 12TB of space with 3 copies of all blocks.

Then I will copy the 4TB back onto raidz3.

(BTW the reason I want zaid3 is because I don't intend to ever monitor this setup. It's an install-it-and-leave-it scenario.)

So yeah, basically the whole point of ZFS is that it was supposed to obviate having to do weird things like copy-all-your-data-"somewhere"-and-then-reformat-all-your-drives (caydsatrayd).

ZFS was invented to avoid caydsatrayd, and now caydsatrayd is exactly what I am spending my whole day doing.

Is there something I am missing??
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
I suggest posting a picture of your pool layout from the GUI. Something sounds really off.

Your pool should now have 10 drives, arranged as two 5-disk Z3 VDEVs. Let's confirm that.

If the above is correct, I reckon the available pool capacity should be something like 16TB, of which you shouldn't use more than about 12.5TB.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464
Any idea when this will come to Linux?
Was there something that made you think this was a Linux forum, so that you would join to post this question?
now I discover raidz3 cannot be expanded
So you didn't do any reading about ZFS before setting up your pool and committing data to it. But now you feel qualified to tell everyone what ZFS was designed to do. I don't see this ending well.
It's an install-it-and-leave-it scenario.
This either.
posting a picture of your pool layout from the GUI
That would assume s/he's using FreeNAS, which is contradicted by the reference to Linux.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
I created a 5 disk raidz3 array with USB3.0 disks of 4TB each
You're not setting yourself up for trouble-free sailing if you don't plan to monitor it and are using USB3 disks.

You really need to use SATA ports (probably as provided from a SAS HBA with breakout cables) in order to have stability and reliability.

You can follow the OpenZFS project on Github to find out about the schedule (which isn't set as yet) for that feature.

You haven't given us visibility on the hardware you're planning to use other than the USB3 disks, so it is hard to give advice other than that is a bad option.

Some of us may be OK with giving you some recommendations if you want to share the rest of your plans (since this is already in the off-topic section)

You can share the output of zpool status -v and/or zpool list -v if you want to help us to see your current pool layout and status.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Naw TrueNAS SCALE is Linux, so that makes total sense to me.

Expansion is coming Soon(tm) which means "freaking eventually". I've seen activity H2 2020, so maybe a beta build 2021 if we're really lucky (it's in Alpha now) and production 2022 / 2023.

In the meantime:

- If you got your pool layout wrong the first time, don't feel bad, happens to a lot of folk. Happened to me. Just do what you are doing: Copy the data off, redo your pool "properly", then on again. "Properly" depends on what the use case is. For bulk storage over GBit, where performance is not needed, 8-wide raidz2 or 10-wide raidz3 (or variations thereof) can absolutely be proper
- Consider future expansion needs. The wider your vdev, the more expensive eventual expansion by replacement of disks. Example:
I have an 8-wide raidz2 with 8TB disks, because I started 5-wide 8TB. This will likely last me for a decade+. Potentially forever. If I did want to expand and replace with N-TB disks, I'd have to buy 8 of them.
Where I to build today, I'd do 6-wide raidz2 with 12TB disks instead. This would last me for the same time, but now if I expand by replacement, I only need to buy 6 disks.

You can of course always expand by adding another vdev, so if I say have 6-wide raidz2, add another 6-wide raidz2. But that comes with noise / power / heat / space concerns. It does give you twice the IOPS of a single vdev, but given that I don't care about IOPS - video files over Gbit is my use case - I am better off with a smallish build in a Node 804 with 8 disks and a single vdev. The one use case I have that does care about IOPS - I am running some medium-sized databases of around 400 GiB - I stuck on its own pool with a single mirror vdev of SSDs.

So, what did you miss?
- Pool / vdev layout in ZFS is fixed and needs to be planned first. Mistakes are natural when being new to the tech, take what you learned and redo the layout, this time with an eye to use case and future expansion
- Any vdev can have all of its disks replaced one by one with a resilver after every replacement, and when you are done, you have the new capacity of the larger disks

"Expected" use of ZFS is to:
- Build a pool that has the number of vdevs / type of disks / type of vdevs you need for your use case. File bulk storage, HDD and raidz2/3; block storage or 10GBit throughput, mirrors, SSD, more vdevs, as appropriate
- The general pool layout is fixed. If you had planned to add more vdevs that's chill, you know that you planned, where the physical media goes, and what conditions trigger adding a vdev
- Given that you built for your use case, simple capacity expansion is done by replacing all disks in a vdev, if you are at "max amount of vdevs I wanted in this"
- Plan for no more than 50% use in block storage (25 or even 10 is better) and no more than 75% use in file storage. Assuming HDD. SSD is a bit more forgiving because IOPS
- Yada yada L2ARC and SLOG - very specific ideas for very specific use cases. L2ARC can help with "lots of small files" aka "oh God the metadata", and SLOG helps with sync write on block storage. Both have more wrinkles than can be discussed here, but the forum has great resources on when these are appropriate and how to design. TL;DR: If you don't know whether you need them, then you don't.

Edit: One more note on "I have an 8-wide raidz2 with 8TB disks, because I started 5-wide 8TB. This will likely last me for a decade+. Potentially forever." Given that HDD life is finite, even He HDD like mine, this might seem like a silly statement. The sheer amount of space I have available and the yearly expansion I know I have, I am looking at a decade, potentially two to three. But my disks won't last that long. It's utterly reasonable to look at pricing and $/TB when a disk fails. In my case, ~150-200 USD per disk is reasonable. By the time these fail, maybe that gets me 20TB. I'd likely replace with something at the upper range of affordable, with a 2x minimum and going for 3x-ish, and then stick with that capacity point for all subsequent replacements, price going down all the while. By the time the last disk has failed, I then have that 2x to 3x capacity. Not necessarily because I need it, but because "why not". The answer to "why not" is "good man you are insane, what about resilver times on disks that large?"
Therefore, it's also entirely possible I'll do another switcheroo instead, and use the moment when a disk fails as an excuse to move from raidz to draid. This would vastly improve rebuild times and set me up for comfortable replacement as more disks fail. That'll be another 3-week action of burn-in, copying data off, redoing pool, and copying data back on. I've done it once, I can do it again.

Which is to say: Whatever you do, have a plan.
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
You're not setting yourself up for trouble-free sailing if you don't plan to monitor it and are using USB3 disks.

Oh yeah excellent point. Stop doing that ASAP. There's so many better ideas. SATA onboard if you can stick everything into the main enclosure; HBA with SAS to SAS expander to SATA if you need to go to an external enclosure. Don't do USB. You're asking for trouble.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
So now I am planning to copy the 4TB onto one disk, and create a 9-disk raidz3 array which should hopefully give me 12TB of space with 3 copies of all blocks.

Ars Technica has a really good ZFS Primer, I recommend you read it. 3x parity is not 3x all blocks. You are describing a 3-wide mirror, which raidz3 is not.

BTW the reason I want zaid3 is because I don't intend to ever monitor this setup. It's an install-it-and-leave-it scenario.

A minimal amount of monitoring is to:
- Use TrueNAS SCALE (Linux)
- Configure an email account on there for alerts
- And just let it run. Snapshots if you want the ability to roll back on accidental deletion, but that's a nice-to-have

Why do you need alerts? Well, drives will fail. So now your 3 redundancy reduces to 2, then 1, then none - and with the next failure all your data are gone. You can always restore from backup, but what a pain. Better to just replace a failed drive, when it failed. TrueNAS SCALE makes this trivial. Just point it at your email address and let it do the rest.
 
Top