Large array thoughts?

alpaca · Jun 12, 2015

We are doing some preliminary searching/quoting for a larger storage system. We presently use a self-built 24bay Supermicro FreeNAS box, dual-x5650, 96gb ram, 10Gbe, running as 11-mirrored vdevs with redundant SLOGs. Performance is good (not great) over NFS. The big wins for me have been the stability, snapshotting, scrubbing, main ZFS features. Even with the relatively older hardware, it has been great.

The challenge for us is storage space and scalability. I know the "Z" stands for "zettabyte", but my concern/question is in the number of drives. Using 4TB drives, in Z2 4+2 vdevs, we're talking about 200 disks for a half-petabyte array. That is sort of our target scale we're exploring. All of the SPOF issues aside; does anyone have any insight into a large (like 200+ disk) array? Our debate is between a ZFS array and a Ceph cluster. A similar capacity Ceph cluster would end up being 20+ separate servers (big $$$) difference.

We would definitely get in touch with IXsystems before purchasing anything at this scale, which is still a ways off, curious if anyone in the community has any thoughts.

Thanks

Bidule0hm · Jun 12, 2015

Actually there's a limit for performances reasons of about 10 to 12 drives per vdev but I've never heard of a limit for the number of vdevs per pool. Maybe I'm wrong and there's a limit so wait for a confirmation ;)

Ericloewe · Jun 12, 2015

There are caveats with insanely large systems, but the filesystem itself is designed to scale to insane sizes by adding vdevs.

More vdevs == more performance, in very rough terms.

Your main problem would be RAM. We're definitely talking Xeon E7 amounts of RAM. I have no idea how FreeNAS behaves with 2TB of RAM for 500TB of disk, but I have the feeling that anything short of the biggest amount of RAM you can throw into a server won't do.

depasseg · Jun 12, 2015

Are there things specific to FreeNAS that would come into play at this scale? Or is it under the general realm of ZFS?

cyberjock · Jun 12, 2015

Few things since I have worked with this kind of scale:

1. You should definitely follow up with iXsystems like you want. With large scale servers things that are often small and obscure performance "speed-bumps" become large-scale performance nightmares. It's nice to be able to call someone and say "hey, I'm paying to make this your problem.. fix it!" It is also nice to be able to buy from a company like iXsystems and then be able to do a remote session with them and have them tell you what the best practices are for your precise scenario, then setup the zvols, datasets, zpool, and services for best practices. It sucks when you make some mistake you aren't aware of, and the solution is to destroy the zpool and rebuild. ;)
2. Yes, expansion is really as simple as it sounds on the surface. Make sure you have lots of RAM. If you want to do 500TB, I'd recommend no less than 384GB of RAM, with 512GB of RAM being even better. L2ARCs and slog devices may or may not be required. However, that's just on the surface. As things scale up other things become much more important, depending on how you use the server. It's impossible to discuss every aspect of this since there's so many ways to use a server. The short answer is that you should either be ready to deal with this yourself, or see #1 (pay someone else to know this stuff so you don't have to).
3. Definitely make sure you have lots of redundancy. Even if you have a backup, corrupting a 500TB pool means a VERY, VERY long restore process. So definitely don't try to do 10-disk RAIDZ1s and then tell yourself all will be okay. With 500TB worth of disks, it won't be okay. And having to destroy the zpool because 2 disks out of 200 failed *really* sucks. I've seen things like that happen too. The actual people that have their data on it are never happy, and recovery (assuming a backup exists) is always, always extremely time consuming.
4. (This may be the most important regardless of what you do...) Don't neglect your file server. When you are that big, disks are going to fail regularly. Have spares on-site to facilitate faster resilvering. It doesn't need to be babied if you did your job well, but don't expect your server to go in a corner and never have problems. 200 disks means you should expect about 1 disk a week to fail. At large scales this is normal. It makes me cringe to think that I might be replacing a disk every week or two in my server, but I have just 10 disks. 200 disks is a whole different beast. Also realize that with 200 disks and all the extra hardware that downtime is more likely. More metal means more crap that can (and probably will fail). RAM goes bad, CPUs go bad, etc. The question is "how will you deal with an outage that may last 2 or 3 business days?"

It is sometimes better (and cheaper) to have two or three smaller systems than one super-mega-have-the-whole-enterprise-on-one-file-server system. So don't let your desire to be the largest kid on the block cloud your judgement. I'd love to be the biggest kid on the block, but I wouldn't risk my data and an empty pocketbook just to claim I did it either. ;)

alpaca · Jun 12, 2015

Thanks a bunch for the thoughts, and look forward to anymore still to come! It's a big (no pun) challenge. The idea of a single giant pool and management plane has a lot of appeal, but may be unrealistic. Maybe even the wrong way.

Noctris · Jun 14, 2015

Out of curiosity: is the large pool just from a management perspective or an actual requirement on the data level. I have no experience in such large arrays, i did however do some test for my specific usage requirements and quickly found out it was beneficial for me to split up my different types of data over several arrays/pools. No need for heavy duty stuff for the archive, i do however want good redundancy ( wd green raidZ3) while for vm's i want performance ( 15k disks, mirrored stripe, ssd slog etc) . This saved us a bunch of cash and headaches. For us, there was not good enough reason to go all out to have the entire array perform top notch while having loads of storage and redundancy.

Sent from my HTC One_M8 using Tapatalk

Important Announcement for the TrueNAS Community.

Large array thoughts?

alpaca

Dabbler

Bidule0hm

Server Electronics Sorcerer

Ericloewe

Server Wrangler

depasseg

FreeNAS Replicant

cyberjock

Inactive Account

alpaca

Dabbler

Noctris

Contributor

Similar threads