Backblaze Storage Pod 4.0

jgreco · Mar 20, 2014

This looks like all sorts of awesome

http://blog.backblaze.com/2014/03/19/backblaze-storage-pod-4/

No more SATA port multipliers. I don't know if their selected cards are compatible with FreeNAS, but it'd be a hell of a lot easier to find alternatives like SAS expanders for this design!

cyberjock · Mar 20, 2014

You are slow.. we discussed this yesterday or the day before in IRC.

jgreco · Mar 20, 2014

Someday you'll be slow and old too.

HoneyBadger · Mar 20, 2014

Only issue with using a Backblaze pod as a FreeNAS base point (or any other OS for that matter) is that Backblaze generally considers the entire unit a disposable chunk. Multiple copies of the data are stored on multiple pods, so ultimate resiliency of any single one isn't given as high priority - they can lose a whole pod (45 drives) and not care because they just pull data from another one.

That said the Rocket 750 does say "FreeBSD" under supported OS, so ... hmm ...

jgreco · Mar 20, 2014

The Backblaze storage paradigm wouldn't be generally useful to most people.

The Rocket 750 is also potentially a problem, it is simply a controller that integrates ten port multipliers on a controller (to get 40 ports). I wasn't proposing to use it. From my point of view, you could strip out the controllers and mainboard. The chassis, power supply, and cabling are the interesting bits, and you can populate it with different options pretty easily.

diehard · Mar 21, 2014

That much storage/money without redundant psu's just seems kinda silly to me.

ser_rhaegar · Mar 21, 2014

The PODs themselves have redundancy. If one full pod goes down, there is another in their datacenter that has the data.

jgreco · Mar 21, 2014

ser_rhaegar said:
The PODs themselves have redundancy. If one full pod goes down, there is another in their datacenter that has the data.

Only in an exceedingly poor design.

There shouldn't be "another," there should be "all the others."

We called this design a "Redundant Array of Inexpensive Servers" ... :

Consider a massive Usenet provider around the year 2000 with terabytes of disks. So the two choices are to try to figure out how to attach lots of storage to a single massive server or figure out how to do distributed storage.

The goal is to keep a server as cheap as possible. As EIDE grew old and SATA became available, the largest cheapest per byte hard drives available were SATA, not SCSI.

So let's say you had a design where you had two sets of spools, server bank 0, letters a-f, and server bank 1, letters a-f, each with 16 drives.

The naive solution is to "mirror" them so that the content on server0a is the same as what's on server0b. This seems simple and is easy to set up ... right until server0a fails and the load on server1a DOUBLES.

So instead what you do is you use a different allocation function so that the data on server0a ends up 1/6th on server1a, 1/6th on server 1b, 1c, 1d, 1e, and 1f.

So now if server0a fails, the load only increases by maybe 15-20% on the servers in bank 1. Further, when you need to rebuild server0's disk, the data is evenly distributed in bank 1 so you aren't pounding them to do the rebuild.

Further, since it is more likely that a single disk would die than an entire server, you also apply different distribution functions to the individual disks, so that you can always reliably know where all the data in your spools ought to be, but if you look at the destination servers and disks in bank 0 and bank 1, they're always on different disks and servers so that if you need to rebuild a disk or server, the load is evenly spread out.

This discussion trivializes some of the fine points, but was an enabler for massive growth in the Usenet market, which went from an average of 7 days retention back in ~2000 to more than two thousand days retention at some providers today.

c32767a · Jun 4, 2014

jgreco said:
This looks like all sorts of awesome

http://blog.backblaze.com/2014/03/19/backblaze-storage-pod-4/

No more SATA port multipliers. I don't know if their selected cards are compatible with FreeNAS, but it'd be a hell of a lot easier to find alternatives like SAS expanders for this design!

I feel bad Necro-ing this thread, but I have one of these sitting on my bench right now. It's actually a pretty dang nice platform for putting 45 drives in a rack.. :)

I had enough LSI cards around to avoid having to deal with the rocket 750. That card looked scary to chance, given the price.

Other thoughts after playing with it a bit:

The PSU is an uncommon model from Zippy. If you opt for the single-PSU version, you would want to have a spare on the shelf, as I'm not sure where you could buy one, other than 45 drives.
Bring your own 120mm fans.. 6 in a push/pull config. With 30 drives in my lab (e.g. not a datacenter) the drives sit around 30C during a scrub.

I don't want to get into a philosophical debate, but I've been researching the "no more than xx drives in RAID-Zn" thing with this box..
I don't see any documentation that discusses the technical justification of the limitation. I've been playing with 15 drive raidZ3 vdevs and have found performance and behavior to be good. remove/replace a disk leads to a long resilver time, but it's not out of line with our Netapps and other platforms with similar sized RAID sets.
What failure mode should I be expecting? How do I trigger the scary demons that lurk behind an oversized vdev?

cyberjock · Jun 4, 2014

The issues mostly revolve around using non-standard stripe sizes leading to massive amounts of lost space, enormous performance penalties with small writes, penalties with handling resilvering and other disk tasks simultaneously, and increased risk of losing your pool due to multiple failed drives.

I'm running an 18 disk single vdev RAIDZ3. I chose to do it because I wanted to see just how much 'suckage' there was. There is enough that I'd never ever try to do this again except as a backup server and even the only as a backup server for a single server. Trying to do multiple backups at the same time could really be a boner.

jgreco · Jun 4, 2014

c32767a said:
I've been playing with 15 drive raidZ3 vdevs and have found performance and behavior to be good. remove/replace a disk leads to a long resilver time, but it's not out of line with our Netapps and other platforms with similar sized RAID sets.

If it's as slow as a Netcrapp, that's bad.

But seriously, try it as it'd happen in reality. Put and keep a heavy load on the system. Fail a disk and replace it. Start a rebuild. Then fail a second disk. Watch the rebuild times. The difference between "survivable" and "pleasant."

c32767a · Jun 4, 2014

jgreco said:
If it's as slow as a Netcrapp, that's bad. :)

But seriously, try it as it'd happen in reality. Put and keep a heavy load on the system. Fail a disk and replace it. Start a rebuild. Then fail a second disk. Watch the rebuild times. The difference between "survivable" and "pleasant."

Zing.. Netapp was doing dedupe, replication and snapshots on solid hardware almost 10 years ago. We've had nothing but solid, mid level performance from our FASs over the last 12 years. We never lost a single byte or had any unplanned downtime on any of our filers over the last 12 or so years. It's unfortunate that things have gone downhill there in the last few years.

Resilvering 1 disk didn't seem all that slow, though there wasn't much else going on. For the moment I'm amusing myself with bonnie++ runs. I'll get the filesystem filled up again and rip 2 drives out and see what happens. I am perhaps lucky in that I can reduce my active workload while I'm replacing a disk, so the system doesn't have to churn on tons of production IO while it's doing a resilver.

cyberjock · Jun 4, 2014

For me the real bummer is the loss of all the disk space. In some iterations its more than 15%! Because of how ZFS organizes itself it is possible to add more disks(notice I didn't say add more redundancy) and actually have less usable space as a result. Talk about being stupid with your money!

c32767a · Jun 11, 2014

jgreco said:
If it's as slow as a Netcrapp, that's bad. :)

But seriously, try it as it'd happen in reality. Put and keep a heavy load on the system. Fail a disk and replace it. Start a rebuild. Then fail a second disk. Watch the rebuild times. The difference between "survivable" and "pleasant."

So I played with this.. built a 9 disk -z3 pool with WD reds and failed a disk.. took about 12h to rebuild.. Took a 15 disk z3 pool with WD reds and failed a disk. took about 14h to rebuild.
From everything I can see, scrub and resilver gain I/O performance as additional spindles are added.. Overall performance seems to scale as well, though they're still slow SATA drives, so random io is still painful to watch.

I did a bunch of Bonnie++ runs. (80G file, 32g RAM) The numbers from a 7, 9 and 15 disk Raid-Z3 vddv all seemed to average out to what you'd expect to see.

write rewrite read

7 154203.7 100723.9 427224.7

9 166967.7 86567.8 269490.3

15 346852.7 205818.6 620628.3

cyberjock · Jun 11, 2014

Nothing you provided(excluding the numbers) really surprises me at all. Pretty much all expected. More disks means marginally more rebuild time. The killer is the wide-ness of the pool causing I/O performance bottlenecks and the lost disk space due to the extreme width.

Not sure what testing parameters you used(the command line used should always be included in bonnie tests because they matter... bigtime) but if you didn't tell bonnie to use random data instead of zeros your numbers are going to be messed up if your pool/dataset uses compression. I don't have my numbers handy, but those numbers look MUCH bigger than what my numbers were(we're talking an extra digit IIRC).

Don't get me wrong, not trying to tell you that you are stupid for doing what you are doing, just trying to get you in the habit of good practices. ;)

c32767a · Jun 11, 2014

Fair enough. Here's the command line from all 3 filesystems:

Code:

bonnie++ -u user -r 32767 -s 81920 -d /mnt/vol1/test -f -b -n 1

7 drive:

Code:

nas1.,80G,,,142816,20,97844,18,,,443393,28,66.0,0,1,60,0,+++++,+++,70,0,68,0,+++++,+++,66,
nas1.,80G,,,142131,20,101640,18,,,379975,24,61.5,0,1,64,0,+++++,+++,65,0,69,0,+++++,+++,69,0
nas1.,80G,,,148942,20,104105,19,,,438962,27,64.5,0,1,65,0,+++++,+++,68,0,69,0,+++++,+++,68,0
nas1.,80G,,,149140,20,99812,18,,,391043,24,65.4,0,1,62,0,+++++,+++,67,0,66,0,+++++,+++,68,0
nas1.,80G,,,172339,23,103372,18,,,455713,28,65.2,0,1,60,0,+++++,+++,68,0,72,0,+++++,+++,70,0
nas1.,80G,,,168009,23,104033,19,,,448350,28,67.2,0,1,57,0,+++++,+++,73,0,70,0,+++++,+++,70,0
nas1.,80G,,,134855,18,95563,17,,,454199,29,67.6,0,1,57,0,+++++,+++,68,0,69,0,+++++,+++,72,0
nas1.,80G,,,168409,23,110232,20,,,437780,28,60.2,0,1,59,0,+++++,+++,66,0,69,0,+++++,+++,64,0
nas1.,80G,,,154816,21,97137,17,,,435608,27,62.8,0,1,56,0,+++++,+++,70,0,67,0,+++++,+++,71,0
nas1.,80G,,,160580,22,93501,17,,,387224,24,59.9,0,1,58,0,+++++,+++,71,0,68,0,+++++,+++,66,0

9 drive:

Code:

[/SIZE][/FONT][/SIZE][/FONT][/SIZE][/FONT][/SIZE][/FONT]
nas1.,80G,,,176261,24,89237,16,,,281175,17,111.1,0,1,67,0,+++++,+++,75,0,75,0,+++++,+++,77,0
nas1.,80G,,,148053,21,75879,14,,,246913,15,84.6,0,1,39,0,+++++,+++,35,0,30,0,+++++,+++,27,0
nas1.,80G,,,159561,23,84219,15,,,274329,17,102.2,0,1,70,0,+++++,+++,77,0,80,0,+++++,+++,78,0
nas1.,80G,,,168098,24,91501,17,,,271169,17,107.7,0,1,67,0,+++++,+++,77,0,75,0,+++++,+++,80,0
nas1.,80G,,,160725,23,87467,16,,,277539,17,98.6,0,1,67,0,+++++,+++,80,0,80,0,+++++,+++,76,0
nas1.,80G,,,169522,23,92404,17,,,277476,17,99.2,0,1,61,0,+++++,+++,77,0,75,0,+++++,+++,38,0
nas1.,80G,,,164939,24,88873,16,,,290219,18,105.1,0,1,58,0,+++++,+++,76,0,78,0,+++++,+++,81,0
nas1.,80G,,,172088,24,90476,16,,,269739,17,107.5,0,1,67,0,+++++,+++,80,0,77,0,+++++,+++,75,0
nas1.,80G,,,176529,25,86208,16,,,265149,16,102.6,0,1,67,0,+++++,+++,77,0,78,0,+++++,+++,77,0
 
nas1.,80G,,,173901,24,79414,14,,,241195,15,84.0,0,1,66,0,+++++,+++,74,0,79,0,+++++,+++,78,0

15 drive:

Code:

nas1.,80G,,,398158,52,219568,42,,,679096,48,94.6,0,1,68,0,+++++,+++,76,0,72,0,+++++,+++,71,0
nas1.,80G,,,360338,47,210154,39,,,631173,44,91.6,0,1,66,0,+++++,+++,76,0,73,0,+++++,+++,75,0
nas1.,80G,,,280585,37,204045,38,,,602630,42,90.7,0,1,64,0,+++++,+++,86,0,81,0,+++++,+++,70,0
nas1.,80G,,,344371,44,205318,38,,,584715,41,88.0,0,1,64,0,+++++,+++,75,0,69,0,+++++,+++,72,0
nas1.,80G,,,344810,45,205007,39,,,645781,46,90.0,0,1,66,0,+++++,+++,75,0,73,0,+++++,+++,67,0
nas1.,80G,,,340166,45,191111,36,,,656911,46,89.7,0,1,69,0,+++++,+++,71,0,77,0,+++++,+++,73,0
nas1.,80G,,,330739,44,211546,40,,,595988,42,87.0,0,1,62,0,+++++,+++,75,0,81,0,+++++,+++,77,0
nas1.,80G,,,398158,52,198753,38,,,589718,42,85.9,0,1,57,0,+++++,+++,73,0,71,0,+++++,+++,73,0
nas1.,80G,,,357042,47,208142,40,,,625617,44,94.3,0,1,60,0,+++++,+++,67,0,71,0,+++++,+++,67,0
 
nas1.,80G,,,314160,42,204542,40,,,594654,43,92.9,0,1,64,0,+++++,+++,79,0,78,0,+++++,+++,70,0

I don't use compression or dedupe on any of my filesystems. With the datasets we store, it's useless.

Edit, sorry for the extra tags mixed in the code blocks. I'm not fighting with this HTML WYSINNWYG editor any more.. :)

cyberjock · Jun 11, 2014

Nice.. I can't find my damn documentation on Bonnie that I had. And my bookmark shows a dead link where i used to go. Go figure!

Well, it's like this:

1. I don't have the original document so I can't post it(which is what I was going to do).
2. Bonnie's tests are unique in that they do very specific testing. The problem is that the testing doesn't reflect real world loading in 99.99% of tests. Most OSes do not have an ARC/L2ARC/ZIL like ZFS does, so many of Bonnie's test results are complete crap, even if you do everything correctly. The problem is that Bonnie isn't a good "benchmark" for ZFS. The only benchmark I really believe in is what is called 'real world'. Most people don't want to hear that, and they may argue to the contrary, but it also happens to be the reality of it.
3. The 9 and 11 versus 15 drives sees a substantial jump in performance. I seem to remember something weird with bonnie and wide pools but I can't remember the specifics. Since I take the stance that benchmarking tools don't do a good job of benchmarking real-world nor handle ZFS in a sane way its easier to just dismiss them then to actually learn all the nitty gritty into why that is the case. Not to mention the benchmark tools are constantly being changed so that would also require you to keep up to date with what changes.

Now, back to your main reason for testing really wide vdevs. The problem has it's own problems that are specific to the structure of ZFS and cannot be easily understood by the laymen. Some stuff like the powers of 2 rule make sense for many people, but lots of other things make no sense without digging really deep. While the lost space is somewhat calculable, it also fluctuates based on real world. You can either lose small amounts of data with a specific block size or lose tons of data with other block sizes. ZFS only lets you set a maximum block size and you have no control with forcing any given size.

In fact, one of the ZFS developers has recently made a comment that some of the expectations of how ZFS works in the ZFS community are not true anymore and haven't for several years. At this point I don't want to go into it any more because someone in the future will take it as correct and I don't have enough solid information either way. So to prevent rumors I'm just going to say there's some discussion that things may not even be what we think they are.

I will say this in case you(or anyone else) are still debating how to build your pool. I'd never do more than a 3 disk RAIDZ1, 10 disk RAIDZ2 or 11 disk RAIDZ3 under any circumstances unless performance and data storage are NOT critical in the slightest. And to be honest, even for backups people usually care. If your backup takes 3 weeks to finish it's not too useful. You'd have to have a damn compelling reason to get me to agree to breaking those rules. I'm not even sure I could think one up that even remotely resembles any kind of reality.

Edit: And please don't take this as a slam in the face. It's not meant that way and I hope it doesn't sound that way. Its just what I've witnessed after spending several years with ZFS and doing all those things people say not to do in test scenarios "just to see what would happen". Generally, if we say not to do something you should probably not do that. ;)

panz · Jun 12, 2014

cyberjock said:
You are slow.. we discussed this yesterday or the day before in IRC.

They adopt 120x120mm case fans! ;)

Sir.Robin · Jun 13, 2014

jgreco said:
If it's as slow as a Netcrapp, that's bad. :)

Why is NetApp crap... or are you just joking?

Caffe Mocha · Jun 29, 2014

Just a noob question , if 180TB of hds are installed, how much ram is needed? Around 180G of ram? Does the 1g of ram per Tb rule still applies?

Important Announcement for the TrueNAS Community.

Backblaze Storage Pod 4.0

Resident Grinch

Inactive Account

Resident Grinch

actually does care

Resident Grinch

Contributor

Patron

Resident Grinch

Patron

Inactive Account

Resident Grinch

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Guru

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Backblaze Storage Pod 4.0"

Similar threads