Need MAX Sequential WRITE Performance (40 GbE SMB)

Status
Not open for further replies.

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
I'm sorry that I don't have time this month, or next, to spend a lot of time on this, because it's the kind of thing I would normally love to talk about at length. I'll see if I can pop in to answer anything that's not been answered sufficiently.
If I was in/near Irvine, I'd offer to help at a reduced rate just to play around with it. It would be a very interesting project!

As for the OMG-way vdev... whoever did that needs to have their fingers broken. If you configured that, OP... well, all I can say is you REALLY need to spend some time reading these forums and most of the content in the "Resources" area. A 60-way vdev would be a horrendous idea for a typical system running gigabit... much less 40gig. It also makes me worry that you haven't done a lot of the other simple recommendations... like recording what serial number is on every drive (or you'll hate yourself when you have to replace a drive), configuring SMART testing and email alerting, pool scrubs, etc.

As for your pool configuration, I'd start with 10 6-disk RAIDZ2 vdevs. That'll give you a decent balance between performance, reliability, and capacity. I'm not sure you'll be able to saturate 40gig with that configuration, but it's where I'd start. That gets you 229.1TiB usable (accounting for the 80% rule, etc.)

On your Mellanox config... 20.20.20.20? Where did this IP come from? Unless you're working for CSC, you really should stick to non-routable IPs as a best practice.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I think I haven't plugged the ZFS books in this thread yet, so I will. Definitely the place to start:

https://www.tiltedwindmillpress.com/?product=fm-storage

That's the whole bundle straight from Michael Lucas' store. The two ZFS books plus FreeBSD storage essentials and FreeBSD specialty filesystems.

They're quick reads, but you can learn a lot, especially if you're new to ZFS.

You can also get them from most major bookstores.
 

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
Indeed. Don't worry, we'll get you straightened out.

That's not a "dual-port 12Gbps SAS HBA". The connectors on the boards are not "channels" or "ports". There are four PHY's ("channels") on each connector, and therefore the controller offers eight channels (PHY's). A single channel is not capable of more than 12Gbps. That number at the end of an LSI part number, like "9341-8i" means 8 internal PHY's.

The individual channels can be used in a narrow port configuration (12Gbps) or a wide port configuration (4 x 12Gbps = 48Gbps) and it is common to see a wideport cable go from the HBA to a SAS backplane. It is also possible to go to an 8x wideport configuration. This is what @tvsjr was talking about, I believe. Mixing up all the various terms is not really helpful to getting points across, and it's useful to make sure we're all talking the same language. SAS is particularly crappy about the word "port" which is generally used to mean "the connection to a SAS endpoint" (how wonderfully abstract!) rather than any physical connector.

Back to the issue at hand, the point was made that you're going to need to walk through the system and tune it at multiple levels. Unlike many iSCSI products that mostly act as a conduit between a network controller and a RAID controller, such as that Win2016 solution mentioned, FreeNAS is actually doing substantial work and you will find it necessary to properly design your hardware (probably Chelsio), your pool (probably mirrors with lots of large disks), your network (probably turn off jumbo, or carefully tune this very sharp edge), etc. in order to get what you seek.

I did not want to step on toes by providing that same answer, but I am going to requote it and emphasize: this comment is 100% spot-on, could not have said it better myself. ZFS is a software package that does in software something that has traditionally been done in dedicated RAID controller silicon. What you want is absolutely possible, in my experience, but may be outside your ability to achieve, especially if you are unwilling or unable to take the deep technical dive into understanding the issues.

ZFS is capable of some truly amazing things, but usually to get there, there's a commitment of resources that has to happen that's a little greater than for some other packages.

I'm sorry that I don't have time this month, or next, to spend a lot of time on this, because it's the kind of thing I would normally love to talk about at length. I'll see if I can pop in to answer anything that's not been answered sufficiently.

Thanks for the education. We come from the Quantum / EMC world, so this FN stuff is a new adventure for us, and costs associated with the low-end like this isn't really a primary consideration, as it looks like $100K in the FN world gets you what $1M in the Quantum/EMC world gets you.

Also, if you're in SoCal, we're happy to pay someone for their expertise. We have some significant ambition on this end -- like a target of 6-7PB total online archive storage at our DC.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Thanks for the education. We come from the Quantum / EMC world, so this FN stuff is a new adventure for us, and costs associated with the low-end like this isn't really a primary consideration, as it looks like $100K in the FN world gets you what $1M in the Quantum/EMC world gets you.

Also, if you're in SoCal, we're happy to pay someone for their expertise. We have some significant ambition on this end -- like a target of 6-7PB total online archive storage at our DC.
FreeNAS is great but the reason those other products are so much more money is very their features and performance can't be compared to FreeNAS.
 
Last edited by a moderator:

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
If I was in/near Irvine, I'd offer to help at a reduced rate just to play around with it. It would be a very interesting project!

As for the OMG-way vdev... whoever did that needs to have their fingers broken. If you configured that, OP... well, all I can say is you REALLY need to spend some time reading these forums and most of the content in the "Resources" area. A 60-way vdev would be a horrendous idea for a typical system running gigabit... much less 40gig. It also makes me worry that you haven't done a lot of the other simple recommendations... like recording what serial number is on every drive (or you'll hate yourself when you have to replace a drive), configuring SMART testing and email alerting, pool scrubs, etc.

As for your pool configuration, I'd start with 10 6-disk RAIDZ2 vdevs. That'll give you a decent balance between performance, reliability, and capacity. I'm not sure you'll be able to saturate 40gig with that configuration, but it's where I'd start. That gets you 229.1TiB usable (accounting for the 80% rule, etc.)

On your Mellanox config... 20.20.20.20? Where did this IP come from? Unless you're working for CSC, you really should stick to non-routable IPs as a best practice.

Believe me, I wish you were in SoCal.

We're literally at the very beginning of this, and I've only just downloaded and did a blind install just a few days ago, with little reading and no experience. So I am strong with the ignorance here, but I have thick skin and don't yet know what I don't know.

The 20.20 IP is strictly for a tiny, local direct-connect ad hoc network. Nothing special or production going on here.

Not sure what you mean about tracking drive serials. I assume when a disk fails you get a visual notification and just swap it out. Some underling can handle that stuff.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I assume when a disk fails you get a visual notification and just swap it out. Some underling can handle that stuff.
Well, that can get complicated. You can get it to do so with Supermicro hardware, but you need to research it and fiddle around with the SCSI Enclosure Services stuff. If you buy from iXsystems, they ship TrueNAS with that kind of thing working. FreeNAS certified too, presumably.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Thanks for the education. We come from the Quantum / EMC world, so this FN stuff is a new adventure for us, and costs associated with the low-end like this isn't really a primary consideration, as it looks like $100K in the FN world gets you what $1M in the Quantum/EMC world gets you.

That's really the thing. Sun's ambition with ZFS was to go after the pricey, specialized, low-yield storage silicon (and storage vendors) by betting that they could do it all in software, on relatively less expensive general purpose server/compute hardware.

They won that battle but lost the war. Some of us started with 386BSD and then FreeBSD back in the '90's, with the Sun idea that "the network IS the computer," and abandoned the "pricey" Sun hardware ... why get one Sun when you can get two PC's for the same price. Redundant array of servers. Then, as massive companies like Google made scaling data centers through distributed and parallel processing a thing, Sun found its big servers of limited usefulness, as the industry leaped over them to do many cheap smaller systems in a distributed network, making MUCH larger computers.

Storage is still hard to do that way, especially for what you're doing. And ZFS is pretty awesome, but you will find that it is a little finicky. Even for your big EMC storage systems, the amount of tuning and engineering that goes into making them work involves fleets of FE's and a certain amount of frustration.

Also, if you're in SoCal, we're happy to pay someone for their expertise. We have some significant ambition on this end -- like a target of 6-7PB total online archive storage at our DC.

Sorry, kinda far away. Strongly suggest you follow some of the suggestions you've been given to this point, as there's huge amounts of room for relatively easy improvement, and @tvsjr has made some good points. If it gets to a point where you really do need some help, there are some of us who do this sort of stuff, but I have a strong preference towards making sure people know the basic concepts underpinning their gear, and while I can provide that on a silver platter for money, it's about the same amount of effort for you to get the information for free from the forum, which is better (dollarwise) for you.

There's also the possibility of buying a fully supported system from iXsystems, which helps fund FreeNAS development.
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Believe me, I wish you were in SoCal.

We're literally at the very beginning of this, and I've only just downloaded and did a blind install just a few days ago, with little reading and no experience. So I am strong with the ignorance here, but I have thick skin and don't yet know what I don't know.

The 20.20 IP is strictly for a tiny, local direct-connect ad hoc network. Nothing special or production going on here.

Not sure what you mean about tracking drive serials. I assume when a disk fails you get a visual notification and just swap it out. Some underling can handle that stuff.
Well, there are always remote options (VPN/etc.) if your environment supports it (since it seems like you have the physical side of things reasonably well configured, with the potential exception of that Mellanox card), but I do agree with jgreco... you need to gain a much better understanding of how all this stuff works, or you need to invest in a supported system like TrueNAS. Remember that a lot of what you're paying for when you buy a TrueNAS/EMC/whatever system is the support, testing, maintenance, etc. FN is an incredibly powerful product, but you're trading cost for needing a lot more knowledge.

7-8PB is gonna be a really interesting project - that's something on the order of 2K drives. Not only are you getting into lots of storage, you're also looking at massive amounts of heat, power consumption, and cost. You're going to have $400K tied up in drives alone, let alone the hardware!

20.20 is a public, routed IP, currently belonging to CSC. You shouldn't be using it if you don't own it. You should be using something from RFC1918 - some subnet inside of 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16. It's most likely just an academic exercise, but it's the Right Way to do things.

I'm running all Supermicro gear myself, and I don't get visual indications of a drive failure. TrueNAS does that, yes, but I don't know if anyone has it working reliably. I have a spreadsheet that I built when I configured the system... showing the serial number for each drive bay. When a drive fails or starts throwing SMART errors, I get an email... I look up the serial, cross-reference from the spreadsheet, and go swap the drive. Simple.

I would suggest being careful with the underlings... underlings have a bad way of pulling the wrong drive and taking an entire pool out. I've seen it happen.
 

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
QUICK QUESTION: Best SEQUENTIAL READ performance on a 44-disk SAS JBOD (spread across two 12G SAS "ports" -- one channel has 24 disks, the other has 20 disks). We have two such JBODs identically configured off the SM Twin server. We want MAX sustained sequential READ throughput now.

To utilize all 44 disks, would a 4x 11-disk Z1 pool or an 11x 4-disk Z1 pool yield maximum performance?

Thanks,
 

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
Just as an update...

We're now running a total of exactly 100x 8TB HGST He SAS3 12G disks in 2x identical SM 44-disk dual expander 12G SAS3 JBODs (88 total) and 12x in the SM Twin server, EACH node having an identical physical configuration -- Dual E5-2630 v2 CPU, 128G DDR3 ECC, LSI 9300-8e Dual 12G SAS "Ports", and Mellanox CX3 EN Dual Port 40GbE QSFP+ NICs running over an Arista 7050QX-32 (16x 40GbE QSFP+ and 8x 10GbE SFP+ ports).
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
The more vdevs, the more IOPS, if all else is equal. Your best performing array would be 22x 2-way mirrors. Obviously, this is also the most expensive configuration (unless you want to talk about 3-way mirrors).
 

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
The more vdevs, the more IOPS, if all else is equal. Your best performing array would be 22x 2-way mirrors. Obviously, this is also the most expensive configuration (unless you want to talk about 3-way mirrors).

We could do 22x2. Is that really the best for sequential READ? Do all 44 disks serve-up data on reads in a mirrored array?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Mirrors are going to give you the best overall performance. With mirrors it's going to read from all disks that have the data. RaidzX is very similar for reads but fewer vdevs means less iops so multiple readers will be slower.
 

BiffBlendon

Dabbler
Joined
Jan 6, 2018
Messages
20
That's good to know. So, if we are tuning for multiple read clients more vdevs will help.

So, with that criteria, would there be a considerable performance difference in sequential READ throughput to two concurrent clients between an 11x4 z1 and a 22x2 mirror?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
That's good to know. So, if we are tuning for multiple read clients more vdevs will help.

So, with that criteria, would there be a considerable performance difference in sequential READ throughput to two concurrent clients between an 11x4 z1 and a 22x2 mirror?
Well you should never use raidz1 so let's change that too raidz2. Have you tested the speed with your work load? It's super easy to create a pool of each kind and test it. I suspect there sequential read performance will be about the same with each pool.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Yeah, I know there's a solution.
Well, kind of a solution, at least. I'm hoping that more testing (and more eyes on the code) will result in a more robust solution.
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Well, kind of a solution, at least. I'm hoping that more testing (and more eyes on the code) will result in a more robust solution.
Honestly, I've never worried about it because my server rack is in a closet. If I get an email with a drive going sideways, I look up the serial number, compare it to my cheat sheet, and pull the drive.

I could probably save a few watts by turning all the blinkenlights off :D
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
We could do 22x2. Is that really the best for sequential READ? Do all 44 disks serve-up data on reads in a mirrored array?

ZFS will tend to try to read from the least-used disk in a mirror. This *can* result in an improvement in read speeds where the client is faster than a single disk can sustain. Modern disks are capable of speeds around 200-250MBytes/sec in some cases, or ~2Gbps, so you won't see that on a 1GbE network, but you would on a 10 or 40. You will not necessarily double the 2Gbps number, but you can definitely improve on it. This works because both disks are working simultaneously to serve a single client. Going to two clients reduces the potential available I/O to "one HDD each".

The addition of more vdevs will substantially improve performance. Going beyond two disks in a mirror vdev allows that vdev to be serving more simultaneous clients more quickly.

At the speeds you're discussing, the use of RAIDZ is probably not a good idea because of the CPU overhead.
 
Status
Not open for further replies.
Top