Looking for a little guidance from the community

kitt001 · Jun 2, 2017

I have a server that I'm planning to install FreeNAS on, but at the last minute I'm debating the plan that I started with; so I'm hoping some of the 'been there done that' that folks around here can either confirm I'm on the right track, or slap me in the right direction before I do something too stupid. :) My objective here is to use the unit as an iSCSI storage repository for a couple of ESX hosts. I have several other FreeNAS boxes accomplishing the same objective elsewhere, but nothing at this scale, so while I'm not completely new to FreeNAS, this is new territory for me

The server hardware is a Supermicro 6028R, with 128GB of memory, (2) 8 core processors, (12) 4TB HGST 7200RPM SAS disks, (2) 128GB SSD disks, and (2) 400GB Intel P3700 NVMe cards.

My initial plan was to install FreeNAS on the pair of SSD disks, mirror the NVMe modules and use them as the ZIL for the 12 spindle disks configured as RAIDZ3.

After further research, I noticed a recommendation that you shouldn't go beyond 11 disks, but some people also seem to argue that it's fine .. I haven't seen a clear piece of logic for the 11 disk limit (though I'm sure there must be one) ... so that's my first question - Is the 11 disk limit technical in someway (maybe performance nose dives beyond that?), and what might the fallout of using all 12 the way I was thinking?

My second question is around the Raid Z3 layout. If I take the 48TB of raw disk and elect Z3 the system projects 21TB usable ... and if I limit usage of that to below the 80% threshold (which I've also seen as 50% when discussing iSCSI in some places) I end up being able to use 11-17TB out of the 48 in the Chassis ... 31+TB lost (out of 48 total) to parity/reserved space seems like a tremendous amount of overhead, does that sound right? or could I be doing something silly here?

And lastly, am I just going about this wrong given the goal? I'm totally open to suggestions if there is a better way to set this up that anyone is willing to suggest, and if the best answer is that I need something more that I don't already have, I'm open to that too. I prefer to do things only once if I can help it

Appreciate any feedback anyone has to offer.

Ericloewe · Jun 2, 2017

kitt001 said:
maybe performance nose dives beyond that?

Around that, really. Vdevs have the IOPS of a single drive, so if you make them immensely wide, you will suffer.

kitt001 said:
My second question is around the Raid Z3 layout. If I take the 48TB of raw disk and elect Z3 the system projects 21TB usable ... and if I limit usage of that to below the 80% threshold (which I've also seen as 50% when discussing iSCSI in some places) I end up being able to use 11-17TB

How do you figure? This should help: https://forums.freenas.org/index.php?resources/zfs-raid-size-and-reliability-calculator.49/

kitt001 said:
My objective here is to use the unit as an iSCSI storage repository for a couple of ESX hosts. I have several other FreeNAS boxes accomplishing the same objective elsewhere, but nothing at this scale, so while I'm not completely new to FreeNAS, this is new territory for me

iSCSI, or any sort of block storage, really, is painful on ZFS and other CoW filesystems. You're looking at mirrors and 50% free space to maintain adequate performance, in most cases.

BigDave · Jun 2, 2017

kitt001 said:
so that's my first question - Is the 11 disk limit technical in someway (maybe performance nose dives beyond that?), and what might the fallout of using all 12 the way I was thinking?

This was @cyberjock 's recent response to a very similar situation, also I've read @jgreco
echo this same explanation, but can't seem to find that one...
https://forums.freenas.org/index.php?threads/12-disk-vdev.54836/#post-381551

jgreco · Jun 2, 2017

11 disks in RAIDZ3 is a recommended maximum that dates back to a time when it was recommended to have a 2^N number of data drives for RAIDZn, and therefore your choices were really 7, 11, or 19 drives, and 19 worked very poorly. With the advent of ZFS compression, the role of 2^N was substantially reduced.

Many arrays are still built on 12-drive or 24-drive arrays, however, and so 11 still works out as a convenient number to size your vdev at, since it allows for an optimal size and a spare drive. Additionally, you'll discover that recovery performance suffers as your vdevs widen, and once you get out there a bit more, you may have a really rough time of it, especially if you are doing something rough like block storage, which may require a lot of seeks to do the metadata traversal.

So here's the thing. This isn't a binary pass/fail. I can hook the smallest U-Haul trailer up to my truck and pull it no problem. But I can also hook a large trailer up to a small car and still be able to pull it. I might not be doing the car any favors, and maybe I can't go up mountains, and maybe I'll be sorry later when I find my engine ruined, but I can do it. Hell, people can sometimes pull a trailer with their bare hands. Doesn't make it wise.

Now for some free but unsolicited and also unapologetically blunt crystal-ball-gazing-derived advice. You won't want to be using RAIDZ for this block storage project. At least, you aren't likely to do it successfully, unless you maybe have some really unusual edge case. You'll suffer the IOPS effect. Twenty five years ago, we had 1GByte 7200RPM SCSI HDD's. Let's optimistically say you could get around 100 IOPS out of them. That translates to around two million sectors. Let's pretend the average write size was 4KB (8 sectors). To write randomly to the entire disk would take a quarter of a million sectors, or, put differently, around three days. Your 30TB-usable RAIDZ3 vdev will start out very fast, but over time performance will fall catastrophically, because the IOPS you can expect out of a single vdev is similar to that of the slowest component device. So, let's say optimistically that your modern drives can hit 250 IOPS, and because ZFS is awesome you manage to get your vdev to run at 500 IOPS (which might be possible if you kept occupancy below 50%, which lowers us to 15TB usable). So to fill 15TB with 4KB random writes at 500 IOPS, That's 84 days to write that. 100% busy. How usable is that array going to be?

You REALLY need the power of independent mirrors to be able to have any chance to get good performance AND high capacity.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

https://forums.freenas.org/index.php?threads/zfs-fragmentation-issues.11818/

etc. The rest of what you have is the general foundation of a pretty decent filer. You are in the general realm of being able to succeed (not like the poor suckers who want to do iSCSI on 8GB RAM), but you'll probably need to readjust your pool design. Try instead:

Six two way mirror vdevs (24TB pool, ~10-12TB truly usable)
One 400GB NVMe for L2ARC
One 400GB NVMe for SLOG

Or you could continue to mirror the NVMe for SLOG and replace those 120's with some large cheap consumer SSD, you can probably handle ~~1TB-ish if you don't set a stupidly low block size on the pool, get two nice 500GB units. I think you'd find this a fairly enjoyable storage platform, or, at least, I've not seen any better setup on similar hardware with ZFS.

jgreco · Jun 3, 2017

and upon re-reading I meant "replace those 120's" to free up space for larger L2ARC devices, maybe 500's. This doesn't mean you cannot use the 120's for the purpose! The goal with ZFS for block storage should be to get as much frequently-read data into flash as possible, where "frequently-read" could easily mean as infrequently as "read once a day". Every read you manage to avoid fulfilling from the pool means one less seek for the HDD's.

The ideal ZFS system for block storage is one where most reads are fulfilled from ARC/L2ARC, because fragmentation has very low impact (only initial retrieval into ARC), and where most pool traffic is writes, where you can control the impact of fragmentation by maintaining gobs of free space, or accept lower write peformance in exchange for being able to use more of your pool space.

Of course, this all totally depends on your actual needs and expectations. If a really super-slow pool is acceptable, your design will be different than if you are looking for a large pool of "this is similar in responsiveness to SSD."

jgreco · Jun 3, 2017

BigDave said:
also I've read @jgreco
echo this same explanation, but can't seem to find that one...

<Grinch-as-Darth-Vader voice> "I find your lack of search disturbing"

kitt001 · Jun 4, 2017

Thanks for the feedback so far, but unsurprisingly, it lead to more questions. :)

First, around the scenario of using one NVMe for SLOG and the other for L2ARC while Mirroring the rest of the disks ... I was under the impression that having no redundancy on the SLOG, made it a single point of failure (or corruption) for the entire VDEV, even if the the rest of the VDEV was was redundant ... is that not accurate?

And second, I again have to admit my ignorance here, given my application, I didn't think that L2ARC would provide a huge benefit. My assumption there was based on the thinking (possibly flawed) that it would want to cache files, and in my application the 'files' are going to be entire hard disks. If I have 20 VMs running, they'd easily exceed the L2, and the system would be strugling to decide which should be cached.

If my L2 assumption is incorrect ... how do I size it? You mention 1TBish, and then state two 500's ... are you meaning to simply stripe them? Does failing to make the L2ARC redundant undermine the VDEV? Or is the fact that it's a read cache remove it from that part of the equation? And is larger simply better here?

danb35 · Jun 4, 2017

I believe that both ARC and L2ARC cache blocks, not files.

Stux · Jun 4, 2017

And L2ARC does not need redundancy as the data is backed by the pool and its failure should not bring the system down.

The only exception would be if its failure lowered the pools utility to the point of virtual failure.

Ericloewe · Jun 4, 2017

danb35 said:
I believe that both ARC and L2ARC cache blocks, not files.

Yup. In ZFS, everything is blocks. Files and directories don't exist as monolithic entities.

Stux said:
And L2ARC does not need redundancy as the data is backed by the pool and its failure should not bring the system down.

The only exception would be if its failure lowered the pools utility to the point of virtual failure.

SLOG is similar, with the catch that a coincidental failure of the SLOG device at the same time the system panics or loses power will cause the loss of the last few seconds of transactions.

kitt001 said:
First, around the scenario of using one NVMe for SLOG and the other for L2ARC while Mirroring the rest of the disks ... I was under the impression that having no redundancy on the SLOG, made it a single point of failure (or corruption) for the entire VDEV, even if the the rest of the VDEV was was redundant ... is that not accurate?

SLOG and L2ARC operate per pool, not per vdev.

Stux · Jun 5, 2017

Ericloewe said:
SLOG is similar, with the catch that a coincidental failure of the SLOG device at the same time the system panics or loses power will cause the loss of the last few seconds of transactions

Yes, and that is why it *may* be worth considering a mirrored SLOG, but only if your SLOG firstly features Power Loss Protection (which it should), and I would contend, if the last few transactions are that important to your application, you should perhaps be setting up a High Availability cluster.

If you don't need a High Availability cluster, then perhaps you also don't need mirrored slogs ;)

Loss of a SLOG will just revert to keeping the ZIL on the primary pool, which is a performance issue, not a reliability issue.

Ericloewe · Jun 5, 2017

Stux said:
but only if your SLOG firstly features Power Loss Protection

Well, SLOG without power loss protection is another name for "very expensive system decelerator".

kitt001 · Jun 5, 2017

Ericloewe said:
Yup. In ZFS, everything is blocks. Files and directories don't exist as monolithic entities.

OK, This makes sense and how it would be beneficial in my application after all. Which leads me back to my previous question, is this simply a "bigger is better" scenario or is there a point where too much of a good thing becomes detrimental? I'll swap out the (2) smallish SSDs with bigger ones ... I just want to be clear if there is an optimal ratio that I should be targeting.

Stux said:
Yes, and that is why it *may* be worth considering a mirrored SLOG, but only if your SLOG firstly features Power Loss Protection (which it should), and I would content, if the last few transactions are that important to your application

I was under the impression, again, perhaps mistakenly, that the loss/corruption of the SLOG would lead to the pool being unrecoverable regardless of the pools own underlying redundancy. (I thought I read this elsewhere on the forum, perhaps Cyberjocks powerpoint, not sure, but maybe I misinterpreted what was being said)

Stux said:
If you don't need a High Availability cluster, then perhaps you also don't need mirrored slogs ;)

The hard disks for the VMs that will be stored on here are development resources. Largely they'll be sitting idle accept the ones involved in an active development effort. While a failure of this system would be a major inconvenience (one I'd rather not deal with), it wouldn't be catastrophic as the systems will be backed up and stored elsewhere. That said, if mirroring the SLOG will mitigate it as a potential source of a future headache ... I'm content to just go that route unless it's truly a waste of resources. :)

Ericloewe said:
Well, SLOG without power loss protection is another name for "very expensive system decelerator".

This is my first foray into use of any NVMe modules, but I don't think the P3700 supports a battery ... And while I realize there is litany of other possible reasons that it would come in to play beyond environmental, the system has redundant power supplies, connected to redundant UPS', attached to redundant power circuits, backed up by redundant generators; so at least the mother nature related reasons are relatively covered.

I apologize if I'm getting things confused and appreciate the patience ... I'm trying to handle several unrelated projects that are all at the edge of my current knowledge base, so I'm trying to expand it as I go .. but there's a lot learning going on :).

Ericloewe · Jun 5, 2017

kitt001 said:
Which leads me back to my previous question, is this simply a "bigger is better" scenario or is there a point where too much of a good thing becomes detrimental?

No. L2ARC needs metadata in RAM, so there's a balance to pay attention to. For 128GB of RAM, something like 600-800GB should be a decent starting point (the recently-introduced compressed ARC changes things a bit, so there isn't a good feeling for what is a good size, yet).

kitt001 said:
I was under the impression, again, perhaps mistakenly, that the loss/corruption of the SLOG would lead to the pool being unrecoverable regardless of the pools own underlying redundancy. (I thought I read this elsewhere on the forum, perhaps Cyberjocks powerpoint, not sure, but maybe I misinterpreted what was being said)

That used to be true looooooong ago. These days, worst-case, you lose a few seconds of transactions, if you're really unlucky. If your luck is normal, you just get atrocious performance until you replace the SLOG.

kitt001 said:
That said, if mirroring the SLOG will mitigate it as a potential source of a future headache ... I'm content to just go that route unless it's truly a waste of resources.

It's your money and your decision to make.

kitt001 said:
but I don't think the P3700 supports a battery

That's because it includes the power loss protection on board. It's an array of capacitors.

kitt001 · Jun 5, 2017

Ericloewe said:
No. L2ARC needs metadata in RAM, so there's a balance to pay attention to. For 128GB of RAM, something like 600-800GB should be a decent starting point (the recently-introduced compressed ARC changes things a bit, so there isn't a good feeling for what is a good size, yet).

So with that as a starting point, the originally suggested 1TB seems reasonable ... which was suggested as (2) 500GB SSD's ... I'm guessing that those can be striped to create the 1TB L2? (I was unclear on that) ... or is a single disk better suited if I can acquire it?

Ericloewe said:
That used to be true looooooong ago. These days, worst-case, you lose a few seconds of transactions, if you're really unlucky. If your luck is normal, you just get atrocious performance until you replace the SLOG.

Woohoo! ... it may have been wrong in the current atmosphere but at least I didn't imagine it. lol. I don't think losing the last few transactions in this case would be any different than a physical machine (VS the virtual ones using this NAS) crashing in mid operation and having to recover file system errors on boot, so I don't think they are THAT critical - my only motivation for the mirrored SLOG was to protect the integrity of the entire pool if it went bad. If that is no longer the case, I'll pull the second NVMe and use it elsewhere.

That brings me to the OS ... originally, I was going to install it on the (2) 128 SSD's that I'm now pulling out in favor of the L2. Is there a risk to running it off a USB flash drive (the server has internal USB sockets)? I know it's done frequently, and I even have two or three very much smaller FreeNAS boxes doing it because those were the resources I had at the time, and though I've never had an issue; the whole concept leaves me feeling a tad iffy. But besides the initial boot of the system, I'm not sure how much the disk gets accessed, so It might really be a non-issue. And while I'm on that topic, does installing the OS on a pair of devices buy me anything that's worth having? It could simply be the old mentality in me that that the system disk should always have a backup when possible.

Thank you again for all of your insight.

Stux · Jun 5, 2017

USB disks are iffier than SSD.

Installing onto a pair of 128GB SSDs is the ideal solution (given that you can't buy decent 32GB SSDs anymore).

The benefit of a USB install is that it doesn't use any SATA ports and they can be cheaper.

kitt001 · Jun 5, 2017

My issue is, that although I still have SATA ports, my chassis is out of drive bays.

However, after reflecting on this a bit from my earlier post, I think that if the goal of the L2 is speed, maybe instead of gettiing a TB worth of SATA SSD, I'd be better served to get a Samsung SSD 960 Pro NVMe 1TB, use that for the L2, and leave the pair of 128 SSD's for the OS. I think the NVMe ends up being more cost effective than the SATA drives, and will perform much better.

Unless there's a compatability issue there, I think that's what I'm leaning toward.

Ericloewe · Jun 5, 2017

No, there's no compatibility issue with NVMe.

kitt001 · Jun 6, 2017

Sorry .. I didn't mean to imply a compatibility issue with NVMe in general ... just that particular module.

Ericloewe · Jun 6, 2017

kitt001 said:
Sorry .. I didn't mean to imply a compatibility issue with NVMe in general ... just that particular module.

Why would there be? It's all PCIe on the physical layer and NVMe on the storage layer.

Important Announcement for the TrueNAS Community.

Looking for a little guidance from the community

Dabbler

Server Wrangler

FreeNAS Enthusiast

Resident Grinch

Resident Grinch

Resident Grinch

Dabbler

Hall of Famer

MVP

Server Wrangler

MVP

Server Wrangler

Dabbler

Server Wrangler

Dabbler

MVP

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Similar threads