FreeNAS on high end home built platform

bazagee · May 2, 2013

Hello all,

Please forgive me for not doing due diligence in reading all the manuals before asking this question. Or for that matter actually installing and testing. ;)
Time is not my friend....

My ultimate question is basically how well does FreeNAS work with hardware controllers the likes of LSI's SAS/SATA RAID controller 9750-8i? I have tried another 'shall be nameless' products that works well be not unless I 'buy' the support for this type of controller... I've given it a good shot with my limited Linux skill set.

I have a SuperMicro 32 drive case with above LSI 3ware controller using retail off the shelve 3TB drives. The product I'm currently testing initially allowed be to create partitions but on needing to add another one now refuses to cooperate. Been told its a driver issue.... ooook then... works on first install then not so much....

Maybe more that most people have tried on this forum but I thought I might give the question a shot...
Thanks in advance and happy to be a forum member.

jgreco · May 2, 2013

The 9750-8i is a 3Ware-legacy LSI controller. In general the 3Ware stuff seems less ideal for FreeNAS than maybe a nice LSI SAS2008 based controller. I haven't played with the 9750, but I think it's a variant on the LSI 2108, so then you're probably going to be required to let the RAID controller make logical devices out of your physical drives, and this isn't a good place for a beginner to start.

You really want to read the document about LSI controllers in the hardware forum. It doesn't deal with the 3Ware legacy products, but the commentary on mfi based cards is generally relevant as to why you don't want to do that.

The underlying problem is that you have a RAID controller. But in the ZFS worldview, ZFS wants to be your RAID controller, and therefore a hardware RAID controller is redundant (and probably slower and definitely less-well-resourced). What you really want is a many-port HBA. So the popular ZFS solution is to take a low end RAID controller like IBM's M1015 and load firmware on it that makes it into an HBA.

bazagee · May 3, 2013

Thanks for the concise reply and straight shooting. What I might try then in blowing up the current 12 HDD in RAID 50 and re-add all the drives as JBOD in the controller. See if FreeNAS can manage that with ZFS. At least I'll know if the device drivers will work with the card. I'd like to try and keep the card for its known performance
and the fact that is installed with a BBU. Hopefully that will turn it into a smart HBA?

Thanks again. I'll post with results next week.

bazagee · May 3, 2013

OK - on second thoughts I just ordered the M1015 - I just want this for company bulk near line storage anyway. Why go through the LSI pain that you have already show in you great write-ups. :D

jgreco · May 4, 2013

That's a better choice. What you'd have discovered is some misery. No one here is likely to accuse me of terse answers, but sometimes they are insufficiently long. I should have expanded that last reply a bit to help you understand.

ZFS wants to be your RAID controller.

In a hardware RAID controller, you have maybe something like a PowerPC 800MHz CPU, some number of SAS/SATA ports, and - if you're lucky enough to have a high end controller, not an entry level controller - some cache and maybe even write cache with a BBU. Fairly generous ones in this day and age seem to be about 1GB of cache.

ZFS can do all the same computational work for RAID on the host CPU. Yes, that slows the host CPU, but on a Xeon, it's not that large an issue, and in the end Sun figured out that CPU cores were cheap while specialized silicon was very expensive.

With specialized silicon, that means you need a more complicated device driver and mangement subsystem. ZFS needs that too, except that ZFS *is* that subsystem, so it is all cleanly integrated with the filesystem and storage layers.

ZFS wants to cache for you. For reads, it will happily cache on both a per-disk ("vdev") AND more abstracted ("ARC") basis. When you're reading things off the pool with ZFS, just as with a RAID controller, it will let those bits sit in cache until pressure flushes them out. The difference is that ZFS's cache is the system's RAM. So if you have a 64GB RAM system, a huge portion of that is being used for ARC. Also, a fixed amount of it is being used for vdev cache (which is similar to the "read-ahead" function on a RAID controller).

So point one is this: if you use a RAID card as a disk controller for ZFS, you're going to see almost no benefit from any read cache on the card, even if you enable "read-ahead" on the RAID controller, because ZFS is doing that already, and with substantially larger resources available to it (which you can even tune as needed).

But ZFS also wants to cache writes for you. A lot of data such as file writes may not be that important and can be written async to the pool. ZFS bunches these up into a transaction group and flushes them to disk when a txg fills, or every few seconds, all configurable. The thing is, a transaction group could potentially be huge - it is based on the size of your RAM. So if you are doing a bunch of modifications within files, this is actually being handled in RAM at very high speed, then flushed out to disk. And when it does, it's probably going to blat out a large amount of data. Much larger than your write thru cache is capable of caching.

So point two is this: if you have a RAID card with write cache, a lot of ZFS writes are going to be absolutely devastating to it because they're likely to be larger than the cache on the RAID.

However, there is some opportunity for write cache to be of use for ZFS sync writes. Typically handled with a specific type of SSD in the ZFS model, it appears that a RAID controller with BBU write cache could be very competitive - quite possibly faster and more durable - than an SSD. Only for the ZIL/SLOG. But this is pretty much the exception rather than the rule.

titan_rw · May 4, 2013

jgreco said:
For reads, it will happily cache on both a per-disk ("vdev") AND more abstracted ("ARC") basis.

Not wanting to derail your excellent post(s) on zfs, I had a question about this.

I was under the impression the vdev cache was disabled by default?

According to this webpage:

Prior to the Solaris Nevada build snv_70, the code caused problems for system with lots of disks because the extra prefetched data could cause congestion on the channel between the storage and the host.

According to this webpage the tunable is:

vfs.zfs.vdev.cache.size

On both my freenas machines, this sysctl is 0. I assume that means it's disabled.

I also found this sysctl:

vfs.zfs.vdev.cache.bshift

As far as I could tell, this is the amount of read-ahead to do per vdev? This defaults to 16 on both of my freenas machines.

I assume it's still best to leave the vdev cache off, and rely on the more intelligent prefetch cache?

cyberjock · May 4, 2013

The first link seems to apply to Solaris. The second for FreeBSD. While both are "ZFS", the tunables, how they work, and what works best for Solaris and FreeBSD, are NOT the same. Really screws things up because when you find some super awesome tunable that you are sure will give you free performance boost, it might not work correctly(or even be the right tunable) in FreeBSD.

So I think you may be reading about 2 different stories, but thinking they are the same. I'm not sure. I generally consider the default settings for ZFS to be good and tuning to be a very bad idea unless you have years of experience with ZFS.

But this vdev cache is interesting, I'll have to read up on it. I could see vdev cache getting out of hand really fast. I have a 24 drive system, and if I set it to a "meager" 128MB of RAM, that would really be 3GB allocated just for the vdev cache. Ouch!

titan_rw · May 4, 2013

Yes, I realize the first link was about solaris. I was only quoting it because it basically says that at some point (on solaris), the vdev cache was disabled by default.

Then I quoted a freebsd page that mentioned the default for vdev cache was 10m, and recommended tuning it to 5m for a low memory (768m) machine. Quoting a machine with 768 megs of ram means (to me at least) this page was probably written some time ago.

Further checking the vdev cache size sysctl on my freenas machines, I find the default now seems to be disabled.

I am guilty of drawing two conclusions: that the vdev cache was enabled by default on solaris, and now is not. And that the vdev cache was enabled by default on freebsd, and now is not.

I am also not sure if the vdev cache is per device (disk), or per vdev. If it's per vdev, then 128m wouldn't be excessive. If it's per disk, then I agree, that's excessive.

cyberjock · May 4, 2013

It's a "per disk" setting. So if you set it to 10M and had 100 disks, you'd be consuming 1000MB of RAM. The drawback is that any RAM allocated to this cache is effectively 100% allocated. So that 1000MB of RAM will always be in use until you change/disable it(and reboot). Any other process that needs RAM will have to compete for whatever RAM your system has minus that 1000MB.

I just read an small article on it and the main reason it is disabled is that it generally doesn't help much for pools with just a few disks, where pre-fetch will work nicely. Pre-fetch doesn't work "as well" for very very large pools. But for very very large pools, you probably don't want to have a 50 disk cache of 50MB per disk. So there really isn't a "good use" case for the disk cache unless you've run some ZFS tools and determined that enabling it for your file system, file structure, usage patterns, and hardware limits actually dictate a desire to use it.

As for your guilt about drawing conclusions.. welcome to the club. A lot of us draw conclusions(and it seems that Murphy loves us so much too) and we learn the lessons the hard way. I've had quite a few major ones over my years of computing. I've been lucky in that I'm pretty conservative with my conclusions, but I have friends that have no problem throwing all sorts of very bizarre conclusions out and they've lost lots of data they'll never get back(wedding pictures, baby pictures, etc.)

jgreco · May 4, 2013

Hm, interesting. I hadn't been in the vdev cache code lately, but it does appear to be disabled in v28 along with a sour commentary to the effect that it's no longer beneficial with (unspecified) other changes that have happened in the ZFS code.

titan_rw · May 4, 2013

cyberjock said:
As for your guilt about drawing conclusions.. welcome to the club. A lot of us draw conclusions(and it seems that Murphy loves us so much too) and we learn the lessons the hard way. I've had quite a few major ones over my years of computing. I've been lucky in that I'm pretty conservative with my conclusions, but I have friends that have no problem throwing all sorts of very bizarre conclusions out and they've lost lots of data they'll never get back(wedding pictures, baby pictures, etc.)

I don't understand. I didn't say I was jumping to conclusions. I drew certain conclusions based on my investigations into the subject at hand. I have no direct experience with Solaris, but I imagine their own wiki is correct. So the conclusion that the vdev cache was disabled on solaris would be correct.

My other conclusion that it is now, but wasn't previously, disabled on freebsd, and therefore freenas, I proved as correct. I booted up a vm running 8.2.0-release (zfs v15). sysctl vfs.zfs.vdev.cache.size shows 10m (10485760).

Maybe it's a difference in language, but to "draw a conclusion" to me is not a negative. It's a result you've arrived at after examining all the information pertaining to the matter at hand. On the other hand, "jumping to a conclusion" would have a bad connotation. For example, "freenas is memory hungry, therefore freenas is bad, and *insert other nas software here* is good". To me that's jumping to a conclusion.

Anyway, I digress. I just thought I would point out that I was under the impression vdev caching was disabled 'now'.

cyberjock · May 7, 2013

So I've been wondering about this vdev cache thing. If you've read some of my other posts, I have an Areca 1280ML-24 port with a 2GB RAM stick installed. I've found that disabling the write cache and setting the hdd read-ahead and volume read-ahead cache to "normal" or "enabled" in my controller BIOS seems to provide a major performance boost(1GB/sec+ scrubbing versus 150-200MB/sec). I'm wondering if I could get similar results using the vdev cache and disabling the read-ahead cache settings on my controller.

My first thought is that this might not really matter in the big picture because the RAM on my controller will basically be unused, and why not use the RAM if you have it? I'm just not sure if one would be more effective than the other.

jgreco · May 7, 2013

cyberjock said:
My first thought is that this might not really matter in the big picture because the RAM on my controller will basically be unused, and why not use the RAM if you have it? I'm just not sure if one would be more effective than the other.

My fuzzy recollection is that there was some logic in the ZFS code to try to optimize the cases where it would be used. I've still been too lazy to research the reasoning behind disabling it in v28. Even a high rate of failure for prefetching would be acceptable if there was sufficient bandwidth to the disks and an occasional hit would reduce latency, which ought to translate into at least somewhat faster pool responsiveness. If prefetch is helping on your Areca, then I'd expect that the vdev cache in ZFS would be a similar win. Curious.

cyberjock · May 7, 2013

I'm 1/2 tempted to try it out.. wonder if there's any risks with using it... I guess more reading is in order.

paleoN · May 7, 2013

cyberjock said:
It's a "per disk" setting. So if you set it to 10M and had 100 disks, you'd be consuming 1000MB of RAM.

Actually it's per vdev based on what I've seen.

jgreco said:
I've still been too lazy to research the reasoning behind disabling it in v28. Even a high rate of failure for prefetching would be acceptable if there was sufficient bandwidth to the disks and an occasional hit would reduce latency, which ought to translate into at least somewhat faster pool responsiveness.

FYI, device-level prefetch will only prefetch metadata now.

Evil Tuning#Device-Level Prefetching
zfs vdev cache consumes excessive memory

titan_rw · May 7, 2013

paleoN said:
Actually it's per vdev based on what I've seen.

FYI, device-level prefetch will only prefetch metadata now.

Evil Tuning#Device-Level Prefetching

I quoted that web page earlier in the thread, and was told is wasn't relevant because it was Solaris. Because Solaris and FreeBSD are different, you can't compare the two.

I understand they're different, but the zfs code in solaris, at that time at least, made it's way down to freebsd, obviously modified for use in freebsd. I understand zfs is now closed source, so that won't (likely) happen again.

So is that web page a source on info that applies to freebsd (and therefore freenas)? That's kinda what I was aiming at originally.

paleoN · May 7, 2013

titan_rw said:
I quoted that web page earlier in the thread, and was told is wasn't relevant because it was Solaris. Because Solaris and FreeBSD are different, you can't compare the two.

Sure you can. You just have to be aware of the differences.

titan_rw said:
I understand they're different, but the zfs code in solaris, at that time at least, made it's way down to freebsd, obviously modified for use in freebsd. I understand zfs is now closed source, so that won't (likely) happen again.

Sun/Oracle is no longer FreeBSD's ZFS vendor. It's been illumos for some time now.

titan_rw said:
So is that web page a source on info that applies to freebsd (and therefore freenas)? That's kinda what I was aiming at originally.

UTSL.

jgreco · May 8, 2013

Well, nice. I'm not sure this actually makes any sense for FreeNAS.

illumos/openindiana have a substantially smaller minimum hardware requirement, and it appears that the number was set to zero largely on the premise that 10MB was onerous if one had "lots of disks". I guess if you had a hundred disks and 10MB each and only 512MB of RAM, that'd be a problem, but for FreeNAS, where the recommended minimum for ZFS is 8GB, and a larger hundred-drive system would be expected to have at least dozens of GB of RAM.

I'm going to play around a little...

titan_rw · May 8, 2013

I enabled a 10m vdev cache on my backup nas. It has one zpool, and only one vdev in that pool (a z2). It hasn't been very active, as it doesn't do much besides receive snapshots, and act as a dumping ground for acronis backups. But here's the stats from about a day of uptime:

Code:

VDEV Cache Summary:                             1.92m
        Hit Ratio:                      35.19%  676.82k
        Miss Ratio:                     62.76%  1.21m
        Delegations:                    2.05%   39.40k

Not very good at only 35% hit rate.

On the contrary, here's what it manages with prefetch:

Code:

File-Level Prefetch: (HEALTHY)

DMU Efficiency:                                 139.45m
        Hit Ratio:                      88.44%  123.33m
        Miss Ratio:                     11.56%  16.12m

Much better. 88% hit, with only 16 megs of wasted reads.

With the vdev cache being only 2 megs, and at only a 35% hit rate, I really doubt if it makes any difference one way or the other.

jgreco · May 8, 2013

I think you're thinking about it the wrong way. A 35% hit rate would be fantastic on a busy pool. The thing is, on spinny media, it is the seeks that kill you. If you seek for something, then seek away for an unrelated task, and then have to seek back to get blocks that were related to the first request, being able to eliminate that third seek as part of the first by making it a longer read could be a massive improvement. If there is no I/O bus contention (think: SAS multichannel to expander, etc) then the cost is basically one of sitting around for a fraction of a millisecond to transfer extra data - and then having to store that in memory.

But the caveat here is that it isn't going to make sense under all circumstances. As a matter of fact, it may not make sense under many circumstances.... but I think the big question is whether or not the change to only running metadata through the vdev cache has killed the potential for effectiveness. If you look at the rates at which requests are being made on your filer, the vdev cache is not being hit very hard compared to the DMU prefetch rates.

I was going to play around with one of the servers here, but the fscking USB flash apparently got corrupted and the stupid thing needs to be reloaded now. The idea of USB keys is attractive but I keep seeing them rot to hell after a year or two.

Important Announcement for the TrueNAS Community.

FreeNAS on high end home built platform

Cadet

Resident Grinch

Cadet

Cadet

Resident Grinch

Guru

Inactive Account

Guru

Inactive Account

Resident Grinch

Guru

Inactive Account

Resident Grinch

Inactive Account

Wizard

Guru

Wizard

Resident Grinch

Guru

Resident Grinch

Similar threads