ZFS best settings for iSCSI as ESXi Storage with ZIL/L2ARC

Status
Not open for further replies.

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Hello,

I'm a little lost on how to configure storages to serve some VMware ESXi clients.

First of all let's describe what I have here:
Supermicro X9SCM-F with Xeon E3-1240 V2 and 32GB of ECC DDR3 at 1600MHz
Intel RS2WC080 flashed as LSI 9211-8i in IT mode
Supermicro Enclosure with 24 drive bays (and two internal)
Broadcom Gigabit Ethernet Card with iSCSI Support
24x Seagate 7200RPM 3TB SATA Disks (ST3000DM001)
4x Western Digital VelociRaptor 10k RPM 1TB (WD1000DHTZ)
2x Western Digital VelociRaptor 10k RPM 600GB (WD6000HLHX)
2x Kingston SSDNow V300 Series SV300S37A 120GB

So this is my actual hardware. Of course we can't put all the disk on the enclosure, and I would like to know what is the best we can have with this hardware.

I was thinking in a 10K RPM pool with the four 1TB drives using RAID-Z and another pool with the remaining twenty 7200RPM disk, leaving four disks and the two 600GB Velociraptors for other use in other machines.

Since we have two large SSD's for ZIL, I don't know if the 10k RPM disks are necessary.

Another problem is: I don't know if RAID-Z will be fast enough to handle the Virtual Machines over iSCSI.

So I'm here to listen your thoughts and suggestions.

Thanks in advance,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Ok.

1) Mainboard, probably good (or at least as good as you'll get short of E5-land)
2) HBA, awesome
3) Which 24-bay enclosure? There are the individual SATA ones and then the SAS expander variants.
4) Broadcom with iSCSI, find a better use for it. Your FreeNAS is perfectly happy without NIC iSCSI support, and Broadcom, while they make a fine card, is not cheap. Intel is cheap AND works well.

You've given us nothing to grasp at performance-wise however. So I have limited comments:

A) For the VelociRaptors, do not waste them in a RAIDZ. Mirrored pairs. You can create two vdevs of 1TB each plus a 600GB vdev, and create a "fast" pool of 2.6TB of space. You of course do not have to use them all, but do use them in mirrored pairs.

B) For the 3TB's, fill the remaining slots in the chassis with them and put them in RAIDZ2 or RAIDZ3 for "slow" pool. RAIDZ is considered dangerous because it can only tolerate one drive failure.

C) For SLOG device, you do not need a 120GB SSD. You do want one with supercap or capacitor array. Explanation here.

D) Depending on your working set size, you may be able to make good use of at least one SSD for L2ARC.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Thanks for the tips jgreco. Let-me do some comments and answer your questions:

Ok.

1) Mainboard, probably good (or at least as good as you'll get short of E5-land)
2) HBA, awesome
3) Which 24-bay enclosure? There are the individual SATA ones and then the SAS expander variants.

I caught me here. It's a SuperMicro SC846 with this backplane:


4U Expander Backplane, features:
• SAS2 compliance
• Scalability through cascading
• 6Gb support
• Inband SES-2 Enclosure Management
• SAS/SATA support
• Single input/output SFF 8087 connectors


I think it's a SAS Expander, since it only require a single SFF 8087 cable.

4) Broadcom with iSCSI, find a better use for it. Your FreeNAS is perfectly happy without NIC iSCSI support, and Broadcom, while they make a fine card, is not cheap. Intel is cheap AND works well.

Ok... we can use the two Intel integrated cards: 82579LM and 82574L

You've given us nothing to grasp at performance-wise however. So I have limited comments:

A) For the VelociRaptors, do not waste them in a RAIDZ. Mirrored pairs. You can create two vdevs of 1TB each plus a 600GB vdev, and create a "fast" pool of 2.6TB of space. You of course do not have to use them all, but do use them in mirrored pairs.

I really don't have any real data about bandwidth. But we have some performance hungry servers: one Exchange 2013 server with 600 users as example. And an Apache2 Webserver with an wordpress page.

I was thinking in using only the four 1TB Disks, since the 600GB disk are not new. But another question emerged: about the zpool with two 1TB RAID1 zdevs, they work as an RAID10 array? With the performance gain of RAID0? There's something like RAID10 using ZFS?

B) For the 3TB's, fill the remaining slots in the chassis with them and put them in RAIDZ2 or RAIDZ3 for "slow" pool. RAIDZ is considered dangerous because it can only tolerate one drive failure.

After reading your comment I was thinking in this scenario:
Two zvol's in RAID-Z2 with 10 drives, to comply with the 2^n + 2 rule of RAID-Z. This will "waste" 20 drive slots, keeping 4 drive slots for the 10k zpool.

The point here is: I was thinking in separate zpools for this "slow" zdevz. To act as a backup from the other pool.

C) For SLOG device, you do not need a 120GB SSD. You do want one with supercap or capacitor array. Explanation here.

D) Depending on your working set size, you may be able to make good use of at least one SSD for L2ARC.

Can we partition this two SSDs and try to use the same device as ZIL and L2ARC? I don't know if we will have the budget to change the SSDs.

And the last comment: this SSD's are attached directly on the SATA3 ports of the Motherboard. Should we use the HBA? There's one SFF8087 port available.

Many thanks in advance,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
4U Expander Backplane, features:
• Single input/output SFF 8087 connectors

I think it's a SAS Expander, since it only require a single SFF 8087 cable.

Yes it is, and it'll work nicely. Bonus: the one controller can handle not only the existing chassis but you can also have the other SFF8087 wired up to an additional chassis if you wish. The only caveat with all that is that the aggregate bandwidth to the backplane becomes limited by the 4 6Gbps lanes (24Gbps or something a bit less than 3GB/sec). Contemporary drives run around 150MB/sec and 24 of them would require about 4GB/sec for "full" capability. Your system is unlikely to be able to push data that quickly in a useful manner, so in my opinion, not an issue. Could be an issue for someone building an all-SSD system though.

I really don't have any real data about bandwidth. But we have some performance hungry servers: one Exchange 2013 server with 600 users as example. And an Apache2 Webserver with an wordpress page.

Let me be blunt here: how in the $&#@(* can you have ESXi and not have any "real data" about your I/O requirements? Go to the host performance tab. Go to "Disk". Look at it.

I was thinking in using only the four 1TB Disks, since the 600GB disk are not new. But another question emerged: about the zpool with two 1TB RAID1 zdevs, they work as an RAID10 array? With the performance gain of RAID0? There's something like RAID10 using ZFS?

It is useless trying to compare ZFS with conventional RAID levels, the marketing droids have screwed up the terminology, and it isn't really correct for ZFS anyways.

ZFS implements virtual devices supporting RAID-style parity protection for data but does so on a block level with variable size blocks. This is RAIDZ or RAIDZ1, which protects against single disk failure. RAIDZ2 protects against double disk failure, and RAIDZ3 against triple. A pool is built out of one or more virtual devices, and ZFS typically balances the pool load amongst vdevs. In an ideal situation, this generally leads to striping between the vdevs.

We typically do not refer to "RAID1" as such, preferring the term "mirroring" instead. A mirror vdev of two devices is effectively RAID1, but you can have additional devices in a mirror for additional redundancy or performance.

ZFS has few limits as to the types of vdevs that can be used to build a pool. If you have a pool with a mirror and a RAIDZ2, it'll cheerfully balance the load.

After reading your comment I was thinking in this scenario:
Two zvol's in RAID-Z2 with 10 drives, to comply with the 2^n + 2 rule of RAID-Z. This will "waste" 20 drive slots, keeping 4 drive slots for the 10k zpool.

I can think of no compelling reason to use separate pools for the two 10 drive RAIDZ2 vdevs, just use a separate pool for the 10K and that should be very flexible.

Can we partition this two SSDs and try to use the same device as ZIL and L2ARC? I don't know if we will have the budget to change the SSDs.

Yes, but don't. If you're going to halfarse it, at least halfarse it properly. Just use iSCSI and disregard sync writes. Trying to use the wrong kind of SSD, and a non-dedicated one at that, for SLOG and L2ARC will merely lead to misery when performance sucks, and will lull you into a sense of having done it right, when you haven't. You will be encouraging contention between the SLOG and L2ARC functions.

Best plan is to get an appropriate SSD for SLOG use and then turn on sync writes when you've added this. Put it in next quarter's budget.

And the last comment: this SSD's are attached directly on the SATA3 ports of the Motherboard. Should we use the HBA? There's one SFF8087 port available.

Many thanks in advance,

Continue to use the SATA3 port. Great use for it. Leaves the controller less work, distributes the I/O load in the system better. Burning a precious SFF8087 reduces future expansion options.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Hello jgreco, let's continue:


Let me be blunt here: how in the $&#@(* can you have ESXi and not have any "real data" about your I/O requirements? Go to the host performance tab. Go to "Disk". Look at it.

Ok, my bad. Since I'm new to the ESXi scene, I don't know basic things from ESXi... and only some servers are starting to be virtualized :)

Well, fortunately, the Exchange Server is the one that I've virtualised, it's heavy on I/O, we have this data:

Write Rate:
Min: 633KB/s
Avg: 2643KB/s
Max: 6247KB/s

Read Rate:
Min: 2304KB/s
Avg: 5412KB/s
Max: 11742KB/s

The other servers, don't have much I/O, only a marginal throughput, something like 80KB/s in average.

This will gonna grow, since we are planning some redundancy in our Exchange Server and other servers, things like High Availability and automatic failover. But at this moment this is the I/O rates.

It is useless trying to compare ZFS with conventional RAID levels, the marketing droids have screwed up the terminology, and it isn't really correct for ZFS anyways.

ZFS implements virtual devices supporting RAID-style parity protection for data but does so on a block level with variable size blocks. This is RAIDZ or RAIDZ1, which protects against single disk failure. RAIDZ2 protects against double disk failure, and RAIDZ3 against triple. A pool is built out of one or more virtual devices, and ZFS typically balances the pool load amongst vdevs. In an ideal situation, this generally leads to striping between the vdevs.

We typically do not refer to "RAID1" as such, preferring the term "mirroring" instead. A mirror vdev of two devices is effectively RAID1, but you can have additional devices in a mirror for additional redundancy or performance.

ZFS has few limits as to the types of vdevs that can be used to build a pool. If you have a pool with a mirror and a RAIDZ2, it'll cheerfully balance the load.

So this is awesome... I know that you may not like the term, but it will work "like" RAID10 with a ZFS backend.

I can think of no compelling reason to use separate pools for the two 10 drive RAIDZ2 vdevs, just use a separate pool for the 10K and that should be very flexible.

I was thinking in the possibility of a ZFS crash on one of the RAID-Z2 pools. The other one will be a backup of this pool and a backup of the 10k RPM pool. And it appears to be a good balance: 10 + 10 + 4 disks.

The question here is: I'm thinking wrong? This machine was bought not to be an iSCSI Server for ESXi. But was intended do be a backup server, since we don't have _ANY_ backup at this moment. So the 10k drives come in play when we changed our minds with VMware ESXi and it's storage requirements. Now can make sense why we have 24x 3TB drives :)

Yes, but don't. If you're going to halfarse it, at least halfarse it properly. Just use iSCSI and disregard sync writes. Trying to use the wrong kind of SSD, and a non-dedicated one at that, for SLOG and L2ARC will merely lead to misery when performance sucks, and will lull you into a sense of having done it right, when you haven't. You will be encouraging contention between the SLOG and L2ARC functions.

Ok... Simply don't use a ZIL at this moment. I will try iSCSI with synchronous writes without a ZIL; let's see how this will perform.

The question here is: what to do with this two SSD drives? Put it in the 10K RPM pool? L2ARC in stripped mode even with "only" 32GB of RAM?

Best plan is to get an appropriate SSD for SLOG use and then turn on sync writes when you've added this. Put it in next quarter's budget.

So I can add a ZIL later without any problem? And another thing: it's one ZIL to all the zpool's or must be one for each pool? (in the case, two SSDs, because ZIL needs mirroring). If we gonna need a pair of SSD's per pool, the partition question come once again, since it will be only for ZIL, can it be partitioned?

Continue to use the SATA3 port. Great use for it. Leaves the controller less work, distributes the I/O load in the system better. Burning a precious SFF8087 reduces future expansion options.

Awesome!

Thanks once again jgreco... You're helping a lot!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Max: 6247KB/s

Controlling variable. Well within the capabilities of something like an Intel 320 SSD. I've got one running backups that suffers a continuous ~~20MBytes/sec.

I was thinking in the possibility of a ZFS crash on one of the RAID-Z2 pools. The other one will be a backup of this pool and a backup of the 10k RPM pool. And it appears to be a good balance: 10 + 10 + 4 disks.

The question here is: I'm thinking wrong? This machine was bought not to be an iSCSI Server for ESXi. But was intended do be a backup server, since we don't have _ANY_ backup at this moment. So the 10k drives come in play when we changed our minds with VMware ESXi and it's storage requirements. Now can make sense why we have 24x 3TB drives :)

No, that's fine, as long as you have some specific reason. I didn't say a reason couldn't exist, just that it tends to be unusual.

Ok... Simply don't use a ZIL at this moment. I will try iSCSI with synchronous writes without a ZIL; let's see how this will perform.

I can save you the time: "poorly". :smile:

The question here is: what to do with this two SSD drives? Put it in the 10K RPM pool? L2ARC in stripped mode even with "only" 32GB of RAM?

You may or may not have need of them. L2ARC in addition to sufficient RAM tends to make the system faster, but if the system is already sufficiently fast, "faster" is meaningless.

So I can add a ZIL later without any problem? And another thing: it's one ZIL to all the zpool's or must be one for each pool? (in the case, two SSDs, because ZIL needs mirroring). If we gonna need a pair of SSD's per pool, the partition question come once again, since it will be only for ZIL, can it be partitioned?

You do not need to mirror SLOG devices anymore. A SLOG device failure is not fatal, and the system reverts to using the in-pool ZIL. However, a SLOG device failure will of course cause performance to tank.

You can read more about all of that at the "some insights re: slog" link I provided above.

If I needed to provide multiple SLOG devices, I would take a different approach that I suspect(!) would be faster. I haven't actually tried this. Take a RAID controller with battery backed write cache. I'm *guessing* that any LSI2108 based controller (M5015, etc) with BBU would be fine. Take some fastish hard drives and create a RAID1 mirror. Now create a few small virtual disks or LUNs or whatever it calls them. The write cache forms a very fast SLOG device, the hard drives provide virtually unlimited endurance, and the LUN's provide multiple SLOGs on a single array. It is likely cheaper and more effective than trying to arrange "appropriate" SLOG devices for each pool. But as I said - I haven't tried this particular approach.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
You may or may not have need of them. L2ARC in addition to sufficient RAM tends to make the system faster, but if the system is already sufficiently fast, "faster" is meaningless.

LOL. I don't know what to do with this guys now... but OK. We'll find an use. Even if I put this on our HPC cluster!

You do not need to mirror SLOG devices anymore. A SLOG device failure is not fatal, and the system reverts to using the in-pool ZIL. However, a SLOG device failure will of course cause performance to tank.

You can read more about all of that at the "some insights re: slog" link I provided above.

I'll read the entire topic again. Perhaps I read it too fast and without attention (!).

But let's consider an power failure taking the entire storage down and fried SLOG SSD. The data would be save? I'm asking this because we had this on the past, and not one or two times... more than five! With hard disks.

If I needed to provide multiple SLOG devices, I would take a different approach that I suspect(!) would be faster. I haven't actually tried this. Take a RAID controller with battery backed write cache. I'm *guessing* that any LSI2108 based controller (M5015, etc) with BBU would be fine. Take some fastish hard drives and create a RAID1 mirror. Now create a few small virtual disks or LUNs or whatever it calls them. The write cache forms a very fast SLOG device, the hard drives provide virtually unlimited endurance, and the LUN's provide multiple SLOGs on a single array. It is likely cheaper and more effective than trying to arrange "appropriate" SLOG devices for each pool. But as I said - I haven't tried this particular approach.

This appears to be good. The old pair of 600GB 10K RPM disks will be useless now, so getting a BBU controller to use this disks as SLOG would be interesting. And it's appears to be cheaper too. Since we will need two pair (or single; it depends on the needing of mirrored SLOG) of SSDs to satisfy our needs. Since we don't need SLOG in the backup pool only on the "fast pool" and on the "slow pool".

jgreco, thanks once again! :)

PS: I've learned the differences between ZIL and SLOG, and I will not write it wrong again! haha!
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
But let's consider an power failure taking the entire storage down and fried SLOG SSD. The data would be save? I'm asking this because we had this on the past, and not one or two times... more than five! With hard disks.
Get a damn quality UPS already. If you addressed this elsewhere I missed it.

Did it fry only the SLOG and not the pool? Any "in-flight", i.e. sync, writes would be lost up to the previous txg. Best practice still is to mirror the SLOG same as always. Keep in mind this is an edge case you are protecting against: not committing the current txg, losing the SLOG, but keeping the pool.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's a very unusual edge case at that. The SLOG mirror advice is a remnant of the days when loss of SLOG would result in loss of pool. It is still a good idea to do, but doing it with SSD borders on the impractical, due to the need to buy relatively expensive SSD's for SLOG. I've been relatively unhappy with the available SLC offerings due to price, and the MLC offerings are not much better. Of course, if you simply disregard the power loss protection requirement, your options become much greater ... but then, as I suggested, just disregard the SLOG and disable sync writes and you've got a faster version of the same thing for a cheaper price.

But for a system with three pools, needing six expensive SSD's for mirrored SLOG becomes prohibitively costly and complicated to implement anyways. That's why it would probably be worth experimenting with a 2108, BBU, and a pair of conventional drives, mirrored, and create some small volumes/LUN's on it. I'm pretty sure it can do that.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
@jgreco; another variable in the game...

Our Kingston V300 SSD are MLC... so they can be used? I saw this here: http://www.anandtech.com/show/6733/kingston-ssdnow-v300-review

And since we don't need mirrored SLOG anymore, and we have two SSDs... this completes the storage server. One SSD for the 10K pool and the other one for the RAID-Z2 pool, leaving the backup pool without external SLOG device.

Thanks in advance,
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Make sure the ssd's are 'power protected'. Most (all?) ssd's use a dram cache as a write buffer to speed up data going to the flash chips. If the power fails, and there's data in the ssd's ram that hasn't made it onto the flash chips yet, will the ssd be able to flush the ram before the flash chips loose power? If not, you're still risking data loss due to unexpected power failure. If the ssd can't guarantee data in the dram buffer is 'safe', then I'd think there'd be not much difference in just disabling sync and accepting the potential loss of written data that way.

Some ssd's have a super capacitor, or a bank of regular capacitors that can keep the flash chips 'powered up' long enough for the data in the dram chip(s) to make it onto stable storage. jgreco is definately the go-to guy for which ssd's are suitably 'power protected', and which ones aren't.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Make sure the ssd's are 'power protected'. Most (all?) ssd's use a dram cache as a write buffer to speed up data going to the flash chips. If the power fails, and there's data in the ssd's ram that hasn't made it onto the flash chips yet, will the ssd be able to flush the ram before the flash chips loose power? If not, you're still risking data loss due to unexpected power failure. If the ssd can't guarantee data in the dram buffer is 'safe', then I'd think there'd be not much difference in just disabling sync and accepting the potential loss of written data that way.

Some ssd's have a super capacitor, or a bank of regular capacitors that can keep the flash chips 'powered up' long enough for the data in the dram chip(s) to make it onto stable storage. jgreco is definately the go-to guy for which ssd's are suitably 'power protected', and which ones aren't.

Hello titan, that's my question: the MLC ones isn't the bank of capacitors version?
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Depends on the ssd. Only certain ssd's have the capacitor thing. MLC / SLC doesn't determine whether they are 'good' to use as a SLOG device. Sometimes it's hard to determine if a particular ssd has that feature or not. It's certainly not a feature that will sell ssd's to the home pc market. As I said, jgreco knows far more than I about suitable ssd's for slog.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Depends on the ssd. Only certain ssd's have the capacitor thing. MLC / SLC doesn't determine whether they are 'good' to use as a SLOG device. Sometimes it's hard to determine if a particular ssd has that feature or not. It's certainly not a feature that will sell ssd's to the home pc market. As I said, jgreco knows far more than I about suitable ssd's for slog.

Hmm... Understood.

Let's wait for jgreco bless once again :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I actually don't have anything much more specific for you than the link I previously posted to my SLOG/ZIL thread. titan_rw may be a little too kind with his words because I've mostly tried to avoid needing to deploy SSD based SLOG. My poor little N36L backup target has a 40GB Intel 320 in it; the fricken' thing was still like $140 and it reduced a system that had been writing NFS with sync=disabled at ~40MBytes/sec to more like ~20MBytes/sec with the SLOG ... which was totally fine for the purpose. But the speed reduction and limited MLC endurance of that device makes me feel I haven't spent wisely, a feeling I hate. Finding a good SLOG is hard; you can find numerous other intelligent discussions out there.

Ultimately I'll end up with a SLOG device on a RAID controller for most future builds because we usually build on top of ESXi, and the cost differential to select a board with a 2208 is modest. The N36L was kind of an experiment because it was cheap at the time.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
It's a very unusual edge case at that. The SLOG mirror advice is a remnant of the days when loss of SLOG would result in loss of pool.
IIRC, Sun did not initially have a mirrored SLOG option as they viewed it the same way. Mirrored SLOG was added as customers saw it as a SPOF and a failed SLOG required manual intervention to import an exported pool. There was some undocumented switches that allowed this and they eventually even added one specifically to import with a failed SLOG.

But for a system with three pools, needing six expensive SSD's for mirrored SLOG becomes prohibitively costly and complicated to implement anyways.
Or two SSD's shared between the three pools. Unless, the expected write rate of the pools exceeds the SSD's capacity.

That's why it would probably be worth experimenting with a 2108, BBU, and a pair of conventional drives, mirrored, and create some small volumes/LUN's on it. I'm pretty sure it can do that.
As long as the write rate does not exceed the controller's ability to keep up including actually writing it out. As you know SLOG is all about latency and NVRAM still beats out your typical SSD.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Or two SSD's shared between the three pools. Unless, the expected write rate of the pools exceeds the SSD's capacity.

Not really supported by FreeNAS but not horribly hard to do from the CLI.

As long as the write rate does not exceed the controller's ability to keep up including actually writing it out. As you know SLOG is all about latency and NVRAM still beats out your typical SSD.

For the listed system, RAID + battery backed write cache and a pair of reasonably fast hard drives ought to be sufficient, and quite possibly lower latency.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I know my Areca 1280ML-24 has a DDR2 ECC RAM slot(Mine has 2GB of RAM). I've shoved more than 1GB of data into the write cache in less than 2 seconds before and the Areca 1280ML-24 can be had for about $200 on ebay.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
or actually the 12 port variant, a different part no. aside from being 3gbps sata, the potential for large write cache seems interesting. can you control what gets allocated to write cache?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not really. The options are "enabled, enabled with BBU present, and disabled".
 
Status
Not open for further replies.
Top