Tiered storage - Mech + SSD in same pool or in separate pools?

SamM · Mar 13, 2019

Short version: Does it make any sense to add SSD mirror VDevs to mechanical drive mirrors? Is FreeNAS smart enough to put 'hot data' on the SSD portion of the pool and the rest of the data on the slower spindles; or do you only gain capacity in the pool by doing so? Is it better to have two pools, one mechanical and the other SSD, then sort what data goes to which pool manually via different (iSCSI to ESXi) shares?

My current setups tend to be a server with 12 mechanical 4Tb drives, 4 VDevs of 3-way mirrors for a rough capacity of 16Tb. My current test server has a 3-way mirror using the same 4Tb mechanical drives, with another 6 sitting unused to be able to play around with. This test server also has seven 1Tb SSDs, the thinking being 2 sets of 3-way mirrors plus one hot spare. The test server let me add a 3-way mirror SSD VDev to the existing 3-way mirror mechanical pool, but I'm not sure if I gain anything other than capacity by doing so.

Jessep · Mar 13, 2019

A pool will be treated as a uniform device.

There is no auto tiering in FN other than ARC or L2ARC (which you have to add manually), a SLOG isn't caching or tiering.

Create a separate SSD pool and manually assign access

iSCSI will likely require mirrored SLOG for performance reasons.

JaimieV · Mar 14, 2019

Short answer: FreeNAS is not smart enough to migrate hot data to an SSD. All ZFS storage is just block storage of varying speeds, there's no disk-speed based autotiering. Before considering an SSD L2ARC, plan to up your RAM instead - this *is* used for primary cache to speed things up. There are posts around on when/why adding L2ARC (and ZIL and SLOG) are useful, and more importantly when they are not, go read around the topic because it's workload and hardware dependent.

Even shorter answer: Set yourself up a separate SSD pool and manually use it for SSD-type tasks.

jgreco · Mar 14, 2019

ZFS does support storage tiering in the form of L2ARC, but really that's about it.

If you want really lightning fast storage for your hot spots on HDD, there are two basic things to do:

1) For read, have sufficient ARC+L2ARC to cover the working set. So if your working set is 1TB of data out of 10TB total data, have at least 128GB main memory and a pair of 500GB SSD's. Ideally 256GB main memory. Your working set reads will *fly*. This does not actually cause "storage tiering" in the sense that it moves data off of the HDD and onto SSD, but the effect is basically the same, with the added advantage that you do not need RAID or mirroring for the SSD's, unlike a typical tiering solution. All data is always on the HDD in the pool. The stuff you use frequently is on SSD.

2) For write, have GOBS of free space on the pool. Never fill a pool more than 50%. Keep it below 25% if you can. Once you get below that, it really starts to feel like an all-SSD pool even though it is HDD. This is because ZFS is converting all your writes, random, sequential, etc., into sequential writes. This isn't storage tiering either, but it will feel fast, so maybe that doesn't matter.

"iSCSI will likely require mirrored SLOG for performance reasons.

That reason being... lowering performance? Skip the mirroring unless it is critical that you maintain performance even in the face of a SLOG device failure. You will lose some performance to the mirror. Keep one SLOG device active and keep another device installed as a spare, and if it fails, disconnect the failed device and attach the spare. I'm pretty sure someone was working on automating that through zfsd but I don't recall the specifics.

Meyers · Mar 14, 2019

jgreco said:
Never fill a pool more than 50%. Keep it below 25% if you can.

Is this really true? Maybe I'm misreading this, but it seems extremely wasteful to me.

danb35 · Mar 14, 2019

Meyers said:
Is this really true?

Yes, it's really true:

jgreco said:
If you want really lightning fast storage for your hot spots on HDD

For best write performance, have lots of free space. If write performance isn't as critical, neither is the free space.

Meyers · Mar 14, 2019

danb35 said:
Yes, it's really true:

This is so hard for me to believe. Even the standard recommendation to only have your pools at 80% max seems extremely wasteful (and expensive) to me, especially when we're getting into 24-48+TB.

I'd like to understand why this is the case. I'll Google around but if you could post any docs you have on this I'd appreciate it.

danb35 · Mar 14, 2019

Meyers said:
This is so hard for me to believe.

I don't see why. The best write performance is going to be with sequential writes. The greater the free space in your pool, the more writes will be able to be sequential. QED.

ZFS is not optimized for resource-constrained applications.

Meyers · Mar 14, 2019

danb35 said:
The best write performance is going to be with sequential writes.

I think this just clicked for me. Thanks.

danb35 said:
ZFS is not optimized for resource-constrained applications.

Tell me about it. I'd still be using XFS if backups didn't take 24+ hours on millions of files. I love ZFS don't get me wrong, but it comes with a considerable cost.

rvassar · Mar 14, 2019

Meyers said:
I love ZFS don't get me wrong, but it comes with a considerable cost.

Well... But consider what the implications are here. < 25% virtually all writes get converted to sequential. < 50% most writes get converted. Above 50%, what happens? The statistics start to shift such that ZFS has to move the heads to "dodge" sectors already in use. These statistics keep sliding as the pool fills, and things start to get really bad around 80%. It's simply a statistics game. You want fast writes, keep the pool empty and those heads still. Want to manage costs? Fill the pool. All it will do is slow down.

As the sign at the Hot Rod shop says... "Speed costs money. How fast to you want to spend?"

Chris Moore · Mar 14, 2019

All hard drives always have slowed down as they get full, it has been that way from the beginning of time as far as I know. Maybe someone can correct me on this?

rvassar · Mar 14, 2019

Chris Moore said:
All hard drives always have slowed down as they get full, it has been that way from the beginning of time as far as I know. Maybe someone can correct me on this?

There's a bunch of reasons... You may be thinking of the linear velocity and aerial density drop of the media as you seek inward towards the spindle. The inner cylinders effectively present their bits slower, because of good old Cir = Pi * 2r and the RPM stays constant. But the bit density also has to go up to keep the same number of sectors in a smaller track. Some drives used to shrink the number of sectors in the inner cylinders, or play games remapping two cylinders into one, etc... The net effect is the same, the heads have to seek more often.

There's a software reason too... Fragmentation. You want to write a 1Gb file to an empty disk the FS driver can just slap it down anywhere. Writing that same 1Gb file to a disk with only 1Gb of free space remaining... Unless the filesystem driver is very very clever, the head has to hop all over the place to fill in the "holes" where the free sectors are.

Meyers · Mar 14, 2019

rvassar said:
Well... But consider what the implications are here.

Interesting. This all makes sense. I don't have anything write heavy but I still plan on going no higher than 80%.

What about fragmentation? Does that factor in as free space decreases?

Meyers · Mar 14, 2019

rvassar said:
Fragmentation

Awesome. Don't know why I never really got this until now. Makes total sense. Thanks.

jgreco · Mar 14, 2019

danb35 said:
I don't see why. The best write performance is going to be with sequential writes. The greater the free space in your pool, the more writes will be able to be sequential. QED.

Today's FreeNAS Forums Award for Most Concise Answer goes to @danb35 ... also saving the OP from having to read a much lengthier diatribe by me.

Jessep · Mar 14, 2019

jgreco said:
That reason being... lowering performance? Skip the mirroring unless it is critical that you maintain performance even in the face of a SLOG device failure. You will lose some performance to the mirror. Keep one SLOG device active and keep another device installed as a spare, and if it fails, disconnect the failed device and attach the spare. I'm pretty sure someone was working on automating that through zfsd but I don't recall the specifics.

My perhaps incorrect understanding was that losing SLOG when using SYNC=ALWAYS such as iSCSI can be "bad". Hence use mirrored SLOG if SLOG is needed, SLOG being only needed for performance reasons, such as SYNC=ALWAYS with iSCSI.

https://forums.freebsd.org/threads/zfs-slog-ramifications-it-if-fails.66944/

Listen to those two jgreco and cyberjock. They know what they're talking about.
As jgreco pointed out, ZFS uses a ZIL whether its on the disks or on SSDs. A SLOG is just a a seperate ZIL. As far as a SLOG making a pool slightly less fragmented; I've never heard of that, I'd say don't worry about it.

The internet is plagued with ZFS, SLOG, L2ARC copypasta, so I'll continue...

ShelLuser said:
if that fails then your pool will definitely fail as well.
No. The consistency of the application data may be compromised, but the pool will not fail. As stated, yes it is a log of synchronous transactions. Though, in normal operation data never flows "through" the SLOG. Data is only read from the SLOG on the next import after a power failure. Therefore, if your SLOG fails catastrophically during regular operation, your pool will not. ZFS will learn the device is'nt accepting writes anymore and mark the pool as degraded. If your SLOG fails during or right after a power failure, you will lose whatever data is in the SLOG, but your pool will certainly not "definitely fail as well". If the SLOG is mirrored as it should be, ZFS will figure which mirror device holds the correct data.

freebsdinator said:
risks of using this approach
If you implement a SLOG incorrectly, it will be no safer than simply setting "sync=disabled" pool wide. Correct implementation means mirrors of capacitor or battery backed flash. These are not suggestions, these are functional requirements. That said, you do it right, yeah you'll be pretty safe, and benefit from the performance of SSDs for synchronous writes.

ShelLuser said:
Why would you want to use synchronous transactions in the first place though? What is the goal here?
Synchronous writes exist because they are your most important data. Otherwise an application has no guarantee of consistency. Databases issue a lot of synchronous writes.

@freebsdinator , since you are running a database, I absolutely advise you to use a mirrored SLOG. This is a good application for the feature.

jgreco · Mar 14, 2019

Meyers said:
What about fragmentation? Does that factor in as free space decreases?

You have the beginnings of the right idea, but there's another factor.

So the thing you really need to understand is that ZFS does try to convert many of its writes into generally contiguous writes. The problem with this is that when you need to REwrite something, the fact that it's a Copy-on-Write filesystem means it will write that new bit of data somewhere ELSE, leaving a free space in the middle of previously-written data.

It is very hard to fight fragmentation when you're using a CoW filesystem. It's a losing battle in the long run. So one strategy is just to love the system and play to its strengths. Look at this:

https://extranet.www.sol.net/files/freenas/fragmentation/delphix-small.png

This graph was generated by Delphix on a single-drive pool to demonstrate the concept. It isn't letting me insert that for some this-forumware-is-stupid reason. Anyways, at the start, every ZFS pool is lightning fast for writes because it's empty. What's more interesting is what happens once your pool gets to equilibrium and you get a steady state performance. The graph helps to illustrate steady state performance, writing random data. Steady state means that there have been enough writes and frees to the pool that the performance has gotten as bad as it is likely to, and fragmentation has stabilized. This may well be a point you never reach, but it's worth contemplating.

What you should notice is that once you hit about 50%, the write speeds are already pretty low, and there isn't a huge drop between there and the commonly-suggested 80%. Once you get to 80%, other problems start to appear, but for performance, this is more meaningful.

So here's the other thing. The program driving that graph is writing random data to a ZFS filesystem on a single hard disk. Normally if you were to write random data to a HDD-backed UFS or DOS filesystem, you would expect to see MAYBE 250KB-500KB/sec total (charitably, ~100 IOPS per second at 4K blocks ~= 500KB/sec, or more realistically ~250 IOPS per second for 512 byte blocks). At 10% occupancy, ZFS is punching stuff out there at 6MBytes/sec. But even at 80%, it appears to be maybe doing around 750KBytes/sec.

So if you keep your HDD pool largely empty, you start to get speeds that look more like SSD. You can take advantage of this to help design storage that works a lot faster than conventional storage, even though you're making it largely out of conventional HDD.

This is all compsci trickery where you are exchanging one thing for another. There's a price to be paid. ZFS is piggy on resources. If you resource a ZFS system sufficiently, it will fly. But it can be a lot of resources to get there.

jgreco · Mar 14, 2019

Jessep said:
My perhaps incorrect understanding was that losing SLOG when using SYNC=ALWAYS such as iSCSI can be "bad". Hence use mirrored SLOG if SLOG is needed, SLOG being only needed for performance reasons, such as SYNC=ALWAYS with iSCSI.

I disagree; use cases vary.

I would not use a mirrored SLOG except in certain circumstances.

1) High quality SLOG devices are hideously expensive. If you mirror two of them, you are burning through twice as much capex, and when one of them fails, the other one may be near failure as well, meaning a sudden expense and a need to acquire a hard-to-source part (i.e. you can't just go down to Best Buy). So you probably need to buy three and keep one as a cold spare.

2) If a SLOG device fails, ZFS will fall back to using the in-pool ZIL. This will feel like hitting a brick wall from a performance point of view. I believe that someone was working on making zfsd be able to spare in a SLOG device in the event of failure. If that's the case, I'd totally try to go that route. Otherwise, if you can't afford a short term performance hit, perhaps mirrored SLOG is the right solution for you.

3) The other thing people forget is that the ONLY thing the SLOG buys you is consistency in the event that your FreeNAS system is hit by a power failure, panic, or other non-graceful reboot. How often does that actually happen?

4) Also don't pay attention to ancient advice. There was a time that ZFS would freak out if it lost a SLOG device, and wouldn't import a pool if the SLOG had failed. Lots of the "must mirror SLOG" information out there seems to assume this is still true. It isn't.

So if you're a bank and you simply MUST prevent inadvertent database corruption or repeat transactions, and you MUST maintain a high level of performance, that calls for mirrored SLOG all the way. If you're running VM's and don't want the hassle of corruption in the unlikely event your FreeNAS panics or loses power, but you can deal with a temporary performance loss while you slot in the new SLOG, then you don't need a mirror.

Jessep · Mar 14, 2019

jgreco said:
4) Also don't pay attention to ancient advice. There was a time that ZFS would freak out if it lost a SLOG device, and wouldn't import a pool if the SLOG had failed. Lots of the "must mirror SLOG" information out there seems to assume this is still true. It isn't.

This seems to be where my information was out of date.

Thank you for the explanation.

SamM · Mar 14, 2019

Wow, this thread went 0 to 60 all the sudden today. Thanks everybody for all the input.

I thought this test system was sitting pretty at 96Gb RAM, considering that my previous systems had 32Gb to 64Gb. But if the recommendation in jgreco's post #4 is 128Gb, or even 256Gb, then I've got a long way to go considering all the current DIMM slots are full... The server can do it, it's just a matter of money that I don't have at the moment.

Regarding the part(s) about 25% vs 50%-80% free space in a pool translating to increased write performance, I understand the bit about the physical limitations of spinning platters and flailing actuator arms swinging heads wildly. Is this free space to performance relationship still true if the pool is all SSD? SSD's don't have seek times nor reduced speeds as the drive fills issue that mechanical drives have right?

Due to this thread and a change in my project objectives, I'm thinking of changing up my original plan spelled out in used-self-refurb-hp-dl380e-g8-sff-ssd-iscsi-server-primarily-for-esxi. In summary, what was supposed to start as a 3-way 1Tb SSD VDev striped against a second identical 3-way VDev and eventually a total of 8 such VDevs plus a single 1Tb spare (25 total disks), is getting merged with the server's predecessor which has 3-way 4Tb mechanical drive VDevs striped against 4 identical VDevs (12 total disks). Reading above, it seems the advice is to keep the SSDs and mechanical disks as separate pools. So I'm thinking of taking the base server (96Gb RAM, not the original 32Gb), sticking the 12 3.5" 4Tb mechanical drives in that with the same 3-way mirror VDevs striped over 4 VDevs (12 disks), then hanging a 25-bay (2.5"/SFF) enclosure off the base server with the (7 to start, eventually 25 if I hit capacity) SSD's over there (again in sets of 3-way mirror VDevs + 1 spare) and in their own pool.

I have a pair of what I was told are mediocre NVMe drives (MyDigitalSSD 240GB (256GB) BPX 80mm (2280) M.2 PCI Express 3.0 x4 (PCIe Gen3 x4) NVMe MLC SSD) that I was originally going to use one as a SLOG device. I was kind of talked out of it in the previous thread as the gains were questionable since the pool was already all SSD. The only theoretical gain of using that NVMe as a SLOG against a SSD pool was to reduce writes being made to the SSD pool, thus extending the pool's write lifespan but at the expense of the SLOG NVMe itself. That all changes if this server is to have a mechanical pool again, so perhaps I should repurpose that NVMe as a SLOG for the mechanical pool? Yes? No?

For what it's worth, my desired priorities in order (highest to lowest) are integrity/availability, then cost, then speed & capacity somewhat tied; hence the triple redundancy drives, redundant server hardware (albeit used/refurbished), etc... So even if I fill out 12 4Tb disks (16Tb) & 24/25 1Tb SSD's (8Tb), that's still a meager 24Tb or 19.2Tb @ 80%. I will likely follow the advice of setting 'sync=always' and overall, the purpose of this server is to power ESXi (likely via iSCSI which is what I've been doing thus far). Our data is largely VM's but includes some file servers and a physical database server. As resources permit, I'll build an identical server and then figure out how to auto replicate the 1st to the 2nd as a very basic Disaster Recovery plan (aside from other point-in-time backups).

Important Announcement for the TrueNAS Community.

Tiered storage - Mech + SSD in same pool or in separate pools?

Dabbler

Patron

Guru

Resident Grinch

Patron

Hall of Famer

Patron

Hall of Famer

Patron

Guru

Hall of Famer

Guru

Patron

Patron

Resident Grinch

Patron

Resident Grinch

Resident Grinch

Patron

Dabbler

Similar threads