Absurd latencies with SMR drives

deafen · Aug 6, 2019

I posted a few days ago about high latencies with new WD 6TB drives (WD60EZAZ) and got a lesson about SMR. Unfortunately, I'm in the boat now, so I need to make the best of the situation.

In the middle of resilvering drive #6 of 7 in my raidz2 array, I'm seeing what I would consider truly absurd behavior. Read latencies are in the >500ms range, and I've seen them as high as 1300ms. Queue lengths are in to 60-70 ops range.

In the gstat output below, da0 is my boot SSD, da2 and da3 are the old 3TB PMR drives, da8 is the resilvering target, and the rest are new drives that have already been resilvered.

Is this just a pathological case of encountering lots of small files, resulting in a pile of tiny writes that are backing up due to having to rewrite full tracks? In this very same resilver, I've seen speeds as high as 50MB/s, which I assume were multi-GB files like VMDKs. This seems like one area in which a traditional track-by-track RAID6 implementation would have an advantage over raidz2. Is it possible to force that behavior in zfs? (Too late for this array, but might be an interesting idea for future SMR-based arrays.)

On another note, the gstat output doesn't make a lot of sense; shouldn't the ops/s equal the sum of r/s and w/s? That's off by more than a factor of 10 in this output.

deafen · Aug 6, 2019

Found the missing ops: they're delete ops. Not sure what they're for, but that would explain at least some of this, if every delete also caused a full-track rewrite.

Constantin · Aug 6, 2019

Could a SLOG help with resilvering? I don’t know since I have yet to run a performance comparison. But I wonder if all those small writes may be hurting your latency a lot and hence a SLOG may help.

deafen · Aug 7, 2019

Constantin said:
Could a SLOG help with resilvering? I don’t know since I have yet to run a performance comparison. But I wonder if all those small writes may be hurting your latency a lot and hence a SLOG may help.

I've heard that it won't make a difference, because SLOG is only used for sync writes, and from what I understand resilvering is an inherently async operation because of the way they prioritize the I/O below user stuff.

I'm not too worried, because resilvering shouldn't happen often, and with raidz2 the chances of an additional double-disk failure while resilvering are still pretty low, even at 36 hours. And I've got good offsite backup. I just find it interesting how this particular technology seems to uncover some pretty pathological edge cases.

Constantin · Aug 7, 2019

I’m surprised that resilvering wouldn’t use sync but I wasn’t able to find any info on that aspect of pool repair. I would have thought that the pool repair business would use the same preferences re: sync as the pool in general - ie. On it the pool is set to sync and off if it is not.

deafen · Aug 7, 2019

Constantin said:
I’m surprised that resilvering wouldn’t use sync but I wasn’t able to find any info on that aspect of pool repair. I would have thought that the pool repair business would use the same preferences re: sync as the pool in general - ie. On it the pool is set to sync and off if it is not.

I think resilvering sits at a lower level than that, since it's effectively a hardware operation. And besides, I've got sync disabled on the whole pool.

Constantin · Aug 7, 2019

You are absolutely right, a SLOG wouldn’t help with an async pool. I’m surprised though that a sync=on pool would not resilver with sync.

Perhaps the system performs a full disk “scrub”, comparing what it should find vs. what it reads after the initial resilver process has completed? How else could one guarantee data integrity post-resilver?

anmnz · Aug 7, 2019

Constantin said:
I’m surprised though that a sync=on pool would not resilver with sync.

Um... why? The sole point of a sync write is that the write operation does not return to the client until the data is written to persistent storage. As there is no client to return to, going through the sync write mechanism for resilvering would be utterly pointless and a horrible waste of resources.

Constantin · Aug 7, 2019

I must be misunderstanding how sync=on works then based on what I read in the SLOG demystified article that iXSystems hosts?

How does an async resilver guarantee that what was supposed to be written was in fact written despite the possibility of power loss etc? Does the system read every written sector as a separate transaction afterwards?

AlexGG · Aug 7, 2019

This is resilver, it is different from application write request.

You do not need to guarantee that was supposed to be written was in fact written. Instead, you need to know what is the last data (as point in time or whatever) which is guaranteed to be written. If power fails, you then restart resilver from the latest point which you know for certain was in fact written. This is possible because either you will have resilvered data (which will then verify good) or you have original data (which you will resilver again). The worst case penalty is that some small part of the pool has to be resilvered twice.

anmnz · Aug 7, 2019

Constantin said:
I must be misunderstanding how sync=on works then based on what I read in the SLOG demystified article that iXSystems hosts?

How does an async resilver guarantee that what was supposed to be written was in fact written despite the possibility of power loss etc? Does the system read every written sector as a separate transaction afterwards?

OK, I see where you are coming from. Maybe this helps:

It is a key guarantee of ZFS that the on-disk format is always consistent. Roughly, if you are doing a big write to disk, once the new blocks containing your data have been written to disk, the final write that actually links those blocks into the current state of the filesystem is small and atomic. (This is why we don't have an "fsck" for ZFS.)

So the answer to the question is: no guarantee is needed!

If you lose power in the middle of writing data to disk, it's like the write never happened. For a resilver that is totally fine -- after power is restored it will just continue the resilver from the previous state of the filesystem.

(Edit: didn't see @AlexGG's reply until after I posted this; I believe we are saying the same thing.)

jgreco · Aug 8, 2019

Constantin said:
I must be misunderstanding how sync=on works then based on what I read in the SLOG demystified article that iXSystems hosts?

How does an async resilver guarantee that what was supposed to be written was in fact written despite the possibility of power loss etc? Does the system read every written sector as a separate transaction afterwards?

You're not misunderstanding how sync=on works. You're misunderstanding what the ZIL is and what a SLOG device is. It's not a write cache.

https://www.ixsystems.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

In order for host-managed SMR to work efficiently, you'd need to optimize a large number of blocks for a contiguous write. (Ok it may be slightly more complicated). The drive needs to know that you're going to be overwriting a large number of contiguous blocks so that it doesn't worry about the previous content on those blocks.

ZFS doesn't do this. Well, sometimes it does, but only accidentally, and not in a way that's compatible with host-managed SMR.

Johnnie Black · Aug 12, 2019

If your signature is correct and you're still on FreeNAS 11.1 you might want to try 11.2, there are indications that a resilver works much better on SMR drives, possibly because it's more sequential.

deafen · Aug 12, 2019

Johnnie Black said:
If your signature is correct and you're still on FreeNAS 11.1 you might want to try 11.2, there are indications that a resilver works much better on SMR drives, possibly because it's more sequential.

It's out of date - I'm on 11.2-U5 now. I should fix that.

deafen · Aug 13, 2019

And I'm out. The first five drives resilvered in ~30 hours each. The sixth ran for four days and was at 55% with 3.5 days projected remaining; I rebooted, and the resilver started over from scratch (which shouldn't happen, should it?) and it's saying 5+ days and growing.

Since I've got two good LSI SAS controllers, I'm going to cut my losses and switch to some 6TB SAS drives that I have found new for $104 each. I'll sell the SMR drives for whatever I can at work, might get $75 apiece for them? Small price to pay to have something with predictable performance. And they're more reliable drives, on paper at least.

Instead of serial replacement, though, I'm going to build the new array and then replicate the data over. Should be a lot faster.

Important Announcement for The TrueNAS Community.

Absurd latencies with SMR drives

deafen

Explorer

deafen

Explorer

Constantin

Vampire Pig

deafen

Explorer

Constantin

Vampire Pig

deafen

Explorer

Constantin

Vampire Pig

anmnz

Patron

Constantin

Vampire Pig

AlexGG

Contributor

anmnz

Patron

jgreco

Resident Grinch

Johnnie Black

Guru

deafen

Explorer

deafen

Explorer

Similar threads

Important Announcement for The TrueNAS Community.