Baffling Performance issues with large zfs pool

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
ARC isn't hitting its max size, so unless you've recently rebooted there's probably downward pressure on it.

Cache hits are almost entirely (98.74%) metadata which makes me think it's constantly looking up the DDT.

You've got about 6.5G of metadata in your ARC, and another 6.7G of ghost metadata (effectively, metadata that was recently in ARC but got evicted) - this combined with the DDT calculations you did before coming out to 14.5G is what's making me think you've got too much deduplication table for your RAM to handle.

I'd say try doubling your current arc_meta_limit with this shell command:

sysctl vfs.zfs.arc_meta_limit=37080287232

And also set dedup=off on the datasets to avoid more DDT updates.

To set it back to the previous value once dedup is gone:

sysctl vfs.zfs.arc_meta_limit=18540143616

Note to other readers - increasing arc_meta_limit will only band-aid the issue if you're binding on DDT lookups (having to check through it to compare new records) - if you're binding on DDT updates (writing the new DDT to disk) then the only thing that can help you is faster vdevs. Disabling dedup will mean that you don't generate more updates, but you'll still have to remove entries as the deduplicated records get removed. You're in for pain, certainly, but if you can still mount and access your pool it's at least "solvable pain."

Edit: I'm guessing based on the single Xeon-5600 and 72GB means that it's at 9x8G and "adding more RAM" isn't an easy option due to lack of compatible or available 16G sticks. If you have some supported 16GB DIMMs though, slap 'em in there to get to 144G while you get deduplication turned off.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I know i may get some eye rolls out of this but i have had issues with LSI RAID cards in the past. Example, had a Cisco C240 m3 loaded with 24x1TB drives and an LSI HBA.

Sidebar on this one - note the difference between "RAID" and "HBA" or cards that claim to be able to do both. Also, was it an official LSI HBA, or was it a Cisco-branded one? Would be curious to see the exact hardware in use here, I've got a hunch as to what your root cause was. ;)
 

Sanman96

Dabbler
Joined
May 15, 2020
Messages
13
Sidebar on this one - note the difference between "RAID" and "HBA" or cards that claim to be able to do both. Also, was it an official LSI HBA, or was it a Cisco-branded one? Would be curious to see the exact hardware in use here, I've got a hunch as to what your root cause was. ;)

It is actually dual 5600s, not sure why FreeNAS doesn't say that. But ya the mobo only supports 144GB according to intel ARK. I'll start the proccess of removing dedup soon and try your tune to see what happens. Thanks for the in depth explanation it is greatly appreciated! Learned about a new command so i will mark that up as a win.

Looked up the POs. Originally it came with a LSI 9265-8i in JBOD. It performed pretty well. FreeNAS ZFS, Ubuntu ZFS and RAID 60 (in Ubuntu and FreeNAS) compared nicely. Neither really had an edge. FreeNAS played well with it. I knew JBOD controllers weren't ideal and instead of hassling with flashing a new shiny cisco card it I purchased a Dell H310 card and flashed it to HBA as many people in the forum have done and claimed amazing performance. Slapped it in, connected the SAS expander and boom, performance reduction. Im sure it could have been fixed with tunes but it was a very default FreeNAS install that once the HBA was installed followed the BPL. I misspoke before, once I saw the performance hit and re-flashed it back to RAID, I created a raid 50, not a 60. Performance was about the same as the original card running the RAID 60. Ran for the short time I needed it to then promptly got decommissioned and promptly recycled after data destruction. There are a lot of variables there but I hate figuring out the maze of LSI HBAs and RAIDs like many other in the forum have said lol
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Let me know the results. Don't be alarmed if the DDT doesn't immediately shrink as you're moving/deleting the data, give it a minute and check to see if the counter starts dropping. Make sure you don't have snapshots of the old datasets or anything like that which would otherwise hold onto the records.

For the Dell H310, did you use Dell's firmware, or the full crossflash to LSI?

I've got that exact card in a backup-target system and it chews through big-block sequential writes at about 500MB/s.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
I left those out because the 9300 is $170 right now, from what I can see. VMWare just ditched the 2xx8 entirely, not that this matters much for FreeNAS. For a corporate environment, USD 170 (and up for 9305) seems reasonable. That's less than one drive. That said, sure, the 2xx8 work great, just a little older silicon.
 
Last edited:

Sanman96

Dabbler
Joined
May 15, 2020
Messages
13
Hi all, Update time. Sorry it took so long but I have rebuilt ZFS without dedup. The problem is half gone. Now it will transfer at about 350MB/s for about 20 seconds and then fall off to about 175MB/s. Then it will start the pausing again. It will transfer at 175 for about 10 seconds then the transfer will stop for about 5 seconds then just keep repeating. The difference now is the system does not completely lock up anymore. WebGUI and SSH are still available. So now that dedup is ruled out, any ideas on how to troubleshoot the pausing now that the system is up and doesn't crash?

Since I had Linux on the box for about a day, I know that the array can handle more sustained sequential speed but I would be happy with ~200MB/s if it was sustained and not pausing all the time.

Also honeybadger, yes it was a full LSI crossflash.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
It sounds like the write throttle might need some adjustments to make the slope more gradual, but even so it shouldn't pause entirely, just slow to a crawl.

Try creating and running Adam Leventhal's dtrace scripts for checking dirty data amount and txg duration.

Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

Code:
txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

Run them with dtrace -s scriptname.d and post some sample results during a copy showing the ramp from "fast" to "slow" to "stalling"

Also, are you still running the 9650SE?
 

Sanman96

Dabbler
Joined
May 15, 2020
Messages
13
It sounds like the write throttle might need some adjustments to make the slope more gradual, but even so it shouldn't pause entirely, just slow to a crawl.

Try creating and running Adam Leventhal's dtrace scripts for checking dirty data amount and txg duration.

Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

Code:
txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

Run them with dtrace -s scriptname.d and post some sample results during a copy showing the ramp from "fast" to "slow" to "stalling"

Also, are you still running the 9650SE?


Here are the outputs for transferring a 44GB file. Started out strong at 500MB/s and by the halfway point it was 200MB/s and doing the stop and start. And yes, i still have the 9650S

Code:
root@freenas[~/troub]# dtrace -s duration.d Data
dtrace: script 'duration.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  7  66943                  none:txg-synced sync took 0.90 seconds
  2  66943                  none:txg-synced sync took 2.67 seconds
  7  66943                  none:txg-synced sync took 4.88 seconds
  1  66943                  none:txg-synced sync took 8.94 seconds
  6  66943                  none:txg-synced sync took 13.50 seconds
  6  66943                  none:txg-synced sync took 16.63 seconds
  4  66943                  none:txg-synced sync took 16.63 seconds
  4  66943                  none:txg-synced sync took 16.70 seconds
  5  66943                  none:txg-synced sync took 16.70 seconds
  7  66943                  none:txg-synced sync took 16.73 seconds
  4  66943                  none:txg-synced sync took 16.63 seconds
  0  66943                  none:txg-synced sync took 16.64 seconds


Code:
root@freenas[~/troub]# dtrace -s dirty.d Data
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  66942                 none:txg-syncing  327MB of 4096MB used
  2  66942                 none:txg-syncing 1126MB of 4096MB used
  7  66942                 none:txg-syncing 2077MB of 4096MB used
  1  66942                 none:txg-syncing 3292MB of 4096MB used
  6  66942                 none:txg-syncing 4096MB of 4096MB used
  6  66942                 none:txg-syncing 4096MB of 4096MB used
  4  66942                 none:txg-syncing 4096MB of 4096MB used
  4  66942                 none:txg-syncing 4096MB of 4096MB used
  5  66942                 none:txg-syncing 4096MB of 4096MB used
  7  66942                 none:txg-syncing 4096MB of 4096MB used
  4  66942                 none:txg-syncing 4096MB of 4096MB used
  0  66942                 none:txg-syncing 4096MB of 4096MB used
  4  66942                 none:txg-syncing 4096MB of 4096MB used
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
You're actually bouncing off the hard limiter for the maximum amount of dirty data (4096MB) which is what's causing the stalls; your back-end vdevs aren't able to keep up with the incoming writes. Writes aren't actually stopping, but if they're being delayed for the maximum amount (100ms by default) that's going to reduce your pool to 10 (yes, 10) IOPS.

Your pool isn't at horrifically high occupancy or that badly fragmented; I know that RAIDZ2 isn't the highest performing solution but 4 vdevs should be able to push a consistent 200MB, not choke up trying to sync it out.

Increasing the async_write max is likely going to be the fix, but before we do that, I assume you've checked to ensure all of your drives are reporting as healthy? Also, what driver is that 9650SE using? If it's not using a well-oiled driver like mps swapping to the crossflashed H310 might be considered (if you can take a downtime?)
 

Sanman96

Dabbler
Joined
May 15, 2020
Messages
13
You're actually bouncing off the hard limiter for the maximum amount of dirty data (4096MB) which is what's causing the stalls; your back-end vdevs aren't able to keep up with the incoming writes. Writes aren't actually stopping, but if they're being delayed for the maximum amount (100ms by default) that's going to reduce your pool to 10 (yes, 10) IOPS.

Your pool isn't at horrifically high occupancy or that badly fragmented; I know that RAIDZ2 isn't the highest performing solution but 4 vdevs should be able to push a consistent 200MB, not choke up trying to sync it out.

Increasing the async_write max is likely going to be the fix, but before we do that, I assume you've checked to ensure all of your drives are reporting as healthy? Also, what driver is that 9650SE using? If it's not using a well-oiled driver like mps swapping to the crossflashed H310 might be considered (if you can take a downtime?)

Yes, I've checked the drives. I have written a few scripts to check all the smart data. Only thing that comes up is a couple drives have a historical high temp warning but that's it. Everything else with all the drives seem healthy. Before posting this last month, I ran short and long smart tests as well.

Here is the dmeg so I assume it is using the twa driver.

Code:
3ware device driver for 9000 series storage controllers, version: 3.80.06.003
twa0: <3ware 9000 series Storage Controller> port 0x1000-0x10ff mem 0xb0000000-0xb1ffffff,0xb3900000-0xb3900fff irq 32 at device 0.0 on pci4
twa0: INFO: (0x15: 0x1300): Controller details:: Model 9650SE-24M8, 24 ports, Firmware FE9X 4.10.00.024, BIOS BE9X 4.08.00.004


I can take downtime but the issue is the backplane and raid controller use the 50 pin sas which is 8 lanes and is even hard to research (reasons why i do know what they are officially called, you can google this part number to see the end i'm talking about 74596-2002). Its basically a double wide SFF-8087. I can't find "aftermarket" backplanes that have the normal SFF-8087 so my choice of raid/jbod cards is pretty limited. Unless you know of cards that have the "MiniSAS 50 Pin" connector. I wanted to get a nice LSI for this build when i received it second hand, but gave up trying to find one with these connectors.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Unless you know of cards that have the "MiniSAS 50 Pin" connector.

From what I can see, the 96050SE is a SATA RAID controller. It uses a MiniSAS form factor, but connects up to 8 SATA through that. It does not support SAS at all.

If that's so, then a possible solution I can see is a custom cable to that backplane, basically two Mini-SAS to 4 SATA breakout cables that do not converge in SATA connectors, but go into that 50-pin. Have you found documentation of its layout?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
the backplane and raid controller use the 50 pin sas which is 8 lanes

This is most likely the SFF-8654 8i connector. There are a variety of cables that split this into 2x SFF-8087s.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I can take downtime but the issue is the backplane and raid controller use the 50 pin sas which is 8 lanes and is even hard to research (reasons why i do know what they are officially called, you can google this part number to see the end i'm talking about 74596-2002). Its basically a double wide SFF-8087. I can't find "aftermarket" backplanes that have the normal SFF-8087 so my choice of raid/jbod cards is pretty limited. Unless you know of cards that have the "MiniSAS 50 Pin" connector. I wanted to get a nice LSI for this build when i received it second hand, but gave up trying to find one with these connectors.

I think you need 74596-1002 - but I'm uncertain if those cables are bidirectional. I know the "breakout" cables for SFF-8087 to SATA are, but I'm not sure if this would be as well. If it's just staying SAS all the way through it should be bidirectional.

What system hardware is this that's tied that closely to the oddball cabling?
 

Sanman96

Dabbler
Joined
May 15, 2020
Messages
13
Hi all,

This is most likely the SFF-8654 8i connector. There are a variety of cables that split this into 2x SFF-8087s.

That is not correct. The cable looks exactly like a double wide 8087. THe 8654 has exposed pins where the connector i need does not.

From what I can see, the 96050SE is a SATA RAID controller. It uses a MiniSAS form factor, but connects up to 8 SATA through that. It does not support SAS at all.

If that's so, then a possible solution I can see is a custom cable to that backplane, basically two Mini-SAS to 4 SATA breakout cables that do not converge in SATA connectors, but go into that 50-pin. Have you found documentation of its layout?

Yep that is correct. It is just SATA, no SAS. That is why the backplane does not have an expander, Just 3 of the double wide 8087s. Cannot find any docs on this box short from contacting the manufacturer (discussed below).

I think you need 74596-1002 - but I'm uncertain if those cables are bidirectional. I know the "breakout" cables for SFF-8087 to SATA are, but I'm not sure if this would be as well. If it's just staying SAS all the way through it should be bidirectional.

What system hardware is this that's tied that closely to the oddball cabling?

Exactly, i am unsure if these are bidirectional and do not want to break anything running them in reverse if the cables are not designed to do that. It was originally a windows 6U file server from LAMB configured in RAID6. I am assuming they went under as a few google's haven't yielded any results. It is basically an intel whitebox with a custum chassis.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Just want to start by saying I have searched the forum on performance issues for the past few months and they all come back to failing drives. I have tried all the smart tests etc and all the drives seem healthy. So i feel like its time to reach out to see if anyone can help.

The issue at hand... Copying large amounts of files to the system (200GB+), the system will burst extremely fast (900MB/s+) for a few seconds as expected. Then the speeds will settle down to about 250MB/s for about a minute. Then the fun begins. The system will become unresponsive. The drive activity lights will stop flashing, ssh will disconnect, the WebGUI becomes unresponsive, jails hang or die. It will stay hung for upwards of 5 minutes sometimes causing applications to fail along with any file transfers. I have been scratching my head on this as I cannot figure out whats causing it. Just as a frame of reference, when I first got the box and it had windows server 2012R2 installed, I could sustain 250MB/s to it all day without a single hiccup. Now that I enabled JBOD and installed FreeNAS the system is extremely unstable. Looking for help troubleshooting the hang. Searched logs and don't see anything obvious so any help would be appreciated!

System build...

Intel dual SFP+ card configured for LACP to Nexus Core switching

Code:
OS Version:
FreeNAS-11.2-U8

(Build Date: Feb 14, 2020 15:55)

Processor:
Intel(R) Xeon(R) CPU E5607 @ 2.27GHz (8 cores)

Memory:
72 GiB


I have 2 500GB HDDs connected to motherboard Sata and zfs mirrored. Then I have 24 seagate constellations configured as follows... The raidz2 is required for drive loss requirements set by others...

Code:
root@freenas[~]# zpool list
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
Data          43.5T  15.1T  28.4T        -         -    24%    34%  1.48x  ONLINE  /mnt
freenas-boot   464G  2.98G   461G        -         -      -     0%  1.00x  ONLINE  -
root@freenas[~]# zpool status
  pool: Data
state: ONLINE
  scan: scrub repaired 0 in 1 days 04:56:26 with 0 errors on Mon May  4 04:56:36 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/ba5b8d42-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/bb44aabe-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/bc29615c-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/bd218228-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/be12168e-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/bef2d192-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/bff8ace4-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c0e54728-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c1efc57f-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c2e5c00e-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c3d7d3af-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c4c47384-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/c5bb9669-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c6afba1d-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c7a915ef-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c8928c80-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/c994b8c2-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/ca9c8587-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
          raidz2-3                                      ONLINE       0     0     0
            gptid/cba20d8a-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/cc9edb60-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/cd968c08-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/cea7f25b-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/cfa689fd-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0
            gptid/d0a8b450-9abc-11e9-8ddb-001e672bf05c  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:34 with 0 errors on Wed May 13 03:45:35 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors



During the hang I cannot check gstat as the system is completely unresponsive. Again, I'm at a loss on how to trouble shoot this further.

Thanks in advance!

While there are many valid points in the above answers, the basic diagnosis so far is incorrect. The correct explanation is obscure and very different.

Once I saw the very clear description of symptoms in your first post I looked for dedup being on in your screenshots/outputs, and knew I'd see it. (Ill also guess you may be running fast network connections too, maybe 2.5 - 10 gigabit +, see later on). This collection of baffling behaviours is exactly what I had in the past, and this is what's actually happening.

The good news is, if your other hardware is good enough (sufficient RAM etc), your issues will probably be 100% resolved if you switch to special metadata SSD vdevs in TrueNAS-12 and replicate your pool to move metadata there (zfs send -R | zfs recv). That will fix it.

THE ISSUE:

In brief, deduplication places an immense demand on the system. Everyone knows it demands a lot of your RAM capacity. What's far less well known is the incredible levels of demand that dedup *also* places on 4k random IO and (not so relevant here) CPU for hashing.

With dedup enabled, *every* block read or written will *also* need multiple dedup table (DDT) entries read or written. That's inherent with dedup. All blocks are potentially deduped not just the file data. That's what dedup is - you're reducing pool space requirements at the cost of very high 4k random IO and CPU usage (for dedup hashing). All those blocks need DDT entries looking up, for any read/write.

To give an idea of scale, my pool has 40 TB actual data deduped to 13.9 TB (about 3x) and needs almost 200 *million* DDT entries to dedup it. Those 200 million DDT entries are each just a few hundred bytes long, so read or write, it's all pure 4k random IO. Yours is much less deduped as well (~1.7 x). You can see how many entries are in your dedup table using zpool status -Dv.

Without an all-SSD pool or (v12 only) SSD special vdevs, that 4k random IO is what's ultimately destroying your pools responsiveness and triggering the hangs. But it's doing it in a nasty indirect way, and taking the networking buffers and network IO capability down, along with it.

You can check this conclusively using gstat, but I know what you'll see.

Reboot (to clear any dedup/metadata/cached data from RAM/ARC/L2ARC). After reboot, run gstat on the console so it's not working across a network (something like gstat -acsp -I 2s), and with the network idle, do a nice big file transfer of the kind that usually falls over badly after a while. Writing a single 30GB to 100 GB file from a client to the server is a good way to make this behaviour show up very clearly.​
As the transfer progresses, and when your file transfer slows, stalls, or gets close to misbehaving, watch what disk IO is going on, and also the block sizes that predominate. Also for HDDs listen for when they start and stop chattering, compared to.what gstat is saying is going on.​
Intuitively for a single big file being written, you'd probably expect to see lots of disk writing (for example 128K writes if its a single file and the pool has plenty of free space). Instead you'll probably see its largely sticking in long phases of processing hundreds or maybe thousands of 4k or mixed size reads.​
(The mixed size is if your system.is also.stalling on loading spacemaps. But DDT is always 4k, and that's the vast majority of the problem. Another thing you should see, but probably won't, is a regular 5 second heartbeat as ZFS builds up IO data for 5 seconds, and then writes it to disk - transaction groups, or TXG's. You won't see that nice clean presentation in your pool for long if at all, because of the issues describe below. After upgrading to special vdevs, that heartbeat came back on my system with gstat)

That high demanding level of 4k IO inherent in dedup, is what sets off a cascade of escalating system problems.

ZFS treats the DDT access as part of the block access generally, so for throttle purposes, it doesn't seem to notice that your disk IO is now invoking in the background much larger numbers of 4k random IO to get the DDT data to allow it to execute the actual writes. Also spacemap data to know where to put it (spacemaps are also 4k random metadata too...)

When I say dedup invokes highly demanding levels of 4k IO, the sheer scale of this isn't obvious at first. My deduped pool on TrueNAS-12-BETA which *doesn't* have this issue any more, sucks up 4k IOs at at 1/2 *million* reads/writes a second, and not just briefly but long and often enough for it to be a regular level.of disk IO. Not just the few thousand 4k IOPS which HDDs can deliver.

tmp3.jpg


That rather shocking figure isn't exaggerated.
This was gstat on my pool, once I moved DDT to special vdevs on 12-BETA1.
See the 2 x 1/4 million 4K IOPS on both of the mirrored metadata vdevs?
(mine were writes not reads, because this was during replication)
That's the backlog your pool is choking on, and unable to process because an HDD pool just can't do that.

So the read/write disk throttles don't control that 4k metadata access as much as needed, and eventually a backlog builds up from this 4k IO pressure, which the pool just can't respond quickly enough to. Because its a deduped pool, *nothing* can happen poolwise until the relevant DDTs for all data to be processed have been loaded into RAM... except they can't be read nearly fast enough and the backlog is building up *really* badly*.

By this point you've got minutes to tens of minutes of 4k IO backlog already built up, even if it were processed at full speed on a fast HDD pool with plenty of RAM, and even if nothing else arrives.

The next step in the cascade is that RAM fills up with the backlogged file handling queue. That's why it runs happily anyway for a while at good speeds. ZFS is using RAM to backlog the file transfers. It might tell the source to slow down a little, but by and large at this point everything's still mostly happy and nothing's in meltdown... yet ("the system will burst extremely fast (900MB/s+) for a few seconds as expected. Then the speeds will settle down to about 250MB/s for about a minute")

But that can't last forever. Eventually if the demanded file read/write is enough GB, or ongoing enough, the backlog starts to invoke a more serious RAM throttle as RAM fills up to the maximum it's allowed to use. So the next step in the cascade is triggered within the wider OS (not ZFS), telling whatever is sending the data to slow down even more, until the backlog clears a little.

Usually that solves it. But in this case, youve got ZFS massively choking on 4k already, and *gigabytes* of RAM of incoming data from your file *already* accepted into.a queue/buffer waiting to *further* transact. All of that data needs its own 4k IOs on a deduped pool - a backlog that can *easily* trigger 0.4 to 4 *million* 4k random IOs for the related DDT records while processing (depending on the size of the backlog in RAM, your dedup level, what's already in ARC and many other things). And your pool is simply floundering, unable to begin making inroads into that massive 4k IO demand.

So the usual remedial action doesn't work as expected. The source is indeed told to slow down. The source for your files is network traffic. The way a networked server tells a networked source to send slower, or pause sending a bit for catch-up, is to lower the TCP window to tell the source not to send as much. At a pinch if the problem continues it can lower it all the way to zero ("zero window"). If you use netstat or tcpdump in console (look up tcpdump tcp window, or tcpdump zero window, on Google) or Wireshark or tcpdump on the client, you'll see that's being handled absolutely correctly. Unfortunately it's just not much help at this point. Whatever is sent, still can't get out of the network buffers into the file system stack (i.e., VFS or ZFS) because those are still choking, and they're going to continue choking for ages.

So TCP window never gets to really bounce back. It's hammered down to near zero constantly for minutes, because the system simply can't accept more data. Any opening in buffers (network or file system) is jnstantly stalled by backed up 4k and makes no real.difference whatsoever, it just becomes another item in a stalled backlog that stops anything new being accepted on more than bytes at a time. That's a continuing situation.

(Note: Network speed is a contributing/triggering factor as well. It's not that 10G is an issue, far from it. But 1 gigabit networking acts as a 120MB/sec throttle from the start on the incoming file data. ZFS might or might not just about cope with the 4k IO load at that speed. But it usually doesnt stand a chance on anything more, which can fill the pipeline 2.5 - 100x faster. The slower speed of 1G Ethernet means the backlog also builds up much slower too, so there's also much more time to get through the workload. I never saw this situation happen when by chance I downgraded my 10G LAN link to 1G. But I guess you might see it even on 1G, with some combinations of CPU speed, pool IO speed, and RAM size.)​

By now the 4k DDT traffic has backlogged ZFS IO, the file system stack, and now the network stack, which has told the source "don't send for a while" by signalling a zero or very low TCP window.....

The source checks for a while, "Can I send yet", but the network stack never really stops being hammered, so 90% of the time the reply is "NO, WAIT!" Eventually after a while, by chance, it'll get told "NO WAIT!" each time it asks for a few times in a row (a client might only check about 4-6 times over the span of a minute if refused). At that point, the client gives up and assumes an error or issue has occurred at the remote end. Thats your SMB/NFS/iSCSI session dying. Your sessions can't reconnect for the same reason they disconnected in the first place - the network packets for negotiating a reconnection are also being stalled or getting "TCP zero window" responses as well. ("The system will become unresponsive ... It will stay hung for upwards of 5 minutes sometimes causing applications to fail along with any file transfers").

But that's not the end of the fun. The same network stack that handles those, also handles webUI and SSH, and even if those were low traffic, they also get told "TCP zero window" too, and at some point the same happens to them, and they conclude the server isn't responding properly and disconnect *their* sessions as well, after a minute or so of continual zero ability to send data across the network to the NAS ("ssh will disconnect, the WebGUI becomes unresponsive").

And finally, after all that, the sting in the tail. With no more incoming data, ZFS can finally and slowly catch up. That takes a few minutes. But what's happened? Oh dear! Your file transactions got cancelled. Which means immediately there's a new set of ZFS tasks to run, becsause ZFS/Samba/NFS/ctl (for iscsi) has to now *unwrite* all of that stuff again, to unwind the part completed transaction and keep the pool clean(!).

While RAM is emptying, Samba/NFS/ctl and your network stack can fill it up way faster than ZFS can process it with its 4k overhead, so whatever happens in effect it's kept at TCP zero window throughout, until it's almost totally clean.

After all that, finally, your server can begin to reaccept data properly. That's when your clients can finally reconnect.

LOCAL (NON-NETWORKED) FILE TRANSFERS:

Be aware this can also happen just as easily in local file transfers within the server too, not just networked. But those don't disconnect so utterly as a rule, because they are might more tightly managed by the same OS, so there's much less to fall over, and much better knowledge if an actual fault has arisen instead of just judging by network reaponses. They just have huge long pauses and slowdowns, instead, in many cases.

THE SOLUTION:

Either
un-dedup and don't use dedup, or try switching to a 1G (or slower) network link if you're on anything fast, or move to 12 and migrate your data to force a rewrite to a set of disks that includes special SSD vdevs - and make sure they are big enough, and good ones, and (of course!) redundant. At this point I'd say wait for 12-RELEASE, or at least 12-RC1 which is due out in a couple of weeks and fixes some things that are important. But even 12-BETA1 was rock solid for data safety AFAIK and BETA2 is niice.

That's the real fix here.

You can also maybe try TrueNAS 12 tunables to load (and not unload) spacemap metadata and preload (and not unload) DDT at boot, and keep L2ARC warm if any, and maybe reserve a certain amount of RAM/ARC for non-evictable ZFS metadata ("vfs.zfs.arc.meta_min" as loader I think?), if you have plenty of RAM.

Note that tunables alone didn't help much on 11.x for me, because the all important tunables are missing - a way to tell it to preload the entire DDT so it never has to do that scale of DDT 4k read "on the spot" during a live file transfer. Or, having loaded it the slow way over time, to keep it "warm" in L2ARC at least and not lose it from speedy access.
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Wow well spotted. And here we are going down a rabbit hole about cables.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Wow well spotted. And here we are going down a rabbit hole about cables.
I was troubleshooting this one off and on for over a year, maybe 2, till I figured it out for sure for myself.

In no specific order, Wireshark/tcpdump/netstat to confirm whether it was a client or server issue and what dialog was happening, netstat to watch the network buffers in parallel, gstat to watch disk IO/block size/direction (RW) and queues,and multiple posts and bug reports to piece it together.

I started like people in this thread, at the network end (wireshark on the client), where I could see the zero window disconnects. Then tcpdump+netstat on the server confirmed it was genuine, not an artifact of networking problems - the server was indeed deliberately requesting this of clients via TCP window, and the servers network buffers were indeed staying pretty much full as well, which all kinda fitted together and explained the dropouts, and I thought *might* mean the servers network buffers couldn't empty for some reason... but why? Samba/SMB/NIC issues? Or what?

The breakthrough came after looking at gstat at the same time, when i could see the network buffers briefly "gulping" (becoming half empty and immediately filling again to the hilt), and that correlated with *read* disk activity in gstat and the noises of HDDs suddenly starting each batch of IO. What were the disks doing? What could they need to read that much, to write out a single big file, and why was ZFS disk IO syncing with the network buffers suffocating? Somehow, ZFS wasn't pulling data off the network properly? So what on earth was it doing with all that IO, in that case?

As a first step i focused on spacemaps (free space data). I replicated the pool to new disks, to eliminate spacemap issues - with spacemaps you can control block size for RW efficiency (16-64 KB?), preload them, and lock them in RAM.

After that, all that gstat showed was incredible amounts of 4k reads when I was expecting 128k writes. I honestly didn't think HDDs could do them that fast. Only one thing that could be, DDT reads, so now I understood what ZFS was spending 100% of its time doing while this went on, although it took a while longer to comprehend that this was in fact what was blocking ZFS and why ZFS was blocking from it, and even more time to understand why so huge a 4k load, and why it had the other knock on effects i was seeing and was triggering that full cascade of events. 4k reads had to be metadata (on a 50% empty newly written non-fragmented pool), and with spacemap issues hopefully eliminated and not 4k anyway, only one kind of metadata could produce that level of volume. Googling for some useful if very basic dtrace supported the idea and confirmed the gstat IO and block sizes involved, and added that disk latency on the 4k IO was just colossal. That was 2018 or 2019. Ive been waiting for special vdevs, log-dedup, or other mitigation ever since. So I moved early to 12-Beta immediately it felt safe, to grab the special vdev feature asap, specifically to speed up DDT reads. Worked a dream.

As a rough idea what I got and have now,with special vdevs on 12-beta2 - and everyone's pool will differ - mine went from the behaviour in the OP, to being able to reliably handle 1 - 3 TB file writes at 300-400 MB/sec constant speed for hours, over SMB (10G LAN). The iSCSI target (formatted as NTFS) is handling random disk access faster than the actual enterprise nearline HDDs plugged into my desktop itself.

That was the impact of really good special vdevs and tunables to free them from usual limits, with 12-beta vs. 11.3. Everything else - other config, HDDs, pool and dedup structures, HDDs used - stayed the same.

So yeah.. I recognised the OP's descriptions ;-)

And I'll take a bet the OP either has *very* low power hardware, or is on ~ 10G+ LAN, because 1 gigabit LAN usually can't feed incoming file data fast enough to overwhelm ZFS IO this way.
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
and make sure they are big enough, and good ones,

Since you've done so much work on this use case: What are your go-to recommendations for SSD models for a special alloc mirror vdev?
 
Top