Samba + iSCSI failing - NAS sends "ZeroWindow" timeouts on simple workloads most of the time

Stilez · Jan 30, 2018

UPDATE @ 2019-01-08:

The cause for the issue described below is now known to be an unreported severe effect of ZFS space map (free space log) handling in some situations. Comprehensive info is on the ixsystems "redmine" bug tracker.

Basically, ZFS is known to have an issue with free space map management (2016 talk by Matt Ahrens). But in some cases, the "4K IO" described on pages 20-28 is enough to choke ZFS - and ZFS isn't catching it as it should, and telling the source to slow down. As ZFS backlogs under the 4K IO demand - which it needs before it can work out where to write files - transaction groups stop being able to writeout, so they can't accept data from the NIC. The network stack takes action itself, reducing the TCP window. At 10G network speeds, incoming data quickly fills any space in the network buffer, so the TCP window is quickly forced to zero enough of the time, and for minutes at a time, and the client file transfer dies.

------------------------

This is a clean update of 11.1-U1, as of a few days ago. No VMs, no jails, no weird config, no other services running, no authentication or share browsing issues, good hardware, and only configured in GUI. I just ran "verify install" and no errors. The hardware is good and new (no issues/errors known), the few tunables are well known ones that should be 100% safe, and there are no Samba auxiliary conf items related to sockets, networking or caching (and iSCSI with the same issue doesn't have such options provided anyway). Data is at the end of the post.

It's fine for a while, but at some point during a write, the server starts to issue ZeroWindow packets and after 50+ seconds of TCP win = 0, the connection gets killed from the client end. I've attached a Wireshark screenshot (tcpdump on the NAS says the same). The test job is a 50 GB single file write . The LAN has no other clients and all disks were idling before the test (checked with gstat). Periodically, the TCP window rapidly decreased from the server end (from ~ 28800 to 0, maybe a dozen time, each lasting 3-5 milliseconds). On most of them it quickly recovered; the last time was about 4 mins / 30GB into the transfer. The client probed for win > 0 at 1, 2, 5, 10, 20, 20 seconds (58s total) after which the client gave up and RST.

Now the NAS should be laughing at this transfer. It's got 128 GB of fast ECC to play with, a 3.5GHz Xeon v4, Chelsio NIC, no parity to calculate (mirrored pool) and 250GB of NVMe L2ARC as well. There's nothing in any log I can see, to explain why it's hitting this issue or to troubleshoot it. It does the same on both Samba and iSCSI, so it's almost certainly a file system or networking issue and not related to config within these services.

I've done a fair bit of digging with not a lot of knowledge at troubleshooting this sort of thing. top shows CPU is 34% system throughout - doesn't seem to be CPU issues. iostat shows HDDs almost dawdling (4-10 MB/s each disk for enterprise 7200s) and queue lengths typically 0-1 and sometimes 4-10 and consistently busy only 8-10%. iostat data was remarkably uunchanging throughout. netstat doesn't seem to report any mbuf delays/denies, and all "worrying" indicators such as "bad XYZ" seem zero or unchanging throughout. I'm less sure what to look for on netstat but it seems innocent. tcpdump confirms that the TCP window announced by the NAS is gradually (although quickly) reduced over periods of about 3 ms quite often. Sometimes down to zero, sometimes down to some other point, before recovering. (I've attached a graph of the 4 times it got down to zero, they show identical behaviour). I tested with L2ARC removed (Samba doesn't hit ZIL) and with iSCSI instead of Samba - same happens. That's all 3 systems checked for obvious signs as best I know to do, which isn't saying much :) and all 3 seem perfectly normal.

Conclusion so far: CPU low use, physical disks low use, network responding to something that's causing it to reduce window for incoming traffic but no clear signs of any specific networking buffer/queue that's congested... Hmm.

What can I do to find a specific place where the issue arises, and a fix for it?

Supermicro X10 + E5-1620 v4
RAM: 128GB (6 module) 2400 ECC
HBA: 2 x LSI 9211
NIC: Chelsio T420 dual 10G (foreseeable max: 2)
Storage:
16 x 7200 enterprise, mixed SAS/SATA (foreseeable max: 22-24)
Usage: 3 way mirrors (3x4TB+3x6TB+3x6TB+3x8TB) + test pool/spares, main pool capacity 22TB @ 55-60% usage, dedup saving 3.9x
ZIL: Intel P3700 NVMe
L2ARC: Samsung 250gb NVMe
Boot: 2 x Intel 320 SSD mirror
Power draw/temp: staggered start enabled, vertical fanned, disk temps 28-35C

Code:

hw.igb.max_interrupt_rate 32768        --- igb can over-poll, this rate limits polling. But not using igb (fallback NIC, currently disconnected)
kern.ps_arg_cache_limit 4096           --- provides more info in ps, doesn't affect sharing/networking
vfs.zfs.min_auto_ashift 12             --- ensures new pools/datasets 4K aligned.

dlavigne · Feb 5, 2018

Were you able to pinpoint the cause for this?

Stilez · Feb 6, 2018

dlavigne said:
Were you able to pinpoint the cause for this?

It's immensely helpful, that you take such care of us "mere mortals" and try to ensure issues are resolved.

It's not fixed, no. Networking connections still time out every time on large file writes.

I've done a load more troubleshooting and have now got a lot of evidence from netstat, gstat, and tcpdump all pointing in very similar directions, so I'm getting a good idea what's happening in the NAS when LAN file writes cause a connection timeout. I'm now 99.99% sure it'll end up being something like this:

"For some reason" data is not pulled from network RECV buffer to the "new" TXG. Instead it's left unprocessed in the networking buffer. Traffic flow resumes at the same time that the next ZFS TXG HDD writeout burst begins.
I have no idea the underlying reason for this happening, but this is what's ultimately causing the timeouts. Something is happening (probably in the ZFS/disk subsystems?) so that network buffers are being left unemptied too often when one would expect they are still be serviced.

In either case, the network RECV-Q buffers become congested/full and TCP window begins to decrease in response.
If the activity causing the stall occurs quickly enough, and doesn't recur "too soon", the session recovers. On resumption, the network queue empties extremely quickly, so it's not full often enough to make a disconnect likely (which requires ~58s of zero win as seen by client probes). That doesnt mean it's healthy - the issue is still recurring periodically even if not so bad as to cause session disconnect. , and an "up-down-up-down" cycle of queue size/TCP window happens. The user/client sees low/zero traffic flow for up to a minute each time, then speeding back up, periodically/cyclically.

But if "whatever" causes the stall, happens too often for the networking RECV buffer to reliably be emptied/recover before the next "stall" event occurs, then the RECV-Q buffer spends a lot of time being driven down to (at of near) zero and doesn't get a chance to recover much if at all. So there's quite a high chance of zero window several times in a row when the client probes. Eventually luck fails and the client gets ~10 zero window responses in a row, and deems the session timed out, so it RST's.
Since NAS resources are ample and it's quite likely that the NAS factory defaults *might well be* suboptimal/unsuited for the hardware/pool/LAN, this tends to suggest the answer will be to pin down a more precise "underlying cause", and remedy it with custom tunables that are more suited to the NAS.

I need help going from there - pinning down what my next steps/tests should be to find the exact problem/tunables/config that's not right for my system in "factory config", to target and fix the issue and understand more exactly what's causing the transfer of data out of the network buffer to periodically stall.

But I was hesitating because I didn't want to ask so much help from the community.

More broadly, I have nice FreeNAS news - my NAS is now working really nicely except for 3 issues, of which this is one. I'll probably have to bring the other 2 to the forum some time soon - one is that iperf isn't getting anything like the throughput it should (2 x chelsio T420-CR on a near-silent small network should be able to pull near 10G line speed, but not yet fully configured it because fixing this issue has been a lot more important to me than optimising an already-working link), the other is a permissions/ACLs issue (not very familiar with these although a lot's been written about them, need understanding on some stuff).

On this issue..... yes, help please!

The server has plenty of resources to handle the issue but I don't have the detailed knowledge of the FreeBSD networking/ZFS subsystems and stats commands needed, to refine my testing any further to pinpoint the precise bottleneck cause and the necessary measures.

Below is a summary of evidence and outputs, additional detail on the NAS that might be relevant (which didn't get mentioned when I wrote the OP) and a summary of my troubleshooting approach and results so far. I['ve put some useful screenshots of system stats while the issue is manifesting, at the end of the post. There's quite a lot of info but it's very straightforward. Just trying to provide good information in the hope of good help.

My best guess on interpreting the issue so far:

My best guess is that despite transaction groups being intended to accept and process current data while ZFS writes out a previous TXG, the problem is that on factory (or near-factory) settings, processing of new incoming data from the network almost totally halts, and resumes at the point ZFS next begins burst-writing out disks, and this leads to the NIC cutting down the TCP window. I don't understand this at all, since the whole point of TXGs is to accept/process data without interruptions while writout of previous data occurs in the background. But I'm sure that's the problem.

I can correlate this in numerous ways (see below). Not least, on 10G at ~350 MB/sec (current line rate for large files being written), netstat shows the RECV-Q hits a limit of about 1.9 MB (see screenshot attached). To fill 1.9 MB from zero, linearly, at that line rate, would take about 4 or 5 milliseconds - which is exactly the behaviour and time period I originally found and graphed in the OP (see OP attachment).

Usually, reducing theTCP window works as intended - the window size bounces back and the network RECV-Q both drop back to zero within a fraction of a second before the writeout burst starting. But netstat and tcpdump show it doesn't always happen. Periodically the condition just doesn't seem to clear at all for tens of seconds and becomes the "status quo" and that's when a connection timeout is happening.

I suspect this is just an artefact of observation timing - meaning that I think in reality the underlying condition just becomes so ongoing that the buffer can't really recover, or perhaps every time the client probes (which is only about 10 times over ~ 58 seconds following the usual waits of 1 -> 2 -> 5 -> 10 -> 20 -> 20 seconds) it's got high odds of being back down at zero again. Either way the client probes about 10 times over ~ 58 seconds, gets zero window each time it asks, and eventually at the end of that time, the client deems a timeout and RST's the connection. The NAS handles this normally, acknowledges, the congestion clears up very quickly, and the connection is good again very quickly.

Updated NAS + pool info:

The server is specced "up front" for heavy single-user use + dedup. It's on a 10G Chelsio LAN (9k Jumbo frames) with 3-way mirrored 7200 enterprise HDDs for speed (3x22TB 55% full), Supermicro + 4GHz Xeon v4, 128GB of 2400 ECC RAM (of which around 85GB is reserved for metadata via a tunable), a 250GB NVMe L2ARC and (not really relevant) a fast low write latency NVMe ZIL. The power is good here but it also has an APC Smart_UPS and EVGA Platinum 1600W PSU to ensure good voltage stability during HDD current peaks. It's more than overdone on the hardware, for what is really quite moderate usage, but I routinely move 500+ GB dirs and files around, and I like it fast.

Online examples propose 5GB per TB and x4 on RAM, based on an example pool that's totally unsuitable for dedup. I've specced server RAM from exact pool data. The pool has 2 datasets of which one is deduped (4 ~ 4.5 x). zdb shows 121M blocks @ 175bytes RAM each for the total pool, so the total DDT is about 20.3G, or 1.7G per TB. Also instead of giving metadata 25% of the pool, I got an extra 64GB and dedicated that 100% to metadata (including DDT), i.e. about 85GB ARC metadata limit, and 50G RAM + 250G NVMe L2ARC for everything else. I'd be happy to lock the metadata in RAM only, if there was a way and if it wouldn't waste the rest of RAM.

Setting the ARC metadata limit and disabling /dev/random harvesting from the network (recommended for 10G) are therefore the only tunables that should be needed for my setup.

But since clean reinstalling 11.1-U1, large single file writes over the LAN consistently timeout even on "factory config (with or without these two tunables)", even though neither the pool nor any hardware or client was changed.

Troubleshooting approach

Clean 11.1-U1 reinstall on newly-wiped spare SSDs with the barest minimum of config to get a user, a file share, and SSH working. No aux params.
Confirmed that networking itself isn't the problem. Used `iperf` both ways between NAS and same client, including variety of window and buffer sizes and 1 - 20 clients, for up to an hour session. No issues on any, works fine.
Confirmed that disk handling alone (as distinct from ZFS processes) isn't the problem. I connected a spare fast (500+ MB/s) SATA SSD to the NAS, used gpart to destroy and reformat as a 200GB ufs in console, mounted it, and used 'cp' to copy the identical files to and from the pool: No issues at all seen.

Found a set of tunables that allowed transfers to complete without TCP window issues. Taken from fairly reliable sources such as Calomel and FreeBSD Wiki that explain what they do. I've listed them below with comments.

Code:

##########  Tunables I used to work around the problem  ##########

# seems 2M isn't recommended for 10G
	  kern.ipc.maxsockbuf:							   16777216 (16M) (dflt: 2M)

# from Calomel - TCP rampup efficiency
	  net.inet.tcp.abc_l_var:							44 (dflt: 2)
	  net.inet.tcp.initcwnd_segments:					44 (dflt: 10)

# Remove write throttle on L2ARC in case this is slowing the pool. (Also prefer to burn out an SSD early, than slow the pool ;-) )
	  vfs.zfs.l2arc_write_boost:						 1000000000 (1G) (dflt: 8M)
	  vfs.zfs.l2arc_write_max:						   1000000000 (1G) (dflt: 8M)

# from Calomel - trust the NAS won't crash and allow much more dirty data before flushing (they use >10G of 32G). Risky in production.
	  vfs.zfs.dirty_data_max_percent:					20% (dflt: 10%)
	  vfs.zfs.dirty_data_max_max:						32212254720 (30G) (dflt: 4G)
	  vfs.zfs.dirty_data_max:							21474836480 (20G) (dflt: 4G)
	  vfs.zfs.dirty_data_sync:						   100000000 (100M) (dflt: about 65M?)
	  vfs.zfs.dirty_data_max_percent:					20% (dflt: 10%)
	  vfs.zfs.vdev.async_write_active_min_dirty_percent: 50% (dflt: 30%)
	  vfs.zfs.vdev.async_write_active_max_dirty_percent: 75% (dflt: 60%)

# Throttles dirty data creation when it gets to this level - Calomel
	  vfs.zfs.delay_min_dirty_percent:				   95% (dflt: 60%)

# not sure what default value is (5-10M?) but giving lots more might reduce impact. 100MB x 4 is a tiny amount for the NAS.
	  vdev_cache.size:								   100000000 (100M) (dflt: "0")

# dedicate lots of RAM for DDT and metadata speed only. Still leaves 50G + L2ARC for everything else which is plenty for single user
	  arc_meta_limit:									85G (dflt: 25% ARC = 30G)

# 351 prevents /dev/random harvesting from 10G NIC's, data and IRQs, widely recommended to prevent slowdown on 10G
# (also listed as kern.random.harvest.mask)
	  harvest_mask (rc):								 351 (dflt: 511)


##########  Tunables seen afterwards, not tried yet  ##########

# Calomel give a formula to calculate optimal value for different network conditions. This was the result.
net.link.ifqmaxlen:						  8192 (dflt: 50)

# Don't understand principles here well enough to guess better values, and don't know how to display bufs in use, so left alone.
# Possibly could help?
# NOTE: Apparently LANs that use Jumbo pkts (like this one) do NOT use the usual sysctls to control some buffer limits
	  kern.ipc.nmbufs:								   300000000
	  kern.ipc.nmbclusters:							  30000000
	  kern.ipc.nmbjumbop:								10000000
	  kern.ipc.nmbjumbo9:								10000000
	  kern.ipc.nmbjumbo16:							   10000000

# More dirty data tuning
	  vfs.zfs.per_txg_dirty_frees_percent:			   30
		
# Allow more concurrent I/O ops: taking advantage of pool structure and HDD quality
	  vfs.zfs.top_maxinflight:						   128 (dflt: 32)
	  vfs.zfs.vdev.async_write_max_active:			   10
	  vfs.zfs.vdev.async_write_min_active:			   1
	  vfs.zfs.vdev.max_pending:						  32

# Alter time between writeouts (hence burstiness), but allows reads and writeouts to occur more sequentially => efficiently.
# When dirty data allowances are very high,  writeout becomes governed by txg timings only, via these tunables
# NOTE: has latency inpact which may or may not be mitigated by extra buffers in other areas.
	  vfs.zfs.txg.synctime_ms:						   2000
	  vfs.zfs.txg.timeout:							   5

# Calomel say up to 64K gives more data in advance for smaller reads without harming IO latency, so why not.
	  vfs.zfs.vdev.cache.max:							32768 (dflt: 16384)


##########  Non-tunables  ##########

I haven't done anything with the NIC other than enabling jumbo frames (MTU 9000). Chelsio T4xx supports a lot of config and queues, perhaps I could get performance gains there. But not until the basic problems are sorted out.

These tunables increase aggressive RAM buffering/caching, streamline disk IO (eg spacing out write bursts and buffering during writeout bursts, increase max concurrent HDD IOPS, and no throttling of L2ARC: also easier to see how network status relates to disk activity) , and more aggressive/efficient TCP/IP ramp-up for 10G networking, at the cost of increased risk of "data in transit" loss and possible burstiness (but also larger caches/buffers to handle it).

They definitely help but it's very unsatisfactory because I don't actually know what issue I'm targeting or a proper "fix", and most of the tunables are 80% guesswork and either not actually needed or may have undesirable side-effects if used outside testing.

Main testing: gradually disabled tunables, working back towards default ("factory") values and watched the effect on the same LAN file write. Each test involved disabling some more tunables, rebooting, then writing a single large (50GB) file via Samba and at times iSCSI, while monitoring over SSH with netstat -an, gstat -x (filtered on physical disks), and tcpdump (port 445 and tcp[14:2] < CHOSEN_WIN_SIZE) in separate windows on 1 second updates. (The tcp word at [14] is the 2 byte window size).

Results so far:

The big picture is that when the tunables got closer to factory, the TCP window really began to show the issue. It repeatedly hits zero or close to it during the write, and I got the impression that traffic resumption seems to coincide with the times that gstat shows ZFS write-out bursts to the HDDs kicks off. Once writeout starts the TCP win almost always bounced back very quickly, at least until I removed all but the last few tunables. So I can avoid the issue by giving everything huge buffers, but that's not ideal.

All tools provided useful results. The attached screenshot shows all 3 stats taken simultaneously when the issue was occurring.

These were my notes, from the point where I could see that disabling tunables back to factory was starting to affect the transfer stats:

Removed: abc_l_var, initcwnd_segments, dirty_data_max, dirty_data_max_max, async_write_active_min_dirty_percent:
Impact: Performance began to hit intermittent slowdowns (gets close to zero then recovers).
tcpdump shows "win 4096, length 37960" on all packets. Transfer completed sensibly.
Removed: async_write_active_max_dirty_percent, dirty_data_max_percent:
Impact: Seeing quite a lot of lower win (200 - 2500), compared to above which were almost all 4096.
Removed: dirty_data_sync:
Impact: Near constant window size concerns - some stretches have win 3000 ~ 4000, but others show it going down to zero or near zero, every few seconds.
Probably pure chance whether or not transfers succeed; this time in testing they did, but my gut feeling looking at the window data ( tcpdump) and large RECV-Q durations ( netstat) was that it was purely a matter of chance, in reality this would fail at times in ongoing use.
Removed: vdev_cache.size:
Impact: This tipped it over the edge. Minute long zero win timeouts; transfers failing.
Stopped testing at this point.
Custom tunables still enabled were: kern.ipc.maxsockbuf, arc_meta_limit, kern.random.harvest.mask

gstat results
gstat revealed an interesting HDD I/O pattern. At almost all times there was also a fairly steady and quite high level of ongoing small read activity happening (looks like 4k random reads comparing IO count and data quantity). This was happening at all times, including during TXG writeout bursts, at rates around 10 and 50 4k reads per second on each HDD. If ZFS is trying to read a lot of small randomly placed metadata blocks at the same time as sequentially flushing/writing out txg's, that would make disk IO pretty inefficient. I can't think how to confirm for sure the type of data being read to see what's going on. I haven't heard of any way to hint to ZFS that it's preferable to prioritise writeouts and avoid issuing these random 4k reads at the same time as a writeout bursts, or to preload (and not de-cache) metadata, but both of these would be worth playing with if it's doable.

netstat results
netstat showed that the TCP zero window was due to the RECV-Q filling up. At the same time as `tcpdump` and `gstat` reported the issue happening, and the client reported file transfer slowdown, the network receive queue/buffer (`RECV-Q`) sharply rose to a high value of about 1.9M for a comparatively long time, often several seconds and sometimes longer. This is almost certainly the reason for the TCP zero win, although no idea why this happens. At other times RECV-Q is quite low and routinely (until I disabled more custom tunables) almost always zero.

On the assumption that a larger RECV-Q would allow traffic to buffer anyway at network level and be cleared once socket emptying resumes, I tried to set this higher and see what would happen if I gave it say 10 MB rather than 1.9 MB. But I couldn't find a tunable which resulted in a larger value.

(Hints appreciated and I need to remember it's using 9k not 1.5k for packet bufs).

It would probably be nothing more than a kludgy workaround, but I'm curious if it spreads network buffer reads enough to avoid ZeroWin timeouts, or just delays them. RECV-Q returns to 0 extremely quickly after most "stalls" and then doesn't congest again until the next burst tens of seconds later, so a larger netbuffer might allow data to still be received at line speed during a burst and emptied in the second or two afterwards.

tcpdump results:
tcpdump showed the issue wasn't a client artefact and the packets were issued by the server. The zero windows in tcpdump coincided with the sharp rise of RECV-Q from zero(ish) to 1.9M in netstat, and its end also seemed to correlate with starting of disk writeout bursts in gstat. tcpdump also showed that the networking side seemed to be behaving - RST and ACK was reasonable, TCP win reduced for real reasons and recovered when it could, and the dropped connections were not isolated random incidents but a result of ongoing issues where TCP win was dropping, and where this got more frequent and ongoing the closer I brought the tunables back to "factory".

In other words the visible symptom was a "tip of iceberg" thing from the occasions where it was low enough long enough, but it was also being dropped often and regularly on occasions where it recovered fast enough not to cause a timeout. Reassuringly it corroborates a systemic issue with buffer emptying rather than some weird network problem!

Back to ZFS processing of disk IO:
In ZFS, as one TXG is written, the next should be filling. I'm not sure if what I'm seeing, matches that. Instead, data acceptance slows and quite often comes to a complete halt during flushing/write activity, and the NIC resumes accepting data (and RECV-Q drops away) right as the disks start chattering, and about the same time that `gstat` shows the start of high levels of write activity. The issue doesn't happen with network-only activity (`iperf` in either direction).

The fact this happens during LAN file write write activity only, not on pure network activity, has some kind of correlation with TXG/disk write-out, and can be improved/fixed by increasing ZFS sysctls, all suggests that somehow, the default factory ZFS settings (and perhaps combined with network settings that don't choose optimal defaults when 10G is detected?) are causing congestion to build up, eventually causing data to not be pulled from the network buffers which triggers the issue.

Request for help to figure next troubleshooting steps

Clearly my system needs help to figure out the precise issue, and the exact tunables that aren't optimal, and fix the unsuitable defaults. I'd like to use actual command output, not haphazard "stabs in the dark", to eliminate / confirm the exact "bottleneck point", so I can fix the problem directly and optimally and not have to keep custom tunables without good cause and without a way to know if they are needed or need changing in future. The system has plenty of RAM if anything needs bigger buffers to cope with periodic writeout/congestion, or can be configured to slow down data receipt without having the TCP window crash so low.

This kind of config issue needs one to work through a bunch of "possibles" or "candidate problems" to find exactly what is bottlenecking. It would help a lot to have a good idea of potential "underlying candidate issues" I should specifically check for, and useful commands to do this, so I can eliminate things that aren't the issue and narrow it down.

At this point, what would help most is a list of issues to eliminate+check and commands to use for the purpose, and any new insights into the data from people who know more about FreeBSD/networking/ZFS than I do (which is most people!). More specifically, suggestions what metrics/commands/outputs are relevant to methodically determine which aspect of ZFS/networking is having the issue and why the server isn't handling it gracefully - rather than just guessing semi-randomly at sysctls without really understanding the issue. I don't mind a long-ish list of monitoring commands/sysctls to try, and instructions what I'm looking for in them.

I didn't bring this here until Dru's inquiry (very very thank you Dru...), because it's a lot to digest and I didn't want to impose on others. But I could really do with a hand identifying next steps.

Screenshot of simultaneous output from tcpdump + netstat + gstat + Windows file transfer, during issue:

dlavigne · Feb 7, 2018

Please report your findings at bugs.freenas.org and post the ticket number here.

bigphil · Feb 7, 2018

I wonder if this could be something more simple? You didn't mention anything about your network setup, curious as to the layout. You've done quite a bit of troubleshooting, but have you tried the following:

-direct connection from client to server to remove any possible switch issues?
-different NIC on client and server?
-does the same happend on a 1Gb connection vs the 10Gb connection?
-what type of 10Gb cabling and/or modules are you using?
-T420 firmware updated?
-does the same issue happen over NFS?

I know these seem like trivial things, but it may just be that simple.

Stilez · Mar 8, 2018

dlavigne said:
Please report your findings at bugs.freenas.org and post the ticket number here.

I will once I've done what I can. Unfortunately I have a lot of time pressure so detailed testing isn't really possible as would be needed for a bug report. I will once I'm past this busy time, as this needs solving :)

bigphil said:
I wonder if this could be something more simple? You didn't mention anything about your network setup, curious as to the layout. You've done quite a bit of troubleshooting, but have you tried the following:

-direct connection from client to server to remove any possible switch issues?
-different NIC on client and server?
-does the same happend on a 1Gb connection vs the 10Gb connection?
-what type of 10Gb cabling and/or modules are you using?
-T420 firmware updated?
-does the same issue happen over NFS?

I know these seem like trivial things, but it may just be that simple.

Good points all. Quick answers - see comment to Dru above:
- direct: think Ive tried, would have to retry to be sure
- different NIC - yes, swapped cards, PCIe slots, transceivers etc. Fully reinstalled Win 8.1 and FreeNAS 11.1 as well, to be sure.
- 1G would be too slow to show the issue, it's a congestion issue and won't happen if the LAN isn't trying to handle a high enough packet/data rate. 1G should be fine and seems to be.
- Chelsio T4 + Finisar SG10 + OM2 all around. Switch is Netgear 10G managed
-T4 firmware - yes, checked.
- NFS no idea, don't use it. Could test but seems little point. Same happens over iSCSI, which eliminates Samba, which would be the main aim of trying NFS.

Stilez · Mar 8, 2018

Interesting update on this:

Enabling the Samba module "aio_pthread" (and aio read/write size = 1 for async + aio max threads / aio_pthread:aio num threads = 1024) seems to help. It still cycles down to zero bytes, then pauses, then cycles back up repeatedly, but at least it seems to let me send multiple large (40+ GB) files without dying midway through the transfer due to zero window - presumably because this mode at least sends "wait" or other packets as well, to avoid timeout? Not a fix but seems to help as a workaround.

Samba docs on this module seems to say that it can help a lot in some scenarios, as the usual module can have issues.

Stilez · Apr 4, 2018

@dlavigne -

I have an interesting update for you on this issue, which you said to report on redmine.

There may be a chance that the problem might be down to this upstream ZFS bug related to dirty data flushing when dedup is enabled, although not confirmed. I emailed the reporter of the bug who said it sounded like this issue.

He says it would affect writes to dedup datasets specifically, and that it could cause them to behave as described in this thread - namely to not flush properly until they eventually get flushed "all at once" (and repeat).

When I was originally posting this thread and updates to it, I didn't test if the issue was dedup specific, but most of my data is highly dup', so it's possible. But I thought I saw the same behaviour to non-dedup datasets as well (albeit in a pool with dedup datasets). Anyhow, I thought I'd give a heads up in case you can nudge this bug along? :)

Zarovzky · Apr 17, 2018

Stilez said:
@dlavigne -

I have an interesting update for you on this issue, which you said to report on redmine.

There may be a chance that the problem might be down to this upstream ZFS bug related to dirty data flushing when dedup is enabled, although not confirmed. I emailed the reporter of the bug who said it sounded like this issue.

He says it would affect writes to dedup datasets specifically, and that it could cause them to behave as described in this thread - namely to not flush properly until they eventually get flushed "all at once" (and repeat).

When I was originally posting this thread and updates to it, I didn't test if the issue was dedup specific, but most of my data is highly dup', so it's possible. But I thought I saw the same behaviour to non-dedup datasets as well (albeit in a pool with dedup datasets). Anyhow, I thought I'd give a heads up in case you can nudge this bug along? :)

Hi, do you know if this bug had been solved? I think I have this problem too, It didn't happen in version 10, it started happening on 11+. I'm currently on U4.

I have spikes of 100% load on the nic right before it dies, no ping, anything...needs restart to get back on.

I will have to try these tuneables u mentioned see if it gets any better, even though my conditions are a bit different.

Stilez · Apr 24, 2018

Zarovzky said:
Hi, do you know if this bug had been solved? I think I have this problem too, It didn't happen in version 10, it started happening on 11+. I'm currently on U4.
I have spikes of 100% load on the nic right before it dies, no ping, anything...needs restart to get back on.
I will have to try these tuneables u mentioned see if it gets any better, even though my conditions are a bit different.

I would NOT try "random tunables that someone you don't know on the internet used for their NAS" unless you've done some checking first to see if the cause might be the same, or you have some understanding of why those tunables might be relevant. There can be many reasons for these issues.

With that said, I'll try to help, at least with some info.

The option I've mentioned is one I've found helpful for one situation:

IF you are using deduplication (I am, because my data is suitable for it. Most people should not!)
AND
IF you have a fast network that can send data faster than deduplication can deduplicate and write it (I'm using 10 gigabit not 1 gigabit)
AND
You find that writes to the NAS time out or "hang", but reads do not,
AND
The issue isn't a "new" issue after changing FreeNAS version (the bug has been present for several years so it would have affected previous versions of FreeNAS)
THEN
There is a possibility that the bug I referred to above might be the issue, in which case the workaround I'm (cautiously!) using MIGHT help.

IF your problem matches all of these criteria, then the issue MIGHT be that deduplication isn't handling its buffers correctly on writing (known ZFS bug), and the data is coming in fast enough that instead of buffers gracefully switching, they fill up to the end and then blam! - they tell the network to stop until it's flushed, which would plausibly explain the Zero Window symptoms and timeouts.

I tried that samba setting (which seems to help) because I understood that the issue seems to be due to ZFS incorrect network buffer management, and the system running out of buffer space becaue ZFS wasn't emptying them fast enough.

But I had to test a lot of things to be sure that was what was happening.

For example, using tcpdump and netstat/iostat/gstat to watch timings and buffer sizes when it happened, and changing buffer related tunables temporarily to see what effect it had on symptoms.

When I understood more about what was happening, I looked for ways to help the buffers not become full, even in situations where ZFS wasn't emptying them at the right time. I tried tunables to handle buffers without sync, tunables to use larger buffers, and tunables to change the pattern and conditions when flushing to disk was done. I didn't understand all of the tunables fully, but I did understand them enough to have a sense which might help, and why, and to test them methodically.

Of all of those, using Samba's aio_pthread worked best as a workaround. But I knew what kind of issue I was having first, and that guided me to a suitable work round. Fixing within Samba's configuration was "good enough", because it was only an issue if network buffers were involved, so it could only affect writes I did with Samba.

It wasn't just a guess, and neither should you just guess for your setup.

In your case it's probably not the same, because the ZFS bug is unchanged from FreeNAS v10.

IceBoosteR · Apr 24, 2018

@Stilez I do very appreciate in your posting and the same on the bugticket. You have provided a ton of useful information and you dig a lot deeper as normal consumers and FreeNAS user would do.
I would wish that this problem get resolved in the next release, and the people at iX, FreeBSD or openZFS or whereever this problems comes from and relates to, do react to this.
I do not know exactly if I had the same problem some time ago https://forums.freenas.org/index.php?threads/deduplication-is-hitting-performance-way-too-hard.57618/#post-405839
but I believe so. Issues with SMB and iSCSI.
I have never digged deeper, but saw the same behaviour as you may did. With my system upgrade to 11 (and more hardware) I saw the same before I have upgraded my main system. In a nutshell: CPU looked fine, more as anough RAM to use and disks were ideling and network dropped from GBit speed to nothing. I thought that my system is not "strong" enough to handle dedup, but with your testing, rsults and experience I may think that there is an implentation issue. Don't bother me if I am wrong and my system IS not powerful enough... But I would like to see that this issue get fixed very soon.
Even if I cannot help you here, pushing the post at the top may help.

anodos · Apr 24, 2018

Stilez said:
Of all of those, using Samba's aio_pthread worked best as a workaround.

FYI, vfs_aio_pthread is a no-op on FreeBSD:

Code:

static struct vfs_fn_pointers vfs_aio_pthread_fns = {
#if defined(HAVE_OPENAT) && defined(USE_LINUX_THREAD_CREDENTIALS)
.open_fn = aio_pthread_open_fn,
#endif
};

So you're actually using samba's internal AIO code path.

Stilez · May 13, 2018

anodos said:
FYI, vfs_aio_pthread is a no-op on FreeBSD:

Code:
static struct vfs_fn_pointers vfs_aio_pthread_fns = { #if defined(HAVE_OPENAT) && defined(USE_LINUX_THREAD_CREDENTIALS) .open_fn = aio_pthread_open_fn, #endif };

So you're actually using samba's internal AIO code path.

I was mostly at a loss on any resolution. Because the issue seems to stem from a known ZFS dirty cache propagation issue, there's not a lot that I could do other than try different things that sound like they might cause data from network to ZFS to disk to cope to the point of not dying. That seemed to make a difference (file transfers consistently completed despite pauses, rather than consistently dying midway) - but if it's a no-op then I'm at a loss what I've done, because I thought I changed one thing at a time. Interesting and thank you!

Stilez · Jan 8, 2019

Cause of this issue now know. It's interesting! And a problem!
No imminent solution to the underlying problem, but hoping for workarounds.

See updated 1st post for details. Full info is on redmine, linked from there.

Stilez · Jan 8, 2019

dlavigne said:
Were you able to pinpoint the cause for this?

Yes, finally. Took teaching myself basic dtrace to do it. Not a cure though. Its a severe unreported side effect of a known headache with ZFS spacemaps.
See 1st post/redmine updates.

Redcoat · Jan 8, 2019

Stilez said:
Yes, finally. Took teaching myself basic dtrace to do it. Not a cure though. Its a severe unreported side effect of a known headache with ZFS spacemaps.
See 1st post/redmine updates.

Should your update be "2019-01-08"?

Stilez · Jan 8, 2019

Redcoat said:
Should your update be "2019-01-08"?

Yes! Fixed thank you.

Important Announcement for The TrueNAS Community.

Samba + iSCSI failing - NAS sends "ZeroWindow" timeouts on simple workloads most of the time

Stilez

Guru

dlavigne

Guest

Stilez

Guru

dlavigne

Guest

bigphil

Patron

Stilez

Guru

Stilez

Guru

Stilez

Guru

Zarovzky

Cadet

Stilez

Guru

IceBoosteR

Guru

anodos

Sambassador

Stilez

Guru

Stilez

Guru

Stilez

Guru

Redcoat

MVP

Stilez

Guru

Similar threads

Important Announcement for The TrueNAS Community.