Swap with 9.10

Stux · Sep 22, 2016

Found this post, where someone is reporting FreeNAS crashing on heavy data transfers, the crash happens after it starts swapping... unnecessarily... perhaps... which is the calling card of this ARC/VM/UMA conflict

https://forums.freenas.org/index.php?threads/freenas-crashing-on-heavy-data-transfers.41851/

Someone points out the FreeBSD bug thread I've been trawling

rs225 said:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

This is filed against 10.0, I don't know whether it would also be the same in 9.x branches, and what FreeNAS has done itself.

@jgreco even drops in to wonder if this is part of the infamous bug 1531, Performance Suckage
https://bugs.freenas.org/issues/1531

jgreco said:
It shouldn't do that, obviously. The swapping isn't a great thing but it is fairly normal for some modest amount of swapout to occur over time. The ~4GB that unused bits of the FreeNAS middleware seems to like to occupy is the usual target. This is because there's a lot of stuff on a NAS that isn't used by your *particular* configuration.

Is there any chance that when it "crashes", it recovers over time? It's possible that you're running into some variation of the issues in bug 1531 relating to transaction group writes, which are supposed to be addressed by the new write throttle mechanism, but if you're maybe catching it before it is able to measure and adjust, it's very possible you could create a situation where the system might go catatonic for ... I'm just going to guess at 30-180 seconds. In such a case, what's actually happening is that one transaction group is being flushed to disk and another full transaction group has been created in the meantime. At that point, ZFS *must* pause, because it isn't committing to disk quickly enough.

I understand that Bug 1531 is actually because the write buffers get overloaded...

Anywho, I'll trawl through the FreeNAS bug tracker

Stux · Sep 22, 2016

There is this bug #1762 Panic During hot disk removal
https://bugs.freenas.org/issues/17672

Which is my concern. This happens because there is swap on the system. My script helps to mitigate the problem... but I think there is a underlying issue in FreeBSD which is causing unnecessary swapping in the first place

And this bug,
System crashes due to 1 drive failure - Non-mirrored Swap on hard drives is bad idea
https://bugs.freenas.org/issues/11617

Solution is to wait for FreeNAS10...

And this bug,
L2ARC cache larger than RAM+swap + write error in zpool
https://bugs.freenas.org/issues/8054

And this ~~bug~~ feature request, to workaround this issue. closed years ago as working correctly. (setting swap to zero is not a solution)
Option to disable usage of swap partitions
https://bugs.freenas.org/issues/1709

Another request for Mirrored Swap
https://bugs.freenas.org/issues/208

Stux · Sep 22, 2016

So, I found some revisions dealing with this patch

https://bugs.freenas.org/projects/f...ions/f2543cb01cf389f602fb09b5ecdead5b9354a916
and then:
https://bugs.freenas.org/projects/f...ions/502601a54088ae2edc419343befadcf273f55be7

Refactor ZFS ARC reclaim logic to be more VM cooperative

Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

According to this, and Karl's original research, arc_free_target needs to equal v_free_target... which this checkin says it does... but not on my system

Code:

# sysctl vfs.zfs.arc_free_target vm.v_free_target
vfs.zfs.arc_free_target: 56375
vm.v_free_target: 173129

Stux · Sep 22, 2016

I created bug #17690
FreeNAS 9.10+ seems to swap unnecessarily
https://bugs.freenas.org/issues/17690

Stux · Sep 22, 2016

So, looking into this a bit further

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

 # sysctl vfs.zfs.arc_free_target vm.v_free_target

vfs.zfs.arc_free_target: 56375

vm.v_free_target: 173129

I noticed that they're not equal. According to the FreeBSD bug report, if they're not equal, then ARC will fight with the VM pager, and all hell will eventually break loose.

I believe they're in 4KB pages. So,

vfs.zfs.arc_free_target: 56375 = 230MB
vm.v_free_target: 173129 = 709MB

230MB certainly seems like the free space that that the ARC seems to be aiming for. And if the VM is panicing and dropping stuff to try to maintain 709MB, then eventually it will run out of options and page... and I expect it will page about 400MB... which is what I'm seeing.

So, I set arc_free_target = v_free_target, and it could be a coincedence, but I've been having troubles with silvering stalling... and what do you know... as soon as I pressed return the scrub kicked back to 190MB/s, which is my disk sequential speed.

You can see the 3 minute stall where *nothing* was happening just about to scroll off on the left. The burst seemed to happen just as I pressed return on this sysctl cmd

Code:

# sysctl vfs.zfs.arc_free_target=173129
vfs.zfs.arc_free_target: 56375 -> 173129
# zpool status -v tank
...
	   6.24T scanned out of 7.16T at 200M/s, 1h19m to go

I wonder if this will fix it?

Stux · Sep 22, 2016

BTW, my 8GB system, which suffers severe stalls while scrubbing/silvering (ie down to 1MB/s), I thought it was a disk problem and was going to run badblocks once I finished transferring the data to the new system...

Code:

sysctl vfs.zfs.arc_free_target vm.v_free_target
vfs.zfs.arc_free_target: 14135
vm.v_free_target: 43336

same thing:
vfs.zfs.arc_free_target: 14135 = 58MB
vm.v_free_target: 43336 = 177MB

Stux · Sep 22, 2016

Found this comment
https://bz-attachments.freebsd.org/attachment.cgi?id=174254

In the patch for FreeBSD11

Code:

/*
+ * When arc is initialized, perform the following:
+ *
+ * 1. If we are in the "memory is low enough to wake the pager" zone,
+ *    reap the kernel UMA caches once per wakeup_delay period 500ms default)
+ *    AND wake the pager up (so it can demote pages from inactive to cache to
+ *    ultimately the free list.)
+ *
+ * 2. If we're below VM's free_target in free RAM reap *one* UMA zone per
+ *    time period (500ms).
+ *
+ */

Seems to confirm that the UMA free pages are marked as "Inactive", and the solution here is they remark it as "Cache" which is a type of page which can just be dropped for space.

The issue I think I'm having now has moved beyond "Unnecessary Swap" to "Too Much Inactive Memory"

MrToddsFriends · Sep 22, 2016

Some info from my 32GB system running FreeNAS-9.10.1 (d989edd):

Code:

scr@blunzn:~ % sysctl vfs.zfs.arc_free_target vm.v_free_target
vfs.zfs.arc_free_target: 56529
vm.v_free_target: 173594

(173594 - 56529) * 4 kBytes / 1024 = 457.28 MBytes

According to the FreeNAS GUI reporting max swap utilization
- from week 34 til now was 303.7M running FreeNAS-9.10.1 (d989edd),
- from Week 13 til now was 499.8M with several 9.10 versions in use.

rs225 · Sep 23, 2016

I'm glad to see some attention on this problem.

This is why I smirk every time anybody tells somebody their system is broken because it doesn't have 8GB of RAM. The problem has always been this bug. I think it also touched on somebody else's problem where their VirtualBox would bomb out if some cache setting wasn't set on their virtual disks. I think I commented about it.

I don't think this problem could be a cause of corruption, or else the FreeBSD PR would probably mention that. However, I continue to keep an eye out for a bug that does cause corruption, because I think there is too much of it being reported that never has what I feel is a 'good' explanation.

Stux · Sep 23, 2016

rs225 said:
This is why I smirk every time anybody tells somebody their system is broken because it doesn't have 8GB of RAM. The problem has always been this bug. I think it also touched on somebody else's problem where their VirtualBox would bomb out if some cache setting wasn't set on their virtual disks. I think I commented about it.

I'm seeing the other part of the FreeBSD bug report now. Namely Inactive ram is growing continuously over time forcing the ARC to shrink. Eventually the ARC won't be able to shrink anymore...

The purple line is Wired which is a proxy for ARC. The grey above the purple is Inactive, which is a proxy for UMA Free (but not actually Free).

ARC is slowly shrinking. Its now 10.8GB, on a 32GB system which has 16.5GB of "Inactive" RAM!

The Inactive "growth" coincided with an Rsync occuring to an iSCSI volume. Its the random sized blocks which cause it.

I'm going to let this system grind itsself into the dust, before rebooting it, and seeing if it repros.

rs225 · Sep 23, 2016

I think it is mentioned in the PR, but I think karl(or someone) did theorize that the problem is more likely when you have more than one ZFS block size in use, as each one uses a different UMA(?) pool.

Stux · Sep 23, 2016

rs225 said:
I think it is mentioned in the PR, but I think karl(or someone) did theorize that the problem is more likely when you have more than one ZFS block size in use, as each one uses a different UMA(?) pool.

Compression is enabled, and doesn't that result in essentially random sizes?

rs225 · Sep 23, 2016

Stux said:
Compression is enabled, and doesn't that result in essentially random sizes?

Yes, but I don't think it was as bad as that. I think it allocates from UMA equal to the max blocksize, pre-compression. So the more variety of active blocksizes in your system, the bigger problem you may have with this UMA issue.

It looks like any improvements in 11.0 will show up in FreeNAS, and beyond that, it would require convincing FreeNAS to apply patches that FreeBSD has declined to incorporate so far.

So if you have a configuration workaround, that is probably the best for now. Does setting vfs.zfs.arc_max work to avoid the problem?

Stux · Sep 23, 2016

vfs.zfs.zio.use_uma = 0

should do it. Has to be done very early.

Dice · Oct 20, 2016

Stux said:
So, I set arc_free_target = v_free_target, and it could be a coincedence, but I've been having troubles with silvering stalling... and what do you know... as soon as I pressed return the scrub kicked back to 190MB/s, which is my disk sequential speed.

Stux said:
vfs.zfs.zio.use_uma = 0

I've only glanced through this thread. I wonder what are your conclusions on how to 'fix' this as of now? In your filed bugreport the comment by dev indicated this is not to be fixed soon.
Forgive me, would you mind sharing your conclusions on what the important settings are? Are the settings set via cli, or as tunables in the gui?

Stux · Oct 20, 2016

The important setting to test is

vfs.zfs.zio.use_uma = 0

According to the FreeBSD bug report, this should work-around the issue. The side-effect will be slightly slower memory allocations in ZFS.Which shouldn't be a problem on a normal home-grade NAS.

Two problems:

1) I haven't had time to test this yet.
2) I haven't actually worked out how to set this tunable yet :)

The best way to trigger the swap usage is to run a complicated rsync while doing a scrub.

In the meantime, my page_in script works-around the immediate problem, and the other problem is that Inactive ram continues to climb over time causing Wired ram (ie ARC) to decrease, which lowers performance, and eventually... will require/cause a restart.

Stux · Oct 20, 2016

An update though, The last swap usage on my systems was on October 7th. Which I believe is the day I upgraded to U2.

Code:

2016-10-07 14:20:00: Paging in 118120 Bytes on /dev/ada2p1.eli
2016-10-07 14:20:06: Paging in 116712 Bytes on /dev/ada4p1.eli
2016-10-07 14:20:12: Paging in 117808 Bytes on /dev/ada1p1.eli
2016-10-07 14:20:18: Paging in 118640 Bytes on /dev/ada0p1.eli
2016-10-07 14:20:23: Paging in 121248 Bytes on /dev/ada3p1.eli

(On my small system I check for swap every 10 minutes and get the email if there is any to be paged in)

Its possible that U2 fixed the issue. Its also possible (although perhaps more unlikely) that the systems haven't been worked hard enough to trigger it.

Mlovelace · Oct 20, 2016

Stux said:
Its possible that U2 fixed the issue.

I haven't seen the inactive memory climb on this NAS since the U2 update nor have I've noticed swap being use, both of which were happening before the update. The system gets about 4TB of data written to it a day so it's moderately active.

Dice · Oct 20, 2016

Encouraging news. 4TB a day - moderately active? ...jeeez

Mlovelace · Oct 20, 2016

Dice said:
4TB a day - moderately active? ...jeeez

It's in an enterprise environment, so it gets one of our VMware cluster backups via Veeam as well as a few MSSQL db backups and their associated transaction logs. Which, is then snapshotted and replicated to a "sister" NAS at our DR site. Luckily we only retain 90 days of active data and archive the rest otherwise I'd have to grow the pools significantly. :)

Important Announcement for the TrueNAS Community.

Swap with 9.10

MVP

MVP

MVP

MVP

MVP

MVP

MVP

Documentation Browser

Guru

MVP

Guru

MVP

Guru

MVP

Wizard

MVP

MVP

Guru

Wizard

Guru

Similar threads