Pool imported from Core to Scale has permanent I/O

Kailee71 · Jan 4, 2024

Hi all,

I have transitioned from Core to Scale using a fresh install but moved the pools over by export/import. My 2x2 striped mirror with 2 mirrored NVMe log devices pool has ever since importing it to Scale shown constant I/O of ~35MiB/s:

Code:

root@truenas:~# zpool iostat zpool
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zpool       5.68T  1.57T    113  1.00K  33.8M  35.5M

Initially I thought it must be scrubbing but after a day I had had enough and tried to stop the scrub with using "zpool scrub -s zpool" but was then told no scrub was in progress. The pool seems clean;

Code:

root@truenas:~# zpool status zpool
  pool: zpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 07:27:39 with 0 errors on Sun Dec 10 07:27:40 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        zpool                                     ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            9f99eb8f-273c-11e9-aed7-0007430c2eaa  ONLINE       0     0     0
            a102ecc6-273c-11e9-aed7-0007430c2eaa  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            63f35a47-95b0-11ea-80a2-0007430c16cc  ONLINE       0     0     0
            d5dec381-955e-11ea-80a2-0007430c16cc  ONLINE       0     0     0
        logs
          mirror-30                               ONLINE       0     0     0
            2a2bb767-88ea-41f3-ae22-616961e17e9d  ONLINE       0     0     0
            93d304ba-b2e0-4c25-9d7b-5c02fa091b87  ONLINE       0     0     0

errors: No known data errors

So I don't think a resilver is in progress either (and should have finished by now). Any ideas what might be causing this I/O?

Best of thanks in advance,

Kai.

winnielinnie · Jan 4, 2024

Probably K3s and/or the middleware, even if you're not using "Apps".

This has come up before in the forums.

One thing about Core (FreeBSD-based, without K3s) is that it's more idle and quiet than SCALE, if your system is not transferring data. Even if you use jails under Core, the only activity is from the processes themselves (only if they're actually doing something.) Otherwise, there is no overhead I/O or CPU usage. This means you can have several jails all running at once, and not even a blip of I/O or CPU usage if no jail process is currently handling data. Jails may require some hands-on command-line understanding, but they're sadly overlooked.

HoneyBadger · Jan 4, 2024

Unfortunately iotop isn't present, but you can add the disk related fields in regular top (ioR, ioRop, ioW, ioWop) and see which increments the most quickly.

Kailee71 · Jan 4, 2024

I'm seeing 30Mib/s and more, ~~mostly writes~~ of now balanced reads and writes - surely this can't be middleware? I'm not running any apps at all.

@HoneyBadger while iotop is not present, htop is, and it has a page for I/O (press tab to get there). Hitting F2 will allow you to grab new colums. It turns out /sbin/init has read 2660GiB and written 466GiB over the last 21 or so hours of uptime, which is roughly 6MiB/s. The next biggest process is /usrs/sbin/smbd, but it has written only ~4GiB which is plausible with what I've been doing. Neither explain a constant 30MiB/s stream of data going to my pool.

In terms of what's actually actively writing, /usr/sbin/netdata -D seem most active (but doesn't nearly reach the 30MiB/s being written).

Any idea what /sbin/init or /usr/sbin/netdata is doing?

Can this have anything to do with me using nspawn (jailmaker)?

Tia

Kai.

HoneyBadger · Jan 4, 2024

/sbin/init is just a symlink to systemd so consider it like "aggregate I/O since boot" - do F6 to sort by the IO_WRITE_RATE perhaps, but I'll wager k3s is sitting on the top of the heap.

Kailee71 · Jan 4, 2024

How would I identify k3s?
It's still netdata at the top, a couple of systemd things, and then a bunch of middlewared entries but they have literally 0 througput;

?

HoneyBadger · Jan 4, 2024

The process will literally be /usr/local/bin/k3s server similar to this:

If you haven't set up apps though, you shouldn't see it.

Your htop also looks a bit different from mine - your IO_R and IO_W look like my RCHAR/WCHAR columns.

Kailee71 · Jan 4, 2024

... I think IO_R/IO_W and RCHAR/WCHAR are essentially equivalent ... might be wrong though ...

In any case, I switched most services off, rebooted, switched them back on, and now it's calmed down a lot;

However SMB transfers, which caused me to look for the cuplrit in the first place, still way too low:

This is less than half I was getting in Core on the same hardware (2x2 striped mirrors spinning with 2 mirrored Optanes as SLOG/ZIL). When I remove the SLOG it becomes really evident something is wrong;

Will have to go away and do some hunting...

winnielinnie · Jan 4, 2024

Are you enforcing "sync" writes, rather than the default of "standard" (which uses "async" for SMB)?

Kailee71 · Jan 4, 2024

Yes sync=always on all datasets as before in Core, but 244MiB/s with and 20MiB/s without SLOG is way too low. Like I said, in Core I was getting >600MiB/s written to SMB (with the SLOG). Getting ~1GiB/s with sync=standard. iPerf3 also saturates my network (~1.09GiB/s).
Small point of info:

Not great but also kinda what I'd expect from only 2 striped mirrors...
Ah wait scrap that ... compression ...

Kailee71 · Jan 4, 2024

Here's something more relevant...

This really is much slower than core.

Kailee71 · Jan 4, 2024

... or so I thought ...

The quickest way to confirm pool speed is reinstall Core again, this time on metal...

So something changed in the pool. I checked the disks, and the only thing I found there is a few failed SMART tests on one of the disks at lifetime 27644 hrs, all tests since then it has passed (and no anomolies noticed either). Current age is 39277 hrs. The error was the raw_read_error_rate, and no other values indicate errors (including reallocated sectors CRC errors). This was just before I moved the pool from an "normal" enclosure to my current DL380P G8.

Where else can I look for changes that might have reduced my throughput?

TIA

Kai.

Kailee71 · Jan 4, 2024

So @HoneyBadger I remembered we were playing around with vfs.zfs.dirty* sysctls a while back and was just going to check up on how this stacks up to the values then (same hardware then and now)... Then:

Code:

truenas# dtrace -s dirty.d zpool
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  74566                 none:txg-syncing    0MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used
4  74566                 none:txg-syncing    0MB of 10240MB used
  1  74566                 none:txg-syncing 1232MB of 10240MB used
  3  74566                 none:txg-syncing 3768MB of 10240MB used
  3  74566                 none:txg-syncing 3344MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used

and now:

Code:

[root@truenas /mnt/zpool/data/preserve/truenas]# dtrace -s dirty.d zpool
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
 17  82249                 none:txg-syncing 1240MB of    0MB used
 17  82249                 none:txg-syncing 4491MB of    0MB used
 15  82249                 none:txg-syncing 3009MB of    0MB used
  1  82249                 none:txg-syncing 1280MB of    0MB used

The "of 0MB used" immediately jumps off the page to me.

However sysctls are the same:

Code:

[root@truenas /mnt/zpool]# sysctl -a | grep -i zfs.dirty
vfs.zfs.dirty_data_sync_percent: 20
vfs.zfs.dirty_data_max_max: 4294967296
vfs.zfs.dirty_data_max: 10737418240
vfs.zfs.dirty_data_max_max_percent: 25
vfs.zfs.dirty_data_max_percent: 10
[root@truenas /mnt/zpool]#

Is this significant?

Kailee71 · Jan 4, 2024

Also just for completeness;

Code:

[root@truenas /mnt/zpool]# zpool iostat zpool
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zpool       4.88T  2.37T      2    382  28.0K  88.4M
[root@truenas /mnt/zpool]#

So it's definately doing something in the background. Cleanly installed Core, no apps, no VMs, nothing, just the SMB share.

Kailee71 · Jan 4, 2024

So this time I thought to check in on the I/O activity every so often... Not at regular intervals but the trend is important I think...

Code:

truenas# zpool iostat zpool
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zpool       4.88T  2.37T      2    382  28.0K  88.4M
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G     36      0  4.13M  2.70K
zpool       4.88T  2.37T     10    279  62.1K  62.4M
----------  -----  -----  -----  -----  -----  -----
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G     26      0  3.05M  2.00K
zpool       4.88T  2.37T      7    211  45.9K  46.1M
----------  -----  -----  -----  -----  -----  -----
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G     14      0  1.64M  1.20K
zpool       4.88T  2.37T      4    127  24.6K  24.9M
----------  -----  -----  -----  -----  -----  -----
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G     12      0  1.38M  1.09K
zpool       4.88T  2.37T      3    111  20.8K  21.0M
----------  -----  -----  -----  -----  -----  -----
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G      9      0  1.07M    869
zpool       4.88T  2.37T      2     93  16.1K  16.4M
----------  -----  -----  -----  -----  -----  -----
truenas# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
boot-pool   11.1G   205G      8      0   976K    771
zpool       4.88T  2.37T      2     85  14.3K  14.6M
----------  -----  -----  -----  -----  -----  -----

... overnight break ...

truenas# zpool iostat -v zpool
                                                  capacity     operations     bandwidth
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
zpool                                           4.88T  2.37T      0     45  4.46K  4.73M
  mirror-0                                      2.33T  1.29T      0      4  2.21K  1.23M
    gptid/9f99eb8f-273c-11e9-aed7-0007430c2eaa      -      -      0      2  1.12K   627K
    gptid/a102ecc6-273c-11e9-aed7-0007430c2eaa      -      -      0      1  1.09K   627K
  mirror-1                                      2.55T  1.08T      0      4  2.23K  1.54M
    da4p2                                           -      -      0      2  1.13K   786K
    gptid/d5dec381-955e-11ea-80a2-0007430c16cc      -      -      0      2  1.10K   786K
logs                                                -      -      -      -      -      -
  mirror-32                                     1.86M  27.0G      0     37     28  1.97M
    gptid/796f1120-ab4a-11ee-a4db-ecb1d77d85a0      -      -      0     18     14  1007K
    gptid/7970f053-ab4a-11ee-a4db-ecb1d77d85a0      -      -      0     18     14  1007K
----------------------------------------------  -----  -----  -----  -----  -----  -----

Looks to me like it's scanning the whole thing, fast sectors to small (yes, spinning rust). Just did a rough estimate, and if it really is scanning the entire surface it'll take at least 44 hours.

Code:

truenas# zpool status zpool
  pool: zpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 07:27:39 with 0 errors on Sat Dec  9 22:27:40 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        zpool                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/9f99eb8f-273c-11e9-aed7-0007430c2eaa  ONLINE       0     0     0
            gptid/a102ecc6-273c-11e9-aed7-0007430c2eaa  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            da4p2                                       ONLINE       0     0     0
            gptid/d5dec381-955e-11ea-80a2-0007430c16cc  ONLINE       0     0     0
        logs
          mirror-32                                     ONLINE       0     0     0
            gptid/796f1120-ab4a-11ee-a4db-ecb1d77d85a0  ONLINE       0     0     0
            gptid/7970f053-ab4a-11ee-a4db-ecb1d77d85a0  ONLINE       0     0     0

errors: No known data errors

So I don't think it's doing a scrub (which normally takes less than 8 hours). I don't think it's resilvering. Is there anything in ZFS that might cause this kind of behaviour?

Kailee71 · Jan 5, 2024

Some more info... This time back in Scale 23.10...

Code:

root@truenas:~# date
Fri Jan  5 11:57:12 CET 2024
root@truenas:~# uptime
 11:57:18 up  1:52,  2 users,  load average: 0.05, 0.05, 0.05
root@truenas:~# zpool events zpool
TIME                           CLASS
Jan  5 2024 10:05:13.037640239 sysevent.fs.zfs.history_event
Jan  5 2024 10:05:13.601640252 sysevent.fs.zfs.config_sync
Jan  5 2024 10:05:13.605640252 sysevent.fs.zfs.pool_import
Jan  5 2024 10:05:13.729640255 sysevent.fs.zfs.history_event
Jan  5 2024 10:05:14.093640263 sysevent.fs.zfs.config_sync
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand

Any and all info if this is relevant much appreciated.

Kai.

Kailee71 · Jan 5, 2024

And finally something I found on reddit; it seems zed may in some circumstance causes sysevent.fs.zfs.config_sync. Could this be happening here?

Beginning to pull out what few hairs I have left, especially as this behaviour is now shown in Core and in Scale... Any ideas @HoneyBadger?

K

HoneyBadger · Jan 5, 2024

Kailee71 said:
So @HoneyBadger I remembered we were playing around with vfs.zfs.dirty* sysctls a while back and was just going to check up on how this stacks up to the values then (same hardware then and now)...

The "of 0MB used" immediately jumps off the page to me.

Is this significant?

No, I believe the kernel symbol we're trying to read there stopped being exported for some reason. It's still reporting the correct "X" amount, we just have to go to the sysctl to grab the "out of Y MB"

Does zpool events indicate config_sync activity going on?

Kailee71 · Jan 5, 2024

Yes @HoneyBadger ;

Code:

root@truenas:~# zpool events zpool
TIME                           CLASS
Jan  5 2024 10:05:13.037640239 sysevent.fs.zfs.history_event
Jan  5 2024 10:05:13.601640252 sysevent.fs.zfs.config_sync
Jan  5 2024 10:05:13.605640252 sysevent.fs.zfs.pool_import
Jan  5 2024 10:05:13.729640255 sysevent.fs.zfs.history_event
Jan  5 2024 10:05:14.093640263 sysevent.fs.zfs.config_sync
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand
Jan  5 2024 10:05:19.597640388 sysevent.fs.zfs.vdev_autoexpand

That's straight after a reboot. Ever since the config_sync the I/O is going on....

Kailee71 · Jan 6, 2024

Hello all and esp @HoneyBadger,
so I thought I'd try and get my pool to speed up doing whatever it thought wa necessary and so I stuck my old RMS-200 back in and replaced the two optanes I was running as SLOG with it. It's still doing _something_ in the background (~61MiB/s) but in the mean time my pool NFS and SMB speeds are back to normal. In fact, they're a little better now, even with the background I/O going on;

Remember, this is with sync=always... I'm going to keep it running now for a few days and see if it will at some stage stop the background traffic and report back after the weekend...

@HoneyBadger - best of thanks as always for your input. I do hope this will fix itself now.

Kai.

Important Announcement for the TrueNAS Community.

Pool imported from Core to Scale has permanent I/O

Contributor

MVP

actually does care

Contributor

actually does care

Contributor

actually does care

Contributor

MVP

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

actually does care

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool imported from Core to Scale has permanent I/O"

Similar threads