Hi All
I know this one is super common and it's because of IO, I'm hoping I have a unique issue though...
Hardware is a S451-3R1 Gigabyte NAS:
Upon some deletions from the Pool I end up with this is syslog:
You get the idea. Once these start appearing it will eventually crash and not free up. Upon reboot, all the deleted data is back from the dead like something from Groundhog Day.
Some pool info for you:
It is under a little duress at the moment due to running a scrub, however, I cannot cancel it, the system will hang.
Now for the fun bit. Sometimes I can delete a straight TB and it'll be 100% fine. Other files look like they've gone, then eventually the dreaded crash and voila, they're back after the reboot.
I don't really want to nuke the pool and I have a feeling I won't be able to. I have disabled dedupe and compression going forward as I'm now using it as a Restic target since I'm doing a NAS1 to NAS2 backup situation - rsync wasn't a backup.
Any thoughts, ideas, or advice for me?
I know this one is super common and it's because of IO, I'm hoping I have a unique issue though...
Hardware is a S451-3R1 Gigabyte NAS:
Linux version 6.1.63-production+truenas (Cobian)
AMD EPYC 7352 24-Core Processor
GIGABYTE S451-Z30-00
48GB RAM (8GB sticks)
34 x Toshiba 16TB HDD (TOSHIBA_MG08ACA16TE) SAS
2 x Toshiba SDD (boot disks, RAID 1)
Intel 10Gb NICs
I have a Pool of 430TB, ZFS. It originally had dedupe and LZO compression enabled on it. I've removed those. It stores mostly VERY large video files, but it also has a mix of a lot of small files, sometimes in the 270k+ region in a subtree.Upon some deletions from the Pool I end up with this is syslog:
Code:
[47609.082625] INFO: task txg_sync:14870 blocked for more than 120 seconds. [47609.090097] Tainted: P OE 6.1.63-production+truenas #2 [47609.097721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [47609.106210] task:txg_sync state:D stack:0 pid:14870 ppid:2 flags:0x00004000 [47609.115234] Call Trace: [47609.118340] <TASK> [47609.121110] __schedule+0x2ed/0x860 [47609.125265] schedule+0x5a/0xb0 [47609.129052] schedule_timeout+0x94/0x150 [47609.133621] ? __bpf_trace_tick_stop+0x10/0x10 [47609.138716] io_schedule_timeout+0x4c/0x80 [47609.143278] __cv_timedwait_common+0x12a/0x160 [spl] [47609.148708] ? cpuusage_read+0x10/0x10 [47609.152908] __cv_timedwait_io+0x15/0x20 [spl] [47609.157816] zio_wait+0x10b/0x220 [zfs] [47609.162343] spa_sync_frees+0x3a/0x70 [zfs] [47609.167201] spa_sync_iterate_to_convergence+0x10a/0x200 [zfs] [47609.173705] spa_sync+0x306/0x5d0 [zfs] [47609.178055] txg_sync_thread+0x1e4/0x250 [zfs] [47609.182943] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [47609.188435] ? sigorsets+0x10/0x10 [spl] [47609.192661] thread_generic_wrapper+0x5a/0x70 [spl] [47609.197850] kthread+0xe9/0x110 [47609.201291] ? kthread_complete_and_exit+0x20/0x20 [47609.206385] ret_from_fork+0x22/0x30 [47609.210262] </TASK> [48954.437910] systemd-journald[1454]: Data hash table of /var/log/journal/e6a7eed86d154d0f8f649912be0a1a89/system.journal has a fill level at 75.0 (8533 of 11377 i tems, 6553600 file size, 768 bytes per hash table item), suggesting rotation. [48954.462014] systemd-journald[1454]: /var/log/journal/e6a7eed86d154d0f8f649912be0a1a89/system.journal: Journal header limits reached or header out-of-date, rotati ng. [51475.585424] INFO: task rm:307616 blocked for more than 120 seconds. [51475.592424] Tainted: P OE 6.1.63-production+truenas #2 [51475.599998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51475.608433] task:rm state:D stack:0 pid:307616 ppid:1 flags:0x00004006 [51475.617506] Call Trace: [51475.620555] <TASK> [51475.623271] __schedule+0x2ed/0x860 [51475.627379] schedule+0x5a/0xb0 [51475.631139] io_schedule+0x42/0x70 [51475.635168] cv_wait_common+0xaa/0x130 [spl] [51475.640084] ? cpuusage_read+0x10/0x10 [51475.644498] txg_wait_open+0x89/0xd0 [zfs] [51475.649326] dmu_free_long_range_impl+0x1e6/0x340 [zfs] [51475.655237] dmu_free_long_range+0x78/0xc0 [zfs] [51475.660522] zfs_rmnode+0x34c/0x440 [zfs] [51475.665211] zfs_inactive+0x127/0x200 [zfs] [51475.670063] zpl_evict_inode+0x36/0x50 [zfs] [51475.675003] evict+0xd0/0x1d0 [51475.678279] do_unlinkat+0x170/0x320 [51475.682152] __x64_sys_unlinkat+0x33/0x60 [51475.686464] do_syscall_64+0x5b/0xc0 [51475.690343] ? ksys_write+0x6b/0xf0 [51475.694128] ? syscall_exit_to_user_mode+0x27/0x40 [51475.699215] ? do_syscall_64+0x67/0xc0 [51475.703266] ? syscall_exit_to_user_mode+0x27/0x40 [51475.708369] ? do_syscall_64+0x67/0xc0 [51475.712423] ? do_syscall_64+0x67/0xc0 [51475.716477] ? syscall_exit_to_user_mode+0x27/0x40 [51475.721570] ? do_syscall_64+0x67/0xc0 [51475.725616] ? do_syscall_64+0x67/0xc0 [51475.729650] entry_SYSCALL_64_after_hwframe+0x64/0xce [51475.734987] RIP: 0033:0x7ffa28b2a8d7 [51475.738850] RSP: 002b:00007ffc1d642c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 [51475.746711] RAX: ffffffffffffffda RBX: 000055f7be946c40 RCX: 00007ffa28b2a8d7 [51475.754133] RDX: 0000000000000000 RSI: 000055f7be946d40 RDI: 0000000000000004 [51475.761561] RBP: 000055f7be93ad00 R08: 0000000000000003 R09: 0000000000000000 [51475.768991] R10: f88dd7b6b1ebeedc R11: 0000000000000246 R12: 0000000000000000 [51475.776416] R13: 00007ffc1d642e10 R14: 0000000000000003 R15: 000055f7be946c40 [51475.783850] </TASK> [51596.413619] INFO: task rm:307616 blocked for more than 241 seconds. [51596.420316] Tainted: P OE 6.1.63-production+truenas #2 [51596.427585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51596.435708] task:rm state:D stack:0 pid:307616 ppid:1 flags:0x00004006 [51596.444450] Call Trace: [51596.447196] <TASK> [51596.449599] __schedule+0x2ed/0x860 [51596.453395] schedule+0x5a/0xb0 [51596.456832] io_schedule+0x42/0x70 [51596.460536] cv_wait_common+0xaa/0x130 [spl] [51596.465111] ? cpuusage_read+0x10/0x10 [51596.469162] txg_wait_open+0x89/0xd0 [zfs] [51596.473726] dmu_free_long_range_impl+0x1e6/0x340 [zfs] [51596.479409] dmu_free_long_range+0x78/0xc0 [zfs] [51596.484476] zfs_rmnode+0x34c/0x440 [zfs] [51596.488945] zfs_inactive+0x127/0x200 [zfs] [51596.493592] zpl_evict_inode+0x36/0x50 [zfs] [51596.498309] evict+0xd0/0x1d0 [51596.501590] do_unlinkat+0x170/0x320 [51596.505473] __x64_sys_unlinkat+0x33/0x60 [51596.509800] do_syscall_64+0x5b/0xc0 [51596.513694] ? ksys_write+0x6b/0xf0 [51596.517506] ? syscall_exit_to_user_mode+0x27/0x40 [51596.522606] ? do_syscall_64+0x67/0xc0 [51596.526655] ? syscall_exit_to_user_mode+0x27/0x40 [51596.531747] ? do_syscall_64+0x67/0xc0 [51596.535794] ? do_syscall_64+0x67/0xc0 [51596.539832] ? syscall_exit_to_user_mode+0x27/0x40 [51596.544917] ? do_syscall_64+0x67/0xc0 [51596.548960] ? do_syscall_64+0x67/0xc0 [51596.552997] entry_SYSCALL_64_after_hwframe+0x64/0xce [51596.558342] RIP: 0033:0x7ffa28b2a8d7 [51596.562218] RSP: 002b:00007ffc1d642c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 [51596.570080] RAX: ffffffffffffffda RBX: 000055f7be946c40 RCX: 00007ffa28b2a8d7 [51596.577522] RDX: 0000000000000000 RSI: 000055f7be946d40 RDI: 0000000000000004 [51596.584954] RBP: 000055f7be93ad00 R08: 0000000000000003 R09: 0000000000000000 [51596.592383] R10: f88dd7b6b1ebeedc R11: 0000000000000246 R12: 0000000000000000 [51596.599825] R13: 00007ffc1d642e10 R14: 0000000000000003 R15: 000055f7be946c40 [51596.607256] </TASK> [51717.245807] INFO: task rm:307616 blocked for more than 362 seconds. [51717.252514] Tainted: P OE 6.1.63-production+truenas #2 [51717.259792] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51717.267917] task:rm state:D stack:0 pid:307616 ppid:1 flags:0x00004006 [51717.276656] Call Trace: [51717.279407] <TASK> [51717.281818] __schedule+0x2ed/0x860 [51717.285613] schedule+0x5a/0xb0 [51717.289055] io_schedule+0x42/0x70 [51717.292758] cv_wait_common+0xaa/0x130 [spl] [51717.297342] ? cpuusage_read+0x10/0x10 [51717.301394] txg_wait_open+0x89/0xd0 [zfs] [51717.305964] dmu_free_long_range_impl+0x1e6/0x340 [zfs] [51717.311647] dmu_free_long_range+0x78/0xc0 [zfs] [51717.316708] zfs_rmnode+0x34c/0x440 [zfs] [51717.321182] zfs_inactive+0x127/0x200 [zfs] [51717.325826] zpl_evict_inode+0x36/0x50 [zfs] [51717.330549] evict+0xd0/0x1d0 [51717.333835] do_unlinkat+0x170/0x320 [51717.337719] __x64_sys_unlinkat+0x33/0x60 [51717.342049] do_syscall_64+0x5b/0xc0 [51717.345948] ? ksys_write+0x6b/0xf0 [51717.349764] ? syscall_exit_to_user_mode+0x27/0x40 [51717.354863] ? do_syscall_64+0x67/0xc0 [51717.358912] ? syscall_exit_to_user_mode+0x27/0x40 [51717.364003] ? do_syscall_64+0x67/0xc0 [51717.368051] ? do_syscall_64+0x67/0xc0 [51717.372087] ? syscall_exit_to_user_mode+0x27/0x40 [51717.377171] ? do_syscall_64+0x67/0xc0 [51717.381209] ? do_syscall_64+0x67/0xc0 [51717.385251] entry_SYSCALL_64_after_hwframe+0x64/0xce [51717.390602] RIP: 0033:0x7ffa28b2a8d7 [51717.394473] RSP: 002b:00007ffc1d642c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 [51717.402337] RAX: ffffffffffffffda RBX: 000055f7be946c40 RCX: 00007ffa28b2a8d7 [51717.409773] RDX: 0000000000000000 RSI: 000055f7be946d40 RDI: 0000000000000004 [51717.417206] RBP: 000055f7be93ad00 R08: 0000000000000003 R09: 0000000000000000 [51717.424637] R10: f88dd7b6b1ebeedc R11: 0000000000000246 R12: 0000000000000000 [51717.432069] R13: 00007ffc1d642e10 R14: 0000000000000003 R15: 000055f7be946c40 [51717.439512] </TASK>
You get the idea. Once these start appearing it will eventually crash and not free up. Upon reboot, all the deleted data is back from the dead like something from Groundhog Day.
Some pool info for you:
Code:
zpool status Data pool: Data state: ONLINE scan: scrub in progress since Wed Jan 3 19:55:49 2024 6.15T / 430T scanned at 122M/s, 5.49T / 430T issued at 109M/s 0B repaired, 1.28% done, 47 days 07:50:45 to go config: NAME STATE READ WRITE CKSUM Data ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 a5217421-9292-4482-a634-f9bdbf83640f ONLINE 0 0 0 036df555-b043-4cd6-b8d7-c8bf1ab68bfa ONLINE 0 0 0 661e54d8-b4ea-4dc5-a07a-53ebb484b4cd ONLINE 0 0 0 a533bcc7-1c45-4282-b24e-0a8daedc13f9 ONLINE 0 0 0 faa15b16-9a04-4510-9d87-20723f997b7e ONLINE 0 0 0 07e992b0-c90b-4a8b-b75f-389ef1facbe5 ONLINE 0 0 0 dd2f1c12-d43d-4c13-8e08-c7a6100e5dad ONLINE 0 0 0 e6dbad60-ee3c-4c78-8602-715e9a7b065d ONLINE 0 0 0 1a0ae9f0-44d9-441d-ad8d-a799157c5305 ONLINE 0 0 0 b73f6c5c-3a0c-42f6-b717-19d1f432da16 ONLINE 0 0 0 8fc6ea7f-4b49-45b6-9531-fa43edf9c3f6 ONLINE 0 0 0 9965c5fe-ade4-4c5c-858f-60cad764c035 ONLINE 0 0 0 922e7bd7-bfa5-4e2d-99ad-69ce155fd445 ONLINE 0 0 0 84177458-c1c6-4980-a613-d8ba65b9bd9c ONLINE 0 0 0 da75589a-b10f-4fbf-a948-dd34f052a519 ONLINE 0 0 0 7deb555a-eadb-4b28-9c0c-7474d60352a2 ONLINE 0 0 0 cd4bbf1f-eb23-4435-8591-dba7fd39c2e0 ONLINE 0 0 0 b0b91c2d-1889-409f-8939-b88f3cdb491e ONLINE 0 0 0 ffc8409c-f867-444e-8c10-63a5440b46a6 ONLINE 0 0 0 9673ab81-0003-43b0-898f-f0e8f630bfd5 ONLINE 0 0 0 c330017d-62ce-485d-965f-1dcac77f53eb ONLINE 0 0 0 2d6c2228-5031-402f-8557-94a36e791dcf ONLINE 0 0 0 efbdae32-a34c-46e8-8801-5d04811f92d7 ONLINE 0 0 0 5a19087b-7354-4e66-b07e-f298d1c621b2 ONLINE 0 0 0 fa023577-eb85-44dc-b6c7-fe2c1fb05e72 ONLINE 0 0 0 23672161-c8e4-4b27-b1b3-ce6506f78bb9 ONLINE 0 0 0 689c43d7-a9ca-4287-87ce-19750d79c79f ONLINE 0 0 0 21c7bb84-a641-4959-9a73-028178e2ab9f ONLINE 0 0 0 543addb4-2aa1-4e2a-b8d4-5c4e8a6c60a8 ONLINE 0 0 0 4feb278b-9af9-493a-a7a3-a2a708d79357 ONLINE 0 0 0 b31f7595-f3ba-4525-8775-66f600696adb ONLINE 0 0 0 74789bb2-a9fd-4fa4-98f1-b33e7594ffb3 ONLINE 0 0 0 db0fa131-d0aa-4252-af47-accfc25d6dcb ONLINE 0 0 0 0cef4b94-4f58-420b-84d7-ac59c5087580 ONLINE 0 0 0 7a6132c4-c64f-4c16-b148-539effaa6fde ONLINE 0 0 0 dbf1d058-15fe-4a92-9903-2f921a63b33e ONLINE 0 0 0 errors: No known data errors zpool iostat Data capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- Data 421T 103T 16.1K 741 258M 16.9M zpool list Data NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT Data 524T 421T 103T - - 22% 80% 1.54x ONLINE /mnt
It is under a little duress at the moment due to running a scrub, however, I cannot cancel it, the system will hang.
Now for the fun bit. Sometimes I can delete a straight TB and it'll be 100% fine. Other files look like they've gone, then eventually the dreaded crash and voila, they're back after the reboot.
I don't really want to nuke the pool and I have a feeling I won't be able to. I have disabled dedupe and compression going forward as I'm now using it as a Restic target since I'm doing a NAS1 to NAS2 backup situation - rsync wasn't a backup.
Any thoughts, ideas, or advice for me?