Been trying to figure this out for a while now, and still unable to, so any help on where to look next would be appreciated.
Here's the problem statement:
When running a replication task (ssh+netcat) there are long pauses during the transfer. I can see the network spike up to around 100+ MB/s, stay consistent, then drop to nothing. It stays in this paused state for a random amount of time, typically around 30 seconds or more and then wakes up and is transferring at full speed again for all of 5 seconds.
There are no errors, and the CPU on the destination TrueNAS is churning away doing something. When I run iostat on the destination side, I can see all 15 drives are busy at around 80-90% utilization, primarily with writes. When transfer kicks in again for roughly 5 seconds, the utilization on all 15 drives jumps to almost 100% on each drive.
I assume TrueNAS is doing something that I'm not knowledgeable enough to notice/detect... but what is it?? Sometimes it will run for minutes at a time without issue, but mostly it just does what I described above. Running zpool status shows no errors and it's not scrubbing or resilvering... the only indication that it's doing something is the CPU utilization and the iostat is showing all 15 drives churning away on something.
Here's what is staying the same:
Here's the problem statement:
When running a replication task (ssh+netcat) there are long pauses during the transfer. I can see the network spike up to around 100+ MB/s, stay consistent, then drop to nothing. It stays in this paused state for a random amount of time, typically around 30 seconds or more and then wakes up and is transferring at full speed again for all of 5 seconds.
There are no errors, and the CPU on the destination TrueNAS is churning away doing something. When I run iostat on the destination side, I can see all 15 drives are busy at around 80-90% utilization, primarily with writes. When transfer kicks in again for roughly 5 seconds, the utilization on all 15 drives jumps to almost 100% on each drive.
I assume TrueNAS is doing something that I'm not knowledgeable enough to notice/detect... but what is it?? Sometimes it will run for minutes at a time without issue, but mostly it just does what I described above. Running zpool status shows no errors and it's not scrubbing or resilvering... the only indication that it's doing something is the CPU utilization and the iostat is showing all 15 drives churning away on something.
Here's what is staying the same:
- The ZFS pools are the same since the beginning. One is 15 x 3TB SAS drives, and the other is 15 x 2TB SAS drives.
- ZFS pools are running raidz2
- ZFS pools are sitting around 58% utlized on the array with 3TB drives(source), and 78% utlized on the array with 2TB drives(destination)
- Both pools are in separate storage shelves - KTN-STL3
- Both pools have a mix of datasets that have some lz4 compression, encryption, and deduplication (dedupe is only covering 2 TB of data)
- New servers (DL560, DL380, DL360p, SuperMicro 8x????) including virtualizing with Proxmox
- Adjusted memory from 8GB to 24GB
- Tried a replacement KTN-STL3
- Swapped out SAS controllers: SAS2008, SAS2308, SAS2208, and whatever the HP SAS controller is)
- Have tried TrueNAS Core and SCALE
- Replaced all network cables, switches, network cards
- Changed boot device from HDD, to SSD to USB