Incredibly slow zfs send/receive, otherwise seems idle, no other tasks or scrubs, whats happening?

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
I started a simple backup like Ive done many times. Both pools local, backup pool empty, usual zfs send -R | zfs receive of a current snapshot.

The first 18 TBs ran normally, took just under an hour per TB. gstat showing the target disks pounding - like 300 MB/sec writes. I went away, came back later. It's now been on 22.0 TB to 22.4 TB for about a day. gstat is showing something like *KB* of write a sec not hundreds of MB.

2 source disks being read, out of the 14 or so in the source pool, and they're just 0.3% and 0.5% busy respectively.

The server is otherwise as far as I can tell, totally idle. zpool status -v shows no issues/errors.

50 minutes for each of the first 18 TB, then suddenly today almost no speed at all.

Where do I begin to diagnose and fix whatever's up?
 
Joined
May 10, 2017
Messages
838
Not a fix but the same happened to me a couple of times, both during large duplications, after the first time it got stuck I added the option to resume on the receive side (-s), so next time just aborted and resumed from were it was.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Not a fix but the same happened to me a couple of times, both during large duplications, after the first time it got stuck I added the option to resume on the receive side (-s), so next time just aborted and resumed from were it was.
I'm using the -s option already. So I could. It's literally idle for seconds at a time, nothing zfs-sending or receiving, no other tasks going on, no background maintenance. Its replicated 400 KB in the last hour.

The destination is a non-redundant pool this time for reasons, and thats OK. It comprises an HDD for data, an SSD for special (metadata). The HDD is a new enterprise drive and zpool iostat -vw shows latencies up to 137 seconds. I wonder if that's it. HDD issue maybe?? smartctl to test it out?

Alternatively, do I kill the send/recv, reboot and resume with token? Did that help you?

What I really want is a way to diagnose whats causing it. Its really strange. If I had a clue what would help for diagnostics, or what was up, I'd bug report it. But I dont know where to begin.
 
Last edited:

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Yes, and IIRC didn't need to reboot, just aborted the send with CTRL+C and resumed.
I can try, and hope it'll work. Never used the resume token before but can't be that hard. But I wonder what the hell causes it, how to.diagnose better.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Disk fails the long smartctl test shortly after starting. Read errors. Contacting supplier and using badblocks -w -t random to confirm if its firmware fixable in the drive. So far no errors from that program. Unclear yet if its a DOA. But at least issue seems clear.
 
Last edited:
Top