Hi all,
*UPDATE/READ_ME_FIRST*: This server hardware profile & pool setup is a "droppin replacement" of an existing "zfs replication server" we already had in the past that we used during 6 years, its sizing & pool setup feats our needs, on its previous version, and within our specific usage, we never had scrubbing speeds nor replication issues like above, we always oscillated between 250 Mbytes per sec and 489MBytes per sec in scrubs.
We have setup a server as follow :
- A poweredge R630
- 2x Intel Xeon E5-2695 v4
- 256G RAM
We use an internal card for system (on ssd), and a LSI SAS3008 card to access drives on a Dell Powervault MD1400 drive bay.
We use 12x TOSHIBA MG07SCA12TEY (12Tb SAS drives) in a raidz2 pool.
We use this host to sync distant datasets of other hosts, sync are made daily, we have a lot of different datasets (roughtly around 40 datasets) and we've been using it since around 8 month without issues ...
Currently, the pool is about 44% used, and with 0% fragmentation following zpool list.
But suddenly, it seems that scrubs are getting super slow, and, replications are failing during scrub.
2 weeks ago I received a lot of replication errors status ... when I tried to relaunch the syncs, I had error messages as above :
After pausing the scrub, all replication jobs went fine, without any issue, they retrieved all the remaining snapshots and we were good.
I then decided to unpause the scrub, so it continued the scrubbing ... but still ... a really slow pace (40/50MBytes per sec)
A week later, same problem : the same scrub was still ongoing, and the replications ended with errors ... I paused again the scrub, relaunched my replications, and we are again back running as before ... I suspect that it may have something related that local snapshots are made on this host each weekend on those replicated datasets ...
But why is this scrub so slow and taking so much time ?
I looked to the dmesg, no error / warning about anything related to the pool, all drives look OK on the smart status, no error spotted on them ... I'm running out of ideas and don't see from where this could come from ... if someone as some idea ... I would really appreciate ^^'
Cheers all
*UPDATE/READ_ME_FIRST*: This server hardware profile & pool setup is a "droppin replacement" of an existing "zfs replication server" we already had in the past that we used during 6 years, its sizing & pool setup feats our needs, on its previous version, and within our specific usage, we never had scrubbing speeds nor replication issues like above, we always oscillated between 250 Mbytes per sec and 489MBytes per sec in scrubs.
We have setup a server as follow :
- A poweredge R630
- 2x Intel Xeon E5-2695 v4
- 256G RAM
We use an internal card for system (on ssd), and a LSI SAS3008 card to access drives on a Dell Powervault MD1400 drive bay.
We use 12x TOSHIBA MG07SCA12TEY (12Tb SAS drives) in a raidz2 pool.
We use this host to sync distant datasets of other hosts, sync are made daily, we have a lot of different datasets (roughtly around 40 datasets) and we've been using it since around 8 month without issues ...
Currently, the pool is about 44% used, and with 0% fragmentation following zpool list.
But suddenly, it seems that scrubs are getting super slow, and, replications are failing during scrub.
2 weeks ago I received a lot of replication errors status ... when I tried to relaunch the syncs, I had error messages as above :
I logged into the server via command line, and I spotted that the pool was scrubbing, and a really slow pace (40/50MBytes per sec) I made the assumption that for some reason the pool did not enough IO for both the scrub and the replications ... so I decided to pause the scrub with "zpool scrub -p".* Replication "[...]" failed: client_loop:
ssh_packet_write_poll: Connection to [...] port 22: Broken pipe
cannot receive resume stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved. A resuming stream can be generated on
the sending system by running: zfs send -t [...]
After pausing the scrub, all replication jobs went fine, without any issue, they retrieved all the remaining snapshots and we were good.
I then decided to unpause the scrub, so it continued the scrubbing ... but still ... a really slow pace (40/50MBytes per sec)
A week later, same problem : the same scrub was still ongoing, and the replications ended with errors ... I paused again the scrub, relaunched my replications, and we are again back running as before ... I suspect that it may have something related that local snapshots are made on this host each weekend on those replicated datasets ...
But why is this scrub so slow and taking so much time ?
I looked to the dmesg, no error / warning about anything related to the pool, all drives look OK on the smart status, no error spotted on them ... I'm running out of ideas and don't see from where this could come from ... if someone as some idea ... I would really appreciate ^^'
Cheers all
Last edited: