Can't get why scrub is so slow / replication failing during slow scrub

Ulysse_31 · Aug 22, 2023

Hi all,

*UPDATE/READ_ME_FIRST*: This server hardware profile & pool setup is a "droppin replacement" of an existing "zfs replication server" we already had in the past that we used during 6 years, its sizing & pool setup feats our needs, on its previous version, and within our specific usage, we never had scrubbing speeds nor replication issues like above, we always oscillated between 250 Mbytes per sec and 489MBytes per sec in scrubs.

We have setup a server as follow :
- A poweredge R630
- 2x Intel Xeon E5-2695 v4
- 256G RAM

We use an internal card for system (on ssd), and a LSI SAS3008 card to access drives on a Dell Powervault MD1400 drive bay.
We use 12x TOSHIBA MG07SCA12TEY (12Tb SAS drives) in a raidz2 pool.
We use this host to sync distant datasets of other hosts, sync are made daily, we have a lot of different datasets (roughtly around 40 datasets) and we've been using it since around 8 month without issues ...
Currently, the pool is about 44% used, and with 0% fragmentation following zpool list.
But suddenly, it seems that scrubs are getting super slow, and, replications are failing during scrub.
2 weeks ago I received a lot of replication errors status ... when I tried to relaunch the syncs, I had error messages as above :

* Replication "[...]" failed: client_loop:
ssh_packet_write_poll: Connection to [...] port 22: Broken pipe
cannot receive resume stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved. A resuming stream can be generated on
the sending system by running: zfs send -t [...]

I logged into the server via command line, and I spotted that the pool was scrubbing, and a really slow pace (40/50MBytes per sec) I made the assumption that for some reason the pool did not enough IO for both the scrub and the replications ... so I decided to pause the scrub with "zpool scrub -p".
After pausing the scrub, all replication jobs went fine, without any issue, they retrieved all the remaining snapshots and we were good.
I then decided to unpause the scrub, so it continued the scrubbing ... but still ... a really slow pace (40/50MBytes per sec)
A week later, same problem : the same scrub was still ongoing, and the replications ended with errors ... I paused again the scrub, relaunched my replications, and we are again back running as before ... I suspect that it may have something related that local snapshots are made on this host each weekend on those replicated datasets ...
But why is this scrub so slow and taking so much time ?
I looked to the dmesg, no error / warning about anything related to the pool, all drives look OK on the smart status, no error spotted on them ... I'm running out of ideas and don't see from where this could come from ... if someone as some idea ... I would really appreciate ^^'

Cheers all

sretalla · Aug 22, 2023

Reading all blocks, calculating parity and checking is an IOPS intensive workload and you're asking a pool with limited IOPS to do that for a fair amount of data (~60TB).

It gets more and more problematic the closer your pool gets to full.

Pool geometry better matched to your expected performance is the only way to "fix" that.

Ulysse_31 · Aug 22, 2023

Well I could understand this argument IF the pool was nearly full ...
We are at 44% ... there is 12 drives to spread the IOPS ... If you have some documentation about that that would explain why scrubbing would be at 40/50Mbytes a sec ...
Writes and replications are done via network arround 100/120Mbytes per sec per replication ... which are totally respectable/normal performances ...
Really sorry @sretalla ... but I really do not see this as a viable and logical explanation ...

That reminds me one other fact that may explain the slow => the pool and datasets are encrypted ...
If encryption brings down so much scrubbing performances ... while keeping writes and read at normal operations ... I really thing that some job/enhancement should be done on scrubbing in a encrypted environment ... OR at least a BIG / HUGE warning should be added at pool creation about performance drop (or even simply dropping encryption support ... because IF encryption is indeed the culprit ... then it seems totally unusable ...)

sretalla · Aug 22, 2023

Ulysse_31 said:
here is 12 drives to spread the IOPS

No, there are not.

The IOPS capability of a RAIDZ2 VDEV is the same as the IOPS of one single member disk.

sretalla · Aug 22, 2023

Ulysse_31 said:
Writes and replications are done via network arround 100/120Mbytes per sec per replication

Throughput and IOPS are completely different things. RAIDZ2 can have fairly high throughput, but can not perform well on tasks that require high IOPS.

Ericloewe · Aug 22, 2023

12-wide RAIDZ2 is wide. Not too wide for comfort, but wide enough that you need to pay attention and make sure you're using large blocks (mostly) and that fragmentation is kept in check.

Ulysse_31 said:
We are at 44% ... there is 12 drives to spread the IOPS

Wrong: A RAIDZp vdev has approximately the IOPS of a single disk, as all (well, all minus p) disks are needed to read each block.

Ulysse_31 said:
That reminds me one other fact that may explain the slow => the pool and datasets are encrypted ...

Doesn't seem likely, with a pair of Xeon E5s. Slow compression (like gzip or higher levels of zstd) might however cause trouble.

Ulysse_31 · Aug 22, 2023

... so 2 things ... :
1- if you reply to my post ... the minimum would be to approve it firstly ^^'
2- one SAS drive at 12Gb/s ... giving 40/50Mbytes per sec ? are you sure about that math ? yes, those 12Gb/s are theorical ... but even in "real world usage" ... one SATA 2/3 drive can be accessed at 150/200 MBytes per sec ...

Ulysse_31 · Aug 22, 2023

Ericloewe said:
12-wide RAIDZ2 is wide. Not too wide for comfort, but wide enough that you need to pay attention and make sure you're using large blocks (mostly) and that fragmentation is kept in check.

Wrong: A RAIDZp vdev has approximately the IOPS of a single disk, as all (well, all minus p) disks are needed to read each block.

Doesn't seem likely, with a pair of Xeon E5s. Slow compression (like gzip or higher levels of zstd) might however cause trouble.

Again ... I was talking about IOPS being spread ... this is not wrong ... I did not said they are "cumulated" or that they get "sum" in any ways ... sorry but when you write into a RAIDZn pool ... the IOPS are spread across all the drives ...

sretalla · Aug 22, 2023

Ulysse_31 said:
one SAS drive at 12Gb/s ... giving 40/50Mbytes per sec ? are you sure about that math ? yes, those 12Gb/s are theorical ... but even in "real world usage" ... one SATA 2/3 drive can be accessed at 150/200 MBytes per sec ...

If you can do 200 IOPS (an average HDD) and each block you're seeking is less than the maximum possible size (not all blocks are 100% full... actually most are not in a pool with lots of small files), then let's say at best, 128K is your block size and on average you're getting 100K of data from each block... 20MB/s (100K x 200).

Ulysse_31 said:
sorry but when you write into a RAIDZn pool ... the IOPS are spread across all the drives ...

No need to apologize, but you're wrong.

If you need to read one block from a pool that's one IOP at the pool level, but each disk needs to give an IOP to reach that result, so it cancels out to one IOP being possible even if all disks give one.

You may be misunderstanding that when large and sequential blocks are being requested in a read operation, ZFS is able to request blocks in a way that is optimal and you will often see much more throughput than when doing lots of small reads/writes or IOPS intensive work (like a scrub).

Ulysse_31 · Aug 22, 2023

for your compression question @Ericloewe : I just checked the datasets, we do use lz4 (light compress) but no gzip unfortunately ...

Ulysse_31 · Aug 22, 2023

sretalla said:
If you can do 200 IOPS (an average HDD) and each block you're seeking is less than the maximum possible size (not all blocks are 100% full... actually most are not in a pool with lots of small files), then let's say at best, 128K is your block size and on average you're getting 100K of data from each block... 20MB/s (100K x 200).

No need to apologize, but you're wrong.

If you need to read one block from a pool that's one IOP at the pool level, but each disk needs to give an IOP to reach that result, so it cancels out to one IOP being possible even if all disks give one.

You may be misunderstanding that when large and sequential blocks are being requested in a read operation, ZFS is able to request blocks in a way that is optimal and you will often see much more throughput than when doing lots of small reads/writes or IOPS intensive work (like a scrub).

Oh ... so my other solaris server ... composed of 3 stripped raidz2 of 12 4Tb drives (so 3x 12x 4Tb SAS 6Gb/s) scrubbing them at 200Mb/s ... is a miracle ? ^^' by the way that one uses gzip ... but no encryption ...

Ulysse_31 · Aug 22, 2023

Sorry *200Mbytes per sec

sretalla · Aug 22, 2023

Are you trying to say that 100 x 200 is 200,000?

I think you'll find it's 20,000.

20MB/s

sretalla · Aug 22, 2023

sretalla said:
Are you trying to say that 100 x 200 is 200,000?

I think you'll find it's 20,000.

20MB/s

OK, I saw your other post now... you're almost off the training wheels and your posts won't need approval soon. That makes more sense.

sretalla · Aug 22, 2023

Ulysse_31 said:
so my other solaris server ... composed of 3 stripped raidz2 of 12 4Tb drives (so 3x 12x 4Tb SAS 6Gb/s) scrubbing them at 200Mb/s ... is a miracle ?

No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)

Ulysse_31 · Aug 22, 2023

But let's put aside the throughput calculations ...

Why does the replications are going wrong during a scrub ? we are used to do the same stuff (replicating production datasets) but under the old solaris server, we have our own zfs send/receive scripts ... and we never had issues replicating during scrubs ... why does replication fails while scrub is in action ?

Ulysse_31 · Aug 22, 2023

sretalla said:
No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)

3 time IOPS ? on a raid0 (stripped) ? connected via SAS 6Gb/s ? hmmm ...

Ulysse_31 · Aug 22, 2023

sretalla said:
No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)

by the way ... we started with 1x 12x 4Tb drives (one drive bay) and we increased the pool each time it reached 90% use ... and never noticed any increase in scrubbing time ... by anyways ...

Ulysse_31 · Aug 22, 2023

Ulysse_31 said:
by the way ... we started with 1x 12x 4Tb drives (one drive bay) and we increased the pool each time it reached 90% use ... and never noticed any increase in scrubbing time ... by anyways ...

but I have to admit since we never had the replication error during scrubs ... well ... we never really looked at it ... ^^"

sretalla · Aug 22, 2023

Ulysse_31 said:
and never noticed any increase in scrubbing time

Were you looking?

Ulysse_31 said:
composed of 3 stripped raidz2 of 12 4Tb drives

Ulysse_31 said:
3 time IOPS ? on a raid0 (stripped) ? connected via SAS 6Gb/s ?

I think you're confused. Or are we not talking about the same pool?

Ulysse_31 said:
why does replication fails while scrub is in action ?

Don't know. No evidence given so far provides a valid reason other than your pool isn't coping with the workload (which is demanding more IOPS than it has) as far as I can see.

Important Announcement for the TrueNAS Community.

Can't get why scrub is so slow / replication failing during slow scrub

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Powered by Neutrality

Server Wrangler

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Dabbler

Powered by Neutrality

Powered by Neutrality

Powered by Neutrality

Dabbler

Dabbler

Dabbler

Dabbler

Powered by Neutrality

Similar threads