Can't get why scrub is so slow / replication failing during slow scrub

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
Hi all,


*UPDATE/READ_ME_FIRST*: This server hardware profile & pool setup is a "droppin replacement" of an existing "zfs replication server" we already had in the past that we used during 6 years, its sizing & pool setup feats our needs, on its previous version, and within our specific usage, we never had scrubbing speeds nor replication issues like above, we always oscillated between 250 Mbytes per sec and 489MBytes per sec in scrubs.

We have setup a server as follow :
- A poweredge R630
- 2x Intel Xeon E5-2695 v4
- 256G RAM

We use an internal card for system (on ssd), and a LSI SAS3008 card to access drives on a Dell Powervault MD1400 drive bay.
We use 12x TOSHIBA MG07SCA12TEY (12Tb SAS drives) in a raidz2 pool.
We use this host to sync distant datasets of other hosts, sync are made daily, we have a lot of different datasets (roughtly around 40 datasets) and we've been using it since around 8 month without issues ...
Currently, the pool is about 44% used, and with 0% fragmentation following zpool list.
But suddenly, it seems that scrubs are getting super slow, and, replications are failing during scrub.
2 weeks ago I received a lot of replication errors status ... when I tried to relaunch the syncs, I had error messages as above :
* Replication "[...]" failed: client_loop:
ssh_packet_write_poll: Connection to [...] port 22: Broken pipe
cannot receive resume stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved. A resuming stream can be generated on
the sending system by running: zfs send -t [...]
I logged into the server via command line, and I spotted that the pool was scrubbing, and a really slow pace (40/50MBytes per sec) I made the assumption that for some reason the pool did not enough IO for both the scrub and the replications ... so I decided to pause the scrub with "zpool scrub -p".
After pausing the scrub, all replication jobs went fine, without any issue, they retrieved all the remaining snapshots and we were good.
I then decided to unpause the scrub, so it continued the scrubbing ... but still ... a really slow pace (40/50MBytes per sec)
A week later, same problem : the same scrub was still ongoing, and the replications ended with errors ... I paused again the scrub, relaunched my replications, and we are again back running as before ... I suspect that it may have something related that local snapshots are made on this host each weekend on those replicated datasets ...
But why is this scrub so slow and taking so much time ?
I looked to the dmesg, no error / warning about anything related to the pool, all drives look OK on the smart status, no error spotted on them ... I'm running out of ideas and don't see from where this could come from ... if someone as some idea ... I would really appreciate ^^'

Cheers all
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Reading all blocks, calculating parity and checking is an IOPS intensive workload and you're asking a pool with limited IOPS to do that for a fair amount of data (~60TB).

It gets more and more problematic the closer your pool gets to full.

Pool geometry better matched to your expected performance is the only way to "fix" that.
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
Well I could understand this argument IF the pool was nearly full ...
We are at 44% ... there is 12 drives to spread the IOPS ... If you have some documentation about that that would explain why scrubbing would be at 40/50Mbytes a sec ...
Writes and replications are done via network arround 100/120Mbytes per sec per replication ... which are totally respectable/normal performances ...
Really sorry @sretalla ... but I really do not see this as a viable and logical explanation ...

That reminds me one other fact that may explain the slow => the pool and datasets are encrypted ...
If encryption brings down so much scrubbing performances ... while keeping writes and read at normal operations ... I really thing that some job/enhancement should be done on scrubbing in a encrypted environment ... OR at least a BIG / HUGE warning should be added at pool creation about performance drop (or even simply dropping encryption support ... because IF encryption is indeed the culprit ... then it seems totally unusable ...)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
here is 12 drives to spread the IOPS
No, there are not.

The IOPS capability of a RAIDZ2 VDEV is the same as the IOPS of one single member disk.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Writes and replications are done via network arround 100/120Mbytes per sec per replication
Throughput and IOPS are completely different things. RAIDZ2 can have fairly high throughput, but can not perform well on tasks that require high IOPS.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
12-wide RAIDZ2 is wide. Not too wide for comfort, but wide enough that you need to pay attention and make sure you're using large blocks (mostly) and that fragmentation is kept in check.
We are at 44% ... there is 12 drives to spread the IOPS
Wrong: A RAIDZp vdev has approximately the IOPS of a single disk, as all (well, all minus p) disks are needed to read each block.
That reminds me one other fact that may explain the slow => the pool and datasets are encrypted ...
Doesn't seem likely, with a pair of Xeon E5s. Slow compression (like gzip or higher levels of zstd) might however cause trouble.
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
... so 2 things ... :
1- if you reply to my post ... the minimum would be to approve it firstly ^^'
2- one SAS drive at 12Gb/s ... giving 40/50Mbytes per sec ? are you sure about that math ? yes, those 12Gb/s are theorical ... but even in "real world usage" ... one SATA 2/3 drive can be accessed at 150/200 MBytes per sec ...
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
12-wide RAIDZ2 is wide. Not too wide for comfort, but wide enough that you need to pay attention and make sure you're using large blocks (mostly) and that fragmentation is kept in check.

Wrong: A RAIDZp vdev has approximately the IOPS of a single disk, as all (well, all minus p) disks are needed to read each block.

Doesn't seem likely, with a pair of Xeon E5s. Slow compression (like gzip or higher levels of zstd) might however cause trouble.
Again ... I was talking about IOPS being spread ... this is not wrong ... I did not said they are "cumulated" or that they get "sum" in any ways ... sorry but when you write into a RAIDZn pool ... the IOPS are spread across all the drives ...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
one SAS drive at 12Gb/s ... giving 40/50Mbytes per sec ? are you sure about that math ? yes, those 12Gb/s are theorical ... but even in "real world usage" ... one SATA 2/3 drive can be accessed at 150/200 MBytes per sec ...
If you can do 200 IOPS (an average HDD) and each block you're seeking is less than the maximum possible size (not all blocks are 100% full... actually most are not in a pool with lots of small files), then let's say at best, 128K is your block size and on average you're getting 100K of data from each block... 20MB/s (100K x 200).

sorry but when you write into a RAIDZn pool ... the IOPS are spread across all the drives ...
No need to apologize, but you're wrong.

If you need to read one block from a pool that's one IOP at the pool level, but each disk needs to give an IOP to reach that result, so it cancels out to one IOP being possible even if all disks give one.

You may be misunderstanding that when large and sequential blocks are being requested in a read operation, ZFS is able to request blocks in a way that is optimal and you will often see much more throughput than when doing lots of small reads/writes or IOPS intensive work (like a scrub).
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
If you can do 200 IOPS (an average HDD) and each block you're seeking is less than the maximum possible size (not all blocks are 100% full... actually most are not in a pool with lots of small files), then let's say at best, 128K is your block size and on average you're getting 100K of data from each block... 20MB/s (100K x 200).


No need to apologize, but you're wrong.

If you need to read one block from a pool that's one IOP at the pool level, but each disk needs to give an IOP to reach that result, so it cancels out to one IOP being possible even if all disks give one.

You may be misunderstanding that when large and sequential blocks are being requested in a read operation, ZFS is able to request blocks in a way that is optimal and you will often see much more throughput than when doing lots of small reads/writes or IOPS intensive work (like a scrub).
Oh ... so my other solaris server ... composed of 3 stripped raidz2 of 12 4Tb drives (so 3x 12x 4Tb SAS 6Gb/s) scrubbing them at 200Mb/s ... is a miracle ? ^^' by the way that one uses gzip ... but no encryption ...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Are you trying to say that 100 x 200 is 200,000?

I think you'll find it's 20,000.

20MB/s
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Are you trying to say that 100 x 200 is 200,000?

I think you'll find it's 20,000.

20MB/s
OK, I saw your other post now... you're almost off the training wheels and your posts won't need approval soon. That makes more sense.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
so my other solaris server ... composed of 3 stripped raidz2 of 12 4Tb drives (so 3x 12x 4Tb SAS 6Gb/s) scrubbing them at 200Mb/s ... is a miracle ?
No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
But let's put aside the throughput calculations ...

Why does the replications are going wrong during a scrub ? we are used to do the same stuff (replicating production datasets) but under the old solaris server, we have our own zfs send/receive scripts ... and we never had issues replicating during scrubs ... why does replication fails while scrub is in action ?
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)
3 time IOPS ? on a raid0 (stripped) ? connected via SAS 6Gb/s ? hmmm ...
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
No. you get 3x the IOPS. I don't know what your block size is on that pool, but larger block/record sizes can deliver vastly different results (1MB recordsize is more-or-less able to get 10x the data with the same number of IOPS if it's dealing with large files)
by the way ... we started with 1x 12x 4Tb drives (one drive bay) and we increased the pool each time it reached 90% use ... and never noticed any increase in scrubbing time ... by anyways ...
 

Ulysse_31

Dabbler
Joined
Aug 22, 2023
Messages
49
by the way ... we started with 1x 12x 4Tb drives (one drive bay) and we increased the pool each time it reached 90% use ... and never noticed any increase in scrubbing time ... by anyways ...
but I have to admit since we never had the replication error during scrubs ... well ... we never really looked at it ... ^^"
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
and never noticed any increase in scrubbing time
Were you looking?

composed of 3 stripped raidz2 of 12 4Tb drives
3 time IOPS ? on a raid0 (stripped) ? connected via SAS 6Gb/s ?
I think you're confused. Or are we not talking about the same pool?

why does replication fails while scrub is in action ?
Don't know. No evidence given so far provides a valid reason other than your pool isn't coping with the workload (which is demanding more IOPS than it has) as far as I can see.
 
Top