Upgraded to 11.1, scrub destroys performance

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
I recently upgraded our storage servers to 11.1-RELEASE. Everything worked fine until the first of Feb. when our bi-monthly scrub started. Scrubs now appear to use all available disk IO and bandwidth regardless of load on the box. In fact, even logging into the box was sluggish and commands would hang so much that I eventually had to cancel the scrub completely. I then upgraded to 11.1-U1 which did the same thing.

I couldn't find any other complaints about this so I had to scrounge around for a fix. I ended up setting vfs.zfs.scrub_delay to 4 and that helped a lot.

I'm just curious why scrub was changed or if it's something to do with my autotune settings (which where generated on install of 9.10) or something else. Also, do I need to change anything else? It was difficult to find anything about slowing DOWN scrubs since most people want to speed them up.
 
S

sef

Guest
We ported over a fairly major rewrite of the scan code; this was in 11.0, but there were a couple of bugs that got fixed in 11.1.

The scan code in 11.1 is much faster (at least in most cases). But it works by scanning each transaction group, and then sorting the logical block addresses, and then issuing I/Os for them in that order. Less seeking, faster performance. Unless you happen to try to do other I/O in that case, which can result in lots of seeks destroying both. So, we also added back some of the tuning variables (vfs.zfs.scrub_delay being one of them -- there's a similar one for resilver_delay). Setting it to 4 is what we would have recommended as the first step.

You can also pause a scrub now, if you want ( zpool scrub -p ${POOL}, and to resume it, zpool scrub ${POOL}).
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Awesome, thanks for the confirmation.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Setting it to 4 is what we would have recommended as the first step.

I ended up setting vfs.zfs.scrub_delay to 20. This worked well until I upgraded to 11.2. The site is really sensitive to latency and for some reason the first scrub caused enough latency to effectively take us down, so I set vfs.zfs.scrub_delay to 40 which seems to be OK for us.

Is there anything else I need to be doing here? This doesn't really appear to have affected scrub completion time so I'm happy with it set to 40.

Also, I followed the recommendations of the forums to do a scrub on the 1st and 15th. Could we set our scrubs to only run once a month? To me twice a month seems overkill. My gut says once is plenty but I wanted to verify.
 
Top