System almost hangs during scrub or resilver

Status
Not open for further replies.

pbo10

Cadet
Joined
Nov 1, 2018
Messages
6
Hi

Sorry for the long post but I wanted to try and add as much info as possible. I'm new to FreeNAS and I've only just finished building my first system, I'm currently testing everything out and almost everything is working well but it just seems like there's something not quite right when it comes to heavy work loads and it's very easy to overload the system but I'm not sure why. Basically everything works nicely, I'm getting good file transfer speeds etc, but if I run a scrub for example on the main pool the system will almost completely hang, the web GUI becomes completely unresponsive, the SMB shares drop off and the single jail I have set up which runs a MySQL server becomes unresponsive.

I understand that the scrub is obviously going to be putting quite a lot of stress on the system, but surely not to the point that everything else stops working altogether. I know the scrub continues working in the background and if I wait a few hours everything comes back to life again.

The specs of the system are:
Intel Xeon E3 1225
32GB ECC Memory
Booting from a 120GB SSD

Pool 1
Mirror 2x 500GB SSD

Pool 2
Raid-Z2 with 8 disks, currently a mix of 1TB and 8TB drives (this is only because I was testing with the drives I had available and planned to switch them all out to 8TB later)

Pool 1 is not doing much at this point, it's only there as I had the drives spare already and thought I might as well use it to run jails/backup some important files from my main pool. So the issue happens when I run the scrub on the main pool (pool 2) which at this point is fairly small, due to the 1TB drives in it the total storage space on it is only 5TB and I've loaded about 2.5TB of data on it for now.

The scrub actually doesn't seem to take too long, it's taking around 2 hours to complete at the moment with the amount of data I've got on there now which doesn't seem too bad, but am I crazy in expecting the rest of the system to continue running still? Surely everything shouldn't grind to a halt every time a scrub is running?

Since this is a new set up for me, I still have it attached to a monitor and noticed some error messages on the screen regarding the ui which you can see below which start about 3 minutes after the scrub starts. I don't know if these errors are relevant, or perhaps they just start happening because the system is running so slowly it's causing other system processes to time out and fail etc.

Is there something obvious I'm missing, or what should I be looking at to fault find for this? One thing I have noticed is that even now while idle the memory used is at 30GB and there's 1GB swap used. Is that normal for a system doing almost nothing?

Code:
Nov 15 19:40:46 freenas uwsgi: [middleware.notifier:177] Popen()ing: zpool scrub HDD
Nov 15 19:43:36 freenas uwsgi: [raven.base.Client:265] Configuring Raven for host: https://sentry.ixsystems.com
Nov 15 19:43:46 freenas uwsgi: [freeadmin.views:210] UI crash exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py", line 42, in inner response = get_response(request)
  File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 244, in _legacy_get_response response = middleware_method(request)
  File "./freenasUI/freeadmin/middleware.py", line 296, in process_request user = AuthTokenBackend().authenticate(header[len(self.HEADER_PREFIX):])
  File "./freenasUI/middleware/auth.py", line 8, in authenticate with client as c:
  File "./freenasUI/middleware/client.py", line 20, in __enter__ local.client = Client()
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 320, in __init__ raise e
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 316, in __init__ raise ClientException('Failed connection handshake')
middlewared.client.client.ClientException: Failed connection handshake
Nov 15 19:43:48 freenas uwsgi: [middlewared.logger.CrashReporting:102] Sending a crash report...
Nov 15 19:43:52 freenas uwsgi: [raven.base.Client:716] Sending message of length 4230 to https://sentry.ixsystems.com/api/2/store/
Nov 15 19:52:04 freenas uwsgi: [raven.base.Client:265] Configuring Raven for host: https://sentry.ixsystems.com
Nov 15 19:52:06 freenas uwsgi: [freeadmin.views:210] UI crash exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py", line 42, in inner response = get_response(request)
  File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 244, in _legacy_get_response response = middleware_method(request)
  File "./freenasUI/freeadmin/middleware.py", line 296, in process_request user = AuthTokenBackend().authenticate(header[len(self.HEADER_PREFIX):])
  File "./freenasUI/middleware/auth.py", line 8, in authenticate with client as c: 
  File "./freenasUI/middleware/client.py", line 20, in __enter__ local.client = Client()
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 320, in __init__ raise e
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 313, in __init__ self._ws.connect()
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 170, in connect rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.6/site-packages/ws4py/client/__init__.py", line 215, in connect bytes = self.sock.recv(128)
socket.timeout: timed out
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
While I don't have the issue you do with the system being non-responsive (or at least noticeably lagging) I do set vfs.zfs.scrub_delay=50 as a sysctl tunable. There is one for resilver as well. I don't remember the default value.

The idea is that number is the number of ticks between I/O for the scrub or the resilver when the pool is not idle.

I suggest you try it. Note the trade off for less interference with normal pool operation is obviously that the scrub or resilver will take longer to complete. You can read about the function(s) here: https://www.freebsd.org/doc/handbook/zfs-advanced.html
 
Status
Not open for further replies.
Top