FreeNAS 11.1 - Very Slow, Resilvering extreme slow etc.

Status
Not open for further replies.

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
The past few weeks my FreeNAS box as ran very poorly, directories load slow, browsing the file system over SMB is painful and my Plex and Emby jails were at times unresponsive, even the FreeNAS GUI would time out and throw errors. I didn't see anything detrimental in the disk performance latency or usage/busy wise.

I did have 1 of 2 disks fail in a mirror running my jails, so I pulled that and it is now rebuilding

Code:
  pool: vol03
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 25 15:31:20 2017
		63.6G scanned at 5.56M/s, 18.2M issued at 1.59K/s, 83.8G total
		0 resilvered, 0.02% done, no estimated completion time
config:

		NAME											STATE	 READ WRITE CKSUM
		vol03										   ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/78ad0495-e9b2-11e7-968a-003048cd885c  ONLINE	   0	 0	 2
			gptid/ece3495b-1649-11e7-84d6-003048cd885c  ONLINE	   0	 0	 0


But look at the speed, 1.59K/s, and it keeps dropping.

gstat is pretty much idle, I can't see anything where I have a busy disk or otherwise that would cause this. Latency across all disks are very low as well.

My hardware

SuperMicro X8DTN+ w/ 2x Intel Xeon Six Core E5645 2.4Ghz and 48GB ECC DDR3
IBM M1015 46M0861 RAID Controller

FreeNAS is installed on 2 mirrored USB sticks, the other volumes are RAIDZ2 and solid, the Jail volume is a mirror.

What am I miss or what else can I look at? Last night when I was starting the replacement /resilvering of the drive, the web UI would time out navigating the system.
 
Last edited by a moderator:

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
I guess I should expand, I can't even reboot FreeNAS via the WebGUI or CLI, this has been an issue since before the re-silver. I get timeout errors upon issuing a reboot or even attempting to load plugins or other areas of the FreeNAS UI
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
This looks like it could be related to https://forums.freenas.org/index.php?threads/problems-after-pool-scrub.60131/unread

which also reports scrubs slowing toward zero.

yeah its lower now, sadly I can't start my jails on this volume either.

Code:
pool: vol03
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 25 15:31:20 2017
		63.6G scanned at 4.48M/s, 136K issued at 9/s, 83.8G total
		0 resilvered, 0.00% done, no estimated completion time
config:

		NAME											STATE	 READ WRITE CKSUM
		vol03										   ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/78ad0495-e9b2-11e7-968a-003048cd885c  ONLINE	   0	 0	 0
			gptid/ece3495b-1649-11e7-84d6-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors



IO it pretty much empty it seems

Code:

zpool iostat -v vol03 1

										   capacity	 operations	bandwidth
pool									alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
vol03								   84.3G  32.7G	  0	  0	  0	  0
  mirror								84.3G  32.7G	  0	  0	  0	  0
	gptid/78ad0495-e9b2-11e7-968a-003048cd885c	  -	  -	  0	  0	  0	  0
	gptid/ece3495b-1649-11e7-84d6-003048cd885c	  -	  -	  0	  0	  0	  0
--------------------------------------  -----  -----  -----  -----  -----  -----

										   capacity	 operations	bandwidth
pool									alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
vol03								   84.3G  32.7G	  0	  0	  0	  0
  mirror								84.3G  32.7G	  0	  0	  0	  0
	gptid/78ad0495-e9b2-11e7-968a-003048cd885c	  -	  -	  0	  0	  0	  0
	gptid/ece3495b-1649-11e7-84d6-003048cd885c	  -	  -	  0	  0	  0	  0
--------------------------------------  -----  -----  -----  -----  -----  -----

										   capacity	 operations	bandwidth
pool									alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
vol03								   84.3G  32.7G	  0	  0	  0	  0
  mirror								84.3G  32.7G	  0	  0	  0	  0
	gptid/78ad0495-e9b2-11e7-968a-003048cd885c	  -	  -	  0	  0	  0	  0
	gptid/ece3495b-1649-11e7-84d6-003048cd885c	  -	  -	  0	  0	  0	  0
--------------------------------------  -----  -----  -----  -----  -----  -----



da6 is the existing SSD, 1st half of the mirror
da11 is the new SSD I put in the server yesterday. IO is pretty much nothing while watching this.


Code:
dT: 1.016s  w: 1.000s
 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da2
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da3
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da4
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da5
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da6
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da7
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da8
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da9
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da10
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da11
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da12
	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da13


at one point while attempting to start the plex plugin (the jails reside on this volume), the webgui became unresponsive after about 30min. I attempted to restart ngix and my ssh session just hung. attempting to reboot freenas itself via the webgui and even shutdown -r now causes the system to hang, but not actually reboot.

Abit of background, the SSD mirror that I am resilvering now;
One of the SSDs was starting to fail, high latency and constant disk busy issues (1k-2k+ms). But also around this point I noticed SMB browsing on any of the volumes on my box was taking an obscure amount of time (30s-1min at times for directories to populate). A reboot would resolve this, I wasn't swapping at the time and the only item that seemed out of place was the failing SSD in the mirror. This mirror only hosts my jails, with 1 for Emby and 1 for Plex, thats all. Freenas itself is installed on 2 SANDisk Cruiser USB sticks (16GB) that are mirrored. (labeled as da12 and da13 above)

I pulled the failing SSD about 2 weeks ago, hoping the system would stabilize with that SSD removed and failing back to the mirror SSD, though managing my box still just seemed off and slow.

Last Friday I upgraded to the current build 11.1, from 11u4. This process took about 4 hours to complete (if not longer)

I did skim /var/log/messages and debug.log, but nothing really jumped out at my untrained eye.
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
Code:
last pid: 71989;  load averages:  0.17,  0.14,  0.16								  up 0+06:50:54  22:21:33
86 processes:  1 running, 84 sleeping, 1 zombie
CPU:  0.1% user,  0.0% nice,  0.1% system,  0.0% interrupt, 99.7% idle
Mem: 310M Active, 932M Inact, 8768M Wired, 37G Free
ARC: 6317M Total, 225M MFU, 5940M MRU, 28M Anon, 95M Header, 28M Other
	 6017M Compressed, 22G Uncompressed, 3.76:1 Ratio
Swap: 10G Total, 10G Free

  PID USERNAME	THR PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND
12966 root		  1  20	0   174M   144M select 20   2:12   2.47% smbd
 5720 root		 12  20	0 98144K 20440K nanslp  1   1:44   2.04% collectd
71985 root		  1  20	0 10256K  4944K CPU0	0   0:00   0.16% top
 5646 root		 46  52	0 45976K 25960K uwait   2   0:51   0.16% consul
 4354 root		  1 -52   r0  3584K  3648K nanslp 18   0:06   0.09% watchdogd
 4709 root		  1  20	0  6340K  2160K select 23   0:06   0.06% powerd
12945 root		  1  20	0   128M   101M select  5   0:03   0.06% smbd
  228 root		 24  21	0   215M   149M kqread 10   5:21   0.06% python3.6
31081 root		  1  20	0 13772K  8216K select 16   0:00   0.01% sshd
 8323 root		  6  20	0   184M 62732K select 22   0:09   0.01% python2.7
 7182 root		  6  20	0   188M 60408K select  6   0:09   0.00% python2.7
 4706 root		  1  20	0 12608K 12728K select 12   0:01   0.00% ntpd
 5614 root		  1  20	0   152M   118M kqread  3   0:11   0.00% uwsgi
 9452 root		  9  38	0 31724K 12804K uwait  13   0:00   0.00% consul
 9462 root		 15  20	0   231M   159M umtxn  10   4:58   0.00% uwsgi
12986 root		  1  25	0 92300K 65320K select  6   3:28   0.00% winbindd
 5618 root		  1  52	0   106M 88376K select 17   0:52   0.00% python3.6
13954 root		  1  20	0   175M   147M select  9   0:37   0.00% smbd
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I was also encountering trouble during a scrub, though the scrub appeared to be proceeding quite a bit faster than you're reporting. Sharing services would become unresponsive, I wasn't able to stop the scrub ( zpool scrub -s tank didn't return to a command prompt after two hours), and reboots would fail as well--I had to do a hard power off. Reported https://redmine.ixsystems.com/issues/27437 and reverted to 11.0-U4 for the time being.
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
I was also encountering trouble during a scrub, though the scrub appeared to be proceeding quite a bit faster than you're reporting. Sharing services would become unresponsive, I wasn't able to stop the scrub ( zpool scrub -s tank didn't return to a command prompt after two hours), and reboots would fail as well--I had to do a hard power off. Reported https://redmine.ixsystems.com/issues/27437 and reverted to 11.0-U4 for the time being.


I'm attempting this now, I first attempted to change the boot back to 11.0u4 from the gui and got a timeout. Yet the GUI reported 11.0u4 would be active on reboot. Reboot failed and required a hard power down, on boot I was still on 11.1.

I reviewed the GUI and it stated 11.0u4 on reboot, so I am attempting another reboot now. Thus far a softpower down still is not working. Thank you for the heads up, from the sounds of it, once I get it to boot 11.0u4 I should be good.
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
After 2 reboots, it still wanted to goto 11.1, so I just choose the boot manager from CLI at boot and selected 11.0u4.
I got a kernel panic on first boot, second boot all is well it appears (yet to dig) but zpool status shows


Code:
  pool: vol03
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 25 15:31:20 2017
		33.9G scanned out of 83.8G at 112M/s, 0h7m to go
		33.9G resilvered, 40.43% done
config:

		NAME											STATE	 READ WRITE CKSUM
		vol03										   ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/78ad0495-e9b2-11e7-968a-003048cd885c  ONLINE	   0	 0	 0  (resilvering)
			gptid/ece3495b-1649-11e7-84d6-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors
root@trump:~ # zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h14m with 0 errors on Thu Dec 21 04:00:20 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		freenas-boot									ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/52a06777-188e-11e6-b700-003048cd885c  ONLINE	   0	 0	 0
			gptid/20b73088-1893-11e6-8f2c-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: vol01
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 14h31m with 0 errors on Sun Nov 19 14:31:47 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		vol01										   ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/bf2fea4c-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0
			gptid/bfe5d67b-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0
			gptid/c09a3ca8-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0
			gptid/c1514137-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0
			gptid/c2058842-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0
			gptid/c2b2ff43-c3be-11e5-ba57-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: vol02
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 8h38m with 0 errors on Sun Dec  3 08:38:22 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		vol02										   ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/2d76670d-5217-11e7-84c9-003048cd885c  ONLINE	   0	 0	 0
			gptid/1b3155d6-1a4c-11e7-854b-003048cd885c  ONLINE	   0	 0	 0
			gptid/1bbaa125-1a4c-11e7-854b-003048cd885c  ONLINE	   0	 0	 0
			gptid/1c2389f5-1a4c-11e7-854b-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: vol03
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 25 15:31:20 2017
		34.5G scanned out of 83.8G at 112M/s, 0h7m to go
		34.5G resilvered, 41.20% done
config:

		NAME											STATE	 READ WRITE CKSUM
		vol03										   ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/78ad0495-e9b2-11e7-968a-003048cd885c  ONLINE	   0	 0	 0  (resilvering)
			gptid/ece3495b-1649-11e7-84d6-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors



Heck, even in the time it took to write this out, its now at 68% re-silvering. Both Plex and Emby plugins are functional again etc. Looks to be an issue with 11.1
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
Code:
  pool: vol03
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: resilvered 83.6G in 47h11m with 0 errors on Wed Dec 27 14:43:18 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		vol03										   ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/78ad0495-e9b2-11e7-968a-003048cd885c  ONLINE	   0	 0	 0
			gptid/ece3495b-1649-11e7-84d6-003048cd885c  ONLINE	   0	 0	 0

errors: No known data errors
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
There are a few things to address here:

The new scrub/resilver code first reads metadata into a buffer in RAM so that larger data reads can be issued as large contiguous blocks. Only this last part is accounted for in the completion percentage, so it'll start off slow. Especially because the buffer is large and filled with stuff that is slow to read.

I wasn't able to stop the scrub ( zpool scrub -s tank didn't return to a command prompt after two hours), and reboots would fail as well--I had to do a hard power off.
This is troubling. I imagine it could be RAM exhaustion, which might benefit from better tuning of the scrub buffer size.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
On my system?
Seems crazy, doesn't it? I guess it might happen if the buffer is accidentally set to something absurd.

It's the only immediate suspect I have. Let's see what the devs come up with.

If you can, try having a terminal open with vmstat -w 5 and start a scrub. That should give a general idea of what's going on.
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
Seems crazy, doesn't it? I guess it might happen if the buffer is accidentally set to something absurd.

It's the only immediate suspect I have. Let's see what the devs come up with.

If you can, try having a terminal open with vmstat -w 5 and start a scrub. That should give a general idea of what's going on.

In my case I was resilvering a mirror array after replacing a failed drive. I did let it set for near 2 days before reverting back to 11.0u4 this afternoon, and the status was still 0.00% complete no ETA.
I know scrubs and or resilvering / initializing an array can take several hours, at times days.. But on day 2 with the zpool status still at 0.00% of only 83~GB of SSD storage; not being able to start plugins, webGUI timeouts and other abnormalities, I had to pull the trigger and get my box back to a functional state =/

I did get it rolled back to 11.0u4 and the resilver took place and finished in 2-3 hours, I'd have to upgrade to 11.1 again and run a scrub on that volume if you wanted to see my stats as well
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
In my case I was resilvering a mirror array after replacing a failed drive. I did let it set for near 2 days before reverting back to 11.0u4 this afternoon, and the status was still 0.00% complete no ETA.
I know scrubs and or resilvering / initializing an array can take several hours, at times days.. But on day 2 with the zpool status still at 0.00% of only 83~GB of SSD storage; not being able to start plugins, webGUI timeouts and other abnormalities, I had to pull the trigger and get my box back to a functional state =/
Yes, that is indeed a problem and definitely not expected.
 

Xelas

Explorer
Joined
Sep 10, 2013
Messages
97
This is troubling news. I've replaced 3 of my 6 x 3TB WD Red drives over the last 3 years, and the last one was just 2-3 months ago while on 11.0. Looks i"m screwed if a drive fails before this is fixed? The resilver took, IIRC, about 10 hours or so with the volume being about 50% full and some of it is VERY fragmented. For the first 1-2 hours, it barely crawled along at 10-20MB/s, and then it would pick up and fly along at 250-300MB/s for the next 6-9 hours or so.
With SSDs, especially in a simple mirror, I'd expect a resilver to just FLY.

Eric - are you truly still running FreeNAS 9.10, or have you just not updated your signature lately?

EDIT: Palmore, danb35, what do you have set for "Resilver Priority"? This is a new option for 11.1
Also, do you still have autotune settings from old installs still enabled? I've found that disabling them (from when they were enabled back in FreeNAS 9.x days) actually leads to better performance on my system, but, of course, YMMV.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
danb35, what doi you have set for "Resilver Priority"?
I hadn't touched that setting, so it should be whatever the default is. I've reverted to 11.0-U4 for the time being though.
do you still have autotune settings from old installs still enabled?
I hadn't used autotune.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yes, the Resilver Priority thing could also have some weird oversight causing trouble...
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
I did not adjust or touch the resilver priority either, I did see the tab, but did not click it as of yet lol. I was more worried of the webGUI timeouts and didn't wanna land myself somewhere that'd put me in a bad spot. I too rolled back to 11.0u4. The resilver went lighting fast at that point.

As for autotune, I'll be honest I don't recall doing so (yet doesn't mean I didn't run it some night half asleep), my system was originally built off 9.10-stable and upgraded from there.
 

Palmore

Dabbler
Joined
Dec 26, 2017
Messages
11
I just recalled your comment about fragmentation, but I can't rememberl if this value is free space fragmentation?

Code:
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot							14.5G  7.68G  6.82G		 -	  -	52%  1.00x  ONLINE  -
  mirror								14.5G  7.68G  6.82G		 -	  -	52%
	gptid/52a06777-188e-11e6-b700-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/20b73088-1893-11e6-8f2c-003048cd885c	  -	  -	  -		 -	  -	  -
vol01								   27.2T  23.3T  3.96T		 -	20%	85%  1.00x  ONLINE  /mnt
  raidz2								27.2T  23.3T  3.96T		 -	20%	85%
	gptid/bf2fea4c-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/bfe5d67b-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/c09a3ca8-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/c1514137-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/c2058842-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/c2b2ff43-c3be-11e5-ba57-003048cd885c	  -	  -	  -		 -	  -	  -
vol02								   18.1T  10.6T  7.56T		 -	27%	58%  1.00x  ONLINE  /mnt
  raidz1								18.1T  10.6T  7.56T		 -	27%	58%
	gptid/2d76670d-5217-11e7-84c9-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/1b3155d6-1a4c-11e7-854b-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/1bbaa125-1a4c-11e7-854b-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/1c2389f5-1a4c-11e7-854b-003048cd885c	  -	  -	  -		 -	  -	  -
vol03									117G  84.6G  32.4G		 -	77%	72%  1.00x  ONLINE  /mnt
  mirror								 117G  84.6G  32.4G		 -	77%	72%
	gptid/78ad0495-e9b2-11e7-968a-003048cd885c	  -	  -	  -		 -	  -	  -
	gptid/ece3495b-1649-11e7-84d6-003048cd885c	  -	  -	  -		 -	  -	  -
 
Status
Not open for further replies.
Top