Bad performance during the night and morning

Status
Not open for further replies.

rwesterh

Cadet
Joined
Jul 7, 2014
Messages
6
Good morning!

I'm having an interesting issue with the FreeNAS box at work.
During the day it performs wonderful - 50 users can read/write from the CIFS shares, 2 VM's run snuggly on the (dedicated) NFS share and everything is nice and well.
However, during the night (it starts at about 4:20) suddenly performance plummets.
It takes 30 seconds to browse directories, opening a file takes about a minute, and the VM suddenly tends to crash or stop responding.
Then, at about 10:00, this is over and everything runs ever so smooth.

My problem is, I can't find out why this is happening.

A bit about my configuration:
I have a Dell 2950 (Intel Xeon E5450, 32 GB ECC RAM) with a MD1000 attached to it.
The MD1000 is split into two pools;
4x 15k SAS disks, 300 gb each, ZFS mirrored. This one is dedicated for running two low-usage VM's.
11x 2 TB Dell Enterprise SATA disks, 7200rpm in a Z3 pool. This one hosts all userdata, shares, etc and is the one slowing down.

During the slow period, I can see (in top) the smbd processes stuck in a ZIO->IO state (it cuts off and I couldnt find how to extend the columns. Haven't been working with BSD-like systems that long..)

Hope anyone can give me some advice.

Edit:
I forgot to mention, I'm using Freenas-9.2.1.6 but it's been happening since the 9.2.1.1 release, this latest version is a new install on a new USB, thinking that was the issue.

Also a pretty graph of what happened last night when I let a cron job write 5 gb via dd, and measuring the time it took.
https://cdn.fland.re/public.php?service=files&t=00cd4a853b139b3c1d96800b96d5adcf

Greetings from the Netherlands,
Rene
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Haha. Your graph is drawn on a whiteboard! I can totally see that on a screen too.

Anyway, are you doing scrubs nightly or have high network traffic? This sounds like you've scheduled some kind of maintenance that is causing the problems and as soon as the maintenance is over it's back to normal.
 

rwesterh

Cadet
Joined
Jul 7, 2014
Messages
6
Hey cyberjock, thanks for the response.

As far as I can tell there are no jobs running on that particular time - there is a scrub scheduled for every 30 days on a sunday, but that's it.
However, there are a bunch of servers starting their backup jobs at that time. But when I disable those jobs, the problem still happens.
This is happening for a few months now - it's a bit of an issue with the coworkers by now.

Is there anything I can use to see what runs at 4:00? I dont think dumping a ps -aux would be helpful to manually read through..
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you can do "systat -if" to see what is going on with network traffic. You can also look at top to see if a particular service is actually doing a bunch of work.

Other than that it's a needle in a haystack and you'll just need to start ruling things out until you can identify the cause. :/
 

rwesterh

Cadet
Joined
Jul 7, 2014
Messages
6
Sounds like I have to stay up for a night then, haha.
At the moment (11:21) there is nothing interesting, on the reporting screen on the web UI there is no high load either - in fact, there is lower than usual traffic during the night.
I understand it's tough to figure out, that's why I ended up asking here.

Top isnt showing anything useful either - it's relatively quiet and it seems the server is, like, waiting for something to happen.
smbd processed when people come in are either stuck in 'select' or 'zio->i' states until they suddenly get served again, and it kinda worries me.
Any idea what that zio>i could be?

Edit; I upgraded to the new version last thursday. It went fine for a few days, then it started again.
Could it be some cache filling up or something?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Top isnt showing anything useful either - it's relatively quiet and it seems the server is, like, waiting for something to happen.
smbd processed when people come in are either stuck in 'select' or 'zio->i' states until they suddenly get served again, and it kinda worries me.
Any idea what that zio>i could be?

Edit; I upgraded to the new version last thursday. It went fine for a few days, then it started again.
Could it be some cache filling up or something?

zio->i could be basically anything. That just tells you there is ZFS activity (duh!).

As for if it could be cache, the list of things it could be is probably like 10 orders of magnitude longer than the list of things it couldn't be. Narrowing down the problem when it is actually happening is your key to success.
 

rwesterh

Cadet
Joined
Jul 7, 2014
Messages
6
I assumed it would be ZFS activity, but it's the part where it stays that way for ages is what worries me.
But I gather I can't be helped for now. *saddened sigh*
Ill leave a monitor open for tomorrow morning with disk io, cpu info, network io and memory data and see what happens there.

Thank you for the quick support!
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I assumed it would be ZFS activity, but it's the part where it stays that way for ages is what worries me.
But I gather I can't be helped for now. *saddened sigh*
Ill leave a monitor open for tomorrow morning with disk io, cpu info, network io and memory data and see what happens there.

Thank you for the quick support!


if you go to the reporting tab on the UI, you can scroll the graphs back to the time range when the problem occurs.. I would look at CPU, network and Disk activity.

The IO-ZIO state would imply that smbd is waiting for the zfs layer to do something.. but that's just a symptom.

A 2950 isn't new, but it should perform well enough.
 

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
which HBA do you use?
and is it really a hba, or do you use a dell perc and and export the disk each manually?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Could there be something running on one of your Windows clients that is causing the activity? A scan, backup, indexing, etc?
 

rwesterh

Cadet
Joined
Jul 7, 2014
Messages
6
Good question about the PERC controller, but it's a SAS pass-through, no RAID whatsoever.

But I fixed it!
And I'm ashamed to say I looked over it this whole time - there was a Long SMART test scheduled. At 4 am. Apparantly that takes a long time on 15 disks.

Thanks everyone for supporting me, keep up the good work. Coffee and cookies for all!
 
Status
Not open for further replies.
Top