FreeNAS server bogs down and crawls

Void64 · Nov 10, 2018

First..

Cisco UCS C240M4
Build FreeNAS-11.1-U6
Platform Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Memory 65387MB

Storage is a standard LSI SAS/RAID controller. No raid in use. w/ Flexcache module. Just using it as an HBA. (mrsas driver)

ZPool config: (12 x 8TB enterprise SAS drives, two raidz2 vdevs in one pool)

Network wise we have two link agg connections. One pair of links are 1Gbps (each) lagg (failover). (intel igb driver)

Second set of links is link agg LACP using Mellanox (mce driver) 25Gbps network adapters to Nexus 93180-EX switches. (50Gpbs port channel). Yes, I know it's probably way overkill, but why not.

First I will state I have this hardware running on other FreeBSD servers (not FreeNAS) and have never had any driver or hardware issue with similar configuration. This seems to be purely random issue.

The issue: At random times the server will just bog down to a crawl. It's so bad that the LACPDU's start timing out on the LACP connection causing the links to flap. When this happens, ALL network connections suffer the same issue. It doesn't seem to be a network issue because there is very little network activity going on and it doesn't matter which port/switch I come in on. If I manage to SSH into the server, it's really sluggish, long delays between keystrokes, etc.

The only kernel messages I see are the "mce" links showing LACP flaps. (because the server is so bogged down it's timing out on the switch side)

- No scrub going on at the time...

More info on the storage side:

root@nas0:~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
freenas-boot 29.5G 4.15G 25.4G - - 14% 1.00x ONLINE -
tank 87T 2.50T 84.5T - 1% 2% 1.00x ONLINE /mnt

- Looking at the zfs history there was nothing going on at any time we experience this. (no snapshots, etc)

Watching "zpool iostat -v tank 1" during the time this is happening shows very little activity at all. (only couple of hundred K of write, very little read)

Simiarly I'm not seeing anything strange (high utilization) from system vmstat, iostat or netstat.

THe problem seems to clear itself up after some time few minutes, hard to say... Cannot really nail down when it starts.

I tried looking for anything, like a cron or run away process while this was happening, but could not see any. Are there any other processes (like some type of house keeping or garbage collection) that runs periodically that may be causing this?

BigDave · Nov 10, 2018

When ever I read about problems like this with EOL/older/used hardware having
random issues that come and go, I tend to think;

POWER SUPPLIES
COMPONENT OVERHEATING

As power supplies age, what was once enough juice to run
your configuration, has now become too weak. Just a thought.

Void64 · Nov 10, 2018

Definately not a power issue. This is a 6 mo old Cisco UCS with dual 1200 watt power supplies powered by two 10kw 208v datacenter PDUs. Only server in the rack this os happening too.

The CIMC has a million sensors in it as well, if it were a power issue, something would of lit up.

I’m beginning to think this may be related to NFSv4 linux clients banging on the box about the same time. Anyone else using heavy NFSv4 with Linux clients? Should be noted that all my linux NFS mounts are over IPv6 only as well...

Elliot Dierksen · Nov 10, 2018

What does the config look like on the switch side? Is it a VPC port channel? If it isn't VPC, does the problem persist when only 1 link is connected? My gut is that there is a difference of opinion between FreeNAS and the switch(es) in question.

Void64 · Nov 10, 2018

Elliot Dierksen said:
What does the config look like on the switch side? Is it a VPC port channel? If it isn't VPC, does the problem persist when only 1 link is connected? My gut is that there is a difference of opinion between FreeNAS and the switch(es) in question.

The Mellanox cards are in a VPC. The Intel cards are just failover teaming. Network seems fine, it totally seems as if the server is spin locking or IO bound, but I dont see it. Reproducing the problem proces elusive. It does seem to be IO related on the Linux clients. Linux is using NFSv4.1 by default whole BSD is still using v3.

A hunch is NFSv4 and locking. I have fallen back to NFSv3 on the linux clients.

Elliot Dierksen · Nov 10, 2018

Void64 said:
Linux is using NFSv4.1 by default whole BSD is still using v3.

I have not taken the NFSv4 plunge yet, but I can't help but wonder if v3 and v4 have problems playing nice together. Nothing to back that up, just a stray thought.

Void64 · Nov 10, 2018

Elliot Dierksen said:
I have not taken the NFSv4 plunge yet, but I can't help but wonder if v3 and v4 have problems playing nice together. Nothing to back that up, just a stray thought.

I meant to say my BSD clients are using v3, while Linux used 4.1 by default. The server side is supporting both depending on what the client is connecting with.

Elliot Dierksen · Nov 10, 2018

Void64 said:
I meant to say my BSD clients are using v3, while Linux used 4.1 by default. The server side is supporting both depending on what the client is connecting with.

That is what I thought you meant. I'll be interested to hear if it works better when all the NFS clients are using v3.

Important Announcement for the TrueNAS Community.

FreeNAS server bogs down and crawls

Void64

Cadet

BigDave

FreeNAS Enthusiast

Void64

Cadet

Elliot Dierksen

Guru

Void64

Cadet

Elliot Dierksen

Guru

Void64

Cadet

Elliot Dierksen

Guru

Similar threads