Void64
Cadet
- Joined
- Nov 10, 2018
- Messages
- 8
First..
Cisco UCS C240M4
Build FreeNAS-11.1-U6
Platform Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Memory 65387MB
Storage is a standard LSI SAS/RAID controller. No raid in use. w/ Flexcache module. Just using it as an HBA. (mrsas driver)
ZPool config: (12 x 8TB enterprise SAS drives, two raidz2 vdevs in one pool)
Network wise we have two link agg connections. One pair of links are 1Gbps (each) lagg (failover). (intel igb driver)
Second set of links is link agg LACP using Mellanox (mce driver) 25Gbps network adapters to Nexus 93180-EX switches. (50Gpbs port channel). Yes, I know it's probably way overkill, but why not.
First I will state I have this hardware running on other FreeBSD servers (not FreeNAS) and have never had any driver or hardware issue with similar configuration. This seems to be purely random issue.
The issue: At random times the server will just bog down to a crawl. It's so bad that the LACPDU's start timing out on the LACP connection causing the links to flap. When this happens, ALL network connections suffer the same issue. It doesn't seem to be a network issue because there is very little network activity going on and it doesn't matter which port/switch I come in on. If I manage to SSH into the server, it's really sluggish, long delays between keystrokes, etc.
The only kernel messages I see are the "mce" links showing LACP flaps. (because the server is so bogged down it's timing out on the switch side)
- No scrub going on at the time...
More info on the storage side:
root@nas0:~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
freenas-boot 29.5G 4.15G 25.4G - - 14% 1.00x ONLINE -
tank 87T 2.50T 84.5T - 1% 2% 1.00x ONLINE /mnt
- Looking at the zfs history there was nothing going on at any time we experience this. (no snapshots, etc)
Watching "zpool iostat -v tank 1" during the time this is happening shows very little activity at all. (only couple of hundred K of write, very little read)
Simiarly I'm not seeing anything strange (high utilization) from system vmstat, iostat or netstat.
THe problem seems to clear itself up after some time few minutes, hard to say... Cannot really nail down when it starts.
I tried looking for anything, like a cron or run away process while this was happening, but could not see any. Are there any other processes (like some type of house keeping or garbage collection) that runs periodically that may be causing this?
Cisco UCS C240M4
Build FreeNAS-11.1-U6
Platform Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Memory 65387MB
Storage is a standard LSI SAS/RAID controller. No raid in use. w/ Flexcache module. Just using it as an HBA. (mrsas driver)
ZPool config: (12 x 8TB enterprise SAS drives, two raidz2 vdevs in one pool)
Network wise we have two link agg connections. One pair of links are 1Gbps (each) lagg (failover). (intel igb driver)
Second set of links is link agg LACP using Mellanox (mce driver) 25Gbps network adapters to Nexus 93180-EX switches. (50Gpbs port channel). Yes, I know it's probably way overkill, but why not.
First I will state I have this hardware running on other FreeBSD servers (not FreeNAS) and have never had any driver or hardware issue with similar configuration. This seems to be purely random issue.
The issue: At random times the server will just bog down to a crawl. It's so bad that the LACPDU's start timing out on the LACP connection causing the links to flap. When this happens, ALL network connections suffer the same issue. It doesn't seem to be a network issue because there is very little network activity going on and it doesn't matter which port/switch I come in on. If I manage to SSH into the server, it's really sluggish, long delays between keystrokes, etc.
The only kernel messages I see are the "mce" links showing LACP flaps. (because the server is so bogged down it's timing out on the switch side)
- No scrub going on at the time...
More info on the storage side:
root@nas0:~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
freenas-boot 29.5G 4.15G 25.4G - - 14% 1.00x ONLINE -
tank 87T 2.50T 84.5T - 1% 2% 1.00x ONLINE /mnt
- Looking at the zfs history there was nothing going on at any time we experience this. (no snapshots, etc)
Watching "zpool iostat -v tank 1" during the time this is happening shows very little activity at all. (only couple of hundred K of write, very little read)
Simiarly I'm not seeing anything strange (high utilization) from system vmstat, iostat or netstat.
THe problem seems to clear itself up after some time few minutes, hard to say... Cannot really nail down when it starts.
I tried looking for anything, like a cron or run away process while this was happening, but could not see any. Are there any other processes (like some type of house keeping or garbage collection) that runs periodically that may be causing this?