Network Services Drop

Jason Brunk · Feb 10, 2016

yeah I don't care what the topic is.. you have 8000+ messages on anything, you know your #$%^ lol

Robert Trevellyan · Feb 10, 2016

Jason Brunk said:
I have never had problems till lately with the network dropping.

It's entirely possible that something else is causing your network drops, but it's important to eliminate the obvious before exploring more exotic possibilities.

jgreco · Feb 11, 2016

Robert Trevellyan said:
It's entirely possible that something else is causing your network drops, but it's important to eliminate the obvious before exploring more exotic possibilities.

What he's describing is *SO* much like the filer just being overwhelmed and things timing out, I'd say the likelihood of it being something else is remote. There's no harm to setting up an ongoing ping session somewhere to monitor the filer, but with 8GB and L2ARC and iSCSI and RAIDZ, everything I've seen over the year screams "totally inadequate system." I never really appreciated just how bad it was until I had actually tried a whole bunch of combinations, at which point my opinion on RAM roughly doubled and my disdain for RAIDZ grew to a point where I simply wouldn't suggest anyone use it except on massively large systems with large numbers of vdevs, or maybe extremely light usage.

jgreco · Feb 11, 2016

Jason Brunk said:
yeah I don't care what the topic is.. you have 8000+ messages on anything, you know your #$%^ lol

Thanks for the compliment, but I feel compelled to point out that at five years, that only works out to maybe four or five messages a day.

Mirfster said:
He has A LOT of input on EVERYTHING.

It's more like I sit around and have time to spout crap while waiting for other things I'm doing to finish.

Anyways, the sad/bad news is that iSCSI on ZFS tends to consume ungodly amounts of resources in order to perform well, but once you do commit that, it pretty much kicks the crap out of a conventional hard drive based array. There isn't a ton of middle ground, either. If you do it right with ARC and L2ARC on the read side, and keeping a good amount of free space and multiple vdevs on the write side, it moves along very nicely. But as an example, in order to get 7TB of good quality usable space on the VM filer here, I'm looking at a 24-bay system stuffed with 2TB 2.5" laptop-ish hard drives, with 128GB of RAM and 1TB of L2ARC. That's 48TB of HDD, 1TB of flash, and 128GB of RAM to get 7TB of nonsucky HDD storage.

Jason Brunk · Feb 12, 2016

Robert Trevellyan said:
It's entirely possible that something else is causing your network drops, but it's important to eliminate the obvious before exploring more exotic possibilities.

Well, the network dropped this morning even after removing that cache drive the other day. So atleast we know for certain that wasn't the culprit. I did grab a debug right after the network services came back online. Would anyone be willing to take a peek at it?

https://www.dropbox.com/s/mi0eqo91zk0vjhk/debug-skooge-20160212072457..tgz?dl=0

Thanks!

Jason Brunk · Feb 12, 2016

jgreco said:
What he's describing is *SO* much like the filer just being overwhelmed and things timing out, I'd say the likelihood of it being something else is remote. There's no harm to setting up an ongoing ping session somewhere to monitor the filer, but with 8GB and L2ARC and iSCSI and RAIDZ, everything I've seen over the year screams "totally inadequate system." I never really appreciated just how bad it was until I had actually tried a whole bunch of combinations, at which point my opinion on RAM roughly doubled and my disdain for RAIDZ grew to a point where I simply wouldn't suggest anyone use it except on massively large systems with large numbers of vdevs, or maybe extremely light usage.

With your opinions of RAIDZ what recommendations would you have now? Different OS? hardware raid?

Jason Brunk · Feb 12, 2016

jgreco said:
Thanks for the compliment, but I feel compelled to point out that at five years, that only works out to maybe four or five messages a day.

It's more like I sit around and have time to spout crap while waiting for other things I'm doing to finish. :)

Anyways, the sad/bad news is that iSCSI on ZFS tends to consume ungodly amounts of resources in order to perform well, but once you do commit that, it pretty much kicks the crap out of a conventional hard drive based array. There isn't a ton of middle ground, either. If you do it right with ARC and L2ARC on the read side, and keeping a good amount of free space and multiple vdevs on the write side, it moves along very nicely. But as an example, in order to get 7TB of good quality usable space on the VM filer here, I'm looking at a 24-bay system stuffed with 2TB 2.5" laptop-ish hard drives, with 128GB of RAM and 1TB of L2ARC. That's 48TB of HDD, 1TB of flash, and 128GB of RAM to get 7TB of nonsucky HDD storage.

I haven't seen alot of speak of cpu, so I am guessing the big resource hog is the ram. For my (what now is obviously small) install, if I throw more memory at it, it seems the general consensus is that is probably the best way to improve my system.

Jason Brunk · Feb 12, 2016

jgreco said:
Thanks for the compliment, but I feel compelled to point out that at five years, that only works out to maybe four or five messages a day.

Glad I can help out with the quota :)

Robert Trevellyan · Feb 12, 2016

Jason Brunk said:
With your opinions of RAIDZ what recommendations would you have now? Different OS? hardware raid?

If you study discussions related to block storage on FreeNAS (of which iSCSI is one example) in these forums, you'll see the same recommendations repeated:

Server grade hardware = necessary.
More vdevs = better (higher IOPS).
More RAM = better.
More capacity = better (don't use more than 50% of pool capacity).

For #2, you use striped mirrors. The others should be self-explanatory.

Jason Brunk said:
if I throw more memory at it, it seems the general consensus is that is probably the best way to improve my system.

This is the most cost-effective single improvement you can make to any FreeNAS system.

Jason Brunk · Feb 12, 2016

Robert Trevellyan said:
If you study discussions related to block storage on FreeNAS (of which iSCSI is one example) in these forums, you'll see the same recommendations repeated:

Server grade hardware = necessary.

More vdevs = better (higher IOPS).

More RAM = better.

More capacity = better (don't use more than 50% of pool capacity).

For #2, you use striped mirrors. The others should be self-explanatory.

This is the most cost-effective single improvement you can make to any FreeNAS system.

Robert,

1. i have server hardware (obviously not nearly enough lol)
2. i am pretty sure i have stripped raidz, is there a better config?
3. this is becoming PAINFULLY obvious :)

Thanks for the feedback :)

Mlovelace · Feb 12, 2016

Jason Brunk said:
Robert,

1. i have server hardware (obviously not nearly enough lol)
2. i am pretty sure i have stripped raidz, is there a better config?
3. this is becoming PAINFULLY obvious :)

Thanks for the feedback :)

Mirrors for iSCSI not raidz, unless you have an all flash pool.

Jason Brunk · Feb 12, 2016

Mlovelace said:
Mirrors for iSCSI not raidz, unless you have an all flash pool.

ah, I see.

Anyone see anything related to the network drop in todays debug? I have an intel pro 1000 i can throw in that box, but figured the 2 onboard intel nics would work.

Robert Trevellyan · Feb 12, 2016

Jason Brunk said:
figured the 2 onboard intel nics would work

They're Intel, which is a preferred brand. I can't comment on whether the specific chipsets (i210 + i217LM) have issues.

Jason Brunk · Feb 12, 2016

Robert Trevellyan said:
They're Intel, which is a preferred brand. I can't comment on whether the specific chipsets (i210 + i217LM) have issues.

So far I tried one on board, then switched to the other. Next step will be to try my pci card and see if it goes away. I have also been reading ALOT of other forums and guides and think a boost in memory is going to be coming out of the tax return :)

jgreco · Feb 14, 2016

Jason Brunk said:
Well, the network dropped this morning even after removing that cache drive the other day. So atleast we know for certain that wasn't the culprit.

Well, no, we don't; the amount of RAM is still insufficient. Eliminating the L2ARC freed up a very small amount of RAM, and the lack of RAM (or ARC to be more precise) is a big issue. I *guarantee* that the L2ARC was part of the problem. Removing it added a small amount of RAM, which wasn't enough to fix the problem (and wasn't expected to).

So I think what you're still running into is that ZFS has the potential to go catatonic for periods if it is under-resourced. It is essentially trying very hard to do too much. This can especially happen in the following cases:

1) A full-ish pool (> 50-60% full) with a lot of fragmentation ("zpool list" frag > 15-20%) and then you throw a massive write task at it. Note that the numbers I give here are at best guesses, but I've been fighting this issue for years so they're vaguely educated guesses.

2) A pool (especially fragmented) where you cause it to become extremely busy with reads, and the pool lacks the capacity to maintain the requested IOPS. This is where ARC and L2ARC is important.

Jason Brunk said:
With your opinions of RAIDZ what recommendations would you have now? Different OS? hardware raid?

Instead of RAIDZ, mirrors. A different OS will "solve" the problem by being much less aggressive about what it is trying to do. Ditto with hardware RAID. The problem is, you'll lose performance too.

See, the thing is that if you provide it with resources, ZFS is capable of doing AMAZING performance tricks with a hard disk. Your typical hard disk is capable of maybe 100-150 IOPS, which translates to around 600KB/sec. Look at what ZFS can do to that:

With 10% occupancy, it is capable of TEN TIMES the throughput on random IOPS workloads. That drops as the disk fills, but even around 50%, it is still faster than disk. And if you provide resources like plenty of ARC and L2ARC, then reads become hella-fast.

But it isn't actually magic. It's an exchange of one kind of resource for another.

Jason Brunk · Feb 15, 2016

Looks like in addition to the memory I will be doing some house cleaning as well to free up more space. :)

jgreco · Feb 16, 2016

Jason Brunk said:
Looks like in addition to the memory I will be doing some house cleaning as well to free up more space. :)

Yes, over time that'll be one of the best things you can do to make ZFS faster.

dvg_lab · Sep 14, 2017

I have the same issue on my setup. I've found that everything works fine while ESXi uses 5.5.0 update 2 (build 2403361), but when I installing 5.5.0 update 3 or even last update (build 5230635) then I see the same thing:

Code:

WARNING: 10.250.101.32 (iqn.1998-01.com.vmware:esxi02-59532b10): no ping reply (NOP-Out) after 5 seconds; dropping connection

and really no ping reply. My setup have a lot of ram and server grade harware, Asus server with double Xeon E5520 and 192gig of ECC RAM, 8x4TB seagate HDD.
Installed doubled 10G Ethernet card

Code:

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> port 0xe880-0xe89f mem 0xf9e80000-0xf9efffff,0xf9f78000-0xf9f7bfff irq 30 at device 0.0 on pci8

I'm using both jacks in multipathing mode for high availability, ALUA works in Most Recently Used path selection mode.
iSCSI working on FreeNAS-9.10.2-U6 (561f0d7a1)

I even have 2nd FreeNAS setup on Supermicro server harware with 48Gig of RAM with the same behavior.

So I did most of magic that I could.. but the only one trick working - undo esxi update to build 2403361 (update 2). Actually I have two esxi servers runing on HP DL380 G8 and one on DL380 G9, so if downgrade esxi version on any of the server everything works like a charm, in case of install patches on any of the server it starting NOP-out after 8-12 hours, also I have VmWare service contract but I can't use it - FreeNas not certified yet, they can't help me.

Mirfster · Sep 15, 2017

Check to see if there is an updated Driver required: https://www.vmware.com/resources/compatibility/search.php?deviceCategory=io

dvg_lab · Sep 18, 2017

Network cards on DL380G8 are HP 571SFP+ based on SolarFlare Solarstorm SFN5162F chip (model SFC9020)... yesterday I updated driver to last version, but everything remained in place. It's interesting, in time when FreeNAS tells NOP-out the netcard disconnecting at all, I even don't see any pings between esxi1 and esxi2 host on interface in time of NOP-out state. Ping statitistics tells about 30-40% packet loss.
IDN, maybe try to change Solarstrom to Intel ?..

Important Announcement for the TrueNAS Community.

Network Services Drop

Dabbler

Pony Wrangler

Resident Grinch

Resident Grinch

Dabbler

Dabbler

Dabbler

Dabbler

Pony Wrangler

Dabbler

Guru

Dabbler

Pony Wrangler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Doesn't know what he's talking about

Dabbler

Similar threads