Locking up periodically and gaps in report info

steve6341 · Dec 10, 2013

Running FreeNAS 9.1.1 from an 8GB USB drive (Kingston DTSE9H/8GBZ). The system consists of an Intel S5000PSL motherboard with 2 Xeon E5345 processors and 32GB of Kingston ECC RAM. I have a 3ware RAID controller (9650se-12ML) with 6 x 4TB (Seagate ST4000NM0033) hard drives connected. These drives are in one raidz2 volume giving me roughly 14.5TB. This is all in a SuperMicro CSE-846TQ-R900B chassis with 900watt redundant power supplies. I am using one of the on-board gigabit ethernet connectors with a static ip- appears to be an 82563EB chipset. There is one Unix (NFS) share and one Windows (CIFS) share on the system. There are a handful of CentOS 6.4 machines that access the NFS share and one Windows 2008 machine that touches the CIFS share. Not very high volume in my opinion, maybe 250mb per hour being read or written (mostly written).

Here is my problem: Several times per day the system becomes unresponsive. You can not read or write to the shares from Windows or CentOS. The web interface is sometimes accessible during this time but usually is not. You can ping the IP address of the system during this time. The only way I have found to get it to start back up (other than rebooting) is to start an SSH session with the system from PuTTY. You don't have to login, just start the SSH sessions so that it prompts for a username. Then all of the shares come back, the processes that are reading and writing to the shares from either Windows or CentOS continue, and the web interface comes up. The other peculiar thing that happens is if you go look at the 'Reporting' tab in the web interface there are gaps in all of the graphs for the time period that the system was unavailable.

The issue seems very similar to the problem raised here: http://forums.freenas.org/threads/nfs-dies-under-load.14346/ but there really is no good solution mentioned and it does not seem to be the exact same issue.

I'm not sure where to go from here. The hardware is mostly new. The used parts were pulled from a machine that was previously running Windows without any problems. The system is in a large data center that is climate controlled and has very reliable power. I'm not sure what to do other than try running an older release like 8.3.2.

I would appreciate any suggestions. Please let me know if I can provide any other information that may help.

joeschmuck · Dec 10, 2013

So you can typically restore system operation by opening up a an SSH windows using Putty? That is odd, very odd. Curious what TOP reports.

steve6341 · Dec 10, 2013

Here is a screenshot from the TOP command just a few minutes ago:

I know its strange and I wouldn't believe it if someone else told me it was happening to them. Here is also a screenshot from the reporting page. FreeNAS locked up twice in the last hour. You can see the gaps in the graphs.

cyberjock · Dec 10, 2013

Those images are broken for me.

joeschmuck · Dec 10, 2013

Images show up fine for me. The gaps look like the system had been shut down between 2 to almost 4 minutes and spanning 25 minute intervals. And 'nfsd' has the overall highest percentage of usage, nothing else comes close.

cyberjock · Dec 10, 2013

I just tried to open it and I get the error "connection to the server was reset"

DrKK · Dec 10, 2013

Yeah steve, so this is EXACTLY what things would look like if you shut the system down for those "lock up" periods.

That's very odd indeed. I hope someone has an idea better than turning off each service one at a time until the undesired behavior stops...

joeschmuck · Dec 11, 2013

cyberjock said:
I just tried to open it and I get the error "connection to the server was reset"

Did you try another browser?

@Steve, try turning off the services you use like NFS first. See if that fixes the problem. This might be the only quick way to isolate the problem. Even if you do locate the problem I would also replace your boot flash drive with a different one and load up 9.1.1 from scratch and perform a minimal configuration on it. If it works fine then deliberately start adding on until you either find the problem or it just works. One last thing, The very first think I would do is rule out hardware failure. I'm not sure how failing ECC RAM would indicate a problem to me but maybe you could run some stress tests on your system. Run MemTest86+ for 3 full passes, run a CPU stress test as well to ensure something isn't happening there. And are you overclocking anything? If so, stop. And do not take offense to some of my suggestions. I have no idea what level of knowledge anyone out here has so I assume they don't know anything and it's just a safe way to go and it can save a lot of time.

steve6341 · Dec 13, 2013

Some more information. I'm hoping someone can help me figure this out. I ran MemTest86+ for about 12 hours. It was able to get through 5 passes without any errors. I updated the firmware on the motherboard to the latest available from Intel. I replaced the Kingston USB drive with a SanDisk 4GB and started using a different USB port. I updated FreeNAS to 9.2.0-RC-x64. I am still having the same issues with lock ups that unlock when you start an SSH session.

One new thing that I noticed: The system time is all over the place. (I noticed this before any of the above changes) I can sync it but the log will repeatedly show time resets of a few seconds several times per hour. The resets will get larger and larger until it finally fails some sanity check (over a 1000 seconds) and the log indicates that you have to manually sync the time. I am not sure if this is causing my lock ups, is caused by the lock ups, or is some other unrelated problem. The time jumps seem to coincide with the gaps in the reporting graphs. Every time the time "jumps" though the system is not locked up?? I have found that we have short bursts of high NFS usage during the day. TOP can report that the usage is around 80~90%. The machine has 8 cores so I'm not sure if that is a real problem. Can NFS usage possibly slow the system down enough to effect the clock/time?

Any additional help or suggestions would be greatly appreciated.

joeschmuck · Dec 13, 2013

It only took 12 hours to get 5 complete passes? Damn that is fast for 32GB of ECC RAM. It takes me overnight to get 3 passes. Guess the server quality has it's virtues.

Have you turned off the NFS service to see if the problem stop? In fact I'd turn off just about every service you can and see if you still have the problem. Sure you won't be sharing files but does your system freeze up at all? The priority should be on finding the source of the problem. The other thing is you could have a faulty MB but if you never had this problem before then why would you have it now. If you mounted this MB into a different case maybe you induced a failure. The possible reasons are too much to go over so I seriously suggest you just turn off your services and see if it works fine. You also could disconnect all your drives, pop in a boot from CD OS like Ubuntu and see how that runs for a few days, see if the clock follows the same problem.

Unfortunately you are on you own right now until you can isolate the issue a little more.

TimAM · Dec 16, 2013

Hi, I found this post as I'm just about to start building a new FreeNAS server using the same board.

Its interesting you say the time is drifting - that could cause some weird behavior - what time server are you using? (I don't remember ever specifying one during setup)
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/network-ntp.html
Perhaps the system is somehow correcting for a drift in time until it becomes so large the 'validity' of foreign requests established when the time was right are no longer accepted. Creating a new SSH connection could 'refresh' this to real current time. If you SSH in does that session stop when the file transfers stop? typing 'date' at the prompt is that time drifting relative to the real time?

I think that makes sense in a really weird way.

Also can I ask what sort of Read/Write performance you're getting from the system NFS/AFP continuous read/write

steve6341 · Dec 17, 2013

I'm using the default NTP servers that are loaded with FreeNAS: 0.freebsd.pool.ntp.org, 1.freebsd.pool.ntp.org, and 2.freebsd.pool.ntp.org. I have also tried setting up a cron job to run every minute pointing to a Windows 2008 server that is acting as a time server on our network. The time still goes out of sync, quickly. During the periods where the system is locked up it seems like the cron job does not run.
Not sure what you mean by 'real current time' but the SSH doesn't change the system time. The SSH session kick starts the server but the time remains out of sync. Yes, the SSH session will eventually die (along with the file transfers) if I actually log into SSH. The SSH will die even while running something like the TOP command. The date command yields the same date and time that the FreeNAS web interface shows, several hours or days behind depending on the last sync. And my read/write performance... not impressive, at all. But that could be related to the other problems I am having so I wouldn't make any decisions based on the performance of my setup.

I have turned off NFS and CIFS locks up. I have turned off CIFS and NFS locks up. I have turned off SSH and... well, you get it. I am going to try a new motherboard, processor, and RAM. Intel DH87RL, I7 4770, and 32GB of RAM. If that works, I will concede that somehow the S5000PSL board is bad. If it doesn't work I will have to give FreeNAS version 8.3.2 a shot.

rm-r · Dec 18, 2013

I have the same thing as OP - gaps in graphs, i haven't lost connection though..... I'll check my time drifts...

rm-r · Dec 20, 2013

hi guys,

i literally just experienced this issue again - lose complete connectivity to the FreeNAS, SSH, file shares, pbi WEBuis - EVERYTHING! and have gaps in my graphs

logs show this

Code:

Dec 20 19:45:22 NAS last message repeated 3 times
 
 
Dec 20 19:50:26 NAS ntpd[2175]: time reset +299.968169 s

note the 5 minute jump..... that's how long it was out for! so seems like the delay was waiting to catch up to the time?

rm-r · Dec 20, 2013

also uptime remains the same (over a day)

cyberjock · Dec 20, 2013

The only thing I can think of is perhaps some hardware failure(or hardware incompatibility).

PierceIt · Dec 25, 2013

I am having the time drift issue on my 9.1.1 build too. Sometimes I can see in the shell preview at the bottom of the web console that the time is successfully reset due to drift - sometimes it reset every couple minutes - but eventually if I leave the server for a half a day or so must stop resetting and so time drifts by hours and CIFs stop working. I can't reset permissions, etc.

Either dropping into the shell and manually setting the time or making a change to one of the NTP server settings forces the time to reset and everything starts working correctly again.

I've been googling this today and found several posts like this one about time drift and freeness 9.1.1

Unfortunately, this is a new build for me and my first time with freeness so I can't say if a previous version of freenas would have had this same issue on this hardware or not.

if anyone finds out more on the time drift issue - please post here

PierceIt · Dec 25, 2013

(i should have also mentioned that I have the same "gap in reporting chart images" as well)

rm-r · Dec 29, 2013

Right so some testing.....

i updated to 9.2.2 via the gui update
- this caused this issue to happen every few minutes, system was unusable
i then got a new USB stick
- installed the full 9.2.0 on there and i still have this issue. that's without importing the pools, jails etc - literally just turning the machine on after installing with the USB image, setting a password and setting my time zone and adding the output to the footer.
Both my NICs are Realtek
- the one on the Mother board and the PCI card. I have tried them separately on the LAN (not lagged)

This issue does seem to coincide with a DHCP renew in the logs but cant be sure

My hardware is listed below in my signature

do you other guys in this thread also have realtek NICs?

Does any more experienced users have any suggestions of things to try? logs to check? i'm going to have to go back to windows unless I find a solution..... please save me!

DrKK · Dec 29, 2013

"Every 5 minutes" is a sacred thing in FreeBSD. There's a cron job (it sysexec's the atrun thing) every 5 minutes in the cron tab, and there's also something that goes with the pbid (push-button-installer daemon) that is also set to (usually) do something every 5 minutes.

I'm just giving that as a data point for someone with a black belt in this stuff to maybe connect some dots, I don't know if it's related.

Important Announcement for the TrueNAS Community.

Locking up periodically and gaps in report info

Cadet

Old Man

Cadet

Inactive Account

Old Man

Inactive Account

FreeNAS Generalissimo

Old Man

Cadet

Old Man

Cadet

Cadet

Contributor

Contributor

Contributor

Inactive Account

Dabbler

Dabbler

Contributor

FreeNAS Generalissimo

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Locking up periodically and gaps in report info"

Similar threads