Locking up periodically and gaps in report info

Dusan · Jan 13, 2014

cyberjock said:
Well, since autotune is for 128GB+ systems... duh!?

I'm not arguing, but are you sure about that? Did any FN developer confirm that statement?
I did check the autotune script what I was setting up my system and I did enable it on my 16GB system because:

It sets some sysctls that make sense even with much less then 128GB of memory -- kern.ipc.maxsockbuf = 2097152, net.inet.tcp.recvbuf_max = 2097152 and net.inet.tcp.sendbuf_max = 2097152.
When calculating the kernel memory & ARC sizes it actually considers low memory scenarios. For example when calculating the vfs.zfs.arc_max it would normally set it to 9/10 of the vm.kmem_size. However, on a memory challenged system it makes sure that the kernel gets at least 1GB and userland at least 2,5GB even if that means that ARC will be smaller that 9/10 of kmem (but not less that 1GB).

cyberjock · Jan 13, 2014

I talked to jpaetzel in IRC about it, and I've changed the note slightly.

Basically, the whole autotune feature was designed because ZFS was being too aggressive with RAM usage once you got to 128GB+ of RAM. Autotune was the solution. Unfortunately, it didn't really work as well as had been hoped, and the 9.x changes have made this feature almost useless.

He said that the autotune feature has been left to bitrot since 8.3 because its really unnecessary. He also said that should always cause a performance drop by artificially limiting the ARC size. He said that with 9.2 the feature really shouldn't be necessary and that autotune is slated to be removed in favor of a better system. Basically autotune is somewhat useful for systems that don't have enough RAM and keep crashing. Here's what the current note says(took it right out of jpaetzel's comment in IRC):

NOTE: Autotune tries to keep ZFS from causing hangs by allocating memory too aggressively. Autotune helps with system stability if the system exhibits instability with default tuning values. This should be used if you get hangs with the default settings. In theory this will always slow the system down by capping the ARC.

david kennedy · Jan 13, 2014

For what it is worth, i have the same gaps in my logs (DELL C1100 setup, its a test setup before i move it to another system)

Anyhow, a lot of questions over the NICs so here's the specs:

Network: Intel® 82576 – 2 x Gb Ethernet
Processors: Intel Xeon CPU Quad-Core L5520 2.26GHz 8MB 5.86 GT/s QPI SLBFA
Motherboard:Intel Custom
Hard Disk :Up to 4 SAS/SATA Drives
Memory:RDIMM ECC DDR3 1066Mhz 24GB (6 x 4GB)

As i mentioned, its more a proof of concept right now. It boots off a 8GB flash and has a single Seagate 3TB disk in one of the hot-swap slots.

What is also "interesting" is the graph history fails to survive a reboot (always starts off blank).

Everything else runs fine including the jails and apps.

Nate · Jan 13, 2014

Dusan said:
I don't think your case is related. The syslog you posted contains the entire boot sequence. This means your system crashed and restarted and the gap you see is caused by collectd not running (it starts at the end of the boot process). The other cases reported in this thread are about time drifts & lockups, but I think no one reported a restart.
So, the question you should be asking is not why you see a gaps in the graphs, but why is your system rebooting?

I just had another episode and I have nothing new in my log this time, but with the same gaps in my reports as shown in this thread. It seems I am having the same problem as others in here.

jpaetzel · Jan 14, 2014

A dmesg output would be useful

rm-r · Jan 14, 2014

sorry i took my machine apart last night ready for the new motherboard, cpu and ram that will arrive this week. once back up i'll post a dmesg if the error still happens - can anyone else provide a dmesg?

rm-r · Jan 14, 2014

dmesg output would be useful

mine is on page 2 of this tread (and now in the bug https://bugs.freenas.org/issues/3889)

perlguy9 · Jan 14, 2014

I also have an older Intel SP5000L motherboard and am experiencing strange timegaps and clock problems on 9.2.0 (with autotune enabled -- i'll turn it off)

perlguy9 · Jan 14, 2014

I'm downgrading to 8.3.2 to see if the problems go away..

rm-r · Jan 14, 2014

I'm downgrading to 8.3.2 to see if the problems go away..

worked for me - but couldn't then import the pool as ZFS version was too new...

rm-r · Jan 15, 2014

Can anyone also do this to help? (from the bug report)

Alright, this is what we should do.
Grab this:
git clone https://github.com/alfredperlstein/eagleeye.git
then cd eagleeye/src and run install.sh
You should create a ZFS dataset for this to live on.
It will grab a bunch of stats every 5 seconds. After you have another one of those blackout periods upload the eagleeye results dir to this ticket.

perlguy9 · Jan 15, 2014

My downgrade to 8.3.2 wasn't successful, for a variety of reasons, so I can probably grab this data.

rm-r · Jan 15, 2014

awesome, thank you - my machine is in pieces at the moment (literally)

rm-r · Jan 16, 2014

ok - took some doing but i got there - note you need to extract the statmatic.tgz before running the ./install.sh!

of course i have had no issues in 4 hours.... also unable to attach my main pool (as they are attached to my new motherboard now - but have a single disc is use in there... the logs are huge - upto 120mb already so hold that in mind when creating a data set!

rough steps i took...

Code:

cd into the dataset you want to store the logs on
create a directory
git clone https://github.com/alfredperlstein/eagleeye.git
cd into the "src" folder
extract statmatic.tgz
chmod +x statmatic.sh and install.sh
./ install.sh
follow the prompts
./ statmatic.sh
start logging!

im going to leave mine over night with a PC looping a large MKV file so stream off it... lets hope i have something by the morning...

rm-r · Jan 16, 2014

ok, so i put the machine back together last night and it ran for about 4 hours then seems to have crashed - when i woke up this morning it was on the error screen saying "this is a zfs volume - not boot" (or similar) so she crashed an burned - i didn't config the bios to boot from usb all the time. i have the logs in my dropbox - can i email a link to them to you - i'd rather not publicly post as not sure of contents. thanks

cyberjock · Jan 16, 2014

PM me the link...

rm-r · Jan 16, 2014

done cyberjock - thanks

cyberjock · Jan 16, 2014

Here's random stuff I found that interests me...

Your NIC = Realtek... crap in my opinion. Could be the cause, but not particularly likely.
Your CPU = AMD... crap in my opinion. Could be related to the cause(very likely based on prior users and below info).

dmesg:

ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND (20110527/psargs-392)
ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20110527/nsinit-380)
...
amdsbwd0: <AMD SB8xx Watchdog Timer> at iomem 0xfec000f0-0xfec000f3,0xfec000f4-0xfec000f7 on isa0
...
umass0: <vendor 0x1005 USB FLASH DRIVE, class 0/0, rev 2.00/1.00, addr 2> on usbus3
...
pid 1751 (vmware-checkvm), uid 0: exited on signal 10

That info from dmesg tells me the following:

Your ACPI support is not compatible with FreeNAS/FreeBSD. You could try rebooting the server and choose the menu option for acpi=disabled.
Your watchdog timer may be related to the problem. Watchdog timers reboot servers when they misbehave.
Your USB flash drive appears to be some crap no-name brand. You should be using a name brand ONLY. (this tend to corroborate with what I'm going to say in a minute.. keep reading)
For some reason some VMWare check tool exited. I don't think this is normal as your system should have booted up, recognized its not in a VM, then never loaded anything related. Could be wrong though.

Normally, if your system tries to boot from your zpool disks you'll get that stupid warning that says "this is a data disk". The fact that you woke up to that tells me that (1) the server rebooted itself and (2) the USB stick was not detected or (3) was not set as bootable by default in the BIOS or (4) is otherwise having problems.

So I'd get a new name brand USB stick and install FreeNAS on it. Import your config file, then bootup with ACPI disabled. See if that helps at all. I will warn you that some AMD boards have hardware that isn't compatible with FreeBSD/FreeNAS and the only solution is to get rid of that board. Unfortunately since I avoid AMD like the plague I can't really provide much help on how to prove its the motherboard other than using a different one.

rm-r · Jan 16, 2014

Thanks cyberjock

Yes you are correct on many points

I have ordered super micro, Intel cpu, ecc ram – all here apart from ram…. Can’t wait….

It’s an apacer usb – popular here in NZ – but I have also bought a sandisk for the new setup already

I was just running this to help out the other users (as I have new kit mentioned above), so hadn’t changed the bios permanently to boot from usb so on crash I tried to beet of the zfs disc – hopefully my issues will go once the new box is up….

perlguy9 · Jan 18, 2014

I'm setting up the stats collector now..

Important Announcement for the TrueNAS Community.

Locking up periodically and gaps in report info

Guru

Inactive Account

Explorer

Dabbler

jpaetzel

Guest

Contributor

Contributor

Cadet

Cadet

Contributor

Contributor

Cadet

Contributor

Contributor

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Cadet

Similar threads