Locking up periodically and gaps in report info

Status
Not open for further replies.

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Well, since autotune is for 128GB+ systems... duh!?
I'm not arguing, but are you sure about that? Did any FN developer confirm that statement?
I did check the autotune script what I was setting up my system and I did enable it on my 16GB system because:
  1. It sets some sysctls that make sense even with much less then 128GB of memory -- kern.ipc.maxsockbuf = 2097152, net.inet.tcp.recvbuf_max = 2097152 and net.inet.tcp.sendbuf_max = 2097152.
  2. When calculating the kernel memory & ARC sizes it actually considers low memory scenarios. For example when calculating the vfs.zfs.arc_max it would normally set it to 9/10 of the vm.kmem_size. However, on a memory challenged system it makes sure that the kernel gets at least 1GB and userland at least 2,5GB even if that means that ARC will be smaller that 9/10 of kmem (but not less that 1GB).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I talked to jpaetzel in IRC about it, and I've changed the note slightly.

Basically, the whole autotune feature was designed because ZFS was being too aggressive with RAM usage once you got to 128GB+ of RAM. Autotune was the solution. Unfortunately, it didn't really work as well as had been hoped, and the 9.x changes have made this feature almost useless.

He said that the autotune feature has been left to bitrot since 8.3 because its really unnecessary. He also said that should always cause a performance drop by artificially limiting the ARC size. He said that with 9.2 the feature really shouldn't be necessary and that autotune is slated to be removed in favor of a better system. Basically autotune is somewhat useful for systems that don't have enough RAM and keep crashing. Here's what the current note says(took it right out of jpaetzel's comment in IRC):

NOTE: Autotune tries to keep ZFS from causing hangs by allocating memory too aggressively. Autotune helps with system stability if the system exhibits instability with default tuning values. This should be used if you get hangs with the default settings. In theory this will always slow the system down by capping the ARC.
 

david kennedy

Explorer
Joined
Dec 19, 2013
Messages
98
For what it is worth, i have the same gaps in my logs (DELL C1100 setup, its a test setup before i move it to another system)

Anyhow, a lot of questions over the NICs so here's the specs:

Network: Intel® 82576 – 2 x Gb Ethernet
Processors: Intel Xeon CPU Quad-Core L5520 2.26GHz 8MB 5.86 GT/s QPI SLBFA
Motherboard:Intel Custom
Hard Disk :Up to 4 SAS/SATA Drives
Memory:RDIMM ECC DDR3 1066Mhz 24GB (6 x 4GB)

As i mentioned, its more a proof of concept right now. It boots off a 8GB flash and has a single Seagate 3TB disk in one of the hot-swap slots.

What is also "interesting" is the graph history fails to survive a reboot (always starts off blank).

Everything else runs fine including the jails and apps.
 

Nate

Dabbler
Joined
Jan 11, 2014
Messages
20
I don't think your case is related. The syslog you posted contains the entire boot sequence. This means your system crashed and restarted and the gap you see is caused by collectd not running (it starts at the end of the boot process). The other cases reported in this thread are about time drifts & lockups, but I think no one reported a restart.
So, the question you should be asking is not why you see a gaps in the graphs, but why is your system rebooting?

I just had another episode and I have nothing new in my log this time, but with the same gaps in my reports as shown in this thread. It seems I am having the same problem as others in here.
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
sorry i took my machine apart last night ready for the new motherboard, cpu and ram that will arrive this week. once back up i'll post a dmesg if the error still happens - can anyone else provide a dmesg?
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166

perlguy9

Cadet
Joined
Dec 6, 2013
Messages
9
I also have an older Intel SP5000L motherboard and am experiencing strange timegaps and clock problems on 9.2.0 (with autotune enabled -- i'll turn it off)
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
I'm downgrading to 8.3.2 to see if the problems go away..
worked for me - but couldn't then import the pool as ZFS version was too new...
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
Can anyone also do this to help? (from the bug report)

Alright, this is what we should do.
Grab this:
git clone https://github.com/alfredperlstein/eagleeye.git
then cd eagleeye/src and run install.sh
You should create a ZFS dataset for this to live on.
It will grab a bunch of stats every 5 seconds. After you have another one of those blackout periods upload the eagleeye results dir to this ticket.
 

perlguy9

Cadet
Joined
Dec 6, 2013
Messages
9
My downgrade to 8.3.2 wasn't successful, for a variety of reasons, so I can probably grab this data.
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
awesome, thank you - my machine is in pieces at the moment (literally)
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
ok - took some doing but i got there - note you need to extract the statmatic.tgz before running the ./install.sh!

of course i have had no issues in 4 hours.... also unable to attach my main pool (as they are attached to my new motherboard now - but have a single disc is use in there... the logs are huge - upto 120mb already so hold that in mind when creating a data set!

rough steps i took...

Code:
cd into the dataset you want to store the logs on
create a directory
git clone https://github.com/alfredperlstein/eagleeye.git
cd into the "src" folder
extract statmatic.tgz
chmod +x statmatic.sh and install.sh
./ install.sh
follow the prompts
./ statmatic.sh
start logging!


im going to leave mine over night with a PC looping a large MKV file so stream off it... lets hope i have something by the morning...
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
ok, so i put the machine back together last night and it ran for about 4 hours then seems to have crashed - when i woke up this morning it was on the error screen saying "this is a zfs volume - not boot" (or similar) so she crashed an burned - i didn't config the bios to boot from usb all the time. i have the logs in my dropbox - can i email a link to them to you - i'd rather not publicly post as not sure of contents. thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
PM me the link...
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
done cyberjock - thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Here's random stuff I found that interests me...

Your NIC = Realtek... crap in my opinion. Could be the cause, but not particularly likely.
Your CPU = AMD... crap in my opinion. Could be related to the cause(very likely based on prior users and below info).

dmesg:

ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND (20110527/psargs-392)
ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20110527/nsinit-380)
...
amdsbwd0: <AMD SB8xx Watchdog Timer> at iomem 0xfec000f0-0xfec000f3,0xfec000f4-0xfec000f7 on isa0
...
umass0: <vendor 0x1005 USB FLASH DRIVE, class 0/0, rev 2.00/1.00, addr 2> on usbus3
...
pid 1751 (vmware-checkvm), uid 0: exited on signal 10

That info from dmesg tells me the following:

Your ACPI support is not compatible with FreeNAS/FreeBSD. You could try rebooting the server and choose the menu option for acpi=disabled.
Your watchdog timer may be related to the problem. Watchdog timers reboot servers when they misbehave.
Your USB flash drive appears to be some crap no-name brand. You should be using a name brand ONLY. (this tend to corroborate with what I'm going to say in a minute.. keep reading)
For some reason some VMWare check tool exited. I don't think this is normal as your system should have booted up, recognized its not in a VM, then never loaded anything related. Could be wrong though.

Normally, if your system tries to boot from your zpool disks you'll get that stupid warning that says "this is a data disk". The fact that you woke up to that tells me that (1) the server rebooted itself and (2) the USB stick was not detected or (3) was not set as bootable by default in the BIOS or (4) is otherwise having problems.

So I'd get a new name brand USB stick and install FreeNAS on it. Import your config file, then bootup with ACPI disabled. See if that helps at all. I will warn you that some AMD boards have hardware that isn't compatible with FreeBSD/FreeNAS and the only solution is to get rid of that board. Unfortunately since I avoid AMD like the plague I can't really provide much help on how to prove its the motherboard other than using a different one.
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
Thanks cyberjock

Yes you are correct on many points

I have ordered super micro, Intel cpu, ecc ram – all here apart from ram…. Can’t wait….

It’s an apacer usb – popular here in NZ – but I have also bought a sandisk for the new setup already

I was just running this to help out the other users (as I have new kit mentioned above), so hadn’t changed the bios permanently to boot from usb so on crash I tried to beet of the zfs disc – hopefully my issues will go once the new box is up….
 
Status
Not open for further replies.
Top