HI there, I too am experiencing load/box getting "stuck" issues with the newest Freenas (9.1., that I wasn't experiencing in the 8.* branch (this is a long one so please bare with me)
First off Both the Box's enviornment
=============
====================
(Box 1.) Freenas box info 9.1.1
====================
=============
[root@Box00] ~# uname -a
FreeBSD Box00 9.1-STABLE FreeBSD 9.1-STABLE #0 r+16f6355: Tue Aug 27 00:38:40 PDT 2013
root@build.ixsystems.com:/tank/home/jkh/src/freenas/os-base/amd64/tank/home/jkh/src/freenas/FreeBSD/src/sys/FREENAS.amd64 amd64
=============
System load and Memory stats of Freenas box (16gb Of memory - deduplication off)
=============
last pid: 19466; load averages: 0.54, 0.53, 0.51 up 9+06:48:57 10:39:48
26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 5368K Active, 381M Inact, 14G Wired, 763M Buf, 1598M Free
ARC: 11G Total, 1874M MFU, 7886M MRU, 32K Anon, 128M Header, 1730M Other
Swap: 8192M Total, 8192M Free
=============
Hard Drive Space usage Stats of Freenas box (1.46TB used out of 5.44TB)
=============
[root@filer00] ~# zpool list
myvol00 5.44T 1.46T 3.98T 26% 1.00x ONLINE /mnt
Error digging: No errors in /var/log/messages or dmesg when what im about to describe happens below this box info, but I think the load increases on freenas too. you'll see description more below.
=============
====================
(Box 2.) Freenas Client info - KVM enviornment (tried both Unbuntu and CentOS)
====================
=============
[root@client] ~# uname -a
Linux myboxA01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
=============
System load and Memory stats (8gb Of memory)
=============
top - 16:35:58 up 15:06, 2 users, load average: 0.06, 0.04, 0.08
Tasks: 147 total, 1 running, 145 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 98.6%id, 0.9%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 7889784k total, 5393792k used, 2495992k free, 112964k buffers
Swap: 8028152k total, 0k used, 8028152k free, 2039944k cached
=============
Mounting the NFS freenas share in /etc/fstab with:
also ive tried many different ways of mounting, but this issue still arises. Here is my latest mount config,
=============
10.95.1.40:/mnt/myvol00/mydir /mnt/mydirnfs soft,timeo=900,retrans=3,tcp
Description of my problems: (first part - rsync process getting "stuck" but runs again when user issues a command at shell) I did end up getting through this
(this first part may be related to the 'dieing under load' part)
Ok it all started when I first installed to Freenas 9.1.1 from the previous 8.* branch. After I installed the latest version and everything was good, I began r-syncing from Freenas some data off an external volume that I had backed to before the install that's connected externally from the main zpool of freenas (ie external hard drive or lsi card seperate volume)- when I noticed Freenas would get "stuck" during the rsync. Not completely hung up, not frozen, but 'stuck' being the key word here. I would begin the rsync, come back a few hours later and run a df -ah at the shell expecting much progress to be made, only to find out maybe 4gb transfered. That's odd I thought. I top the box, and it looked like the rsync process was still alive and well, but the load on it was very low, as if it went to sleep or something. Ok so i come back to the shell and df -ah a couple more times and see that that the rsync increased to 5GB, and the process load kicks up again. Ok strange I though.. Let me wander off another couple hours. Sure enough I came back a couple hours later, df -ah and its only at 8GB. Strange again I think. So i rince, wash and repeat, same behavior! rsync only continues if the box has some kind of user interactivity.. a df in that case. So I painstakingly get all the data transfered over by having to continually df the box to keep it 'awake' over a few days (weird huh?)..
Once that was all done, and my data was back on my fresh new Freenas 9.1.1 I chalked up that file transfer rsync issue to maybe a fluke, or some weird rsync bug that didn't have to do with freenas - anyways, i didn't really care, i finally had all my data back in place in a new version of freenas, (Woohoo!) that was until my real life critical problem started happening:
Description of my problems: (second part - Client core dumping with the following error messages and core dump)
So I install my client (the first time being Ubuntu 12, and later on as I'll explain to rule out any client OS specific issues Cent OS 6.4), get the client all setup with nfs utils, everything mounts great! no problems the first time. "sweet" im thinking.. I configure my site up and everything comes up, awesome! all the data is there! So I go away thinking "all in a days work, everything is cool". Until I come back about 2 hours later again. I load the site up - everything has come to a crawl, it doesn't load. I log into the client and top the box - the load is very high, and climbing. However I see no process causing this really high load. So i start to check the web server logs - nothing.. Then I check the kern.log on unbuntu and messages log on Cent OS and see these:
Under load, and randomly the the freenas client gets these
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648109] INFO: task wget:2373 blocked for more than 120 seconds.
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648347] wget D ffff88021fd14580 0 2373 2372 0x00000000
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648353] ffff880210509b98 0000000000000002 ffff880210509fd8 0000000000014580
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648357] ffff880210509fd8 0000000000014580 ffff880211b85dc0 ffff88021fd14e30
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648360] ffff88021ffb72e8 0000000000000002 ffffffff8113f130 ffff880210509c10
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648364] Call Trace:
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648375] [<ffffffff8113f130>] ? wait_on_page_read+0x60/0x60
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648381] [<ffffffff816f843d>] io_schedule+0x9d/0x130
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648384] [<ffffffff8113f13e>] sleep_on_page+0xe/0x20
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648387] [<ffffffff816f6180>] __wait_on_bit+0x60/0x90
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648390] [<ffffffff8113eeff>] wait_on_page_bit+0x7f/0x90
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648394] [<ffffffff81085560>] ? wake_atomic_t_function+0x40/0x40
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648399] [<ffffffff8114bac1>] ? pagevec_lookup_tag+0x21/0x30
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648402] [<ffffffff8113f011>] filemap_fdatawait_range+0x101/0x190
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648485] [<ffffffff8114ab0e>] ? do_writepages+0x1e/0x40
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648488] [<ffffffff811406e9>] ? __filemap_fdatawrite_range+0x59/0x60
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648491] [<ffffffff811407ff>] filemap_write_and_wait_range+0x3f/0x70
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648519] [<ffffffffa01425b8>] nfs_file_fsync+0x78/0x90 [nfs]
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648525] [<ffffffff811d52fd>] generic_write_sync+0x4d/0x60
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648528] [<ffffffff8114152e>] generic_file_aio_write+0x9e/0xc0
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648537] [<ffffffffa01428b1>] nfs_file_write+0xb1/0x1e0 [nfs]
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648542] [<ffffffff811a6ab0>] do_sync_write+0x80/0xb0
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648545] [<ffffffff811a71ed>] vfs_write+0xbd/0x1e0
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648549] [<ffffffff811a7c29>] SyS_write+0x49/0xa0
Sep 16 02:18:46 myboxA0 kernel: [ 3600.648554] [<ffffffff81702fef>] tracesys+0xe1/0xe6
Sep 16 02:19:25 myboxA0 kernel: [ 3639.264053] nfs: server 10.95.1.40 not responding, still trying
Sep 16 02:19:47 myboxA0 kernel: [ 3661.600063] nfs: server 10.95.1.40 not responding, still trying
So I'm like, "ok is freeenas unresponsive? what gives?" So I ssh from the client to the freenas box - Now here's the interesting part, the second I initiate that SSH connect to freenas - Everything comes back! The webpage loads right away, and the client load goes back down to normal - but the box has already core dumped. Now I'm starting to think of shades of the df -ah thing that made the rsync continue. Could I replicate this? Sure enough, it's a continuous cycle, some process which accesses the freenas share a lot (I've seen it crash with wget, nginx, and other processes , I did a lot of google searching to make sure it wasn't process specific) hangs because the NFS server 'goes away' . But the moment I try to connect to it on the network with SSH, everything come back - like literally the web browser goes from spinning and waiting to instant load.
Summing it all up
I'm not sure if both problems are related, but it seems like freenas is 'going to sleep' or something weird under heavy load. Having to df to make the file transfer continue, and having to ssh to bring the NFS server back from 'not responding' both seem like the same kind of behavior. I tried Ubuntu and Cent OS in hopes in ruling out an operating system specific issue, but this happens on both clients with the exact same setup.
Also, I've googled about the "
blocked for more than 120 seconds" thing and results come up, but they are older.
Any ideas why Freenas is getting stuck in these weird scenarios? I really want to stick with Freenas too, as it's awesome, and having to downgrade back to the 8.* branch would be a pain.