Unstable system, with loss of CIFS share

SinDeus · Dec 23, 2015

Hi all,

After having updated to the latest version of FreeNAS (i.e. FreeNAS-9.3-STABLE-201512121950), my server started showing unprecedented issues.

It has occurred twice since then, and the symptoms are as follow:

Loss of the CIFS share.
Unable to connect to the web interface.
Unable to connect via SSH.

Oddly though, the mere 4 jails on the server still respond.

The only way I can get around this is to manually reboot the system.

I tried to investigate, and I found that my /var/log/messages is overloaded with these lines:

Code:

Dec 23 00:02:52 jotunheim nmbd[2385]:   Error = Can't assign requested address
Dec 23 00:02:52 jotunheim nmbd[2385]: [2015/12/23 00:02:52.632332,  0, pid=2385, effective(0, 0), real(0, 0)] ../source3/nmbd/nmbd_subnetdb.c:127(make_subnet)
Dec 23 00:02:52 jotunheim nmbd[2385]:   nmbd_subnetdb:make_subnet()
Dec 23 00:02:52 jotunheim nmbd[2385]:     Failed to open nmb bcast socket on interface 0.255.255.255 for port 137.  Error was Can't assign requested address
Dec 23 00:02:52 jotunheim nmbd[2385]: [2015/12/23 00:02:52.632888,  0, pid=2385, effective(0, 0), real(0, 0)] ../source3/lib/util_sock.c:485(open_socket_in)
Dec 23 00:02:52 jotunheim nmbd[2385]:   bind failed on port 137 socket_addr = 0.255.255.255.

The file itself is pretty large, I am currently putting it aside to avoid some unnecessary memory usage.
I'm a bit lost regarding the next step to take... Any advice ?

SinDeus · Dec 23, 2015

OK, I did some digging and I found this issue. This seems pretty related to mine, since my logs are quite big. So, I renamed my logs (to start out fresh), and did what the ticket said: set the log level of samba to 'none'.
I will keep you posted.

SinDeus · Dec 24, 2015

#TriplePostFTW

Well, I got this. Hang on, it's pretty wild.

The samba daemon was logging (I still need to figure out why) tons of these previously mentioned lines in /var/log/messages
The messages log grew to tremendous size
The FreeNAS web interface showed me pictures like this on the reports tab:

I SSH'd into my box and printed the top processes, which showed this one on the first place:

Code:

# top

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
73718 root          1 103    0 89748K 16252K CPU0    0 351:44 100.00% python2.7

To figure out what this ominous python process was, a simple command was sufficient:

Code:

# ps aux

USER         PID  %CPU %MEM    VSZ    RSS TT  STAT STARTED       TIME COMMAND
root       73718 100.0  0.2  89748  16252 ??  R     3:05AM  352:54.16 /usr/local/bin/python /usr/local/bin/telemetry-gather.py /var/log/messages /var/log/messages.0.bz2 /var/log/messages.1.bz2 /var/log/messages.2.bz2 /var/log/messages.3.bz

So, the telemetry-gather utility was trying to parse my messages log files, which were huge
I killed this friggin' process, and all went back to normal
I also, as stated before, set the log level of the CIFS service to 'none'.

This telemetry-gather script, I think, was taking up more and more memory, the system had to kill other daemons; hence the loss of the web interface, CIFS service, among others.

PS: if admins would like me to revert my triple posts into an edit of the first, just say so.

SweetAndLow · Dec 25, 2015

You should file a bug on that telemetry script. It should be designed in a way to not tank the system when reading a large file.

SinDeus · Dec 25, 2015

You are absolutely right, I will do so whenever I have enough time.

SinDeus · Jan 3, 2016

Hi again.

After days of struggling, I finally decided to come back here. It seems that the telemetry script is not the only one at fault here. Every single day, my server kind of dies. I lose almost everything: web GUI, SSH, CIFS, FTP. If I log into my router interface, I see that the FreeNAS machine has been offline since XX hours. The only cure is to hard reboot...
The strange thing is: the jails are functionning normally. They are all here, they are all accessible and are all doing their jobs.

What can I do to pinpoint the problem? I have been searching into my /var/log/messages, without success so far. This morning, the security run output email I received contained these lines:

Code:

> Timecounter "TSC-low" frequency 1297080432 Hz quality 1000
> vboxdrv: fAsync=0 offMin=0x280 offMax=0xe78
> re0: promiscuous mode enabled
> epair0a: Ethernet address: 02:05:74:00:07:0a
> epair0b: Ethernet address: 02:05:74:00:08:0b
> epair1a: Ethernet address: 02:87:aa:00:08:0a
> epair1b: Ethernet address: 02:87:aa:00:09:0b
> epair2a: Ethernet address: 02:65:b7:00:09:0a
> epair2b: Ethernet address: 02:65:b7:00:0a:0b
> epair3a: Ethernet address: 02:a1:ac:00:0a:0a
> epair3b: Ethernet address: 02:a1:ac:00:0b:0b
> epair4a: Ethernet address: 02:38:08:00:0b:0a
> epair4b: Ethernet address: 02:38:08:00:0c:0b
> ugen1.5: <Logitech> at usbus1 (disconnected)
> ukbd0: at uhub3, port 3, addr 5 (disconnected)
> uhid0: at uhub3, port 3, addr 5 (disconnected)
> arp: 192.168.1.24 moved from 02:a1:ac:00:0a:0a to bc:5f:f4:b1:00:e8 on epair3b
> arp: 192.168.1.24 moved from 02:65:b7:00:09:0a to bc:5f:f4:b1:00:e8 on epair2b
> re0: link state changed to DOWN
> re0: link state changed to UP

Is this of any help to anyone? I kind of would like to solve this issue without reinstalling the whole system, but if I have to, I will.

anodos · Jan 4, 2016

SinDeus said:
Hi again.

After days of struggling, I finally decided to come back here. It seems that the telemetry script is not the only one at fault here. Every single day, my server kind of dies. I lose almost everything: web GUI, SSH, CIFS, FTP. If I log into my router interface, I see that the FreeNAS machine has been offline since XX hours. The only cure is to hard reboot...
The strange thing is: the jails are functionning normally. They are all here, they are all accessible and are all doing their jobs.

What can I do to pinpoint the problem? I have been searching into my /var/log/messages, without success so far. This morning, the security run output email I received contained these lines:

Code:
> Timecounter "TSC-low" frequency 1297080432 Hz quality 1000 > vboxdrv: fAsync=0 offMin=0x280 offMax=0xe78 > re0: promiscuous mode enabled > epair0a: Ethernet address: 02:05:74:00:07:0a > epair0b: Ethernet address: 02:05:74:00:08:0b > epair1a: Ethernet address: 02:87:aa:00:08:0a > epair1b: Ethernet address: 02:87:aa:00:09:0b > epair2a: Ethernet address: 02:65:b7:00:09:0a > epair2b: Ethernet address: 02:65:b7:00:0a:0b > epair3a: Ethernet address: 02:a1:ac:00:0a:0a > epair3b: Ethernet address: 02:a1:ac:00:0b:0b > epair4a: Ethernet address: 02:38:08:00:0b:0a > epair4b: Ethernet address: 02:38:08:00:0c:0b > ugen1.5: <Logitech> at usbus1 (disconnected) > ukbd0: at uhub3, port 3, addr 5 (disconnected) > uhid0: at uhub3, port 3, addr 5 (disconnected) > arp: 192.168.1.24 moved from 02:a1:ac:00:0a:0a to bc:5f:f4:b1:00:e8 on epair3b > arp: 192.168.1.24 moved from 02:65:b7:00:09:0a to bc:5f:f4:b1:00:e8 on epair2b > re0: link state changed to DOWN > re0: link state changed to UP

Is this of any help to anyone? I kind of would like to solve this issue without reinstalling the whole system, but if I have to, I will.

Post a debug file. 'System' -> 'advanced' -> 'save debug'.

SinDeus · Jan 4, 2016

anodos said:
Post a debug file. 'System' -> 'advanced' -> 'save debug'.

Didn't know that! It sure will help. I will post this this evening, when I get home. Thanks for the tip.

SweetAndLow · Jan 4, 2016

What hardware are you using? Looks like a realtek nic which are not that great and will cause all kinds of networking problems.

SinDeus · Jan 4, 2016

My motherboard is a ASRock H61MV-ITX, with a Realtek RTL8111E - 10/100/1000 Mbps chipset. The jails are still accessible though... And, I forgot to tell, it is impossible to interact with my server: even if I plug a keyboard or a screen in, nothing will show / happen.

(I forgot to set up my sig with my hardware, will do it right away)

SweetAndLow · Jan 4, 2016

You don't put your hardware in your signature whatever you have a problem, you put it in the post with the question.

SinDeus · Jan 4, 2016

SweetAndLow said:
You don't put your hardware in your signature whatever you have a problem, you put it in the post with the question.

Yup, I get that; but a hardware description in a signature is always helpful to whoever reads a post. Seeing yours triggered that thougth.

SinDeus · Jan 4, 2016

@anodos: OK, I extracted some debug files... but I don't know which one I should be posting. Or should I post the whole .tar.gz thing?

In the meanwhile, I digged into my /var/log/messages around the time the router marked the FreeNAS server as "offline". I found this:

Code:

Jan  4 06:12:11 jotunheim nmbd[2393]:   STATUS=daemon 'nmbd' finished starting up and ready to serve connectionsPacket send failed to 192.168.1.255(138) ERRNO=No route to host
Jan  4 06:13:11 jotunheim nmbd[2393]: [2016/01/04 06:13:11.249327,  0] ../source3/libsmb/nmblib.c:873(send_udp)
Jan  4 06:13:11 jotunheim nmbd[2393]:   Packet send failed to 192.168.1.255(138) ERRNO=No route to host
Jan  4 06:13:11 jotunheim nmbd[2393]: [2016/01/04 06:13:11.249591,  0] ../source3/nmbd/nmbd.c:361(reload_interfaces)
Jan  4 06:13:11 jotunheim nmbd[2393]:   reload_interfaces: No subnets to listen to. Waiting..

Nothing comes before or after this. Rings any bell?

anodos · Jan 4, 2016

SinDeus said:
@anodos: OK, I extracted some debug files... but I don't know which one I should be posting. Or should I post the whole .tar.gz thing?

In the meanwhile, I digged into my /var/log/messages around the time the router marked the FreeNAS server as "offline". I found this:

Code:
Jan 4 06:12:11 jotunheim nmbd[2393]: STATUS=daemon 'nmbd' finished starting up and ready to serve connectionsPacket send failed to 192.168.1.255(138) ERRNO=No route to host Jan 4 06:13:11 jotunheim nmbd[2393]: [2016/01/04 06:13:11.249327, 0] ../source3/libsmb/nmblib.c:873(send_udp) Jan 4 06:13:11 jotunheim nmbd[2393]: Packet send failed to 192.168.1.255(138) ERRNO=No route to host Jan 4 06:13:11 jotunheim nmbd[2393]: [2016/01/04 06:13:11.249591, 0] ../source3/nmbd/nmbd.c:361(reload_interfaces) Jan 4 06:13:11 jotunheim nmbd[2393]: reload_interfaces: No subnets to listen to. Waiting..

Nothing comes before or after this. Rings any bell?

Post the whole tar.gz. The message from nmbd typically indicates your NIC just crapped out.

SinDeus · Jan 4, 2016

Whoa that was fast.

Here it goes! Thank you.

anodos · Jan 4, 2016

A few notes:
1) You have two WD Green Drives in your RAIDZ1 array that have extremely high Load Cycle Counts. You will want to use wdidle3 on them to fix that and keep an eye on them going forwards. Make sure you keep backups of your important information.
2) Your auth.log indicates that an IP address on your local network and an external network are access your server at almost the same time. Just FYI.
3) FreeNAS is not up-to-date. Try updating to latest and reproducing the problem.
4) You have multiple jails running simultaneously. Try to reproduce the problem with them all stopped.
5) You have a realtek NIC. These are well-known to behave poorly when under load in FreeBSD. Perhaps try adding an intel gigabit NIC and reproducing the problem.

SinDeus · Jan 4, 2016

First of all, thank you for taking a look!

I will do that. I'm in the middle of replacing my 2 TB drives for 4 TB ones, the problem will then solve itself.
Mmh. Weird. I'll look into it, thanks. Edit: that's me alright, from my phone, checking if SSH still works :p
I was up-to-date a few days ago, and thought this caused the issue - hence the rollback. But it still occurs, so I'll update to the very last version and stay on it.
Will do.
I will try to find an Intel NIC that fits on my motherboard (I only have a PCIe 3.0 x16, wonder if I can put PCIe x1 on it? Edit: yes I can!).

Again, thank you so much for your time. I will let it run without jails and see what happens - and post the results here.

SinDeus · Jan 5, 2016

Small update: my FreeNAS server is still up. I stopped all jails last night, after updating to the latest version.
Starting from tonight, I will start jails one by one and see what happens (if nothing happened until then).

Edit: 2 jails are up since then, and my server is working flawlessly. I'm starting to think that the last jail that I set up, a BittorrentSync one, was messing with it, perhaps driving my NIC crazy. After all, my problems seem to have started around the time of this jail's creation...

Edit 2: so far, so good. The culprit was definitively the BittorrentSync jail, as nothing bad happened while it was off. I should receive my new NIC soon, and will try to start the jail again after installing it.

SinDeus · Jan 17, 2016

Well, I haven't been starting the BittorrentSync jail for nearly 10 days, thinking it was faulty... but my server went down Friday night. So, I mounted a new NIC, and now wait for something to happen... or not.

SinDeus · Jan 25, 2016

I really like to reply to myself.

It happened. Again. From what I understand from the gazillion logs I got, is that samba somehow loses its cool when the network becomes unstable. Perhaps my router isn't feeling too well once in a time...

Anyway, the nmbd daemon, when experiencing these unstabilities, enters in an infinite loop. So, I would like to disable it (I have no need of network name discovery, the share is fully accessible by IP address).
@anodos I wanted to do exactly the same thing as you in this thread, i.e. disable netbios via an extra parameter in the CIFS configuration. However, nmbd is still started when toggling the CIFS share. Do you have any idea how I can prevent it from running? Apart from killing it from the CLI just after starting samba...

Important Announcement for the TrueNAS Community.

Unstable system, with loss of CIFS share

Explorer

Explorer

Explorer

Sweet'NASty

Explorer

Explorer

Sambassador

Explorer

Sweet'NASty

Explorer

Sweet'NASty

Explorer

Explorer

Sambassador

Explorer

Attachments

Sambassador

Explorer

Explorer

Explorer

Explorer

Similar threads