Timer Interrupt and Hard Reset?

Status
Not open for further replies.

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
Well, I have a X10SL7-F with watchdog disabled both in the BIOS and with the jumper, and the latest 9.3 update with email correctly configured and I check regularly the IPMI log.

I don't have these emails/events so you must have something not correctly configured somewhere ;)

I checked everything, it is disabled. But in typical FreeNAS forum fashion I know I'll be called an idiot, noob, moron, no nothing anyway. Even after I take screenshots and post them in this thread next week. That just seems to be the way this place operates. (Not directed at you in particular).

Since you also have an X10SL7-F, if it isn't in production (or if you just don't care), try to offline a disk via GUI in latest 9.3 release. Then try to online it back to your pool without restarting the server or issuing a replace command (ie. not resilver it). I found no online option via GUI, so have to drop to CLI. But onlining the device didn't work reliably. Had to eventually "replace" and "resilver". That is a pretty big issue IMO.

Here is another question for you (sorry for going a bit off topic). I believe the X10SL7-F should support hot swapping via the LSI 2308 in a good case. Yet in FreeNAS/FreeBSD it doesn't seem to work, even on this "server grade hardware". I see this;
Code:
da3 at mps0 bus 0 scbus0 target 11 lun 0
da3: <ATA TOSHIBA MG03ACA3 FL1A> s/n            53L8K729F detached
cam_periph_alloc: attempt to re-allocate valid device da3 rejected flags 0x118 re
daasync: Unable to attach to new device due to status 0x6

It sounds to me like this might be an issue with FreeBSD that they are working on but have not finalized. Getting that from HERE, HERE, and HERE. I'm going to test in Solaris soon as well.

These are just some of the things I'm seeing. This doesn't even go into the bugs I've dealt with for many years using FreeNAS that have been rather serious. I'm not trying to "rain on anyone's parade", like I said I like a lot of what FreeNAS does and offers, I'm just being honest and with that honesty comes the fact I think FreeNAS needs to work on "reliable" software on "server grade hardware" or instead create an HCL. This isn't my first rodeo with ZFS, FreeBSD, Solaris, storage servers, etc., but I'm sure plenty of people around here will take offense or call me names anyway. Fortunately I really don't give 2 shits.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
"I checked everything, it is disabled. But in typical FreeNAS forum fashion I know I'll be called an idiot, noob, moron, no nothing anyway." No, it's not my intention anyway. It's just that with the same setup I don't have the problem so the chances are that it's a misconf on your side (I think you'll agree to say that when there is a problem in the IT realm 99% of the time it's a PEBCAK :))

"Since you also have an X10SL7-F, if it isn't in production (or if you just don't care), try to offline a disk via GUI" Yeah, it's a production server (and I don't have the perfect backup system I want yet) so I'll not provoke the devil :p

IIRC when a drive doesn't want to online because it has been in a pool before, just reboot and it should solve the problem ;)

You can reproduce the problem? if yes, try the reboot to see if it's that ;)

Yeah, hot swap isn't particulary well supported but IIRC I saw a recent post talking about some big improvements about that. Sorry I don't have the url so you'll need to search.

FreeNAS isn't perfect, I'll say that too (for example I really missed the man pages...) but it's the best thing I ever saw for a NAS ;) Note that there is TrueNAS which is the business oriented FreeNAS with a support team, etc. I don't know anything about it but it should be more bug free logically.
 

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
Thanks for the response. Wasn't directed at you just in general I find this forum very hostile, which is unfortunate. I feel most of us like freenas and just want to see it improve.

I double checked, the watchdog is off in BIOS, and the jumper removed. I took pics but I'm on my cell right now. I should also note that you should not have to remove the jumper. The manual states that the jumper can be connected and that for it to really be on requires turning it on via BIOS. The only thing I can think that could cause this not working would be a supermicro firmware issue. Do you know what firmware you are using?

I understand your apprehension on messing with a production system, luckily I'm messing with this right now on a test system. Yes, rebooting will fix the issue as you mentioned. I figured that out a while back. However that is not really optimal. Also I've noticed issues where trying to online the drive can actually causes all the disks in the pool to start becoming unavailable, locking up the web GUI, and eventually the whole system. Reboot and it all comes back fine. No matter how you cut it though, that shouldn't be happening and doing the same steps on a Solaris based distro's doesn't behave that way. Reboot isn't always an option either.

Hot swap is defentitely a freebsd issue. Perhaps 10 will work better. I tested it in OmniOS and it worked as expected, so a bit disappointed with freebsd on this.

I think freenas does a lot of great stuff. However we can't live in denial and pretend like there aren't bugs that need to be addressed. I'm always of the opinion that if someone takes the time to report an issue they should be greated with consideration and not dismissal.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Removed? IIRC it's a 3 pins header with jumper on 1-2 for enabled and 2-3 for disabled, or the reverse.

FW 1.42, BIOS 2.0 ;)
 

lvi

Cadet
Joined
Feb 23, 2015
Messages
1
I guess as I may as well chime in on this too. I'm experiencing the same issue as JayG30 - same board, CPU, and willing to bet the same Crucial memory. Watchdog is disabled in BIOS and I have the jumper disabled too. The resets are seemingly random, oddly enough it's only done it while I'm home near the system, but not necessarily when I'm accessing data/the GUI from it (maybe I just emit a bad aura when I sit near it).

Same FW and BIOS as Bidule0hm.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
"oddly enough it's only done it while I'm home near the system" if it's really the case (not just placebo effect) the only thing I can think of now is a bad seated connector who vibrate when you walk near the system but it's very very very unlikely to be that so I think it's the placebo effect (a bit like it seems to you that you are always in the slower queue at the shop checkouts but of course you know it's completely random) :)

So it must be a random bug (and a tough one).
 
Last edited:

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
There does seem to be a software watchdog system, though apparently it is off by default. Could this be anything to do with it?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I checked everything, it is disabled. But in typical FreeNAS forum fashion I know I'll be called an idiot, noob, moron, no nothing anyway.

You idiot! Just kidding. ;)

Watchdog timers require two things, hardware and software.

If the hardware is disabled as you say, then it shouldn't be a problem.

I know on the software side the watchdog timer doesn't work for everyone (there's just too many different watchdog timers that exist to support them all). I know it worked on my X9SCM-F. It is a bit odd that the X10SL7 doesn't work for 2 people. I wish I had the actual hardware to test but unfortunately I don't. :(
 

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
So just to come back to this, I can confirm, without a shadow of a doubt, that something is wrong here.

The day I posted in this thread I checked both the BIOS and the jumper. The setting is off in the BIOS and the jumper is removed.
Yesterday morning this is in the event log...

noidea.png


Firmware revision and IPMI revision....
noidea2.png


I didn't run OmniOS/Solaris derivative for as long, but in that time I did not see anything like this. So it seems like something related to FreeBSD. Will see about submitting/commenting on any bug reports. Not sure what to even run to diagnose such an issue.

I SHOULD NOTE THAT THIS HAS NOT CAUSED MY SERVER TO ACTUALLY REBOOT, HANG, OR DO ANYTHING ELSE STRANGE. JUST THE MESSAGES IN SYSTEM EVENT LOG!
 
Joined
Jun 21, 2014
Messages
3
I have also noticed these events though I have a slightly different case as my system experienced a failure in one of the two power supplies. The system rebooted shortly after experiencing the error. The failed power supply continued to alarm. I have since RMAed the power supply and I am currently operating on one supply.

MB#: SUPERMICRO MBD-X10SL7-F
CPU#: Intel Xeon E3-1231V3
RAM#: 32GB (Samsung DDR3-1600 8GB)x4
Manufacturer Part#: M391B1G73QH0-YK0

Event Log
event_log.jpg

IPMI Revision
ipm_device.jpg
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Just seen this as well when investigating what happened to my system. At 1:32 UTC, I've received an email about my system being degraded (lost 2 drives out of 4 in one RAIDZ2 pools). This is exactly what I am seeing:

Code:
510,System Event,07/22/2015 01:32:48 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
511,System Event,07/22/2015 01:32:49 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset


For a moment I thought that this may coincide with a storm that was passing through around that time, however, the server is on a UPS with a surge protectors on all cables. No UPS event was recorded for that time.

My question on this is - is the loss of the 2 drives from the pool due to the hard reset or was there some other event that has caused the timer interrupt + hdd failure?

Here is the output after swapping one of the drives and starting the resilver:

Code:
  pool: mainsafe
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jul 22 09:33:33 2015
        2.79T scanned out of 16.7T at 222M/s, 18h17m to go
        17.4G resilvered, 16.63% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        mainsafe                                        DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/07f5a19f-dab0-11e2-916c-1c6f65c75eb9  ONLINE       0     0     0
            replacing-1                                 UNAVAIL      0     0     0
              11626242231722143569                      UNAVAIL      0     0     0  was /dev/gptid/08487478-dab0-11e2-916c-1c6f65c75eb9
              110263956467305268                        REMOVED      0     0     0  was /dev/gptid/f4269859-3043-11e5-8df2-0cc47a066d63  (resilvering)
            gptid/0f503227-9bd0-11e4-b172-0cc47a066d63  ONLINE       0     0     0
            gptid/08f226ee-dab0-11e2-916c-1c6f65c75eb9  FAULTED      3     4     0  too many errors
          raidz2-1                                      ONLINE       0     0     0
            gptid/b013e750-7c01-11e4-b461-0cc47a066d63  ONLINE       0     0     0
            gptid/b0dae3d8-7c01-11e4-b461-0cc47a066d63  ONLINE       0     0     0
            gptid/b1a64fb9-7c01-11e4-b461-0cc47a066d63  ONLINE       0     0     0
            gptid/b27216fe-7c01-11e4-b461-0cc47a066d63  ONLINE       0     0     0

errors: No known data errors


I am running FreeNAS-9.3-STABLE-201501162230 on X9SCM-F-O with LSI 9220 flashed to the correct version of IT firmware.
 
Last edited:

FX159

Dabbler
Joined
Feb 4, 2015
Messages
11
Small suggestion... even if it's coming late.

Supermicro boards with IPMI do not only have one but two watchdogs!

All you did was disabling the BIOS watchdog, but the IPMI does also have an independent watchdog.

Atleast on Linux the IPMI watchdog only gets enabled when the "ipmi_watchdog" module is loaded.

As long as the "start_now" parameter of the module is not set to 1 the IPMI watchdog only starts if a watchdog daemon starts polling it.

Had my fair share of fun discovering that on a X10DRi-T board recently.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Small suggestion... even if it's coming late.

Supermicro boards with IPMI do not only have one but two watchdogs!

All you did was disabling the BIOS watchdog, but the IPMI does also have an independent watchdog.

Atleast on Linux the IPMI watchdog only gets enabled when the "ipmi_watchdog" module is loaded.

As long as the "start_now" parameter of the module is not set to 1 the IPMI watchdog only starts if a watchdog daemon starts polling it.

Had my fair share of fun discovering that on a X10DRi-T board recently.

I actually did not have the BIOS watchdog enabled. Does the IPMI module exist on FreeNAS as well?
 

FX159

Dabbler
Joined
Feb 4, 2015
Messages
11
I never ran FreeNAS or FreeBSD on that particular board, but when your BIOS watchdog is disabled and there is still a working watchdog then probably yes...

FreeBSD implements an ipmi driver that makes the ipmi watchdog available to the system.

There should be a device named "/dev/fido" on your system. (You can check that on the console.)

"/dev/fido" can be polled by the watchdogd daemon.

Atleast that's how it works on vanilla FreeBSD.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I never ran FreeNAS or FreeBSD on that particular board, but when your BIOS watchdog is disabled and there is still a working watchdog then probably yes...

FreeBSD implements an ipmi driver that makes the ipmi watchdog available to the system.

There should be a device named "/dev/fido" on your system. (You can check that on the console.)

"/dev/fido" can be polled by the watchdogd daemon.

Atleast that's how it works on vanilla FreeBSD.

I've had another reboot yesterday night, absolutely nothing amiss in the FreeNAS log (I was running tail -f /var/log/messages). In the IPMI log, I can see the same Watchdog 2.

I can also see /dev/fido in my system.

My question is - why is this happening? Is something else wrong with my system that makes the watchdog fire? Or can I just disable it safely and not worry about it? If so, how do I do that?

I can see this info when querying for watchdog:

Code:
[root@freenas] ~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      137 sec
Present Countdown:      135 sec
 

FX159

Dabbler
Joined
Feb 4, 2015
Messages
11
Then the watchdog fired...

According to ipmitool and the existing "/dev/fido" device the bmc watchdog is definitely up and running.

You can temporarily disable the ipmi watchdog with the following command "ipmitool mc watchdog off".
After that command you can run "ipmitool mc watchdog get" again, it should say the watchdog is off.

Well... are you 100% sure your system did not freeze? If it freezed the watchdog did exactly what it is supposed to do.
Maybe you have a hardware problem, if it only happens irregularly the watchdog daemon probably works as intended.
In your situation the watchdog fires after 137 seconds of not being polled...
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Then the watchdog fired...

According to ipmitool and the existing "/dev/fido" device the bmc watchdog is definitely up and running.

You can temporarily disable the ipmi watchdog with the following command "ipmitool mc watchdog off".
After that command you can run "ipmitool mc watchdog get" again, it should say the watchdog is off.

Well... are you 100% sure your system did not freeze? If it freezed the watchdog did exactly what it is supposed to do.
Maybe you have a hardware problem, if it only happens irregularly the watchdog daemon probably works as intended.
In your situation the watchdog fires after 137 seconds of not being polled...

Thank you very much for this - will try it out! I do suspect that something may be amiss though it's hard to determine what exactly - and I would rather have the box freeze to be able to inspect what is visible on the screen, etc.

By the way - would a power supply issue come up in the IPMI tools at all (using standard PSU) via voltage/other warning or PSU can also be a culprit?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Actually, that did not work - the timer was started again!

Code:
[root@freenas ~]# ipmitool mc watchdog off
Watchdog Timer Shutoff successful -- timer stopped
[root@freenas ~]# while true; do ipmitool mc watchdog get; sleep 10; done 
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      300 sec
Present Countdown:      300 sec
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      137 sec
Present Countdown:      130 sec
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      137 sec
Present Countdown:      130 sec
 

FX159

Dabbler
Joined
Feb 4, 2015
Messages
11
That could probably be the case... sounds like a reasonable idea.

If your BMC has voltage sensors you may be able to spot a faulty PSU, but the readings of the BMC are probably not 100% accurate.

And of course, system freezes can be caused by a faulty PSU.

It did work, however I forgot to tell you to kill the watchdogd process. You have to kill the watchdogd process before stopping the IPMI watchdog, otherwise the watchdogd process will poll the IPMI watchdog and thus start it again.
 

okgunguy

Explorer
Joined
Aug 4, 2015
Messages
72
I forgot to tell you to kill the watchdogd process. You have to kill the watchdogd process before stopping the IPMI watchdog
What is the command to do this?
 
Status
Not open for further replies.
Top