Spontaneous system crashes with iSCSI LUNs

Status
Not open for further replies.

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Thanks. I'll wait for an official statement before I risk another panic!
 

Paccc

Cadet
Joined
Sep 2, 2013
Messages
8
Well if it makes you feel any better, applying the latest STABLE version for me fixed this bug. Just prior to upgrading, my system was crashing about 3 times a day. I updated it about 4 days ago and haven't had a single crash yet. The system has been under pretty heavy load all week too (due to some database table rebuilds), which typically makes it crash even sooner than usual.

I think it's worth a try if this bug is causing a big issue for you. It was a showstopper issue for me, so I was more than happy to test the latest version, and thankfully it finally fixed it (after about 3 months of tearing my hair out). :)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I may or may not get this 100% correct, so forgive me if later I find out I'm wrong..

1. A bug in ctl was identified causing the crashes mentioned in this thread. In particular, ctl was optimizing commands that were invalid commands, eventually leading to a crash. Of course we didn't know this until recently.
2. (I'm not sure of the specifics) but at some point this bug was introduced. I'm guessing it has happened in the last 60 days or so because we've only become aware of it recently. It was also hard to narrow down because of specifics to the customer's design that I'm not going to discuss. Suffice it to say, if we had seen this in FreeNAS first it may have been a faster turnaround to fix the bug, which leads me to #3.
3. I've been pretty busy lately, so I've been slacking on staying up on the forums. I have 18 pages of historical threads to catch up on. :/
4. Since nobody from the FreeNAS community created a bug ticket, when a TrueNAS support ticket was created for a TrueNAS user, the support team submitted a bug ticket. This went into the TrueNAS bug ticket system which is internal to iXsystems.
4. Customer that had the issues worked with us and we eventually issued a fix. That fix went into TrueOS.

TrueOS is basically "FreeBSD" with custom stuff added, removed, or changed that makes the basis for FreeNAS and TrueNAS. Its based on FreeBSD, but has it's own fork specifically for things like FreeNAS and TrueNAS.

5. Since the ticket was an internal ticket, and since FreeNAS shares the code with TrueOS, the fix was included in the most recent FreeNAS build. But since the TrueNAS ticket is internal and no FreeNAS equivalent was ever made, there doesn't appear to be a changelog for the bug (or the fix) but it is most certainly there. It's just a matter of tracking everything back appropriately.

At the end of the day there's a few things to keep in mind:

1. If you are getting crashes, and you are convinced its not a hardware failure, put in a bug ticket and attach a debug file. Let the developers figure out if you have a poor configuration, your hardware is bad, or if this is a genuine problem and needs fixing. Panics should *not* typically be happening if you are sticking to recommended gear and taking all of the advice given by the experienced users here in the forums to heart. In this case nobody seems to have put in a bug ticket, so the FreeNAS community didn't really head this one off, which meant it too longer to identify because TrueNAS users are typically more interested in uptime until we can get a definitive reproduction case, etc. If you are a big customer you may have HA, which means you may have crashes every day and you may not know it because the workloads typically failover seamlessly. So unless you look at the uptime (which I think is what the TrueNAS user noticed) then you may not know. (yay for TrueNAS HA!)

What made this worse was that we couldn't reproduce this in a lab environment, and while some people saw the issue regularly, the vast majority never had a problem. I personally have been using iSCSI for more than a year, but never hit this bug. And with all of the weird and nasty things I've done to my system on the software side, and doing "all that crap I tell you end-users to never do to your production systems" I'd have hoped to have accidentally hit the problem.

2. Internal tickets for TrueNAS are basically 'hidden' because they often contain customer data. Not that we're trying to hide big juicy bugs and major security holes from the world, but we need a place to do things that may include attaching things that are sensitive to our paying customers. Often the big juicy bugs are first identified by the FreeNAS community, so the FreeNAS ticket tracks the bug to eventual completion. It just didn't for this case because of #4 above.

3. If you are having this particular problem, it is in your best interest to upgrade to the latest STABLE build. If you are still experiencing this issue afterward, iXsystems will almost certainly want to know.

Likewise in the future if you are having panics, it's a good idea to put in a bug ticket, especially if several people can confirm the same cause. That's almost certainly a sign that a bug exists and needs to be squashed ASAP.

Right now we've applied the fix for the TrueNAS customer that had the problem, and they've not had the problem since. Likewise it sounds like nobody that uses FreeNAS has had the problem if they have upgraded to the latest build of FreeNAS. So I think the patch that intended to fixed the issue has been successful and we can claim victory over the panics. (yay!)

Remember, nothing I say here in the forums or IRC is "official statements from iXsystems" so ymmv.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Thanks for the info. I had created the other thread as a preliminary "ping the community" before creating a new ticket. Just to do a sanity check that I'm not doing something wrong. Sadly, due to the nature of the bug, it wasn't immediately obvious if the problem was me, my rig, or FreeNAS (FreeBSD). I assumed that it wasn't the latter, as I couldn't find anybody else with the same problem. Until I found this thread!

I was having a hard time believing that FreeBSD had iSCSI problems across the board, but apparently there are only certain conditions that tickle the bug.
 

blade5502

Cadet
Joined
May 2, 2012
Messages
9
Are there any updates regarding this bug?

I've similar issues with same panic since about a half year while making backups from my ESXi server over iSCSI. I'm running the latest 9.3 stable built.

panic.txt says:
ctl_check_for_blockage: Invalid serialization value 1667590243 for 1 => 14

msgbuf.text say:
Fatal trap 12: page fault while in kernel mode

I know the page fault error in msgbuf.txt is normally an indicator for bad memory, but i don't think so as I'm using ECC RAM

Should i file a bug @redmine?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You're running the latest build? Can you post the build number? I sincerely doubt you are using the latest build, hence I'm asking. :D
 

blade5502

Cadet
Joined
May 2, 2012
Messages
9
I'm running FreeNAS-9.3-STABLE-201602031011, but I'm trying to upgrade to latest 9.10 at this moment to see if this changes anything.

Some weeks ago I randomly recognised that zdb -m complains about "space map refcount mismatch: expected 268 != actual 266" in the last line of output, but scrub says that the pool is fine. I've googled around about it and some say this could lead to file corruption, others say it's harmless. Maybe this could be the underlying issue here?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
the mismatch is expected if the zpool is mounted. It's normal and should be ignored. Now if the zpool wasn't mounted (and you'd be doing a lot more than just 'zdb -m') and it said that then there would be a problem.

Edit: Just to touch on your issue, you can put a bug ticket in, but if the evidence suggests your RAM is bad, you should probably prepare for that answer. In the meantime you should look at the IPMI logs and see if RAM errors are being detected. It would suck to find out that it really *is* bad RAM despite having ECC RAM. I know we had one case of that earlier this year and the person lost their zpool. :(
 

blade5502

Cadet
Joined
May 2, 2012
Messages
9
IPMI doesn't show any memory errors so all should be fine.

After update to latest 9.10 RELEASE version i made some new backups from my ESXi Host. No crashes so far - so it seem that the issue is fixed within FreeNAS 9.10 and/or FreeBSD 10.3
 
Status
Not open for further replies.
Top