Panic crash (randomly and sometimes on import)

Status
Not open for further replies.

Waterb0y

Cadet
Joined
Jan 22, 2017
Messages
5
A few weeks ago my server started randomly rebooting. I thought it was a hardware issue so I replaced the power supply and then mobo / cpu / ram. It still crashes. It will sometimes boot, but will randomly crash at no set time interval (but seemingly during hd read/writes). If I disconnect all drives, it will not crash. I can't get it to complete a scrub without crashing.

I've attached pictures of the crashes. Any ideas or help is appreciated.

System:
FreeNAS 11.0 U4
Supermicro X10SL7-F (flashed IT mode)
8GB ECC RAM
ZPool: 7x2TB, 5x1TB

Crash on pool import.jpg

Crash while running.jpg
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My advice...

Based on your statements you can get it to the point of running a scrub. If this is true then you should be able to access your data. Start to backup all your data if you haven't already.

Read my linked item below for Hard Drive Troubleshooting and examine the output of your SMART data for each drive. Hopefully you are running SMART Long/Extended Tests, if not, run the test on all your drives, ensure they are fine.

If all that testing goes well and you have backed up your data then I'd power down, disconnect one pool of drives (maybe the 7x2TB pool) and then power up. If that works then you know the issue is with the pool you disconnected. If it still fails then reconnect those 7 drives and disconnect the other pool and power back on. This time I would expect all to go well. If not then I'd suspect something else like your motherboard SATA IT Flash or even a SATA cable.

Also think back, just before the problems started did you do anything like update FreeNAS, update your BIOS, make any change at all? This change could have been a week earlier that the first time you noticed the issue.

Have you run the basic RAM & CPU stability tests for completeness?

If you do isolate it down to a specific pool, then I'd destroy the pool and recreate it. If that works without issue then I'd destroy the pool again and run badblocks on it to ensure the drives are actually good. If that passes then recreate the pool and continue forward.

You never stated if the pools are RAIDZ1, RAIDZ2, or Striped, for your problem this is important data to provide.
 

Waterb0y

Cadet
Joined
Jan 22, 2017
Messages
5
Thanks for the reply... this is definitely a tough one to troubleshoot and I appreciate the help I know this post was slim on details, I was seeing if someone recognized the panic message.

All the details I can remember:

The drives are all in one pool, 2 vdevs. RAIDZ2. Drives are all WD green drives and a few Seagate drives. I had memtested my old system (at build and when I started troubleshooting this issue) and everything was fine, but I haven't tested the new system yet. Just got everything installed a few days ago and since the problems persisted, I figured that the hardware wasn't the problem. I'll stress test them after I figure this out or have to rebuild the pool.

If I disconnect all of the drives the server is stable. The odd part is that it will sometimes boot and import normally and then randomly crash usually within an hour. There is no real pattern to it that I can figure out.

I can force an import into readonly mode using #zpool import -o readonly=on -f <poolname> but now that I'm going over it all again in my head I don't think I've left it in readonly mode for longer than a few minutes while troubleshooting so I don't know if that crashes the system. I will try that today while trying to pull some media files and report back.

Thanks for the links, I'll read them, run the SMART tests this weekend and report back. I had them scheduled to run regularly, but reinstalled FreeNAS trying to fix this problem a few weeks ago when I started troubleshooting and haven't set them up yet. Only hard drive issue I'm having is pretty recent (about a week ago). I started getting a critical warning that a drive had uncorrectable and unreadable sectors, but I haven't had a chance to replace it yet. I'm not sure how a resliver would go at this point, and if I disconnect the failing drive I still have the same issues. Not sure if this is related to the error, or just a failing drive making my life difficult at the moment.

The data is mostly photo file backups and my media collection. It isn't anything that I'd be devastated at losing, but it is about 6TB of data and I'd really like to exhaust all possibilities before destroying the pool since I know the data is still there and I can access it for limited periods of time. I have backups of everything except some of the media files.

As for changes,

I thought this problem only started a few weeks ago but looking through my emails to get you a better timeline, I am seeing that it was actually rebooting as early as 2 months ago, possibly before, but only once a week. I was on 11.1 at the time and I guess that I thought that whatever it was would get patched out. Damn, thought it was just a random hardware issue this whole time but now I guess it could have been a bad upgrade as well. Would a bad upgrade cause the problem get worse weeks later? This pool was in my older system at that time, i5, 8GB non-ECC ram, LGA1155 EVGA motherboard (don't remember the exact model number), with a SATA expansion card. I tried a downgrade during all of this troubleshooting to v.11.0 U4 and the pool would actually boot on that version, but didn't put 2 and 2 together that the issue may have started with 11.1. It actually ran great for a week or so until the issues came back.

I thought I had a hardware issue so I replaced the power supply. Didn't work. I then thought it might be a SATA controller on the motherboard since the error only occurred when using the disks. I figured it was a good opportunity to replace the older hardware (and non-ECC RAM) and replaced the Mobo / processor (i3) / ram a few days ago. The issue has persisted. The only hardware I haven't replaced are any of the hard drives or cables.

Thanks again for the assistance, it is greatly appreciated.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I tried a downgrade during all of this troubleshooting to v.11.0 U4 and the pool would actually boot on that version, but didn't put 2 and 2 together that the issue may have started with 11.1. It actually ran great for a week or so until the issues came back.
I don't think I understand, did it work while running 11.0-U4 or did it work for 1 week and fail using 11.0-U4? Of course if it was working then roll back and submit a bug report.

Just to ensure I'm clear, you have one single pool which consists of two vdevs?

I would back up the files you need to since it doesn't sound like a lot because you have most of it backed up already.

Do not wait to run these basic tests, they can show if there is a stability problem, in your case it could be the power supply.

Your problem sounds a lot like a power supply issue, what is the make/model of your power supply? You should leave all your drives connected to ensure you are pulling the same power as when the system fails and then run MemTest86+ and ensure you run on all CPUs (an option during the test setup). Let this run overnight to see if any failures occur.

You have at least one failing hard drive, this absolutely could cause you a lot of pain. Typically this wouldn't cause a cat failure however if there is an electronic issue vice wearing platter damage, this could be it. But you did say that you disconnected the suspect drive and it didn't fix it.

Where is the suspect failing drive, in which vdev does it reside?
 

Waterb0y

Cadet
Joined
Jan 22, 2017
Messages
5
Sorry for the late reply, I was gone all weekend....

Correct, 11.0-U4 worked for a week and then started rebooting again.

I have 1 pool consisting of 2 vdevs.

I'm currently rerunning smart tests on the drives.

My power supply is an EVGA 600 BQ.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I need to ask even thoug I feel like you previously answered it, did you replace the power supply? You have an odd problem that I would think was related to the power supply as my first suspect.

Let us know how the SMART tests turn out.
 

Waterb0y

Cadet
Joined
Jan 22, 2017
Messages
5
I did replace the power supply. I will update when I have the results of the smart tests.

Thanks for the help thus far, it is an odd problem indeed.
 

Waterb0y

Cadet
Joined
Jan 22, 2017
Messages
5
It looks like this problem might be fixed (I stress might). Short smart tests show all drives passed (except the failing drive which I disconnected). All of the drives except the disconnected problem child also passed the long smart tests I ran last night. The server has been up for almost 24 hours now which is longer than it has run in a single stretch for a month now. I'll consider this fixed if I get a few more days of stability.

I disconnected the failing drive from both the motherboard AND power supply. I had previously only disconnected it from the motherboard and I'm now kicking myself for that oversight. I scrubbed my data and all is well. Jails and shares working normally. Email notifications aren't working, but I haven't dug into any major troubleshooting of that issue yet so I'm sure I just missed something setting that up.

I'm guessing it was an electronics issue with that drive making the problem look like a bad power supply, but I have no way of proving that other than the one change that kept the server from crashing was disconnecting the drive's power cable.

I'll post an update if the situation changes. Thanks for the help.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Glad you are making some movement forward on this problem. You might consider plugging that suspect drive into a different computer to see if you can get the drive to operate, then try to run a SMART Long test on it and examine the results. Badblocks would actually be the test to do so you can really test all the surface media properly. But if there is something wrong and the SMART data can show it, RMA the drive if it's in warranty.
 
Status
Not open for further replies.
Top