FreeNAS continuously rebooting after importing pool

jblackburn · Jul 18, 2014

Hi,

I've got a problem whereby FreeNAS appears to continually reboot, shortly after importing a degraded ZFS volume.

I have an HP Microsever N40L with 4 3TB Seagate drives. Everything was working well for a couple years. I've been keeping it up-to-date and was running 9.2.1.5. I had a drive fail (ada2). I updated my backup (with zfs send | zfs recv) (which took a few hours and did lots of I/O). And powered down freenas (though I forgot to detach the drive in the WebUI).

I replaced the drive with a 4TB drive. Unfortunately shortly after booting up, the WebUI and shell locks up and the system reboots. It did this on loop, and seems to reboot shortly after mounting the pool - I get an email telling me the pool is degraded right before it reboots.

I've tried everything to make it stop rebooting:
1) putting the bad drive back in
2) booting the system with nothing in ada2
3) swapoff -a as soon as it boots up
4) starting fresh re-imaging my USB thumb drive with the latest version

Finally I reimaged the USB drive with 9.2.1.6 and it boots cleanly now. However as soon as I import the zfs pool (using the web ui, or the command line) it reboots shortly after... There's nothing interesting in /var/log/messages. If I don't import the pool it stays up.

Note that I can perform I/O to the raw devices without issue, e.g. :
dd if=/dev/ada3 of=/dev/null
and I did manage to backup the important filesystems before taking it down for the disk replacement - so I *think* the hardware is fine.

As a result, it feels like there's a kernel panic(?) or similar causing it to reboot shortly after mounting the pool degraded. What's the best way to debug this? Should I expect /var/log/messages to persist across reboot?

Many thanks for any help.

Cheers,
James

Yatti420 · Jul 18, 2014

Have you tried a new USB stick just to be safe? If this doesn't resolve the issue I would think something with the pool is causing issues..

cyberjock · Jul 18, 2014

Your pool a RAIDZ1 or RAIDZ2? I will say it sounds like you are running RAIDZ1 and actually have 2 bad disks. You replaced one disk and the other failing disk has corrupted data causing the system to crash because the zpool's file system structure is irrepairably damaged since you have no more redundancy.

Can you post the SMART data on all of your disks?

jblackburn · Jul 18, 2014

I've tried a fresh image on a new USB stick. Same issue of immediate reboot soon after import.

I'm using RAIDZ1. It's possible there's bad data, I guess.

All the disks show as OK with no errors logged on smartctl.

smartctl -a /dev/ada0
SMART overall-health self-assessment test result: PASSED
...

I'm just surprised there are no logs / graceful failure of the pool... I was under the impression it should fault / refuse to import in some way, and there should be some logs of what's failing? Immediately after the import I could ls around the mount point etc.

Any other ideas on how to proceed?

jblackburn · Jul 18, 2014

Note I haven't yet managed to do a zpool replace. So it's still running (or trying to run) as degraded RAIDZ-1 with 3 disks.

I have parity memory, and had frequent cron'd scrubs with no errors ever reported. The machine worked fine (without trying to reboot it) with a single failed disk for a week while I waited for a replacement. It's only post reboot that it seems flaky importing the pool...

Should I be able to catch the crash somehow, is there some way to get it to dump to the logs?

cyberjock · Jul 18, 2014

I don't care that it says PASSED. That doesn't mean what you think it means. Like I said, please post the entire SMART output. Pastebin is preferred for this kind of thing because it will keep the text formatting.

jblackburn · Jul 19, 2014

Ah sorry. Here are the smartctl -a logs:

http://pastebin.com/qjEtg2yZ
http://pastebin.com/VLbCmR4r
http://pastebin.com/AuZKfRen
http://pastebin.com/ybmks9Mg

Note the 3rd one ada2 is the new drive which isn't yet part of the pool.

cyberjock · Jul 19, 2014

Ok.. here's my analysis:

1. ada0 is too warm. 40C should be the maximum temp and its 42C.
2. ada1 is also too warm at 42C. It has 4 for high-fly writes but that's not exactly a major problem.
3. ada2 is also too warm at 41C.
4. ada3 is also too warm at 42C.

None of them look bad.

But none of them have any kind of SMART testing schedule, so you are a bad boy.. spank yourself.

And none of them are actually being kept at a good temperature, so spank yourself again (no you aren't allowed to like it).

And lastly, read my link about how bad RAIDZ1 is as you might be one of those poor souls that is about to lose their data because they chose RAIDZ1.

So what you need to do now is install FreeNAS 9.2.1.6 to a new USB stick (or a spare hard drive). Then after it's installed and booted up go to the command line and try this command:

zpool import -R /mnt <poolname>

If that command crashes your system you have... problems. Report back and we'll see what to do next. :P

gpsguy · Jul 19, 2014

How much RAM do you have in your server?

joeschmuck · Jul 19, 2014

Out of curiosity, did you compare the drive serial number from the ada2 error message to the drive you actually pulled? This would rule out that you pulled a good drive by accident.

jblackburn · Jul 20, 2014

Thanks all.

> But none of them have any kind of SMART testing schedule, so you are a bad boy...

Hrm. I'm pretty sure I did have this configured in the FreeNAS UI - and in fact, I see:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 44075 -
# 2 Short offline Completed without error 00% 44003 -
# 3 Short offline Completed without error 00% 43931 -
# 4 Short offline Completed without error 00% 43859 -
...
So there is(/was) periodic short tests configured at one point.

> zpool import -R /mnt <poolname>

I've reimaged the stick, and done zpool import -R /mnt tank (as I have done a few times before). And so far it hasn't rebooted, yet - progress, maybe! But it was working for a little bit before crashing before...

I don't think I've pulled the wrong drive as:

> Out of curiosity, did you compare the drive serial

Code:

[root@freenas] ~# zpool status -v
  pool: tank
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
    the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 4h29m with 0 errors on Sat Jun  7 20:29:14 2014
config:



    NAME                                            STATE     READ WRITE CKSUM
    tank                                            DEGRADED     0     0     0
      raidz1-0                                      DEGRADED     0     0     0
        gptid/3145db7c-981b-11e1-b7cd-e839352ca6e7  ONLINE       0     0     0
        gptid/31c827f4-981b-11e1-b7cd-e839352ca6e7  ONLINE       0     0     0
        8942053745572058482                         UNAVAIL      0     0     0  was /dev/gptid/324d8ef9-981b-11e1-b7cd-e839352ca6e7
        gptid/d1aded6e-7f3e-11e2-875b-e839352ca6e7  ONLINE       0     0     0
errors: No known data errors

Also I can ls -lR /mnt/tank and I see file metadata flying by.

The system has 8GB of memory, and 6GB of swap.

Mem stats look like:

Code:

last pid:  3403;  load averages:  0.00,  0.16,  0.16                                                          up 0+00:09:58  08:58:54
30 processes:  1 running, 29 sleeping
CPU:  0.0% user,  0.0% nice,  0.2% system,  0.4% interrupt, 99.4% idle
Mem: 155M Active, 55M Inact, 586M Wired, 3232K Cache, 64M Buf, 6987M Free
ARC: 151M Total, 470K MFU, 58M MRU, 16K Anon, 4081K Header, 89M Other
Swap: 6144M Total, 6144M Free

It's now been up for a few minutes post import so I'm going to attempt a device replace...

jblackburn · Jul 20, 2014

Spoke too soon. Pressing View Volumes in the FreeNAS web UI and the system locks up and reboots :(

joeschmuck · Jul 20, 2014

Maybe you should run some MemTest86 on your machine, you could also have a RAM issue, power supply issue, etc... It should run with 8GB RAM.

So did you say you performed a clean install and tried to import the pool? Did you restore your configuration file by chance, I'm asking because you shouldn't if you are troubleshooting this issue.

Is the data in your pool valuable and something you need to get back? If not, you might consider wiping out your pool and starting from scratch.

jblackburn · Jul 20, 2014

I didn't restore the configuration no, so the pool doesn't mount on boot up which gives me a window for accessing it. Would be a real pain to recover the pool from backup.

Is there any way to find out why the machine is crashing? Shouldn't there be something logged somewhere?

I'll try a memtest too.

jblackburn · Jul 20, 2014

No issues found by memtest.

I can import the pool fine. When I press View Volumes in the UI it seem to kernel panic.

I've attached a monitor, and the log scrolls past faster than I can read - and the final bit doesn't look particularly interesting:
https://plus.google.com/10134076571...6038199470433431122&oid=101340765714109938245

I've done some googling but haven't yet found a guide for a debugging kernel panics on the freenas build.

jblackburn · Jul 20, 2014

Have disabled the rebooting on crash:

Code:

ddb scripts
ddb script kdb.enter.default="watchdog 38 ; textdump set ; capture on ; show allpcpu ; mt ; ps ; alltrace ; textdump dump;"

(Removing the reset switch at the end of the ddb script.)

I've attached a photo of the backtrace:
https://plus.google.com/10134076571...6038209916106553314&oid=101340765714109938245

Looks like it's panicing in dsl_deadlist_open().

Any thoughts on debugging further?

jblackburn · Jul 20, 2014

Looking at the backtrace, it seems to be listing snapshots. The pool appears to work if I don't do that. I'm updating the last bits of my backup with rsync, which is working ok.

Give I've never seen a disk error on this pool, I'd be willing to bet that corruption after a single disk fails is software bug rather than a hardware issue...

It's also a shame that FreeNAS makes it impossible to figure out what's going wrong without a fair bit of hacking. Up till now my NAS was headless. To have nothing useful in any logs on a kernel panic does mean that users with 'random reboots' have no idea what's going on, and when they discover it's panicking it feels fairly hard to trap the error.

cyberjock · Jul 20, 2014

I could easily turn that around too...

It's a shame that the server admin went with RAIDZ1 despite all the warnings. (Keep in mind I guessed at a RAIDZ1 because what you said prior to that is exactly what we see when people make the less-than-best idea of going with RAIDZ1. It is also why I have the link in my signature trying to get people to not do RAIDZ1)

Yes, your disks may not have seen an error, but past errors is not an indicator of present or future performance. ZFS also can't fix silent corruption which is possible and you have no way of determining if there ever was any silent corruption by the very definition of the corruption... its silent.

If you've enabled the syslog dataset in 9.2.1.6 you'd find that the logs in .syslog may be useful. But also considering that your pool isn't mounting that may or may not even work. I don't think FreeNAS is going to use the pool for its syslogs since you haven't mounted the pool from the WebGUI.

Also, FreeBSD doesn't require "a fair bit of hacking" to find out what is wrong. You just aren't familiar with how to properly debug FreeBSD/FreeNAS. ;)

At this point I'd try doing a RAM test like joeschmuck says. Aside from the RAM test finding a problem my guess is your pool is corrupted which is causing the crashes. This is fairly common in RAIDZ1 which is why we try to push people to RAIDZ2.

jblackburn · Jul 21, 2014

I've done a memtest. No issues found with the RAM.

> This is fairly common in RAIDZ1 which is why we try to push people to RAIDZ2.

It's great pushing people away from RAIDZ1, but then RAIDZ3 would also be better than RAIDZ2... Similarly you must advise people against a simple mirror as there's only 1-disks worth of redundancy. This machine only supports 4 drives, so I didn't want to lose half the space to parity. I also maintain a separate backup using zfs send | recv. Irrespective of a second disk failing (which I see no evidence for here) there's a fair bit of anecdotal evidence of ZFS bugs losing people data.

> Also, FreeBSD doesn't require "a fair bit of hacking" to find out what is wrong. You just aren't familiar with how to properly debug FreeBSD/FreeNAS. ;)

Ok. Where do I find instructions for diagnosing or even finding the root cause of this issue? My comment above shows I had to tweak ddb scripts which I haven't found documented anywhere on the FreeNAS website or forums. Of course I could have missed it...

> ZFS also can't fix silent corruption which is possible and you have no way of determining if there ever was any silent corruption by the very definition of the corruption... its silent.

True. And similarly there's no way to tell whether I've hit some edge-case bug in the ZFS code which might now be irreparable as ZFS doesn't come with a FSCK (and zdb appears to be fairly poorly documented).

It's not the end of the world, as I have a backup. However we've been considering ZFS at work, and this sort of failure mode - which requires a full rebuild of a pool - is not something we've had to do with other filesystem / volume manager configurations.

joeschmuck · Jul 21, 2014

I don't think you ran the MemTest long enough, or did you run it for 24 hours yet? It may not find anything wrong but it needs to run for a period of time in order to weed out any intermittent problems.

If you have a backup of your data, once you have ruled out RAM as a possible issue, I'd just blow away the pool on each individual drive and start all over but this time creating a RAIDZ2. I say this only because you have wasted a lot of your time on this problem and it's just quicker to do the recovery.

Important Announcement for the TrueNAS Community.

FreeNAS continuously rebooting after importing pool

Dabbler

Wizard

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Active Member

Old Man

Dabbler

Dabbler

Old Man

Dabbler

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Old Man

Similar threads