FreeNAS installation (11.3-U2) destroyed after power-outage

airflow

Contributor
Joined
May 29, 2014
Messages
111
Hello!

Download.jpg

Yesterday my girlfriend tried to recover a seemingly lost bread in our toaster in the kitchen by poking with a metal fork in it. This produced a short-circuit which led to an power-outage in my flat. After recovery, I found that my FreeNAS-installation couldn't boot any more.

The installation boots from two usb-bootsticks as bootdevice, 6 WD HDs as main storage pool (Z2), and an NVME-SSD for the jails. Mainboard is ASRock Rack E3C226D2I with 16GB ECC RAM.

The failing boot-process looks like this: https://www.youtube.com/watch?v=z6c-eteYED4

I tried to boot the installation multiple times, with varying UBS bootstick configuration. I removed either of the two UBS-sticks and booted from each of them separately to rule out a hardware-problem there. Booting from either of them worked, but it kept crashing. The problem persisted and looked always the same.

After I while I desparately tried an older boot-entry, 11.3-U1. This worked and the system returned to normal! The usb-sticks are fine and scrubbing works.

Then I wanted to update this old, working system again to 11.3-U2. This worked and created an boot-entry 11.3-U2.1. But when booting into this new system, it first ran some update-scripts (with database-conversion and stuff) and then booted again, after which it again failed to boot and crashed.

So what is this? I then tried to remove all 11.3-U2 boot-partitions on the freens-boot and re-do the updates. But this fails, too, as removing the boot-partition "11.3-U2.1" does not work without any error-message.

I'm at loss now - could you please help me finding out and fixing the problem here?

Thanks,
airflow
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
No idea how to fix it, but if you can boot into a previous version, why not save the config, and then reinstall FreeNAS on 11.3-U2 from scratch after wiping a USB drive? Seems to me like less effort than trying to figure out how to recover the damage done by the power outage.
 

DarrylM

Dabbler
Joined
Apr 14, 2020
Messages
13
I agree with @mgittelman - boot from prior version and save a config file - this will save your data and your sanity by not wasting time debugging (as I did).

Once you've done that, download new ISO, get new (Not reformatted) boot and install devices (USB keys if that's what you've been running). Create one new bootable USB using Wind32DiskImager or balenaEtcher and get a second onto which FreeNAS will be installed. Plug them into server and give it a go. If that still doesn't work, it's time to look at the BIOS

The reason I know this is because I experienced a very similar scenario with my freshly installed 11-2 UI (not from a kitchen short but from an electrical storm and failed UPS). I had to reinstall my BIOS, which appeared to have been whacked. Even after updating the BIOS, I had multiple kernel panics and a number of false starts where the boot errored out in BTX (the BSD Boolader). Maddening when a trouble-free O/S like FreeNAS suddenly gives you all kind of trouble. I wound up having to change the SATA settings from IDE to AHCI and suddenly the system booted. In theory, this shouldn't have mattered, but it did. I have also been advised to change boot from BIOS to UEFI for my hardware (I haven't tried this yet).

What I am seeing now is that 11-3 takes over 20 minutes to boot. Everything is fine once it's up, but every single message you'd expect to see duing boot goes by SO SLOWLY and takes forever. I was (until today) running only 8Gb RAM and was told that could be part of it. I installed another 8Gb this afternoon and have not seen any improvement. So I will be interested in knowing about your system's performance once you've managed a successful upgrade. FYI - my next step will be attempting boot from an SSD in hopes that boots faster.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
Thanks guys for your suggestions.

It looks better now, but I still cannot really say I completely recovered. For all these strange phenomenon I decided to go the "new install & load config"-route. Still, this didn't work flawlessly and it needed several tries.

In one of my tries, the system booted normally, but immediately after booting I got a message on the console which said
ZFS WARNING: Duplicated ZAP entry detected (nos-tun)
after which the system panicked again.

Then in another try it behaved normally, was stable and I could use it. I already closed the case and went to sleep. The next morning, I found that hours later in the night the system rebooted unexpectedly, but came back again and is again stable from then on.

My guess is it must be one of the two scenarios: Either some part of the hardware is slightly defective and behaving unexpected, which causes the OS to crash. The other is that my saved config has some issue, which I activate by loading it into the system.

For the moment, I leave the system as it is (it has now been stable over 12 hours with normal use) and I hope the problem solved itself. :smile:

@DarrylM - your inputs are appreciated and probably I will try to play around with those BIOS-settings if the problems persist. Also, I will look to find some hardware-diagnosis-CD, perhaps that will help me finding the culprit.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Did you try to scrub the ZFS Pool yet? Once your group of disks is deemed healthy, they can move to another box and be identified if you need to replatform for other issues (e.g. motherboard, RAM, CPU)
 

argumentum

Dabbler
Joined
Apr 28, 2020
Messages
17
Yesterday my girlfriend tried to recover a seemingly lost bread in our toaster in the kitchen by poking with a metal fork in it. This produced a short-circuit which led to an power-outage in my flat
That should have not been a big deal. Something else must have happened, right at that moment.

In any case, I would load the FreeNAS in another hardware if the data has no backup elsewhere. ( my 2 cents )
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
some part of the hardware is slightly defective and behaving unexpected,

Hi Airflow,

Unfortunately, I think you got it right here... Obviously, your server was not protected by UPS... Should it have been, it would have either stay up or would have do a graceful shutdown.

So what kind of electrical protection was on that server at the fatal moment ?

Here, my piece of crap Thanatos is my only FreeNAS server not on a UPS (it is meant to be and stay offline...) but still, it is protected by a professional power bar from APC.

If your server was not protected against electrical hasard, I would agree with @argumentum and recommend you try running your pool from another hardware platform.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
Thanks for all your input. Status so far - the system did not reload again after the last unexpected reboots. Did not change anything, just executed scrub as recommended by @marcevan which finished fine and returned 0 errors.

I am happy the system seems to be stable now - even if it's unsatisfying to not know what the actual problem was. For me the case is closed so far. If the problem returns, I'll play with BIOS-settings and will do hardware diagnosis/swap.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Did a quick check for the ZAP warning and it might be related to Deduplication.
 

airflow

Contributor
Joined
May 29, 2014
Messages
111
Just for your entertainment and perhaps even enlightenment, I'll let you know how the story with my problematic FreeNAS continued. Disclaimer: It's resolved now.

  • I wrote before that my system was stable at that point. That was true as long as it was not rebooted or shutdown. But the next reboot always ended in multiple unexpected reloads during booting, after about 5 times it finally completely booted and *then* it was stable. At the next (solicited) reboot, the same behavior was seen - several retries of unsuccessful booting, then finally the system became stable and everything worked normally.
  • So even if it worked for me for everyday life (because normally I don't reboot my system actively, it runs 24/7), I knew something was fishy and I wanted to find and solve the problem.
  • As I suspected the hardware to be faulty, I moved all 6 hard-drives to another computer and booted the system there with the same UBS-boot-stick I used on the server. I couldn't move one component however, which was an internal NVME-flash-drive for the jails. In this configuration, the system booted normally and without hiccups.
  • As the jails were missing in the test, I had the idea that perhaps this NVME-drive was faulty and causing the issues. So I reverted the 6 drives back, only removing the NVME-drive it from the original server, after which there were no unexpected reloads any more. So it must be something with the NVME-drive, I thought.
  • But I was wondering why the SSD-pool could always be mounted normally before the unexpected reloads happened at a later point in the boot-process. So to double-check, I put the NVME-drive back in, but disabled all jails auto-start before. In this configuration, the system booted normally without problems. So it has to have something to do with the jails. When I started one of the jails manually after booting, the system sometimes immediately crashed, sometimes not, sometimes only after a few minutes... not consistent behavior here.
  • So it has to be some logical problem within the jails, I thought at that point. How good that I have build scripts for each of my jails to easily recreate them at will (which then creates jails, does updates, builds ports, imports configs etc until they're as before). I did that, I deleted one of the jails and recreated them with the script - and after a few seconds into updating the ports-tree the whole system crashed again.
  • Can you imagine how pissed I was at that point? What could possibly be the problem here? I don't know why I did what I did then, but out of a mixture of desperation and better ideas I tried re-installing the operating-system to a new usb-stick and *not* re-import the original configuration (because that didn't work already before). Instead, I just imported the two pools and then configured everything in the FreeNAS-GUI from scratch (most of the work and logic is in the jails anyway. This worked flawlessly, the system never crashed again, and all jails run smoothly in the original state just as before. I didn't even have to rebuild them, they all just worked.
 
Top