Boot Loop After Power Outtage

victorbrca · Jun 30, 2020

Hello everyone. A seasonal user here (not very familiar with ZFS), but have had a FreeNAS box at home for the past 10 years (different boxes).

I've had a couple of power outages this past week and after one of them, my FreeNAS box is stuck on a boot loop. After plugging in a monitor and a keyboard I saw that it was getting a panic: Solaris(panic): blkptr at 0x.. has invalid CHECKSUM when trying to import one my pools:

I have:

1- Booted into single-user mode and tried to import the pool - Had the same error message (Solaris(panic))
2- Booted into single-user mode and tested the import (with -f -F -n) - Also had the same error message (Solaris(panic))
3- Booted into single-user mode and ran zdb -hh <POOL> to try and get a list of TXGs so I could rollback with zdb -AAA -L -t <TXG> -bcdmu <POOL>, however, I get a lot of unrecognized records

Any help would be greatly appreciated.

I have seen requests on another similar thread for the output of zdb -U /data/zfs/zpool.cache -eCC yourpoolname, so here it is (apologies for photos and not text):

============================================
Edit (additional note):

I just ran memtest on the two modules and it looks like both are bad.

Samuel Tai · Jun 30, 2020

Do you have a pool named Backup? If so, this appears to be the pool that's causing the panics. Your pool Volume1 seems to import OK.

victorbrca · Jun 30, 2020

Samuel Tai said:
Do you have a pool named Backup? If so, this appears to be the pool that's causing the panics. Your pool Volume1 seems to import OK.

I do have 2 pools (Backup and Volume1). I'm able to import Backup in single-user mode with no issues. Volume1 generates panic when trying to import.

Samuel Tai · Jun 30, 2020

Interesting. In your first screenshot, Volume1 goes through a normal LOADING, LOADED, UNLOADING cycle fine. Backup yields a panic after LOADING.

Samuel Tai · Jun 30, 2020

Unfortunately, due to the damaged memory from the power surge, it appears their scrambled contents were written to the Volume1 block pointer. You may need to image off the members of the pool, and run Klennet to see if you can recover the contents of the pool.

victorbrca · Jun 30, 2020

Maybe that pool had issues as well? I took the first photo when I first discovered the issue (Sat 27th), but I didn't have a spare keyboard to troubleshoot. Today I was able to get a keyboard and spent some time looking into the problem, and then took the other screenshots.

Could all of this be related to the bad memory modules? Should I buy new modules as the first troubleshooting steps?

Samuel Tai · Jun 30, 2020

Yes, definitely replace the memory first.

victorbrca · Jun 30, 2020

Samuel Tai said:
Yes, definitely replace the memory first.

Thanks, I will do that. Would you have any link that I could start reading on imaging the pool members? Is that just dd'ing and troubleshooting from the image?

Samuel Tai · Jun 30, 2020

Klennet runs from Windows on dd images.

victorbrca · Sep 29, 2020

I've been dragging to tackle this issue due to my inexperience with ZFS, but I need to get the files and my FreeNAS fixed. I was able to get the full zpool history and I don't see any issues, but I'm guessing that would be normal. Would I be able to rollback to one of these txgs? I see a similar issue here, which txgs rollback was suggested, but no confirmation/resolution.

Full file attached

Code:

2020-06-01.08:00:21  zpool scrub Volume1
  history command: ' zpool scrub Volume1'
  history who: 0
  history time: 1590998421
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'errors=0'
  internal_name: 'scan done'
  history txg: 27030016691
  history time: 1591009062
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'pool version 5000; software version 5000/5; uts  11.3-RELEASE-p6 1103000 amd64'
  internal_name: 'open'
  history txg: 27030033565
  history time: 1591316710
  history hostname: ''
unrecognized record:
  history internal str: 'pool version 5000; software version 5000/5; uts  11.3-RELEASE-p6 1103000 amd64'
  internal_name: 'import'
  history txg: 27030033567
  history time: 1591316710
  history hostname: ''
2020-06-05.00:25:16  zpool import 11546197154207157221 Volume1
  history command: ' zpool import 11546197154207157221 Volume1'
  history who: 0
  history time: 1591316716
  history hostname: ''
2020-06-05.00:25:16  zpool set cachefile=/data/zfs/zpool.cache Volume1
  history command: ' zpool set cachefile=/data/zfs/zpool.cache Volume1'
  history who: 0
  history time: 1591316716
  history hostname: ''
unrecognized record:
  history internal str: 'func=1 mintxg=0 maxtxg=27030211166'
  internal_name: 'scan setup'
  history txg: 27030211166
  history time: 1592208000
  history hostname: 'freenas.local'
2020-06-15.08:00:41  zpool scrub Volume1
  history command: ' zpool scrub Volume1'
  history who: 0
  history time: 1592208041
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'errors=0'
  internal_name: 'scan done'
  history txg: 27030213230
  history time: 1592219246
  history hostname: 'freenas.local'

I also spent some time looking at my backup (2x drives set up as mirror) and I was able to confirm that my rsync scripts were running for the important data (so if needed I can restore from that). However, would that backup be viable seeing how the data might already have been corrupted before rsync compied it?

victorbrca · Oct 1, 2020

A little update on my issue. I booted into single user mode again and enabled vfs.zfs.recover and vfs.zfs.debug, which allowed me to test the import with a rollback (zpool import -fFn Volume1). The output indicated that only 20 secs of data would be lost, so I ran the import (with -fF) which went through ok.

I manually inspected the last modified files and they were fine. So I rebooted the box and then ran a scrub, which finished without issues and did not have to repair anything:

Code:

Thu Oct  1 15:54:42 EDT 2020
  scan: scrub repaired 0 in 0 days 03:03:55 with 0 errors on Thu Oct  1 15:54:41 2020

Should I ran any other checks to confirm that everything is ok? Any other recommendation?

Thanks,
Victor.

Important Announcement for the TrueNAS Community.

Boot Loop After Power Outtage

victorbrca

Dabbler

Samuel Tai

Never underestimate your own stupidity

victorbrca

Dabbler

Samuel Tai

Never underestimate your own stupidity

Samuel Tai

Never underestimate your own stupidity

victorbrca

Dabbler

Samuel Tai

Never underestimate your own stupidity

victorbrca

Dabbler

Samuel Tai

Never underestimate your own stupidity

victorbrca

Dabbler

Attachments

victorbrca

Dabbler

Similar threads