Boot Loop After Power Outtage

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
Hello everyone. A seasonal user here (not very familiar with ZFS), but have had a FreeNAS box at home for the past 10 years (different boxes).

I've had a couple of power outages this past week and after one of them, my FreeNAS box is stuck on a boot loop. After plugging in a monitor and a keyboard I saw that it was getting a panic: Solaris(panic): blkptr at 0x.. has invalid CHECKSUM when trying to import one my pools:

IMG_20200627_192203.jpg



I have:

1- Booted into single-user mode and tried to import the pool - Had the same error message (Solaris(panic))
2- Booted into single-user mode and tested the import (with -f -F -n) - Also had the same error message (Solaris(panic))
3- Booted into single-user mode and ran zdb -hh <POOL> to try and get a list of TXGs so I could rollback with zdb -AAA -L -t <TXG> -bcdmu <POOL>, however, I get a lot of unrecognized records

IMG_20200630_132509.jpg



Any help would be greatly appreciated.

I have seen requests on another similar thread for the output of zdb -U /data/zfs/zpool.cache -eCC yourpoolname, so here it is (apologies for photos and not text):

IMG_20200630_132925.jpg


IMG_20200630_132941.jpg


============================================
Edit (additional note):

I just ran memtest on the two modules and it looks like both are bad.
 
Last edited:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Do you have a pool named Backup? If so, this appears to be the pool that's causing the panics. Your pool Volume1 seems to import OK.
 

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
Do you have a pool named Backup? If so, this appears to be the pool that's causing the panics. Your pool Volume1 seems to import OK.

I do have 2 pools (Backup and Volume1). I'm able to import Backup in single-user mode with no issues. Volume1 generates panic when trying to import.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Interesting. In your first screenshot, Volume1 goes through a normal LOADING, LOADED, UNLOADING cycle fine. Backup yields a panic after LOADING.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Unfortunately, due to the damaged memory from the power surge, it appears their scrambled contents were written to the Volume1 block pointer. You may need to image off the members of the pool, and run Klennet to see if you can recover the contents of the pool.
 

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
Maybe that pool had issues as well? I took the first photo when I first discovered the issue (Sat 27th), but I didn't have a spare keyboard to troubleshoot. Today I was able to get a keyboard and spent some time looking into the problem, and then took the other screenshots.

Could all of this be related to the bad memory modules? Should I buy new modules as the first troubleshooting steps?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Yes, definitely replace the memory first.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Klennet runs from Windows on dd images.
 

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
I've been dragging to tackle this issue due to my inexperience with ZFS, but I need to get the files and my FreeNAS fixed. I was able to get the full zpool history and I don't see any issues, but I'm guessing that would be normal. Would I be able to rollback to one of these txgs? I see a similar issue here, which txgs rollback was suggested, but no confirmation/resolution.

Full file attached
Code:
2020-06-01.08:00:21  zpool scrub Volume1
  history command: ' zpool scrub Volume1'
  history who: 0
  history time: 1590998421
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'errors=0'
  internal_name: 'scan done'
  history txg: 27030016691
  history time: 1591009062
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'pool version 5000; software version 5000/5; uts  11.3-RELEASE-p6 1103000 amd64'
  internal_name: 'open'
  history txg: 27030033565
  history time: 1591316710
  history hostname: ''
unrecognized record:
  history internal str: 'pool version 5000; software version 5000/5; uts  11.3-RELEASE-p6 1103000 amd64'
  internal_name: 'import'
  history txg: 27030033567
  history time: 1591316710
  history hostname: ''
2020-06-05.00:25:16  zpool import 11546197154207157221 Volume1
  history command: ' zpool import 11546197154207157221 Volume1'
  history who: 0
  history time: 1591316716
  history hostname: ''
2020-06-05.00:25:16  zpool set cachefile=/data/zfs/zpool.cache Volume1
  history command: ' zpool set cachefile=/data/zfs/zpool.cache Volume1'
  history who: 0
  history time: 1591316716
  history hostname: ''
unrecognized record:
  history internal str: 'func=1 mintxg=0 maxtxg=27030211166'
  internal_name: 'scan setup'
  history txg: 27030211166
  history time: 1592208000
  history hostname: 'freenas.local'
2020-06-15.08:00:41  zpool scrub Volume1
  history command: ' zpool scrub Volume1'
  history who: 0
  history time: 1592208041
  history hostname: 'freenas.local'
unrecognized record:
  history internal str: 'errors=0'
  internal_name: 'scan done'
  history txg: 27030213230
  history time: 1592219246
  history hostname: 'freenas.local'


I also spent some time looking at my backup (2x drives set up as mirror) and I was able to confirm that my rsync scripts were running for the important data (so if needed I can restore from that). However, would that backup be viable seeing how the data might already have been corrupted before rsync compied it?
 

Attachments

  • zdb.ecc.txt
    4 KB · Views: 157
  • zdb.txt
    117.6 KB · Views: 200

victorbrca

Dabbler
Joined
Oct 1, 2013
Messages
19
A little update on my issue. I booted into single user mode again and enabled vfs.zfs.recover and vfs.zfs.debug, which allowed me to test the import with a rollback (zpool import -fFn Volume1). The output indicated that only 20 secs of data would be lost, so I ran the import (with -fF) which went through ok.

I manually inspected the last modified files and they were fine. So I rebooted the box and then ran a scrub, which finished without issues and did not have to repair anything:

Code:
Thu Oct  1 15:54:42 EDT 2020
  scan: scrub repaired 0 in 0 days 03:03:55 with 0 errors on Thu Oct  1 15:54:41 2020


Should I ran any other checks to confirm that everything is ok? Any other recommendation?

Thanks,
Victor.
 
Top