KDB Panic and Restart Loop After One Minute

sduignan

Cadet
Joined
Nov 15, 2021
Messages
3
Apologies if this issue has already been resolved here, I can see a few similar posts, but no resolution that seems to fit my particular case.

I have a FreeNAS system that's been operational for about five years. It's currently running FreeNAS-11.3-U5. I have two pools, the first one (VOLUME_2) was created when the system was first set up and the second one (VOLUME_3) is only about a year old.

I am having an issue with the first pool (VOLUME_2), for the last ten days, the system has been restarting in a regular loop, when the VOLUME_2 drives are connected, it only stays up for about three minutes, including two minutes of boot time, so I only get about one minute of access to the system. If I disconnect the drives in the problem pool, the system is stable.

The VOLUME_2 pool contains the iocage dataset for my plugins as well another dataset (Media), it is a RAIDZ1 of five 4TB WD Red drives. The drives in this pool are connected to the sata heads on the motherboard.

The only thing that I changed on the system around the time that the issue started, was the addition of two new disks and the creation of a temporary pool to make a local copy of the VOLUME_3 pool. The issue first started about a week after I had added these two disks.

I have a backup of most of the data in the Media dataset, but I spent a couple of months reorganising and renaming most of the files and I really don't want to have to dig out and reorganise these files again. All I want is to get the Media dataset up and running again for long enough to sync it to the backup.

I connected a head to try and get some idea of what it was doing during the failure, the error flashes up pretty quick so I had to video it to be able to read it, but I think that it says something like:

“fffffffff 8x8800), file: /freenas-releng/freenas/ BE/os/sys/cdd1/contrib/opensolaris/uts/common/fs/zfs/zio.c, line: 281
cpuid 4
KB: stack backtrace:
db_trace_self_wrappert) at db_trace_self_wrapper x2b/frame exfffffe0456776798"
(see attached screenshots)​

Things that I have tried:
BIOS Update
Memtest 086+ (No errors)
Checked that all fans and connectors are working
Disabled all periodic snapshot and scrub tasks for the VOLUME_2 pool

System Specs:
OSFreeNAS-11.3-U5
MotherboardASUS P8Z68-V LX
CPUIntel i7 2700K 3.4GHz
RAM4 × 4 GB DDR3 1333MHz Non-ECC
HDD's (VOLUME_2)5 × 4TB WD Red's in RAIDZ1
HDD's (VOLUME_3)2 × 6TB WD Red's in MIRROR
PSUCorsair VS 550 (550W)


I know very little about FreeNAS and FreeBSD, I haven’t had to do much with this system over the years. I was wondering if disconnecting the pool from the system and importing it into a fresh install could help me get access to the Media dataset.

Given that I only have about one minute to work with when the pool is connected, I don’t want to do something that will take more than a minute to undo (if that makes sense?)

I’d appreciate any advice or suggestions.

 

Attachments

  • Screenshot_20211115-233022.png
    Screenshot_20211115-233022.png
    2.6 MB · Views: 125
  • Screenshot 2021-11-16 143752.jpg
    Screenshot 2021-11-16 143752.jpg
    411.8 KB · Views: 118

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
Not sure this will help but I'd try the following, with the goal of isolating OS corruption or HD/cable fault. Good luch.
1. Does it reboot if ONLY Volume_2 disks are connected? What about if one disk in Volume_2 is disconnected?
2. You could try a fresh install to a fresh boot disk (does not seem to show that in your info)?
3. What about a fresh install of 12, again on a diffferent boot disk?
 

sduignan

Cadet
Joined
Nov 15, 2021
Messages
3
Not sure this will help but I'd try the following, with the goal of isolating OS corruption or HD/cable fault. Good luch.
1. Does it reboot if ONLY Volume_2 disks are connected? What about if one disk in Volume_2 is disconnected?
2. You could try a fresh install to a fresh boot disk (does not seem to show that in your info)?
3. What about a fresh install of 12, again on a diffferent boot disk?
Thanks, I'll try those suggestions as soon as I can.

It still reboots if only the Volume_2 disks are connected, but I haven't tried disconnecting individual disks in that pool. I will give that a go this evening.

I am waiting on a pair of thumb drives to be delivered so I can create a fresh boot, my current boot drive is an OCZ Agility SSD

I have tried upgrading the current system to 12, but that didn't help, so I rolled back to 11.3. Once I get the new drives I will try a fresh install of TrueNAS 12.
 

sduignan

Cadet
Joined
Nov 15, 2021
Messages
3
I followed jlpellet's advice and tried disconnecting individual drives in the Volume_2 pool.

When all but one of the drives (ada9P2) were connected, the pool still showed as unavailable and the system was stable. When ada9P2 was connected and one of the other drives was disconnected, the system would boot, recognise that the pool was in a degraded state, and then reboot on a cycle as before.

Disk Removed
Volume_2 Accessible
System Stable
ada9p2​
YES (DEGRADED)​
NO​
ada6p2​
NO​
YES​
ada5p2​
NO​
YES​
ada7p2​
NO​
YES​
ada8p2​
NO​
YES​

The weird thing is, once I had cycled through disconnecting all the drives, when I came and connected them all up again, the system recognised the Volume_2 pool and stayed stable. So I now have full access to my pool again. I have been backing it up since.

When I checked the pool status after reconnecting all the drives, it said that a resilver had finished.

Can anyone help me shed some light on this sequence of events? given that the pool is a RAIDZ1 I would have expected it to read as degraded when any of the drives were disconnected.
 

Attachments

  • Screenshot 2021-11-17 201317.jpg
    Screenshot 2021-11-17 201317.jpg
    45.8 KB · Views: 111
  • Screenshot 2021-11-17 202115.jpg
    Screenshot 2021-11-17 202115.jpg
    33.3 KB · Views: 116

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
My thought is that this was a cable problem - either defective but workable in certain physical configurations. As a result, your plugging & unplugging drives both reseated the cables AND left them in a slightly altered physical config. It's not clear if you disconnected data, power, or both but I think it likely this was a physical issue corrected/mitigated by dis/reconnecting drives. Good luck.
 
Top