Boot pool status is DEGRADED

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
The boot drive on my TrueNAS CORE system is a couple of mirrored 40GB Intel SSD 320 series drives. I've recently got the following alert:
CRITICAL
Boot pool status is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
See log below.

After a power down and messing with the SSDs I get the alert:
CRITICAL
Boot pool status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected..


Few questions:
  1. How do I know what mirrored drive is bad?
  2. Can I pull one of the drives (assuming it still boots) and check if it's ada0, if so, put in the new and resilver? I'll probably then replace the other.
  3. Any suggestions on what SSD I should use as a replacement? On Newegg is see an Intel 40GB for $90 and a SanDisk 64GB for $30. I'll probably avoid the cheap Dogfish and Biwin drives. lol
Shown below is some of my log file and a few screenshots.

The log file before reboot showed:
...
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): Error 5, Retries exhausted
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 38 82 a8 40 04 00 00 00 00 00
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 38 82 a8 40 04 00 00 00 00 00
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
Feb 13 08:13:13 freenas (ada0:ahcich0:0:0:0): Error 5, Retries exhausted
...



1707838475011.png



1707838495308.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
1. You can grab the physical serial number from smartctl -a /dev/ada0 and compare that way.
2. I'd definitively confirm which drive is having issues before you pull anything. You're likely to have some challenges booting as well since it looks like the other boot device is on the SCSI/SAS bus (or possibly USB) so you may need to reconfigure your boot device in UEFI/BIOS.
3. There should be a number of used-pull small Intel DC S3500/S3520 drives (80/120G) floating around eBay for less than those two Newegg prices - that would be my suggestion.

As always, make a backup of your system configuration first before making changes, just in case you pull the wrong drive or the second one decides this is when to pack it in as well!
 

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
smartctl -a /dev/ada0 indicates the s/n is BTPR234301AN040G3 which is the drive listed in Storage / Disks (see screenshot above).

Do you agree the screenshot above shows ada0p2 and da7p2 as the mirrored SSD and ada0 is having checksum issues?

Going to the bottom of my list of drives is see the other drive:
1707841407149.png


Is there place in the GUI that shows the status of the mirror boot drives? I'm running TrueNAS CORE v12.0-U8.1.
 

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
@HoneyBadger per your comment :
2. I'd definitively confirm which drive is having issues before you pull anything. You're likely to have some challenges booting as well since it looks like the other boot device is on the SCSI/SAS bus (or possibly USB) so you may need to reconfigure your boot device in UEFI/BIOS.
Is this something I should adjust? Per my previous comments one of my mirrored drives is ada0 and the other is da7, does this indicate they are on on different interfaces?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Is there place in the GUI that shows the status of the mirror boot drives? I'm running TrueNAS CORE v12.0-U8.1.
System -> Boot -> Actions -> Boot Pool Status

Is this something I should adjust? Per my previous comments one of my mirrored drives is ada0 and the other is da7, does this indicate they are on on different interfaces?

Yes, ada suggests it's on a SATA controller and da indicates it's on SCSI (probably your M1015) so unless your M1015 was flashed with a boot ROM when you did the IT mode flash (or it still has the factory boot ROM) removing the SATA drive would likely make your system unbootable. Adding the third SSD into the system and using the Replace option in the UI will hopefully prevent this, as it should copy the boot partition over, but I'm not able to validate that against 12.0-U8.1 specifically right at the moment.
 
  • Like
Reactions: TAC

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
@HoneyBadger Thanks a lot for your help!

Next time I'm in the box I'll see if I have a port available on the SATA controller, if so I'll add a new SSD and do a 'replace' on the bad ada0 SSD. I think I remember reflashing the M1015 way back when when I installed it, so I might be good. When I first set up this box I was booting of two mirrored 16GB USB jump drives and eventually swapped those out for these (more reliable) Intel SSDs.

Worst case I should be able to get rid of the mirror and boot off the single remaining good SSD, right?

I'd probably be asking for trouble to swap a HD off one of the two vdevs in my pool that's on the SATA controller and put it on the M1015 (and resilver). Then add a new SSD to the open SATA port and resilver the mirrored boot drive. :rolleyes:


From this screenshot it doesn't look like ada0 has completely failed, just checksum issues.
1707854263163.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Most of the walkthroughs for crossflashing cards into their LSI equivalent HBA leave the bootrom portion as optional - thus it tends to get skipped. I can't say for yours for certain, but if you don't have the "Press CTRL-C to enter BIOS" or similar prompt from your LSI HBA, you don't have a bootrom on there and wouldn't be able to boot from the SSD attached to it.

However, what you'd probably be able to do is swap the two drives, such that the one with bad checksums is on the HBA, the good one is on the SATA controller - boot from SATA, and then even hot-swap the failed unit out.
 

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
So with my existing configuration do not really have a mirrored drive in that if the the drive on the SATA port fails by system will not boot off drive connected to the M1015?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It's mirrored data, but it may not be a true "highly available boot pool" - you can see the extent that some of our community members have gone to in order to create a "truly redundant boot solution" in @jgreco 's resource:

 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Unfortunately, you can't have nice things unless you're Oxide Computer. They boot directly to an Illumos kernel (no UEFI) with ZFS, NVMe, PCIe, and whatever else they need to get up and running (<32 MB, to fit on the EEPROM traditionally used for the system firmware) and load the rest from NVMe.

Though now that I think about it, if Windows can have ZFS, why can't we have a UEFI driver? Apart from the fact that nobody's done so yet, ot course.
Actually, I can answer part of that: firmware images are huge on modern systems, to the point that it's becoming a problem. ZFS would further increase the size of the images.
 

TAC

Contributor
Joined
Feb 16, 2014
Messages
152
@HoneyBadger Just an update.... I got 3 used Intel SSD boot drives off ebay and was going to swap one, and maybe the other, out of my mirrored boot drive but came up with the following.
  1. I didn't have any more errors on the boot drive until I opened the case and must have slightly move one of the SATA cables. Then saw a bunch more errors!
  2. I pulled both cables off my M1015 controller, hit them with contact cleaner and haven't seen ANY complaints from the system since. My log file is really boring!
  3. One of the mirrored drives is on my M1015 card and the other is off a SATA port on the Supermicro MoBo and the system boots just fine.
The problem must have been with the drive cabling from the M1015 and not the drive. I do think I remember having to order an additional cable from the M1015 when I initially set it up.

Maybe its an issue with my beer consumption, but I find that when I set something up a year ago and it's worked perfectly ever since, I have trouble remembering all the details. lol

My next project will be updating TrueNAS 12.0 to 13 and see how much dicking around I'll have to do to get all my jails running. :rolleyes:

Thanks again for all your input.
 
Top