The boot volume state is ONLINE: One or more devices has experienced an unre... (mirrored boot USB)

Status
Not open for further replies.

asimov-solensan

Contributor
Joined
Oct 14, 2016
Messages
113
Hello,

So, today I got home and freenas was not working, at least CIFS shares, seems that jails were working correctly. Since I had no web access I just rebooted and received this message on boot:
Code:
The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.


Read some threads and definitely the boot device is corrupted to some extent.

Now the thing is that non of the threads I read the user was working with a mirrored system while I do it (two 64GB USB stcks). I guess that in my case I should be able to just unplug the failing USB, then boot, then rebuild the mirror if I need it.

But I don't know how to check which one is failing, I just have to unplug one of them and try? Can I check exactly which one is failing? Run a fsck from another computer (or livecd or whatever) is recommended or is better to just replace the memory?

Sorry if questions are stupid I'm just getting used to freenas.

Thanks for the help
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I have had two different USB boot disks fail on me at two different times and it is always a pain. The USB memory sticks, when they die, don't give any warning and there is no fixing it. You might be able to send it back for warranty replacement but fsck isn't going to make it work again.

The thing to do is power the system off, remove one of the USB sticks and try to boot on the one that is still in there. If it won't boot from that one, power it off and switch drives and try again. Hopefully one of them will let you boot back to where you can access the GUI and your shares will be available. Give that a try and let me know how it goes. I have more advice.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Did you make a backup of your system config?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
If your USBs have activity lights, you can dd from the one you don't want to replace to identify it.

The oft recomend cruiser fits have actuvity lights.
 

asimov-solensan

Contributor
Joined
Oct 14, 2016
Messages
113
Thanks all for your answers.

I don't have a configuration backup, I thought that the point of a mirrored boot device was to avoid manual backups. But anyhow my system boots although the warning showing.

Will do a backup as soon as possible.

@Stux: My memory sticks have got light indeed, but they are installed on an internal USB port and I can't see them easily, I must move the case to another place where cables reach and all and is a bit of a mess.

Now the problem has become more complex. I can boot from both sticks but then the error I got is a degraded pool (obviously).

Code:
The boot volume state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.


So, still can't know which USB is the bad one, and safest procedure will be replace both and load backup, which makes me wonder what's the point of the boot mirror. I'm doing something wrong or is this a design flaw?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Can you post the output of camcontrol devlist, glabel status, and zpool status in [ code ] tags, lets see if there's something there.
 

asimov-solensan

Contributor
Joined
Oct 14, 2016
Messages
113
Thanks for the fast answer I was still with the case opened and all tried this.

Now I think I have got what I needed, let's see:
camcontrol devlist
Code:
<WDC WD20EZRZ-00Z5HB0 80.00A80>  at scbus1 target 0 lun 0 (pass0,ada0)   
<WDC WD20EZRZ-00Z5HB0 80.00A80>  at scbus2 target 0 lun 0 (pass1,ada1)   
<WDC WD20EZRZ-00Z5HB0 80.00A80>  at scbus5 target 0 lun 0 (pass2,ada2)   
< USB DISK 3.0 PMAP>  at scbus7 target 0 lun 0 (pass3,da0)   
< USB DISK 3.0 PMAP>  at scbus8 target 0 lun 0 (pass4,da1) 


Ok, nothing new here.

glabel status:
Code:
  Name  Status  Components   
gptid/7d96d52c-f2f5-11e6-b707-0cc47acf86d3  N/A  ada0p2   
gptid/7c7a9375-f2f5-11e6-b707-0cc47acf86d3  N/A  ada1p2   
gptid/428c0d7e-709d-11e7-904f-0cc47acf86d3  N/A  ada2p2   
gptid/df2e7818-f271-11e6-9ec9-0cc47acf86d3  N/A  da0p1   
gptid/dfaa4b0c-f271-11e6-9ec9-0cc47acf86d3  N/A  da1p1 


Now we have got an identifier for the sticks, that's what I needed.

zpool status:
Code:
  pool: RAID2x3a   
 state: ONLINE   
  scan: scrub repaired 0 in 2h5m with 0 errors on Sun Jul 30 02:05:22 2017   
config:   
   
  NAME  STATE  READ WRITE CKS
UM   
  RAID2x3a  ONLINE  0  0   
 0   
  raidz1-0  ONLINE  0  0   
 0   
  gptid/7c7a9375-f2f5-11e6-b707-0cc47acf86d3  ONLINE  0  0   
 0   
  gptid/7d96d52c-f2f5-11e6-b707-0cc47acf86d3  ONLINE  0  0   
 0   
  gptid/428c0d7e-709d-11e7-904f-0cc47acf86d3  ONLINE  0  0   
 0   
   
errors: No known data errors   
   
  pool: freenas-boot   
 state: ONLINE   
status: One or more devices has experienced an unrecoverable error.  An   
  attempt was made to correct the error.  Applications are unaffected.   
action: Determine if the device needs to be replaced, and clear the errors   
  using 'zpool clear' or replace the device with 'zpool replace'.   
  see: http://illumos.org/msg/ZFS-8000-9P   
  scan: resilvered 287K in 0h0m with 0 errors on Sat Aug 19 10:05:13 2017   
config:   
   
  NAME  STATE  READ WRITE CKSUM   
  freenas-boot  ONLINE  0  0  0   
  mirror-0  ONLINE  0  0  0   
  da0p2  ONLINE  0  0  5   
  da1p2  ONLINE  0  0  0   
   
errors: No known data errors 


And now there is the thing, correct me if I'm wrong but da0p2 is the broken one. Now it's a matter of disconnecting USB and check zpool status again until the one online shows no errors. Is this correct?

Then replace the device at some point.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I would run dd if=/dev/da0p2 of=/dev/null

This Will light up the faulty drive.

Of course, it actually looks like the drive just disconnected for a little bit. It might've just had an 'event'.

I'd make sure you have boot scrubs running (say every 5 days). Clear the errors zpool status clear freenas-boot and keep an eye on it. If nothing happens perhaps it's fine. If something does then junk the stick.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Thanks all for your answers.

I don't have a configuration backup, I thought that the point of a mirrored boot device was to avoid manual backups. But anyhow my system boots although the warning showing.

Will do a backup as soon as possible.

So, still can't know which USB is the bad one, and safest procedure will be replace both and load backup, which makes me wonder what's the point of the boot mirror. I'm doing something wrong or is this a design flaw?
Having drive redundancy is never a replacement for having a backup. Any level of drive redundancy is to reduce the likelihood of needing the backup.
The question of where the problem is, is why @Stux asked for the reports.
Code:
da0p2 ONLINE 0 0 5

This line above, how it ends in 5, that is where the error is. If you are able to access the Web GUI, go ahead and make a backup of the config DB. Here is the part of the manual that talks about it. The button is called "Upload Config" because it is going to save a copy of the system config database where you tell it to save the file. This can be used to recover from a fresh install to new boot media by simply loading the config DB and it will set everything back the way you had it when the DB was saved. It is a good idea to make a backup every time you change any settings. I have a script that automatically makes a backup for me once a week.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
"Save config" downloads the current config to your client system.

"Upload config" allows you to restore a previously saved config.

Many people run a script to periodically email the configs. The config is then a 56KB (tgz) backup of your NAS configuration.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
"Save config" downloads the current config to your client system.

"Upload config" allows you to restore a previously saved config.

Many people run a script to periodically email the configs. The config is then a 56KB (tgz) backup of your NAS configuration.
Yes, that is a much clearer way to say it.

Here is the link to the scripts I use: FreeNAS scripts
 

asimov-solensan

Contributor
Joined
Oct 14, 2016
Messages
113
Yes I have been able to find how to make a backup by myself, and thanks god becaus I really messed up things.

Let me explain the situation. I was happily running freenas with the remaining USB and ordered a replacement for the failing one.

But because I'm massively stupid I formatted the broken stick and tried to rebuild the mirror, just to confirm that the memory was indeed unusable. The process ended in failure as expected and I though that there wasn't any problem, just wait to get the new stick and replace the failing drive.

But then I shut down the server to replace the USB and I'm unable to boot since then. With grub complaining about not finding device:
Code:
grub: no such device: f2e6101ef6feab68


Then enters rescue mode. There I can see four disks hd1 to 3 are my sata disks (I can ls the root) and hd0 is the USB. And I cannot run a ls over this one, it shows two partitions gpt1 and gpt2 but both show just unknown filesystem.

I guess that the OS is still in this memory and it's just that got messed somehow. Is there a way to fix this?

Thanks
 

asimov-solensan

Contributor
Joined
Oct 14, 2016
Messages
113
Hi, all. Well good new is that I got my freenas installation running again. I did exactly what this thread explains:

https://forums.freenas.org/index.ph...er-attempted-replace-of-boot-usb-drive.29610/

Now I'm running freenas as usual and right now is resilvering the new USB to get the boot mirror working again. As soon as it finishes I think that we can consider this whole problem fixed.

Thanks all. I learned a few things and the script to send a backup to mail is a great idea I will configure it in brief.

Freenas is an amazing project and has a great community beneath it. I'm really enjoying it.
 
Joined
May 9, 2017
Messages
11
I'm resurrecting this discussion, as we experienced the exact same failure mode as described by the OP:
  • Our Supermicro freenas appliance was unresponsive on GUI and ssh access, also unresponsive on SMB and AFS services. Weirdly, the VM's were running normally, and were accessible through their own interfaces. Main IP (same as GUI) responded to pings normally.
  • Upon power-cycling the appliance (through the Supermicro IPMI service), we encountered the same error pointing at a boot drive checksum failure. We run a mirrored boot pool, similar as the OP, but with SATA DOM SSD's instead of sticks. While physical failure of, anything really, can't be ruled out, my "gut feeling" tells me it's less likely on our config than on the case above, and there's a non-zero chance this is software-related.
  • Scrubbing the boot volume yielded no further errors, and logs didn't show any other unusual events (temperature spikes, power availability issues, etc). The appliance resumed working normally upon reboot.
Essential configuration:
- Supermicro chassis (4U, 36x bays)
- Supermicro MB X10DRI-T4+ with dual Xeon E5-2620v4, 256GB DDR4 ECC RAM in a 16x16 configuration.
- Dual SuperDOM (Supermicro SATA DOM) SSD-DM064-PHI, mirrored as ZFS boot pool
- I'm not sure to which degree any other equipment could be relevant for the case at hand, will provide happily any other details if the case.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
You should file a ticket.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 
Joined
May 9, 2017
Messages
11
You should file a ticket.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Nope, I was utterly wrong (along with my "gut"). One of the SATA DOM's unceremoniously died this morning, so it's just as much related to hardware failure as the OP's case.
 
Joined
May 9, 2017
Messages
11
Nope, I was utterly wrong (along with my "gut"). One of the SATA DOM's unceremoniously died this morning, so it's just as much related to hardware failure as the OP's case.

Actually no, I was wrong again. The drive was indeed disconnected by FreeNAS, but it did not fail. Upon power cycling the server (a couple of weeks later, it's a live machine which we can't just reboot randomly for debugging purposes), the drive popped up like nothing happened. We added it to the boot mirror, resilvered, scrubbed, all good. Now, checking the logs, we found as follows:

Code:
Jul  6 04:45:31 jpfserver ahcich8: Timeout on slot 5 port 0
Jul  6 04:45:31 jpfserver ahcich8: is 00000000 cs 00000080 ss 000000e0 rs 000000e0 tfd c0 serr 00000000 cmd 0004c717
Jul  6 04:45:31 jpfserver (ada0:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 bd 94 da 40 04 00 00 01 00 00
Jul  6 04:45:31 jpfserver (ada0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:45:31 jpfserver (ada0:ahcich8:0:0:0): Retrying command
Jul  6 04:46:03 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:46:33 jpfserver ahcich8: Timeout on slot 8 port 0
Jul  6 04:46:33 jpfserver ahcich8: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 80 serr 00000000 cmd 0004c817
Jul  6 04:46:33 jpfserver (aprobe0:ahcich8:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul  6 04:46:33 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:46:33 jpfserver (aprobe0:ahcich8:0:0:0): Retrying command
Jul  6 04:47:04 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:47:34 jpfserver ahcich8: Timeout on slot 9 port 0
Jul  6 04:47:34 jpfserver ahcich8: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0004c917
Jul  6 04:47:34 jpfserver (aprobe0:ahcich8:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul  6 04:47:34 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:47:34 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted
Jul  6 04:48:06 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:48:36 jpfserver ahcich8: Timeout on slot 10 port 0
Jul  6 04:48:36 jpfserver ahcich8: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000 cmd 0004ca17
Jul  6 04:48:36 jpfserver (aprobe0:ahcich8:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul  6 04:48:36 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:48:36 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retry was blocked
Jul  6 04:48:36 jpfserver ada0 at ahcich8 bus 0 scbus8 target 0 lun 0
Jul  6 04:48:36 jpfserver ada0: <SATA SSD S9FM02.1> s/n B45007611E2400144564 detached
Jul  6 04:49:08 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:49:38 jpfserver ahcich8: Timeout on slot 11 port 0
Jul  6 04:49:38 jpfserver ahcich8: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd 80 serr 00000000 cmd 0004cb17
Jul  6 04:49:38 jpfserver (aprobe0:ahcich8:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul  6 04:49:38 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:49:38 jpfserver (aprobe0:ahcich8:0:0:0): Retrying command
Jul  6 04:50:09 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:50:39 jpfserver ahcich8: Timeout on slot 12 port 0
Jul  6 04:50:39 jpfserver ahcich8: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 80 serr 00000000 cmd 0004cc17
Jul  6 04:50:39 jpfserver (aprobe0:ahcich8:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul  6 04:50:39 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:50:39 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted
Jul  6 04:51:26 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:51:26 jpfserver ahcich8: Poll timeout on slot 13 port 0
Jul  6 04:51:26 jpfserver ahcich8: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 80 serr 00000000 cmd 0004cd17
Jul  6 04:51:26 jpfserver (aprobe0:ahcich8:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jul  6 04:51:26 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:51:26 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted
Jul  6 04:52:26 jpfserver ahcich8: Timeout on slot 14 port 0
Jul  6 04:52:26 jpfserver ahcich8: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000 cmd 0004ce17
Jul  6 04:52:26 jpfserver (ada0:ahcich8:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
Jul  6 04:52:26 jpfserver (ada0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request
Jul  6 04:52:26 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:52:26 jpfserver smartd[5263]: Device: /dev/ada0, failed to read SMART Attribute Data
Jul  6 04:52:27 jpfserver xptioctl: pass driver is not in the kernel
Jul  6 04:52:27 jpfserver xptioctl: put "device pass" in your kernel config file
Jul  6 04:53:13 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:53:13 jpfserver ahcich8: Poll timeout on slot 15 port 0
Jul  6 04:53:13 jpfserver ahcich8: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0004cf17
Jul  6 04:53:13 jpfserver (aprobe0:ahcich8:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jul  6 04:53:13 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:53:13 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted
Jul  6 04:53:43 jpfserver ahcich8: Timeout on slot 16 port 0
Jul  6 04:53:43 jpfserver ahcich8: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 80 serr 00000000 cmd 0004d017
Jul  6 04:53:43 jpfserver (ada0:ahcich8:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
Jul  6 04:53:43 jpfserver (ada0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:53:43 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:54:29 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:54:29 jpfserver ahcich8: Poll timeout on slot 17 port 0
Jul  6 04:54:29 jpfserver ahcich8: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 80 serr 00000000 cmd 0004d117
Jul  6 04:54:29 jpfserver (aprobe0:ahcich8:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jul  6 04:54:30 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:54:30 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted
Jul  6 04:55:00 jpfserver ZFS: vdev state changed, pool_guid=17032347544455370488 vdev_guid=4106658276116341525
Jul  6 04:55:00 jpfserver ahcich8: Timeout on slot 18 port 0
Jul  6 04:55:00 jpfserver ahcich8: is 00000000 cs 001c0000 ss 001c0000 rs 001c0000 tfd 80 serr 00000000 cmd 0004d217
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): DSM TRIM. ACB: 06 01 00 00 00 40 00 00 00 00 01 00
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 bd 94 da 40 04 00 00 01 00 00
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 bd 95 da 40 04 00 00 01 00 00
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 bf bd 96 da 40 04 00 00 00 00 00
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): Error 5, Periph was invalidated
Jul  6 04:55:00 jpfserver (ada0:ahcich8:0:0:0): Periph destroyed
Jul  6 04:55:02 jpfserver ZFS: vdev state changed, pool_guid=17032347544455370488 vdev_guid=4106658276116341525
Jul  6 04:55:46 jpfserver ahcich8: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul  6 04:55:46 jpfserver ahcich8: Poll timeout on slot 21 port 0
Jul  6 04:55:46 jpfserver ahcich8: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 80 serr 00000000 cmd 0004d517
Jul  6 04:55:46 jpfserver (aprobe0:ahcich8:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jul  6 04:55:46 jpfserver (aprobe0:ahcich8:0:0:0): CAM status: Command timeout
Jul  6 04:55:46 jpfserver (aprobe0:ahcich8:0:0:0): Error 5, Retries exhausted


We don't exclude hardware issues, and we will look closer to power events and power supply loads, but this may indeed end up as a software issue. I've found somewhat similar reported behavior in this thread, but that was under 9.10.2-U3, and we're on 11.1-U5, so it may be completely unrelated.

I'll postpone submitting a ticket until we have more data, or can replicate this consistently. I'm sure the FreeNAS team has their hands full working on 11.2, and we want to avoid sending them on wild geese chases.
 
Status
Not open for further replies.
Top