LIGISTX
Guru
- Joined
- Apr 12, 2015
- Messages
- 525
I got a message a few days ago that appeared to be a possible loose cable issue. I got this email:
The above was on 11.1U6
Two nights ago I decided to update to 11.1U7 and that night a few hours after, I got a slew of emails, not wanting to deal with it yesterday before work, I shut FreeNAS down and shut down ESXi and the server itself.
Replugged all the cables, made sure everything was fine, brought everything back up and it seemed to be fine. I believe it had to resilver a driver, but not the entire drive itself as it seemed to be very quick. I checked zpool status and it seemed fine.
This morning I woke up to more emails. I am not really sure what to check or what is useful here, but once again it is degraded, and looking at the security run output from the night I upgraded to 11.1U7 I have similar issues as previously stated at the very bottom of the emailed output:
Today, I woke up to these after it started to scrub the pool as planned last night:
Running zpool status right now it looks like it is 92% repaired. So do I have a drive issue? Is a controller getting a bit confused (it looks like in one of my logs, da2 and da4 were listed, but I don't see that log showing any signs of an error, I don't know what that log is telling me tbh), or is something just not happy and this scrub/repair its currently doing may help?
I have been running this box for almost 1.5 years now with very few issues, nothing even close to "o shit, my daya!", and have been under ESXi now for many months without issue. Whatever this is seems to have started a week ago with the initial log message of retrying a command on da9, and now I have whatever issue this may be. I have restarted since that da9 retry issue, so I am not sure if da9 is still the same drive or not; I should have noted down its serial number. Right now, I know da9 is serial ending in F4UTL although I am not sure I see that in any of the above logs, nor am I sure how to check what may be wrong.
I figure I will let this thing finish doing its scrub, see what sort of resolution it comes to all by itself (I know FreeNAS is pretty good about not destroying data, so that is my reasoning to let it finish). All other ESXi VM's seem to be fine, thus I don't think its a low level hardware issue causing data errors, although no other VM would be as sensitive as FreeNAS, I am really only running a couple ubuntu VM's doing very simple things as this is just a little homelab playground; and I am not very advanced, so its a pretty boring playground.
Any help would be great!
Code:
.local kernel log messages: > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 cd 91 6f e8 00 00 00 08 00 00 length 4096 SMID 761 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 4096 > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 cd 91 6f e8 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Retrying command > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 cd 91 6f e8 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) -- End of security output --
The above was on 11.1U6
Two nights ago I decided to update to 11.1U7 and that night a few hours after, I got a slew of emails, not wanting to deal with it yesterday before work, I shut FreeNAS down and shut down ESXi and the server itself.
Replugged all the cables, made sure everything was fine, brought everything back up and it seemed to be fine. I believe it had to resilver a driver, but not the entire drive itself as it seemed to be very quick. I checked zpool status and it seemed fine.
This morning I woke up to more emails. I am not really sure what to check or what is useful here, but once again it is degraded, and looking at the security run output from the night I upgraded to 11.1U7 I have similar issues as previously stated at the very bottom of the emailed output:
Code:
> epair0a: promiscuous mode enabled > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6c a0 00 00 00 10 00 00 length 8192 SMID 342 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0 > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6c a0 00 00 00 10 00 00 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Retrying command > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6c a0 00 00 00 10 00 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6d a8 00 00 00 08 00 00 length 4096 SMID 178 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0 > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6d a8 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Retrying command > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6d a8 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 6d a8 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 aa 38 00 00 00 08 00 00 length 4096 SMID 762 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0 > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 aa 38 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Retrying command > (da9:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 94 a2 aa 38 00 00 00 08 00 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) > (da9:mps0:0:16:0): WRITE(10). CDB: 2a 00 0a 70 7b e8 00 00 20 00 length 16384 SMID 1021 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0 > (da9:mps0:0:16:0): WRITE(10). CDB: 2a 00 0a 70 7b e8 00 00 20 00 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Retrying command > (da9:mps0:0:16:0): WRITE(10). CDB: 2a 00 0a 70 7b e8 00 00 20 00 > (da9:mps0:0:16:0): CAM status: SCSI Status Error > (da9:mps0:0:16:0): SCSI status: Check Condition > (da9:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > (da9:mps0:0:16:0): Retrying command (per sense data) > mps0: mpssas_prepare_remove: Sending reset for target ID 16 > da9 at mps0 bus 0 scbus33 target 16 lun 0 > da9: <ATA WDC WD40EFRX-68N 0A82> s/n WD-WCC7K2YF4UTL detached > (da9:mps0:0:16:0): WRITE(10). CDB: 2a 00 0a 70 7b e8 00 00 20 00 length 16384 SMID 599 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0 > (da9:mps0:0:16:0): WRITE(10). CDB: 2a 00 0a 70 7b e8 00 00 20 00 > mps0: Unfreezing devq for target ID 16 > (da9:mps0:0:16:0): CAM status: CCB request completed with an error > (da9:mps0:0:16:0): Error 5, Periph was invalidated > GEOM_MIRROR: Device swap0: provider da9p1 disconnected. > (da9:mps0:0:16:0): Periph destroyed -- End of security output --
Today, I woke up to these after it started to scrub the pool as planned last night:
Code:
.local kernel log messages: > uhub0: 8 ports with 8 removable, self powered > ugen0.2: <VMware VMware Virtual USB Mouse> at usbus0 > da2 at mps0 bus 0 scbus33 target 9 lun 0 > da2: <ATA WDC WD40EFRX-68N 0A82> Fixed Direct Access SPC-4 SCSI device > da2: Serial Number WD-WCC7K7XF3TTF > da2: 600.000MB/s transfers > da2: Command Queueing enabled > da2: 3815447MB (7814037168 512 byte sectors) > da2: quirks=0x8<4K> > da4 at mps0 bus 0 scbus33 target 11 lun 0 > da4: <ATA WDC WD40EFRX-68N 0A82> Fixed Direct Access SPC-4 SCSI device > da4: Serial Number WD-WCC7K7PHXACK > da4: 600.000MB/s transfers > da4: Command Queueing enabled > da4: 3815447MB (7814037168 512 byte sectors) > da4: quirks=0x8<4K> -- End of security output --
Code:
Checking status of zfs pools: NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT freenas-boot 15.9G 2.27G 13.6G - - 14% 1.00x ONLINE - xxx 36.2T 20.4T 15.9T - 4% 56% 1.00x DEGRADED /mnt pool: xxx state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub in progress since Thu Feb 7 00:00:07 2019 12.5T scanned at 1.18G/s, 11.4T issued at 1.08G/s, 20.4T total 1.27M repaired, 56.16% done, 0 days 02:21:11 to go config: NAME STATE READ WRITE CKSUM xxxxxxxx DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/ac8d872a-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/ad4a2436-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/af89686d-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee DEGRADED 0 0 373 too many errors (repairing) gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee ONLINE 0 0 0 errors: No known data errors -- End of daily output --
Running zpool status right now it looks like it is 92% repaired. So do I have a drive issue? Is a controller getting a bit confused (it looks like in one of my logs, da2 and da4 were listed, but I don't see that log showing any signs of an error, I don't know what that log is telling me tbh), or is something just not happy and this scrub/repair its currently doing may help?
I have been running this box for almost 1.5 years now with very few issues, nothing even close to "o shit, my daya!", and have been under ESXi now for many months without issue. Whatever this is seems to have started a week ago with the initial log message of retrying a command on da9, and now I have whatever issue this may be. I have restarted since that da9 retry issue, so I am not sure if da9 is still the same drive or not; I should have noted down its serial number. Right now, I know da9 is serial ending in F4UTL although I am not sure I see that in any of the above logs, nor am I sure how to check what may be wrong.
I figure I will let this thing finish doing its scrub, see what sort of resolution it comes to all by itself (I know FreeNAS is pretty good about not destroying data, so that is my reasoning to let it finish). All other ESXi VM's seem to be fine, thus I don't think its a low level hardware issue causing data errors, although no other VM would be as sensitive as FreeNAS, I am really only running a couple ubuntu VM's doing very simple things as this is just a little homelab playground; and I am not very advanced, so its a pretty boring playground.
Any help would be great!