Degraded Pool. Dead Drive?

Aflac_Attack

Cadet
Joined
Apr 21, 2018
Messages
8
I logged into my FreeNAS as I do every so often to make sure everything's running smoothly and I'm greeted with the big yellow Degraded message.
degraded.JPG


Go click on the Alert and it shows:
Alert.JPG


Go check out the Disks page and get:
disks.JPG


Alright so it looks like ada3 is the cause of the degradation. I bring up the log for when the alert was triggered and this is what I get. Don't know what to make of it to be honest.
log1.JPG

log2.JPG


Thinking that either the drive is dead, or somehow FreeNAS errored out on me, I give the system a 'ole restart in the (small) hope that'll do it. Alas, it did not.
After the restart, the (bad) drive didn't seem to be brought up at all, as now instead of 7 active hard drives, there are only 6, ada0 -ada5. With the ada3 now active and working properly, so I'm thinking the ada title is only a temporary operational title and it was redistributed to one of the working hard drives after the restart.
restart - disks.JPG


So I have a degraded pool that's missing 1 hard drive. This hard drive was labeled as ada3 but after the restart it doesn't to have seem to been brought back up at all, and one of the good hard drives took the name ada3.

What's next?
How do I go about determining what the problem is?
How do I determine which is the affected drive other than unplugging every drive one at a time?
How do I run manual SMART or other testing commands on each drive?
How do I view and/or upload the log files? I can upload the logs from after the restart if that would help diagnose the problem. I'd just need someone to tell me how to do that.

System Specs:
Ryzen R5 1600
Asrock Taichi x370
16 GB ECC RAM @ 2400mhz (2x8)
7x 10 TB HDD's in RAID Z2 all directly connected SATA to motherboard. No HBA or Raid card nonsense.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
You can identify the drive by serial number - 2yj1gwgd.
I would first reboot the machine and check to see if the drive is seen in the BIOS.
If not, check the cables.
If not found, plug one of your good driver into the port used by the bad one to check the port, plug bad drive into known good port and treat what you find.
I guess that you don't have SMART tests running? You can see test info with smartd -q showtests on the command line.
To test drives use smartctl -a /dev/adaX where X is 0, 1, 2, etc.
 

Aflac_Attack

Cadet
Joined
Apr 21, 2018
Messages
8
Not one to leave forum posts unfinished, here is the update: I managed to resolve the degradation problem I was having and safely reintroduced the "faulty" drive back into the system.

  • I began by removing ada3 2Y1GWGD from the system and running diagnostic tests on it. Surface Test, BadBlocks, and Long SMART. (10TB Surface Tests take 17 hours...) All tests completed with no issue, so I was sure the Hard Drive itself was still healthy and functional.
  • Next I tested the other physical components of the system; SATA cables, SATA ports, both of which seemed fully functional.
  • Then I tried various combinations of offlining the drive and bringing it back online, and replacing it with itself, with no success.
  • Tried clearing past SMART data, as I read if SMART had thrown an error for whatever reason, and that was still in the firmware log that might prevent it from being onlined.
  • I then read about possible having to clear the partition label on the affected drive in order for the replace to work, so brought the drive to another system where I was running the installer shell and tried a few variations on the "clear label" command. I kept trying different arguments as I never got any kind of response saying that the label had been cleared, but decided to just give it a shot anyways in the hope that the label had in fact been cleared.
  • Reinstalled the drive back into the FreeNAS system, still could not get the disk to Online without a degraded state, so replaced it and it began resilvering. I was hoping to just be able to pop it back in and bring it back up WITHOUT the resilvering, but that was not the case. Resilvering completed successfully and all 7 drives are back up in the array showing healthy.
I still have no idea what actually caused ada3 to drop out to begin with, as it wasn't any kind of physical fault, so I can only think it was something wonky with ZFS/FreeNAS itself. But crisis averted I guess, and no need to waste money on another hard drive.

~I'm not sure if there's a way to close this thread or mark this issue as resolved but if an admin can do so, then please do.~
 
Top