Multiple Drives not capable of SMART self-check

Yaniv

Dabbler
Joined
Feb 5, 2015
Messages
12
Hi!

I've been using FreeNAS on a few different boxes for a few years, but running into an issue with my latest build. It's been running FreeNAS 24x7 for close to two or three years now.

OS Version:
FreeNAS-11.2-U5
(Build Date: Jun 24, 2019 18:41)
Processor & MB:
Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz (4 cores)
ASUS PRIME Z270-K LGA1151
Memory:
32 GiB ( 4 x 8GB XPG Gammix D10 DDR4 2666MHz (PC4 21300))
Sata Controller:
Ableconn PEX10-SAT 10 Port SATA 6G PCI Express Host Adapter Card - AHCI 6 Gbps SATA III
(the rest are connected directly to the MB)

I can hunt down the rest of the specs as needed but basically it has 10 x WD RED sata drives (regular not pro), and 5 x Iron Wolf Seagate drives, all 15 are 4TB.
I have short smart scans set to run weekly, and long scans to run once a month. The OS runs on 2 x 64GB mirrored thumbdrives (Corsair Flash Voyager Vega 64GB).

Lately, I've been getting a lot of email alerts about drives not capable of smart self check. The server will stay on for a few weeks with no issue, and then suddenly the alerts begin randomly (time doesn't match up with the monthly scan). I'll get the alert about one drive, then the next day about another, etc etc this goes on until there are 7 or 8 drives in the email. At that point the server can become unresponsive. I can't SSH etc. If I can, I'll check the pool status and either it's degraded and re-silvered, or one out of the 7 or 8 drives, has been removed. If I restart the server, it says the pool is online, all drives are online, and there are no issues.

I'll then go and manually kick off a full long test of the one that was removed (I noted the gptid, and found the drive info), and see that the drive has passed the long test. I am in no way an expert at reading the results, but my understanding is that as long as the value and worst value are above the threshold, I should be good. I've attached a sample of one of the WD drives and one of the Seagate drives. The most recent one that was removed, is the WD drive.

The WD one seems ok to me. The Seagate one also seems ok, but probably close to giving up on me. The drives are all 2-3 years old.

I guess my question is, why do I keep getting these alerts, and how do I fix whatever the issue is? Is it just genuinely that all the drives happen to be going bad so soon, all at similar times?
 

Attachments

  • SEAGATE-05042020.txt
    6.6 KB · Views: 246
  • WD_RED-05032020.txt
    5.9 KB · Views: 221
Last edited:

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hey Yaniv,

First thing would be not to wait for days when something goes wrong with your server... Whenever an alert arrives, you should investigate it ASAP.

That should let you SSH in the server, log in the WebUI, do the checks and collect more info about the problem...

Next time it happen, don't wait for the case turning from bad to worst and investigate ASAP. Should you not find then, collect as much evidence as you can (like zpool status ; logs ; alert messages ; ...) and we will work it from there.
 

Yaniv

Dabbler
Joined
Feb 5, 2015
Messages
12
Hey Yaniv,

First thing would be not to wait for days when something goes wrong with your server... Whenever an alert arrives, you should investigate it ASAP.

That should let you SSH in the server, log in the WebUI, do the checks and collect more info about the problem...

Next time it happen, don't wait for the case turning from bad to worst and investigate ASAP. Should you not find then, collect as much evidence as you can (like zpool status ; logs ; alert messages ; ...) and we will work it from there.

Hi Heracles,

Thanks for the reply, much appreciated. The most recent alert is from today:

FreeNAS @ freenas.local

New alerts:
* Device: /dev/ada5, not capable of SMART self-check

Current alerts:
* Device: /dev/ada5, not capable of SMART self-check
* A system update is available. Go to System -> Update to download and apply the update.
* Pool Media state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Once I got the email, I ran the extended test of ADA5. The results of that are attached to my first post, as the seagate file.
Yesterday, the email said ADA 12, the results of that are attached as well, under the WD file. It doesn't show up in today's email because the scan must have resolved it for now. No doubt tomorrow will be about a different drive.

Current zpool status:

root@freenas[~]# zpool status -v
pool: Media
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 144M in 0 days 00:32:52 with 0 errors on Sun May 3 10:45:43 2020
config:

NAME STATE READ WRITE CKSUM
Media ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/03960005-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/0dc67fc1-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/17eab28f-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/214dbf98-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/2dc1298f-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/3a57f5fd-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/4603f0d8-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/4dd500fa-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/55a98bea-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/5e4661b0-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/66dba58b-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/72901a40-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 3
gptid/7f011f60-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/8acad059-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0
gptid/929e1b01-0d3d-11e9-9800-704d7b87f1d2 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:10:22 with 0 errors on Wed Apr 29 03:55:22 2020
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da0p2 ONLINE 0 0 0
da1p2 ONLINE 0 0 0

errors: No known data errors

the one with the chksum 3 is actually the ADA 12 drive from yesterday.

Do you know which logs I should be looking at when this stuff happens? I'm just not sure if this is hardware specific or something in the OS? Thank you for your help.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Ableconn PEX10-SAT 10 Port SATA 6G PCI Express Host Adapter Card - AHCI 6 Gbps SATA III
(the rest are connected directly to the MB)
I can probably say with 100% certainty this is not supported. I didn't bother reading the rest of your post because you didn't bother reading the documentation or hardware guide.

Replace your sata controller with a supported HBA.
 

Yaniv

Dabbler
Joined
Feb 5, 2015
Messages
12
I can probably say with 100% certainty this is not supported. I didn't bother reading the rest of your post because you didn't bother reading the documentation or hardware guide.

Replace your sata controller with a supported HBA.

Replies like these are probably why people are hesitant to ask for help in forums like this.

You assumed that I have not read the doc or the hardware guide. I have read both. This is not my first rodeo with FreeNAS. If you took the time to actually finish reading the post, you would have seen I was asking about a specific alert, and if the results on the hard drives are abnormal or not. I was not asking anyone to diagnose my unsupported controller, or help with that portion of it in any way. If the results are normal and there are no known bugs with this version of the OS, then obviously, I would need to take a look at the controller. This post was more about - am I taking the right troubleshooting steps, and if not, where could I improve before going the hardware route.

I also went through the forum rules before posting. It doesn't say - if you don't meet the hardware guide, don't bother posting.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi again Yaniv,

The proper way of debugging is going from lowest layers to the higher one, up to the top. The reason is simply because a layer relies and is dependent from the lower ones, a failure in a lower layer can be the source of another failure at a higher layer.

Here, indeed, the lowest layer is hardware. So we have to troubleshoot that one.

Because you are using an exotic controller, we need to clear out this one. One way would be to take all problematic drives and move them to the motherboard. After the move, is the problem moving with the drive and repeats itself once on the board, or is the problem moving from a drive to another, staying with the same controller ?
 

Yaniv

Dabbler
Joined
Feb 5, 2015
Messages
12
Hi again,

Ok, makes sense - thanks. I am pretty sure that some of the drives with this alert are the ones that are directly attached to the mb. I will have to keep an eye on it when it happens again. I'll try what you suggested regarding switching out the drive and going from there. Hopefully that helps narrow down the issue.

Thanks again!
 
Top