Disk confusion

sbeaudoin · Apr 10, 2023

Hi,

I am using TrueNAS Core v13.0 U4 and I really don't understand the information provided in the disk status notifications. I have a drive bench with 15 disks seperated in two pools : Main and Archive. Each one have a hot spare. I also have a USB disk pack with 4 drives used for backup and a quick way to bring it in case of fire.

I receive the following content by email daily and decided to act :

* Device: /dev/da18 [SAT], Self-Test Log error count increased from 0 to 1.
* Device: /dev/da20 [SAT], 16 Currently unreadable (pending) sectors.
* Device: /dev/da20 [SAT], 16 Offline uncorrectable sectors.
* Device: /dev/da18 [SAT], 16 Currently unreadable (pending) sectors.
* Device: /dev/da18 [SAT], 16 Offline uncorrectable sectors.

* Pool Main state is ONLINE: One or more devices has experienced an
unrecoverable error. An attempt was made to correct the error. Applications
are unaffected.

I deduce disks /dev/da18 and /dev/da20 have problems and those are in the pool named "Main"', as it is the only pool named on this content with an error.

So, I go to the storage>disks section to find DA18.

It show DA18 is in my "Archive" pool, not my "Main" pool! So I go in the dashboard and look for the archive pool status, disk w/error says 0, so every disk of archive is good, including DA18, so why do I receive a daily notification as it is defective??

Now, the next one, DA20. It is not in my list in storage>disks…

So, I go to the pools screen and ask for a status of each one. DA20 is in the main pool, but as a hot spare and shows 0 read, 0 write and 0 checksum errors. I was expecting 16 read errors… Also, DA3 indicates a checksum error, but is not reported in the email…

Probably the reason it did not show in the disk screen is because it is a hot spare? I go back to the disk screen and see the only hot spare of the main pool I have is DA21!! And the pool is "N/A". Only the label I put for this drive says it is in the main pool.

So, I am really confused… As managing disks is a crucial part of a NAS, I don't understand why I receive either false alarms or else, the information provided either on the email or the screens are simply wrong and can't be trusted?

So, the questions are :

Did I do something wrong in my configuration to produce this nonsense?
Why the email report DA18 and DA20 as defective and the screen show no errors?
Why DA20 becomes DA21?
Why the error reported for DA3 on the screen is not reported in the email?

Thank you for your help...

joeschmuck · Apr 10, 2023

I will answer #2... The drives have not failed, YET! But this is data you should evaluate as pre-failure data. What strikes me as odd is both drives have the exact same data. I would investigate this more to rule out the drives as actual starting to fail. Run a SMART Long test on both.

And #3... Track drives by the serial number, not the drive identifier (da20 or DA21). Why? Because drives are assigned the drive identifier are assigned by which one becomes available first. These can change ID's which is why we promote using the serial number. Drive ID's don't often change but some hardware it can happen more often.

I think that if you are responsible for this NAS, you should understand more about hard drives, SMART data, and TrueNAS. Read the User Guide, it may help you out some.

And #4... looks like the checksum had one failure (I think that is the checksum column). if the drive recovered the data then the checksum will not clear automatically. Run a SCRUB, if it passes and the error is still listed, run the zpool clear poolname to clear the error.

sbeaudoin · Apr 10, 2023

Thank you for your kind and rapid answer. Here are some precisions.

#1 - So, I deduce nothing is wrong with my config.
#2 - I understand they have not failed yet, but still, no error show on the disk status screens and errors appears on the email, it brings confusion.
#3 - If so, why does the email does not give the serial number? As it gives the disk ID and no serial number, and as the disk ID can change, and as the screens does not show the errors, how am I supposed to find out the exact drive who failed or is about to fail except do a SMART on each and every drive? The email should provide the serial number, the right pool (it says the failed drives are in "Main" when the disks are actually in "Archive") and the screen should show the errors (not zero, again, confusion) on the drive with the corresponding serial numbers. It should then offer me, with those, to do a SCRUB and offer to clear the error if it does not find nothing.

It is for a home setup and it was "sold" to me as a lot cheaper and as easy to use as a synology NAS.

Cheaper? It is for the amount of storage I have. As easy? No. I spent a week to mount this setup and if I don't want to be confused or know what to do, I must read the 607 pages manual as the web interface gives wrong or insufficient informations and insufficient functions.

joeschmuck · Apr 11, 2023

TrueNAS is fairly easy for an enterprise solution, not near as easy as a Synology where you drop in some drives and forget about it. Whoever sold you on the fact that TrueNAS is easier, they were not very honest with you or maybe they just feel it's easier for them so it should be easier for you. But it's not terribly difficult either, but if you have no experience with computers, networking (a little bit), then you are likely going to need help in setting up the NAS. Do not reference YouTube videos, they are normally outdated quickly and some just provide poor advice.

You should not forget about TrueNAS once you have it setup, it does require you learning something and monitor it. If you leave it alone, it will generally run for many years but then hard drive failure will occur and because you forgot all about it 3 to 5 years ago, you panic. See it here all the time unfortunately. But TrueNAS is one of the best values out there for a NAS solution for home use. Probably even for business use but I have no first hand experience there.

I can't speak to why TrueNAS does not list the serial numbers but the advice I provided will save you grief when you need to replace a drive. Always use the serial number. You can obtain the serial number via the GUI as you have shown above.

I have a script (see my links in my signature for Multi-Report. This consolidates all the important information you should need about your system, from drive health to pool health. A new version will be released very soon, version 2.2 but I'd like to wait until the next version of TrueNAS Scale comes out. But you first need to get your TrueNAS working properly. Maybe it is setup properly now, sounds like it is. You are just experiencing some possible prefailure of two drives. Possible because without seeing the output of smartctl -x /dev/da18 and smartctl -x /dev/da20, I can't diagnose if the drives are in trouble or not. This is the part you will need to learn. Even learn for a Synology.

BTW, TrueNAS Core is the more mature software, stick with it for now. As a person new to all of this, do not be tempted by Scale. It does have features that will appeal to many people, but if you just want to run a NAS, the Core version is the way to go.

If you have recommendations to improve the TrueNAS software, I would recommend you submit a jira ticket and be explicit in the details and provide a recommended solution.

So, you should post the SMART data on those two drives so we can tell you if it's a problem, if it is, what to do about it. The forum members are hear to help where we can. We are not iXsystem employees, just users like you, some of us with lots of experience. I'm not experienced in all things so I only answer what I feel comfortable with.

Best of luck to you.

WI_Hedgehog · Apr 11, 2023

sbeaudoin said:
It is for a home setup and it was "sold" to me as a lot cheaper and as easy to use as a synology NAS.

Cheaper? It is for the amount of storage I have. As easy? No. I spent a week to mount this setup and if I don't want to be confused or know what to do, I must read the 607 pages manual as the web interface gives wrong or insufficient informations and insufficient functions.

Well, cheaper...that depends on Total Cost of Ownership, including the time you invest in learning any system. It's hard to beat Synology for an easy-to-use home system that you just plug in and click around a bit.

TrueNAS is more like Datacenter-grade middleware sandwiched between hardware and an operating system. It'll haul a truckload of data, but it's no racecar, and you've got to learn how to drive it.

For me, rolling my own OS would be the "cheaper" solution from a hardware perspective, but that's time I don't have. iXsystems did an excellent job with TrueNAS, so it's worth my time in learning TrueNAS and adapting to their method of doing things. They also do a great job with hardware, so that may be worth it to you also. The big gain for me is the education I get by being here--the members are helping me get from "large office" to "datacenter" knowledge, and that's really incredible--there's no-one around me with that level of experience so I'm very grateful.

sbeaudoin · Apr 12, 2023

@joeschmuck, Thanks again for your time. What i conclude is TrueNAS as a combination of backend and ZFS is great and merits all the hype. But it seems to me the web interface functional design is not considering real-life scenarios, like the one I live now. It is only a facade to facilitate some tasks and give the impression it is an easy to manage O.S. But the only way to manage it is not with this interface. You must understand the commands and be ready to use a command prompt on anything out of the usual, if we consider diagnosing a disk out of usual, which it is not.

A good designed web interface should offer a way to easily find a defective hard drive and guide you to diagnose it and replace it if needed. All I have is a drive label where a serial number should be provided, a wrong pool and it leave you with a silent "good luck with the rest".

Well, my rant is finished and will probably use the official way to comunicate it to the developpers.

I will use the script you suggest and the instructions you provide. I am technically fluent and so, should be able to manage it, I just didn't expect to invest this amount of time to do it.

Thanks again for your help.

@WI_Hedgehog, I understand, in your case, you want an "Open the hood and see how it works" experience. I respect that but it is not the way I wanted to experience it.

WI_Hedgehog · Apr 12, 2023

sbeaudoin said:
A good designed web interface should offer a way to easily find a defective hard drive and guide you to diagnose it and replace it if needed. All I have is a drive label where a serial number should be provided, a wrong pool and it leave you with a silent "good luck with the rest".

I've been able to double-click and drill down to open more information on all drives in my system, including serial number. Do note I'm running all SAS drives on a recommended HBA so that might have something to do with it.

sbeaudoin · Apr 13, 2023

Yes, me too. But the point is : None report an error except the one with the checksum (who is absent from the email). And as the email does not provide the serial number, only DA18 and DA20 and as I should not use the label to do the correlation, the only way to identify without doubt the one in error is to either use a command line or smart thest all of them :

But anyway, the point of this case is to find how I can identify the defective drives efficiently. I will work on that soon.

Davvo · Apr 13, 2023

You can get the serial of each disk from the webUI.

sbeaudoin · Apr 13, 2023

@Davvo , I know, this is not the point. The point is : the only information I have from the email is a label and the corresponding disk does not show errors and many articles says to only use serial number to identify a drive, information I don't have from the email.

Davvo · Apr 13, 2023

Take a look at joe's multi report script then, it does report the disk serials and not only the labels.
The setup should be simple enough and it will provide you an easier way to accurately monitor your disks (which looks like is your concern right now).

Keep in mind however that TrueNas isn't a closed lid solution and you are expected to acculturate yourself on how it works; however if you follow this forum's advices and use a few tricks (like said script) you can reach that kind of experience.

WI_Hedgehog · Apr 13, 2023

Davvo said:
Take a look at joe's multi report script then, it does report the disk serials and not only the labels.
The setup should be simple enough and it will provide you an easier way to accurately monitor your disks (which looks like is your concern right now).

Keep in mind however that TrueNas isn't a closed lid solution and you are expected to acculturate yourself on how it works; however if you follow this forum's advices and use a few tricks (like said script) you can reach that kind of experience.

I agree, though the email TrueNAS sends should reference Serial Numbers instead of Device Names. That should be submitted as a Feature Request. ( @sbeaudoin is correct )

sbeaudoin · Apr 14, 2023

@Davvo. I installed the script and it took 5 minutes! Great informations! But still, no serial numbers, but "GPTID"?? Maybe it is supposed to be on these sections who report an error for each drive? :

########## NON-SMART status report for da6 drive (N/A : N/A) ##########
SMARTCTL DATA

/dev/xpt0 control device couldn't opened: Permission denied
Unable to get CAM device list
/dev/ada0: Unable to detect device type
Please specify device type with the -d option.

They will possibily disapear when I will create the CRON job, but I doubt it. I will read the documentation of the script for that.

But still, it reports 0 errors and I still receive TrueNAS email indicating 16 errors on DA18 and DA20, the last one is from april 9th... I would try to do a SCRUB as suggested above, I have a task defined for that, but it is due in 2 days and the web interface only give me options to edit or delete it, no way to launch it?! I will take a look in the manual to find how to launch it on the CLI...

Davvo · Apr 14, 2023

sbeaudoin said:
@Davvo. I installed the script and it took 5 minutes! Great informations! But still, no serial numbers, but "GPTID"?? Maybe it is supposed to be on these sections who report an error for each drive? :

This is how the first part looks like:

What you posted is the second part that includes the raw output from the cli.

Code:

Unable to detect device type
Please specify device type with the -d option.

I usually get this for my USB boot drive; the permission denied might be a configuration/installation issue, I don't get it.

sbeaudoin said:
But still, it reports 0 errors and I still receive TrueNAS email indicating 16 errors on DA18 and DA20, the last one is from april 9th... I would try to do a SCRUB as suggested above, I have a task defined for that, but it is due in 2 days and the web interface only give me options to edit or delete it, no way to launch it?! I will take a look in the manual to find how to launch it on the CLI...

I say you don't need to rush it, and you should reset the error count and see if they come back.

To reset the error count, you can use the WebUI shell to execute the following: zpool clear poolname.
To manually start a scrub, execute the following: zpool scrub poolname.

As a side note, it is suggested to use something like PuTTY for easier terminal management.

joeschmuck · Apr 14, 2023

@sbeaudoin
You did not post the entire output. You only posted a small section for a single drive (da6) which does not report SMART data and as my friend @Davvo indicated, it is likely due to the interface does not support it. I am more than happy to assist you in using the script and if there is anything I can improve upon, I most certainly will.

Attached is the output file from my system. I do not have many drives so I use doner files to run simulations through. I will not share the simulation files as I have told people that I will not share their data so you get to see my data.

Davvo · Apr 14, 2023

Can you please tells us a bit more on your system hardware, especially how are you plugging all those drives to the motherboard?

Because TN has a few very specific hardware requirements (some depending on ZFS), and in this case one of those could be the issue here depending on your configuration.

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

In the last year or two, we've had a resurgence of users asking about SATA Port Multipliers and cheap SATA controllers. Please, do NOT use port multipliers, and use cheap SATA controllers only after extensive research. SATA controllers and SATA...

www.truenas.com

Why you should avoid USB attached drives for data pool disks

This subject has been coming up quite often in the last year or 2. Perhaps because of TrueNAS SCALE has brought forth some attention to TrueNAS in general and ZFS in specific. Please note that this is about USB attached storage for ZFS data...

www.truenas.com

sbeaudoin · Apr 14, 2023

@Davvo, no, the first section describe the pools :

All the other content is text, no other tables. I include the rest of the content in this post, no serial numbers, no other tables.

As I didn't took time to read all the instructions of the script, yet, it is possible I misread something. I will keep this session updated.

sbeaudoin · Apr 14, 2023

I will also put my system details in my profile this week-end. Mainly, I use a supported controler in IT mode hooked by two cables to the two controllers of a 15 bay EMC external drive shelf. I also use an external USB disk box for backup with 4 disks as I want to be able to unplug it and bring it rapidly in case of fire. But for those disks, the label is "ada0" to "ada4" and the reported errors are for disks "da18" and "da20". So, I don't think the external disk pack is concerned.

I will update with the exact models this week-end.

joeschmuck · Apr 14, 2023

@sbeaudoin, there are no more charts because the script did not recognize any more drives. The script only grabs SMART data, if it can't grab SMART data then it isn't able to pull the data from the drives. This has me thinking of alternate ways to pull data such as camcontrol, but I'd thing smartctl would do that already.

Yes, it would eb nice to know your hardware configuration and also the firmware in the HBA.

sbeaudoin · Apr 16, 2023

@Davvo , I updated my profile with the complete TrueNAS system specs. Also, I did the clear and scrub. We should see it the errors come back. If so, I will update this post.
@joeschmuck , except for the USB, I use a controller in IT passthrough mode, so it should be able to grab SMART data, isn't it?

Important Announcement for the TrueNAS Community.

Disk confusion

Dabbler

Old Man

Dabbler

Old Man

Guru

Dabbler

Guru

Dabbler

MVP

Dabbler

MVP

Guru

Dabbler

MVP

Old Man

MVP

Dabbler

Attachments

Dabbler

Old Man

Dabbler

Similar threads