New system build, looking for pool recommendations ssd drives

Richard Kellogg

Dabbler
Joined
Jul 30, 2015
Messages
27
Is this stated as such in the manual? Because if not I would caution that you probably need lanes for connecting some of the on-board stuff.
I currently am not using any of the pcie slots on the motherboard. this board, with the original firmware, was not able to do bifurcation. However with a newer update (which I have), it has the ability to bifurcate x8 to x4,x4. it also definitely has the ability to have 5 slots, with x8.

I chose to go with enterprise SATA SSDs instead of NVME SSDs in NVME to PCIE cards for cost reasons. I was able to get new exterprise 4TB SATA Intel 4510 drives for ~$250.
 

Richard Kellogg

Dabbler
Joined
Jul 30, 2015
Messages
27
I'm not sure where to find the error logs. Just saw that truenas said the drive was bad. attached file is smartctl output for the "bad" drive. I don't know how to interpret the smartctl output.
attached is the error log. note: the bad disk was ada0, later, I moved it to a different sata port, and it now shows up as ada3. sure looks like there were repeated errors
 

Attachments

  • error_ada0.txt
    41.2 KB · Views: 23

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You CRC Error could have easily caused the drive to become DEGRADED and of course you likely seen a GUI alert for the CRC Errors. The value on the drive will never clear, get back to zero. It only increments as this error occurs. The value if 5 right now. If the value increases then check or replace your SATA data cable for drive S/N: PHYF150105AR3P8EGN as that is the most likely suspect.

I do not see any other obvious warning signs with this drive.

The next thing you can do is run zpool status -v and see if there are any ZFS errors. If there are, please post the output of that command in code brackets for the next instructions on what to do, but hopefully the report is all ONLINE and no errors.

Good Luck
 
Last edited:

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Definitely a hardware error of some sort. Definitely needs fixed too. I am also concerned about attribute 174/192, some of those could well be related to 199.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I am also concerned about attribute 174/192, some of those could well be related to 199.
I highly doubt it, completely different things.

As I said above,
If the value increases then check or replace your SATA data cable for drive S/N: PHYF150105AR3P8EGN and that is the most likely suspect.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I highly doubt it, completely different things.

As I said above,
It depends on how the drive accounts for it. I've seen cabling cause 192. He has a hardware problem.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
While I still disagree, I won't say it is impossible for a 192 error to cause a 199 error, stranger things have happened and I definitely do not know it all, and it is okay to disagree. This wouldn't be a learning forum if we didn't challenge each other from time to time.

As for it being a hardware problem, I completely agree for the UDMA CRC Errors, but what hardware specifically? We do not know that just yet and the log file does not provide that data, nether does the current SMART data. So I will request the extended SMART data smartctl -x /dev/ada3 be provided.

The most likely suspect is the SATA cable, either the connection at the motherboard, the connection at the drive, or possibly the cable has just failed (yes, it actually does happen out of the blue). Any one of the three are very likely the issue.

Those 199 UDMA CRC Errors could have occured at any time, maybe the extended data will show when they occured. If the data does show these errors happening and if it's not been fairly recent, then I'd say the problem is gone for now. You might be able to link it to when the system was built, these things do happen especially when someone's hands are in the case bumping cables or just moving the case around. I prefer the better quality locking SATA data cables, and as short as possible for the distance needed.

An easier way to grab data off the NAS if using the GUI, or SSH as well, send an email to yourself. This is very handy so long as your email is setup in TrueNAS already and this can be used for many things.

Here is how:
smartctl -x /dev/ada3 | mail -s "ada3 SMART" your_email@address.com
"ada3 SMART" = Subject Line Text
your_email@address.com = The email address to send the test data to.

Another example: zpool status -v | mail -s "Zpool Status" your_email@address.com

As for IDs 174 and 192 "Unsafe_Shutdown_Count" that is typically caused by an improper shutdown of the system. The power is removed to the SSD before the system tells the drive to shutdown. An easy to see way to create this issue is to hold down the power button on the computer and force it to power off. The UPS unexpectedly powers off. Maybe TrueNAS software is not flagging the drive properly. In this last case I would monitor these values and see if they continue to increase and if you can tie them to an event, even better.

For comparison please provide the same extended data for the other drives as they are the same Model drive, comparing this data would definitely shed more light on the 174 and 192 issue and if it is a single drive issue or a whole system issue.

You may find out that a Jira ticket needs to be submitted.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Yeah, my guess was he was not shutting down the system randomly for no reason but could be a bad guess. Comparing to other drives should definitely provide more data. But if his power cable was loose/wonky, you can get those. But I wouldn't expect the same number from other drives in that case. Yes, the errors do not tell you what, so, you just have to start troubleshooting. Swap drive and power connectors with another drive, etc.

FYI, not trying to disagree at all! Just adding something I have seen, nothing more. You have a lot of experience so it's unlikely you will be wrong very often. I have a lot of experience, but not as much on hardware side. I have done a fair amount, but not extensive.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
FYI, not trying to disagree at all! Just adding something I have seen, nothing more. You have a lot of experience so it's unlikely you will be wrong very often. I have a lot of experience, but not as much on hardware side. I have done a fair amount, but not extensive.
I didn't feel you were poking the bear, I just wanted to let you know this I wasn't upset or threatened, all is good. And unfortunately I do get things wrong periodically and I dislike it when that happens, but not often with HDD/SSD, just other topics, and I'm still learning NVMe. I have a little hardware experience, I user to repair hard drives in the early 1980's. Tear them apart, put them back together with some new parts as needed, do all the fine alignments. That was a hydraulicly moved head assembly that had 20 heads, only 100 cylinders. In the mid 1980's I could take apart and put back together 5.25" Seagate hard drives and realign the drive. I never needed to replace parts but things needed a cleaning and more importantly the head alignment was never great. They used stepper motors back them. Now days you have a voice coil and a signal feedback loop to move the heads where they need to go. I have repaired a modern 3.5" HDD, it was no fun worrying if my data was gone for good or not. About 4 hours later I had my data back and I transferred it to another drive. The repaired drive I tore into, cut open a window to the top of the casing and added a piece of ESD plastic as a clear window and let others watch as the heads moved around like crazy. It's impressive. Don't get me talking, I ramble on and on. Pizza Time!
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I didn't feel you were poking the bear, I just wanted to let you know this I wasn't upset or threatened, all is good. And unfortunately I do get things wrong periodically and I dislike it when that happens, but not often with HDD/SSD, just other topics, and I'm still learning NVMe. I have a little hardware experience, I user to repair hard drives in the early 1980's. Tear them apart, put them back together with some new parts as needed, do all the fine alignments. That was a hydraulicly moved head assembly that had 20 heads, only 100 cylinders. In the mid 1980's I could take apart and put back together 5.25" Seagate hard drives and realign the drive. I never needed to replace parts but things needed a cleaning and more importantly the head alignment was never great. They used stepper motors back them. Now days you have a voice coil and a signal feedback loop to move the heads where they need to go. I have repaired a modern 3.5" HDD, it was no fun worrying if my data was gone for good or not. About 4 hours later I had my data back and I transferred it to another drive. The repaired drive I tore into, cut open a window to the top of the casing and added a piece of ESD plastic as a clear window and let others watch as the heads moved around like crazy. It's impressive. Don't get me talking, I ramble on and on. Pizza Time!
Wow, never did that! You describe yourself as "old man", how "old"? 64 here.
 

Richard Kellogg

Dabbler
Joined
Jul 30, 2015
Messages
27
You CRC Error could have easily caused the drive to become DEGRADED and of course you likely seen a GUI alert for the CRC Errors. The value on the drive will never clear, get back to zero. It only increments as this error occurs. The value if 5 right now. If the value increases then check or replace your SATA data cable for drive S/N: PHYF150105AR3P8EGN as that is the most likely suspect.

I do not see any other obvious warning signs with this drive.

The next thing you can do is run zpool status -v and see if there are any ZFS errors. If there are, please post the output of that command in code brackets for the next instructions on what to do, but hopefully the report is all ONLINE and no errors.

Good Luck
Thanks so much for the replies.

Attached is the smart -x /dev/ada3 results. If I understand this correctly, on 5 occasions, a write was occurring, and failed CRC. and this was at the freebsd disk drive level.

What I don't understand is why ZFS would consider 5 bad sector writes = degrade drive. This drive as 4 TB capacity. That is 8 billion, 512 byte sectors. Over the course of 5 years, I would expect any drive would experience some number of write failures, maybe hundreds, and still be considered working. I was under the impression that disk drive drivers remap sectors that fail to spare sectors. But I really do not know if that is correct.

As for SATA cables, mine were used, and could be an issue, so I've ordered replacements. However, the PC was on a shelf, and not subject to jostling around. If I forgot to mention, I'm using truenas core 13.0-U5.3.
 

Attachments

  • smartclt-x-ada3.txt
    16.7 KB · Views: 19

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
A drive will remap sectors on bad sectors, but these are not bad sectors (attribute 5, 197). And, a drive having to remap bad sectors is usually a sign the drive is getting EOL. For me, I'll take a couple bad sectors early on in it's life(which again these are not), but, after that, the drive is being replaced. But that doesn't apply here anyway as you don't have bad sectors.

Looks like all the CRC errors occurred at the same time. Could be power supply too. I'll defer to Joe but if it were me, first step is to swap cabling with a drive with no errors. That tests several thing, the port, and, you get to reseat cabling. See if the error reoccurs and if it does, does it follow the drive or not. ZFS definitely did the right thing here. ZFS is meant to protect your data, and "drives" that are failing CRC are not good! In this case, it's not really the drive anyway most likely. And it can't rewrite the data as it doesn't check out, the data is bad and you don't want a drive writing bad data.

IOW, the system is sending one piece of data, the drive receives a different piece of data.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You describe yourself as "old man", how "old"? 64 here
Just turned 62.

Attached is the smart -x /dev/ada3 results.
Also please attach data for a few more of the drives, ada0, ada1, ada2 so we can compare to see if it is a whole system issue or just a single drive issue.

As for the UDMA CRC Errors, the most recent one was 74 hours (3 days) ago, assuming power was applied to the system continuously.

Now I'm going to ask the difficult question, it is a memory one... What were you doing 3 days ago (Feb 29 at about noon) with the NAS machine?

The log data you sent has this:
Feb 29 12:12:20 truenas kernel: pid 34016 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
Feb 29 12:12:21 truenas 1 2024-02-29T17:12:21.111679+00:00 truenas.local devd 1897 - - notify_clients: send() failed; dropping unresponsive client
Feb 29 12:12:21 truenas syslog-ng[49548]: syslog-ng shutting down; version='3.35.1'
Feb 29 12:57:44 truenas syslog-ng[2964]: syslog-ng starting up; version='3.35.1'
Feb 29 12:57:44 truenas ---<<BOOT>>---
What this looks like is the system shutdown at 12:12 PM Thursday and powered back up at 12:57 that same day. I saw no "shutdown" or "reboot" listed in the log file which has me concerned. I'm thinking your system may be crashing or maybe you turn it off with the power button on the case vice doing it from the GUI. Or maybe 13.0-U5.3 does not record those message in the log file. If the system is rebooting or shutting itself down, that could the the cause of the improper shutdown being recorded on the SSD.

While I don't think this is the case since there is a "shutting down" message, if you are powering off using the case power button, then you may have a setup wrong in the BIOS. In the BIOS you may have the option to select Instant Off, Shutdown, or 4-second Power Off when pressing the power button. You should want the 4-second power off setting. Again, I don't think this is the issue but if it is something you are doing, stop doing it.

I recommend that you update your TrueNAS to version 13.0-U6.1 as I know in the '/var/log/messages' file if a shutdown or reboot was requested, it will be clearly listed. I don't know about the version you have.

When you update the software, Do Not update the ZFS version feature flags on your pool if asked. In the GUI you can just Dismiss it. This will ensure you can roll back to the older version without any problems.

I await (sitting in my recliner) more data. Let's hope it is just a SATA cable, trying to Keep It Simple.
 

Richard Kellogg

Dabbler
Joined
Jul 30, 2015
Messages
27
Just turned 62.


Also please attach data for a few more of the drives, ada0, ada1, ada2 so we can compare to see if it is a whole system issue or just a single drive issue.

As for the UDMA CRC Errors, the most recent one was 74 hours (3 days) ago, assuming power was applied to the system continuously.

Now I'm going to ask the difficult question, it is a memory one... What were you doing 3 days ago (Feb 29 at about noon) with the NAS machine?

The log data you sent has this:

What this looks like is the system shutdown at 12:12 PM Thursday and powered back up at 12:57 that same day. I saw no "shutdown" or "reboot" listed in the log file which has me concerned. I'm thinking your system may be crashing or maybe you turn it off with the power button on the case vice doing it from the GUI. Or maybe 13.0-U5.3 does not record those message in the log file. If the system is rebooting or shutting itself down, that could the the cause of the improper shutdown being recorded on the SSD.

While I don't think this is the case since there is a "shutting down" message, if you are powering off using the case power button, then you may have a setup wrong in the BIOS. In the BIOS you may have the option to select Instant Off, Shutdown, or 4-second Power Off when pressing the power button. You should want the 4-second power off setting. Again, I don't think this is the issue but if it is something you are doing, stop doing it.

I recommend that you update your TrueNAS to version 13.0-U6.1 as I know in the '/var/log/messages' file if a shutdown or reboot was requested, it will be clearly listed. I don't know about the version you have.

When you update the software, Do Not update the ZFS version feature flags on your pool if asked. In the GUI you can just Dismiss it. This will ensure you can roll back to the older version without any problems.

I await (sitting in my recliner) more data. Let's hope it is just a SATA cable, trying to Keep It Simple.
Attached are 2 other drives that were in the raidx2 pool.

As for what I was doing at 12:12 on the 29 of feb. This was after I noticed the pool was degraded. I opened the PC, and hot removed the offending drive without shutting down the truenas. Then put the bad drive into a windows box, deleted the partitions, and created a ntfs windows disk. played with it, it seeme ok. I then moved it back into the truenas. For this,I powered down the truenas. I then had trouble booting - the box would try booting off other than the boot drives. So I moved the boot drives to different sata ports (lowering their boot order). This changed the /dev/adax assignments.

I do hope this is just a bad sata cable, as I have replacements on order.
 

Attachments

  • smartctl-x-ada5.txt
    24.7 KB · Views: 25
  • smartctl-x-ada4.txt
    12.5 KB · Views: 17

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Other drives look good, likely cable or SATA port. They have matching unsafe shutdown counts too. You'll simply want to monitor the drives and see if all is well and no repeat once you replace cables. See if Joe agrees.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The extra data provided shows no UDMA CRC Errors for either drive. Odds are the data cable.

Keep track of the UDMA CRC Errors, if they start going up then that is a problem. I saw one person who have tens of thousands of these errors and a cable fixed it. The physical drive is likely NOT the cause.

Now for the Unsafe Shutdown Count, the last two drives have the exact same values of 39 and I did a little research, I suggest you do the same on the meaning of this SMART value so you can feel better about ignoring these, but that is my advice for now, ignore this value. Hopefully when you get to the current TrueNAS version, the values stop incrementing. That would actually be something good to know.

Now to address the ZFS DEGRADED issue... I can see that you have --
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
You might want to set both of these to a value such as 70.0 (7 seconds which is a common setting) to try to read the data vice failing once will degrade the pool.

The last thing to address... Run routine SMART tests. Right now you are not. I recommend a daily SMART Short test on all drives at say 2:00AM, and then a weekly SMART Long test on Sunday at 2:15AM. The tests take 2 minutes or less to run on these drives. This will at least help identify drive failures early, or that is the goal of SMART.

Let us know how it works out.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
What I don't understand is why ZFS would consider 5 bad sector writes = degrade drive. This drive as 4 TB capacity. That is 8 billion, 512 byte sectors. Over the course of 5 years, I would expect any drive would experience some number of write failures, maybe hundreds, and still be considered working. I was under the impression that disk drive drivers remap sectors that fail to spare sectors. But I really do not know if that is correct.
The problem is not the remapping of 5 sectors per se. HDDs reserve a certain capacity to do this under the covers and have been doing so for many years. So when 5 unreadable sector are reported to the outside world, this means that multiple read attempts failed and a silent remapping was not possible. Usually because many, many other sectors have been remapped before and the reserved space has been used up. Or because something has gone catastrophically wrong.

Either way it is a situation that usually means the drive is in a really bad shape. And ZFS takes a "better safe than sorry" approach, hence the degraded status. You are of course free to ignore this. But the choice of ZFS and TrueNAS usually indicates that people care about their data.
 

Richard Kellogg

Dabbler
Joined
Jul 30, 2015
Messages
27
Just turned 62.


Also please attach data for a few more of the drives, ada0, ada1, ada2 so we can compare to see if it is a whole system issue or just a single drive issue.

As for the UDMA CRC Errors, the most recent one was 74 hours (3 days) ago, assuming power was applied to the system continuously.

Now I'm going to ask the difficult question, it is a memory one... What were you doing 3 days ago (Feb 29 at about noon) with the NAS machine?

The log data you sent has this:

What this looks like is the system shutdown at 12:12 PM Thursday and powered back up at 12:57 that same day. I saw no "shutdown" or "reboot" listed in the log file which has me concerned. I'm thinking your system may be crashing or maybe you turn it off with the power button on the case vice doing it from the GUI. Or maybe 13.0-U5.3 does not record those message in the log file. If the system is rebooting or shutting itself down, that could the the cause of the improper shutdown being recorded on the SSD.

While I don't think this is the case since there is a "shutting down" message, if you are powering off using the case power button, then you may have a setup wrong in the BIOS. In the BIOS you may have the option to select Instant Off, Shutdown, or 4-second Power Off when pressing the power button. You should want the 4-second power off setting. Again, I don't think this is the issue but if it is something you are doing, stop doing it.

I recommend that you update your TrueNAS to version 13.0-U6.1 as I know in the '/var/log/messages' file if a shutdown or reboot was requested, it will be clearly listed. I don't know about the version you have.

When you update the software, Do Not update the ZFS version feature flags on your pool if asked. In the GUI you can just Dismiss it. This will ensure you can roll back to the older version without any problems.

I await (sitting in my recliner) more data. Let's hope it is just a SATA cable, trying to Keep It Simple.
I purchased new sata cables and just installed them. And then I upgraded to 13.0-u6.1. It was a bit of an ordeal though. First, I purchased 2 32 GB supermicro superdoms for boot devices, since my motherboard had 2 “orange” sata ports for the purpose. I thought I could just add them as additional mirrors to my boot pool, then remove the 2 240 GB sata devices I was using for booting, allowing room for 2 additional larger SSD drives. But as you probably already guessed, the mirror failed because the sata-doms were smaller than the boot drives.

So I saved current configuration, downloaded the 13.0-u6.1 loader, and built a virgin 13.0-u6.1. Then I ran into a weird problem. I could not get the bios to boot from the correct drive. The boot order in bios did not distinguish among HDD drives. And the satadoms were sata-4, sata-5, so the pool disks were being selected to boot, and it failed. I could see what the order among the “HDD drives” was, just there was no means to change it. What finally worked was to remove all drives but the boot drives, then boot, then add the other drives. This changes the order among the HDD drives to the boot devices first.

Of course it took several hours for me to figure that out. After that, I reloaded the pools, and restored the configuration which all went without and issues.

I will try your other suggestions on setting up weekly smart tasks soon.

Thanks for your help.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Well that sounded like a fun day. I'm glad you were able to get things up and running again, now let's hope it stays that way.
 
Top