New Drive Errors Drive or??

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Greetings,

I think I have the sequence of events pretty close to accurate :wink:. Thanks in advance for any help to another nooob. System config in my signature.

Two days ago I finished my build-out for my first TrueNAS Scale system and let some family members know that they could check it out. To be safe, I decided to replicate my primary pool (Main) to an external drive after loading all the Apps that I wanted and about 9000 photographs. The primary pool also has an SMB share that has just under 600,000 files, and the ix-applications datasets using 2 TiB of space for everything. During the replication I got 14 read errors. When I checked the status of the replication the next morning the system was frozen - on, but not responsive. I rebooted and the Boot pool was unavailable from Main. I re-installed TrueNAS, and put the pool on Main (as it was before) and imported my two pools, did a scrub (no errors) and a short S.M.A.R.T. test on the drive that had the errors (a brand new Seagate Ironwolf 4 tb drive). One issue was that I now had 2 checksum errors. My system has 4 SATA ports on the motherboard, and a PCIe card with 4 SATA ports. Since the checksum error could be caused by a bad cable or controller I moved the faulty disk from the PCIe SATA card to the motherboard (now all pool Main is on the SATA card and the PCIe card has 2 6 Tb drives for the 2nd pool and the boot drive). I also changed the cable for the problematic drive.

Since I didn't know the status of my replication I decided to backup my most critical files on the SMB share (using Backup4all on my desktop). During that process I got errors on the same 4 Tb drive that I had moved from the PCIe SATA card to the motherboard SATA port:
  • 9 Read Errors
  • 1 Write Errors
  • 2 Checksum Errors (no change)
I decided to let the backup finish - and during that time I did a little work using the SMB on the 2nd pool creating a few test VMs. And low and behold, got errors on a new just replaced by warranty 6 Tb Seagate Ironwolf drive on the 2nd pool.
  • 1 Read Errors
  • 5 Write Errors
  • 1 Checksum Errors
Now, I've been working on this system for over a month, and had it loaded up once with files and apps then tore it down again. No errors. I'm suspicious that I may not have a drive error but something else. The long S.M.A.R.T. test on the 4 Tb drive showed no errors (see attached file) that I can see. I am currently doing a long S.M.A.R.T. test for the 6 tb drive that showed errors during the SMB backup.

I am using non-ECC memory - is there a memory test I can do? I am not getting any ZFS errors.
CPU runs about 30C with moderate activity and when I was doing some photo processing along with copying files it got up into the mid 40Cs. CPU usage never went above about 30 or 35%.

Again, thanks in advance for the advice.
jengle
 

Attachments

  • 2022-02-10_sda_smartctl.txt
    5.6 KB · Views: 123

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I don't have any suggestions at present. But, could you please supply the output of the following commands, in code tags?
zpool list zpool status

Also, did you burn in your computer?
And then burn in your drives?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I do not see anything of concern for the hard drive data you provided. @Arwen asked the same things I was going to ask.

Did you realize that you are using a Realtek NIC? This is generally problematic. It doesn't appear you are able to use ECC RAM either.
16GB RAM is not much memory for all the applications you have running so check the SWAP Partition, see how much space you are and have used. Anything above a few KB is not good, ideally it should never be above zero, and then you should add more RAM if SWAP is being used. SWAP is another reason a system can become unstable.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thanks for the quick replies. Thanks for the comments; I knew this was a minimum specs system so now I'm pricing out a better motherboard and cpu and ECC memory that would also allow VMs. Wasn't quite in the budget yet though. Also thanks for the coaching on doing the correct forum formats.

Code:
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Main       14.5T  4.46T  10.1T        -         -     0%    30%  1.00x  DEGRADED  /mnt
boot-pool   928G  2.64G   925G        -         -     0%     0%  1.00x    ONLINE  -
sixterra   5.45T  1.52T  3.94T        -         -     0%    27%  1.00x    ONLINE  /mnt


Code:
  pool: Main
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0B in 02:30:39 with 0 errors on Thu Feb  9 10:42:43 2023
config:

    NAME                                      STATE     READ WRITE CKSUM
    Main                                      DEGRADED     0     0     0
      raidz2-0                                DEGRADED     0     0     0
        8d9e10df-47c9-4589-8d57-8af7ed114b6c  ONLINE       0     0     0
        f441430a-b8b2-4d4d-aa9f-98e2cf6d1e84  ONLINE       0     0     0
        7e70f78b-e01e-4533-9231-a9e3e9ee45d5  FAULTED      9     1     2  too many errors
        cfe5372a-aa0e-4322-9031-bd13276489fb  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sdg3      ONLINE       0     0     0

errors: No known data errors

  pool: sixterra
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

    NAME                                      STATE     READ WRITE CKSUM
    sixterra                                  ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        2d302038-bbaa-48f0-af40-933190942b68  ONLINE       1     5     1
        3b4ea718-ef20-49bd-9873-158251d1f19e  ONLINE       0     0     0

errors: No known data errors



I did NOT burn in the system (I'll have to google how to do that), nor did I burn in the drives (I read about that, but I had done a short S.M.A.R.T. with no errors and let it slip doing the long test before fully loading the system). I used default swap - Max is 78.7 Mib Mean 77.3 MiB, min 76.06 Mib used with 3.91 GiB free (max) 3.91 GiB Mean and 3.91 GiB Min.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I don't have any suggestions why that drive failed. But, since you have moved it to the on-board SATA, you might clear the errors and replace it with it's self. Then see if the errors return.

Oh, and I forgot to ask you to supply the PCIe 4 port disk controller brand & model. It is not in your signature's hardware list. This CAN make a difference. Their are some that can over heat, causing errors.
 
Joined
Jun 15, 2022
Messages
674
As a note, I've found badblocks (which will do 4 write-read passes, each with a different pattern, per run meaning two runs would be 8 write-read passes) to find things beyond what a smartctl long test will find.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
The 4 port SATA is a JESOT SATA 3.0 Card based on marvell 88SE9215; 575 ratings at 4.5 on Amazon. I'll clear the errors and let it run for a bit. Thanks @Arwen. Also thanks for the note about badblocks @WI_Hedgehog; I will give it a try.
jengle
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Joined
Jun 15, 2022
Messages
674
Do you have airflow over the hard drives and that JESOT card? smartctl will pull the drive temps if the drives report them. There may be a way to get the temp of the card...that one's a bit more fun. I'd just touch the heatsink and make sure it's not uncomfortable to touch. The drives though, they have to be kept reasonably cool if you want them to last.

Personally, and it's maybe just me, I'll get a SAS controller and stick SATA drives on it (@jgreco got me started down the 'good controller' path). I've read a bunch here on SATA controllers not living up to what you'd hope and expect them to do, and slowly testing a few systems here seems to be proving that out. The thing with the generic ones is they get you most of the way there, and if that's good enough for you who am I to say any different? TrueNAS sure will find weak setups and report them though, that's been an unexpected blessing--no more unexplained data loss. And @jgreco --read that guy's stuff on controllers, along with what other members here have said--they're really great at keeping your data preserved without breaking the bank.

If you are going to use your current setup, perhaps consider doing:
badblocks -o /mnt/jumpdrv/log/sdX.badblocks.txt -b 512 -p 4 -svw /dev/sdX
Use smartctl -a to see if blocksize should be 512 or 4096.
NOTE that's a write test and will wipe all data clean off your drive, never to be recovered, and it'll possibly take 1-2 weeks to complete, depending on how fast your system is.
-p 4 is 4 runs. Each run is a write pass where it writes a pattern across the whole drive, followed by a read pass where it reads the whole drive and makes sure the pattern is what it should be. There are four patterns per run, so four write-reads of the disk. Depending on the drive, one write pass might take 16 hours and 4 hours to read it back, so 20 hours per write-read x 4 runs....

I do all drives at once by using multiple terminals (CTRL-Alt F1, F2, ...) or running an ISO image of System Rescue, then startx, and opening multiple terminals. (It works great, 9 drives just tested 'peachy' for me)

If your badblocks.txt files are all size 0, no issues. If not, issues.

If your system runs -p 4 without issue you're probably fine. I think Mr. @jgreco suggested 10, "but ain't nobody got time for dat." (He's probably right, however, because he's always right.)

Hopefully this and the other members' advice helps you get your system tuned to a point you're happy with it and it runs the way you want.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Ok, lots to digest here. Questioning how much more I want to invest in this system as it is since there are so many items that are questionable. Thanks again @WI_Hedgehog .

jengle
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I did NOT burn in the system (I'll have to google how to do that), nor did I burn in the drives (I read about that, but I had done a short S.M.A.R.T. with no errors and let it slip doing the long test before fully loading the system).
I can see how it is tempting to get the new "toy" (meant purely as a positive term!) up and running. But brand new HDDs have a likelihood for failure that is similar to drives that have 5-8 years of service under their belt. In other words: There is a relatively high risk of HDDs dying in the first couple of weeks. For truly critical systems I would burn in the HDDs for at least 4 weeks, probably more. I did a bit more than 12 weeks here, and still had a failure after 6 or 7 months.

And of course a good backup is always adivisable.
 
Joined
Jun 15, 2022
Messages
674
You callin' me cheap, punk? :smile: ;-)
I might conclude you're tighter than Kirstie Alley in a pair of spandex pants. :wink:

Though I cannot fault your frugality,
for it's saved money for both you and me,
indubitably. :cool:
 
Joined
Jun 15, 2022
Messages
674
Quote of the Day --
"ZFS eats RAM like a fat man at an all-you-can-eat fried chicken buffet." -- Anonymous

THAT WAS NOT 'ANONYMOUS,' and I love fried chicken. I just thought it a bit juvenile for the thread, though patently true...have you seen the TrueNAS console report how ZFS chews through RAM??? Greasy...
maxresdefault.jpg


(Feel free to undelete the post if you feel it adds to the conversation...)
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
THAT WAS NOT 'ANONYMOUS,' and I love fried chicken. I just thought it a bit juvenile for the thread, though patently true...

I'm happy to credit you if you wish. However, since you had deleted this comment, I did not feel right posting an attribution.

The comment was sufficiently spot-on that I couldn't see letting it pass on without other forum members getting a laugh out of it. With more than 17,000 posts, whatever is in my signature is appended to messages all over these forums. :smile: Also, I've been under some pressure to "like stuff" so it doubles as trolling @winnielinnie for a general round of forum fun.
 
Joined
Jun 15, 2022
Messages
674
Joined
Oct 22, 2019
Messages
3,641
@WI_Hedgehog, so close... you came so close... I salute you. Don't give up.

We will see it some day...

Never stop believing.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I might conclude you're tighter than Kirstie Alley in a pair of spandex pants. :wink:

I have to assume you're talking more along the lines of "Fat Actress" era Kirstie (and I cannot convince myself that this is defamatory as she was the producer of a series named Fat Actress, perhaps tasteless though) rather than "Vulcan Eyebrows" Kirstie. I always pictured myself as more of a Buck Rogers, oh, wait, yeah, Action Hero Makeover... my conclusion is that it just can't be fun to be in a role where you have molded action figures made of you. Or where you have to wear spandex to begin with.

a bit juvenile for the thread,

It's Friday afternoon, even tech lists like NANOG can get a bit juvenile after a long week.
 
Joined
Jun 15, 2022
Messages
674
NANOG, the secret hiding place of men sporting the facial hair of a Wookie...

NANOG.png


Buck Rogers...I dated a woman who looked like Erin Gray...that's a story best shared over a few beers though.
 
Top