ATA error count increased. cant find the disk its referring to??

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
Im a complete beginner with trueNAS and linux, but after running my server for 3 months I got an error in TrueNAS:

device: /dev/sdg [SAT], ATA error count increased from 0 to 1.

However, when i look at all the disks in my system, i can't find a /dev/sdg drive

1707767256681.png


sda, b, c, e, f are my 5 ultrastar 6TB disks
sdd is a 500GB SSD that runs my apps (kingston, 2 weeks old)
NVME1n1 is the boot NVME (sketchy chinese quality, but i keep a backup of the config files)
NVME0n1 is a L2ARC for the spinning rust (good quality, 2 weeks old).

when i run smartctl -a /dev/sdg i get:
Smartctl open device: /dev/sdg failed: No such device

Am i just too stupid to understand this right? am i looking in the wrong place? or is the error that i received just bogus and should i ignore it?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That's like the second problematic /dev/sdg in a week. Not-so-lucky number 7.

It's possible that it died outright in the meantime. Are you sure there aren't any other mass storage devices tucked away in your server in some forgotten corner?
 

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
That's like the second problematic /dev/sdg in a week. Not-so-lucky number 7.

It's possible that it died outright in the meantime. Are you sure there aren't any other mass storage devices tucked away in your server in some forgotten corner?
All disks are accounted for. The server also only has 6 sata connections, so that comes down to my 5 disks and the app-pool SSD.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Hmm... What's the output of ls -l /dev/disk/by-id?
 

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
ran some things:

Code:
admin@truenas[~]$ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTB25L -> ../../sde
lrwxrwxrwx 1 root root 10 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTB25L-part1 -> ../../sde1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTJ28L -> ../../sdb
lrwxrwxrwx 1 root root 10 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTJ28L-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTJZEL -> ../../sdf
lrwxrwxrwx 1 root root 10 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTJZEL-part1 -> ../../sdf1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTKAUL -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTKAUL-part1 -> ../../sda1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTL5YL -> ../../sdc
lrwxrwxrwx 1 root root 10 Feb 12 20:02 ata-HGST_HUS726T6TALE604_V9JTL5YL-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 ata-KINGSTON_SA400S37480G_50026B7784D22189 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 12 20:03 ata-KINGSTON_SA400S37480G_50026B7784D22189-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Feb 12 20:02 dm-name-nvme1n1p4 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Feb 12 20:02 dm-uuid-CRYPT-PLAIN-nvme1n1p4 -> ../../dm-0
lrwxrwxrwx 1 root root 13 Feb 12 20:02 nvme-SSD_M.2_PCIe3_128GB_InnovationIT_014762308301403 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-SSD_M.2_PCIe3_128GB_InnovationIT_014762308301403-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 13 Feb 12 20:02 nvme-ShiJi_128GB_M.2-NVMe_2023092600022 -> ../../nvme1n1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-ShiJi_128GB_M.2-NVMe_2023092600022-part1 -> ../../nvme1n1p1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-ShiJi_128GB_M.2-NVMe_2023092600022-part2 -> ../../nvme1n1p2
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-ShiJi_128GB_M.2-NVMe_2023092600022-part3 -> ../../nvme1n1p3
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-ShiJi_128GB_M.2-NVMe_2023092600022-part4 -> ../../nvme1n1p4
lrwxrwxrwx 1 root root 13 Feb 12 20:02 nvme-nvme.126f-303134373632333038333031343033-5353445f4d2e325f50434965335f31323847425f496e6e6f766174696f6e4954-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-nvme.126f-303134373632333038333031343033-5353445f4d2e325f50434965335f31323847425f496e6e6f766174696f6e4954-00000001-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 13 Feb 12 20:02 nvme-nvme.126f-32303233303932363030303232-5368694a69203132384742204d2e322d4e564d65-00000001 -> ../../nvme1n1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-nvme.126f-32303233303932363030303232-5368694a69203132384742204d2e322d4e564d65-00000001-part1 -> ../../nvme1n1p1
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-nvme.126f-32303233303932363030303232-5368694a69203132384742204d2e322d4e564d65-00000001-part2 -> ../../nvme1n1p2
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-nvme.126f-32303233303932363030303232-5368694a69203132384742204d2e322d4e564d65-00000001-part3 -> ../../nvme1n1p3
lrwxrwxrwx 1 root root 15 Feb 12 20:02 nvme-nvme.126f-32303233303932363030303232-5368694a69203132384742204d2e322d4e564d65-00000001-part4 -> ../../nvme1n1p4
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x5000cca0bde74178 -> ../../sde
lrwxrwxrwx 1 root root 10 Feb 12 20:02 wwn-0x5000cca0bde74178-part1 -> ../../sde1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x5000cca0bde75440 -> ../../sdb
lrwxrwxrwx 1 root root 10 Feb 12 20:02 wwn-0x5000cca0bde75440-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x5000cca0bde757a9 -> ../../sdf
lrwxrwxrwx 1 root root 10 Feb 12 20:02 wwn-0x5000cca0bde757a9-part1 -> ../../sdf1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x5000cca0bde7590a -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 12 20:02 wwn-0x5000cca0bde7590a-part1 -> ../../sda1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x5000cca0bde75c34 -> ../../sdc
lrwxrwxrwx 1 root root 10 Feb 12 20:02 wwn-0x5000cca0bde75c34-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 Feb 12 20:02 wwn-0x50026b7784d22189 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 12 20:03 wwn-0x50026b7784d22189-part1 -> ../../sdd1



Code:
admin@truenas[~]$ lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda             8:0    0   5.5T  0 disk
└─sda1          8:1    0   5.5T  0 part
sdb             8:16   0   5.5T  0 disk
└─sdb1          8:17   0   5.5T  0 part
sdc             8:32   0   5.5T  0 disk
└─sdc1          8:33   0   5.5T  0 part
sdd             8:48   0 447.1G  0 disk
└─sdd1          8:49   0 447.1G  0 part
sde             8:64   0   5.5T  0 disk
└─sde1          8:65   0   5.5T  0 part
sdf             8:80   0   5.5T  0 disk
└─sdf1          8:81   0   5.5T  0 part
nvme1n1       259:0    0 119.2G  0 disk
├─nvme1n1p1   259:1    0     1M  0 part
├─nvme1n1p2   259:2    0   512M  0 part
├─nvme1n1p3   259:3    0 102.7G  0 part
└─nvme1n1p4   259:4    0    16G  0 part
  └─nvme1n1p4 253:0    0    16G  0 crypt [SWAP]
nvme0n1       259:5    0 119.2G  0 disk
└─nvme0n1p1   259:6    0 119.2G  0 part


Code:
admin@truenas[~]$ sudo zpool status
  pool: Klemmers
 state: ONLINE
  scan: scrub repaired 0B in 00:44:25 with 0 errors on Mon Jan  1 00:44:27 2024
config:


        NAME                                      STATE     READ WRITE CKSUM
        Klemmers                                  ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            bce74d2d-d157-4e36-acb4-653dab1a9c2d  ONLINE       0     0     0
            813b3319-1c1c-4d0a-8fb5-60d3f229b865  ONLINE       0     0     0
            c35c7d97-a9da-4847-8aae-999913c68a1d  ONLINE       0     0     0
            b1d95636-17c1-415f-bbb7-816ad2594aba  ONLINE       0     0     0
            61381e0c-5359-43d3-b1a7-870d1d1a9cd5  ONLINE       0     0     0
        cache
          nvme0n1p1                               ONLINE       0     0     0


errors: No known data errors


  pool: apps
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Sun Jan 21 00:00:01 2024
config:


        NAME                                    STATE     READ WRITE CKSUM
        apps                                    ONLINE       0     0     0
          1acc454d-fce4-4ed1-842b-b792957ff6e1  ONLINE       0     0     0


errors: No known data errors


  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:24 with 0 errors on Mon Feb 12 03:45:25 2024
config:


        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme1n1p3  ONLINE       0     0     0


errors: No known data errors


to me this all looks good... and no sign of /dev/sdg
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Bizarre. No USB flash drives that were detached?
 

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
Bizarre. No USB flash drives that were detached?
No. I have fiddled around with a USB thumb drive 4 weeks ago, but i disconnected that, and rebooted after that. That would not explain that error showing up just now.

Really thanks for trying to help me out here by the way. I forgot to add that to the other messages before and i cant seem to be able to edit (yet?)

EDIT: now i can edit!!
I just remembered that i have a UPS connected via USB. i guess not, but could that cause this kind of error?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Not unless they're doing something silly like present a USB mass storage device to provide a Windows driver. Some old 3G adapters did something like that, but with a virtual CD drive.
 

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
Not unless they're doing something silly like present a USB mass storage device to provide a Windows driver. Some old 3G adapters did something like that, but with a virtual CD drive.
I would not totally rule out me doing something stupid, but i don't think this happened.

I did a full scrub of both my pools which finished without errors.

i also did an extended SMART test, which turned up withsomething on /dev/sdf but i dont know what to make of this. I honestly think its harmless (and on another disk than the original error indicated):

Code:
admin@truenas[~]$ sudo smartctl -a /dev/sdf
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726T6TALE604
Serial Number:    V9JTJZEL
LU WWN Device Id: 5 000cca 0bde757a9
Firmware Version: VKGNW4B0
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb 13 09:58:14 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 681) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   152   152   024    Pre-fail  Always       -       397 (Average 396)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       18
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2408
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       109
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       109
194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 14/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2


SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.


Error 2 occurred at disk power-on lifetime: 2407 hours (100 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0


  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a8 08 bd 94 40 08  23d+21:51:05.439  READ FPDMA QUEUED
  60 00 48 08 c5 94 40 08  23d+21:51:05.356  READ FPDMA QUEUED
  60 00 a0 08 b5 94 40 08  23d+21:51:05.356  READ FPDMA QUEUED
  60 00 40 08 ad 94 40 08  23d+21:51:05.335  READ FPDMA QUEUED
  60 00 98 08 a5 94 40 08  23d+21:51:05.335  READ FPDMA QUEUED


Error 1 occurred at disk power-on lifetime: 2392 hours (99 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0


  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 40 20 00 42 3d 40 08  23d+06:48:49.801  READ FPDMA QUEUED
  60 40 18 c0 41 3d 40 08  23d+06:48:49.763  READ FPDMA QUEUED
  60 40 58 c0 3f 3d 40 08  23d+06:48:49.760  READ FPDMA QUEUED
  60 40 10 80 3f 3d 40 08  23d+06:48:48.825  READ FPDMA QUEUED
  60 40 08 40 3f 3d 40 08  23d+06:48:48.824  READ FPDMA QUEUED


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2406         -
# 2  Extended offline    Completed without error       00%      2121         -
# 3  Extended offline    Completed without error       00%      1378         -
# 4  Extended offline    Completed without error       00%      1124         -
# 5  Extended offline    Completed without error       00%       646         -
# 6  Extended offline    Completed without error       00%        30         -
# 7  Short offline       Completed without error       00%         7         -
# 8  Extended offline    Aborted by host               90%         5         -


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So you could be looking at a SATA cable or connection issue with this drive. Keep an eye on the UDMA_CRC_Errors. If they increment then you need to examine it. Normally these are not the controller nor the hard drive. If you examine all the SMART data from all your drives, if you are seeing this value not at zero, then it could be the motherboard chipset, or a bad batch of SATA cables, or if you have the drives mounted in a removable tray system, that could be suspect as well.

As for the sdg issue... I suspect this was actually sdf. When the system powers up it assigns drives as it sees fit (normally by which one is ready first), but it is odd to skip a letter. That is the only thing that makes some sort of sense. Keep an eye on things for a while.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Very few SMART tests have been run, that's not great, but the disk isn't reporting anything beyond some errors on the interface (possibly did to a bad or loose cable).
 

Tha_Reaper

Dabbler
Joined
Dec 2, 2023
Messages
10
So you could be looking at a SATA cable or connection issue with this drive. Keep an eye on the UDMA_CRC_Errors. If they increment then you need to examine it. Normally these are not the controller nor the hard drive. If you examine all the SMART data from all your drives, if you are seeing this value not at zero, then it could be the motherboard chipset, or a bad batch of SATA cables, or if you have the drives mounted in a removable tray system, that could be suspect as well.

As for the sdg issue... I suspect this was actually sdf. When the system powers up it assigns drives as it sees fit (normally by which one is ready first), but it is odd to skip a letter. That is the only thing that makes some sort of sense. Keep an eye on things for a while.
Thanks. the error happened not during a boot, but after the system has been up for 20 days. So the timing is a bit weird.
You make a good point with the connection issue. The SATA cables are cheap ones from aliexpress. And i use a Jonsbo N2 case, so also removeable drive trays.
I'll keep an eye out, and stop worrying for the time being. thanks for both of your time. I'll report back if anything changes
 
Top