New SSD Faulted state

vii · Sep 27, 2023

Hello, I purchased two used SSD drives and configured them as a RAID mirror. However, I have noticed several times that one of the drives shows a faulted state.
SAMSUNG_MZ7WD960HAGP-00003
I dethatch the drive and ran the following command:

 
badblocks -wsv /dev/sdb

Also, the SMART test does not show any problem. The drives have been used for 50k hours, but the badblocks test shows no problem either. :/

I have changed the SATA cable several times, but the problem persists. Sometimes, when I transfer something, the speed drops to zero and hangs for a minute. When I check the log, it shows me write and read errors on this device only sdb.

NugentS · Sep 27, 2023

Wow - using -w on an SSD is a good way of doing it no good at all.
The drives should be decent drives being mixed load enterprise drives

Hardwarelist please as per forum rules

joeschmuck · Sep 27, 2023

vii said:
badblocks -wsv /dev/sdb

@NugentS beat me to it, but I echo that thought. I cringed when I saw this was being run on a SSD. You should not run commands that you do not fully understand. What you have done is used of some of your SSD life because you wrote a lot of test data to the drive and when you write, you cause an erase cycle which is the thing that wears out a drive. In general (some drives are better, some worse) there are 3000 erase cycles per block and once you have erased a block too many times, it fails.

Since you say the drive have over 50K hours, you should post the SMART results as well so we can tell you if the drive has much life left on it.

vii · Sep 28, 2023

joeschmuck said:
@NugentS beat me to it, but I echo that thought. I cringed when I saw this was being run on a SSD. You should not run commands that you do not fully understand. What you have done is used of some of your SSD life because you wrote a lot of test data to the drive and when you write, you cause an erase cycle which is the thing that wears out a drive. In general (some drives are better, some worse) there are 3000 erase cycles per block and once you have erased a block too many times, it fails.

Since you say the drive have over 50K hours, you should post the SMART results as well so we can tell you if the drive has much life left on it.

I messed up I’m new to this stuff, so I will be more careful in the future.
Is there any way to remove the faulted state and make the drive keep operating? I used it for an AI project to cache things temporarily. But I don’t know why it became faulted before ran this command. That’s why I took this risk I don’t think there is anything I can do when it becomes faulted :(
This is the first time I use this command. Thanks for warning me about this risk.


root@truenas[~]# smartctl -l selftest /dev/sdb -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7WD960HAGP-00003
Serial Number:    S186NYADB02278
LU WWN Device Id: 5 002538 500106dc4
Firmware Version: DXM8DW3Q
User Capacity:    880,180,674,560 bytes [880 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 28 12:42:11 2023 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (53956) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  70) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       59810
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       273
177 Wear_Leveling_Count     0x0013   098   098   005    Pre-fail  Always       -       285
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       20992
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   046   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
202 Exception_Mode_Status   0x0033   100   100   010    Pre-fail  Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       262
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       500649527541

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         7         -
# 2  Short offline       Completed without error       00%         7         -
# 3  Offline             Completed without error       00%         7         -
# 4  Extended offline    Completed without error       00%        48         -
# 5  Extended offline    Completed without error       00%        48         -
# 6  Short offline       Completed without error       00%        29         -
# 7  Short offline       Completed without error       00%        29         -
# 8  Short offline       Completed without error       00%        24         -
# 9  Short offline       Completed without error       00%        24         -
#10  Short offline       Completed without error       00%        36         -
#11  Short offline       Completed without error       00%        36         -
#12  Short offline       Completed without error       00%        12         -
#13  Short offline       Completed without error       00%        12         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@truenas[~]#

NugentS · Sep 28, 2023

NugentS said:
Hardwarelist please as per forum rules

I repeat - hardware list please - full specs as per forum rules

vii · Sep 28, 2023

NugentS said:
I repeat - hardware list please - full specs as per forum rules

CPU: i7 4770
Motherboard: ASUS Z87-PLUS
Storage: 2 x 1TB HDD Raid 1 Mirror + 2 x 1TB SSD Raid 1 Mirror
SSD 128GB x 2 OS
PCIe 10Gb Network card
The motherboard + CPU is new; it was in a closed store for years.

joeschmuck · Sep 28, 2023

Your SSD is telling us that it has 98% life left and no failed blocks, not bad for a SSD with almost 60K hours and 233TB written.

NugentS · Sep 28, 2023

[pedant] the motherboard + CPU is not new - its quite old - just unused [/pedant]

Can you post the result of a "zpool status -v" please

vii · Sep 28, 2023

This screenshot was taken on 20/9, when the problem first appeared. After that, I ran the bad block test. This is the latest update for zpool status. I have cleaned the zpool many times.
I do zpool clean many time and restart my nas


  pool: storage.data.charlotte
 state: ONLINE
  scan: scrub repaired 0B in 00:15:31 with 0 errors on Thu Sep 28 20:08:12 2023
config:

        NAME                    STATE     READ WRITE CKSUM
        storage.data.charlotte  ONLINE       0     0     0
          mirror-0              ONLINE       0     0     0
            sdc2                ONLINE       0     0     0
            sdb2                ONLINE       0     0     0

All config


root@truenas[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Sep 29 06:15:36 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdd3    ONLINE       0     0     0

errors: No known data errors

  pool: storage.data.charlotte
 state: ONLINE
  scan: scrub repaired 0B in 00:15:31 with 0 errors on Thu Sep 28 20:08:12 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        storage.data.charlotte                    ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            6a9a0390-9d77-4044-a208-da8c22dac41f  ONLINE       0     0     0
            a12a2e2d-bcb1-42cc-9ec2-097bb96e37b6  ONLINE       0     0     0

errors: No known data errors

  pool: storage.data.violet
 state: ONLINE
  scan: scrub repaired 0B in 02:46:26 with 0 errors on Wed Sep 27 19:02:33 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        storage.data.violet                       ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            d64750ec-0007-4a5b-8228-8b2c21e3b66f  ONLINE       0     0     0
            0a56a742-40c4-45b4-926c-31a70ad99ae8  ONLINE       0     0     0

errors: No known data errors
root@truenas[~]#

joeschmuck · Sep 28, 2023

I need to learn something new @NugentS, why is the zpool status output flipping back and fourth between the drive/dev ID (sdb/sdc) or the gptid?

Of course I do not see anything obviously wrong with the OP's system, all looks good right now with the exception of my question above.

NugentS · Sep 28, 2023

@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.

@vii you appear to have 6 disks in total. Which ports on the motherboard are you using?

Intel® Z87 chipset :
6 x SATA 6Gb/s port(s), yellow
Support Raid 0, 1, 5, 10
Supports Intel® Dynamic Storage Accelerator, Intel® Smart Response Technology, Intel® Rapid Start Technology, Intel® Smart Connect Technology*3
ASMedia® ASM1061 controller : *4
2 x SATA 6Gb/s port(s), dark brown

joeschmuck · Sep 29, 2023

NugentS said:
@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.

I was looking specifically at the charlotte pool, it changed from sdb/sdc to gptid. Just something I don't recall seeing in the past.

vii · Sep 29, 2023

joeschmuck said:
I was looking specifically at the charlotte pool, it changed from sdb/sdc to gptid. Just something I don't recall seeing in the past.

yeah i use the -L flag with the zpool status -v command. It makes it easy to identify which drive has the problem and errors.

zpool status -v -L

NugentS said:
@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.

@vii you appear to have 6 disks in total. Which ports on the motherboard are you using?

Intel® Z87 chipset :
6 x SATA 6Gb/s port(s), yellow
Support Raid 0, 1, 5, 10
Supports Intel® Dynamic Storage Accelerator, Intel® Smart Response Technology, Intel® Rapid Start Technology, Intel® Smart Connect Technology*3
ASMedia® ASM1061 controller : *4
2 x SATA 6Gb/s port(s), dark brown

The devices are connected to the motherboard's 6 SATA ports, and there are two additional ports called media ports . I have utilized all six main ports (1-6). Initially, I used two extra ports before, but they were extremely slow.
This is the order of them: sdb is on p2; before badblock, it was on p4 I changed them to make sure the port was not the problem.

I thought the problem might be with the ssd itself, perhaps something in its controller causing it to turn off momentarily.
I think the best thing to do is to use the good SSD as a cache for the HDD to speed things up. Maybe this is the best option,
Thanks to you guys for helping me in this case . I have learned a lot.

NugentS · Sep 29, 2023

How much memory do you have?
A cache (L2ARC) probably won't help depending on how any users you have, how much memory you have and how you use things

joeschmuck · Sep 29, 2023

vii said:
yeah i use the -L flag with the zpool status -v command. It makes it easy to identify which drive has the problem and errors.

Der, I should have known something like that. Thanks.

vii · Sep 29, 2023

NugentS said:
How much memory do you have?
A cache (L2ARC) probably won't help depending on how any users you have, how much memory you have and how you use things

Two 2GB RAM sticks of the same brand, and two 4GB RAM sticks of the same brand, totaling 12 GB, all DDR3 1333. There are only two users who use the NAS, and we don't work on the same time. I have set up Pi-hole for local DNS management. My requirement is just to be able to read data, with a minimum speed of 500 MB. Additionally, I want to back up this data using RAID 1. All AI/ML projects are executed on my PC.

Important Announcement for the TrueNAS Community.

New SSD Faulted state

vii

Dabbler

Attachments

NugentS

MVP

joeschmuck

Old Man

vii

Dabbler

NugentS

MVP

vii

Dabbler

joeschmuck

Old Man

NugentS

MVP

vii

Dabbler

joeschmuck

Old Man

NugentS

MVP

joeschmuck

Old Man

vii

Dabbler

NugentS

MVP

joeschmuck

Old Man

vii

Dabbler

Similar threads