New SSD Faulted state

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
Hello, I purchased two used SSD drives and configured them as a RAID mirror. However, I have noticed several times that one of the drives shows a faulted state.
SAMSUNG_MZ7WD960HAGP-00003
I dethatch the drive and ran the following command:
badblocks -wsv /dev/sdb
Also, the SMART test does not show any problem. The drives have been used for 50k hours, but the badblocks test shows no problem either. :/
1695842328065.png

1695842368593.png

I have changed the SATA cable several times, but the problem persists. Sometimes, when I transfer something, the speed drops to zero and hangs for a minute. When I check the log, it shows me write and read errors on this device only sdb.
 

Attachments

  • 1695838268367.png
    1695838268367.png
    28.1 KB · Views: 65
  • 1695838383024.png
    1695838383024.png
    170.6 KB · Views: 65

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Wow - using -w on an SSD is a good way of doing it no good at all.
The drives should be decent drives being mixed load enterprise drives

Hardwarelist please as per forum rules
 
  • Like
Reactions: vii

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
badblocks -wsv /dev/sdb
@NugentS beat me to it, but I echo that thought. I cringed when I saw this was being run on a SSD. You should not run commands that you do not fully understand. What you have done is used of some of your SSD life because you wrote a lot of test data to the drive and when you write, you cause an erase cycle which is the thing that wears out a drive. In general (some drives are better, some worse) there are 3000 erase cycles per block and once you have erased a block too many times, it fails.

Since you say the drive have over 50K hours, you should post the SMART results as well so we can tell you if the drive has much life left on it.
 
  • Like
Reactions: vii

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
@NugentS beat me to it, but I echo that thought. I cringed when I saw this was being run on a SSD. You should not run commands that you do not fully understand. What you have done is used of some of your SSD life because you wrote a lot of test data to the drive and when you write, you cause an erase cycle which is the thing that wears out a drive. In general (some drives are better, some worse) there are 3000 erase cycles per block and once you have erased a block too many times, it fails.

Since you say the drive have over 50K hours, you should post the SMART results as well so we can tell you if the drive has much life left on it.

I messed up I’m new to this stuff, so I will be more careful in the future.
Is there any way to remove the faulted state and make the drive keep operating? I used it for an AI project to cache things temporarily. But I don’t know why it became faulted before ran this command. That’s why I took this risk I don’t think there is anything I can do when it becomes faulted :(
This is the first time I use this command. Thanks for warning me about this risk.

root@truenas[~]# smartctl -l selftest /dev/sdb -a smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: SAMSUNG MZ7WD960HAGP-00003 Serial Number: S186NYADB02278 LU WWN Device Id: 5 002538 500106dc4 Firmware Version: DXM8DW3Q User Capacity: 880,180,674,560 bytes [880 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Sep 28 12:42:11 2023 EEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (53956) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 70) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 59810 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 273 177 Wear_Leveling_Count 0x0013 098 098 005 Pre-fail Always - 285 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 20992 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 062 046 000 Old_age Always - 38 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 202 Exception_Mode_Status 0x0033 100 100 010 Pre-fail Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 262 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 500649527541 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 7 - # 2 Short offline Completed without error 00% 7 - # 3 Offline Completed without error 00% 7 - # 4 Extended offline Completed without error 00% 48 - # 5 Extended offline Completed without error 00% 48 - # 6 Short offline Completed without error 00% 29 - # 7 Short offline Completed without error 00% 29 - # 8 Short offline Completed without error 00% 24 - # 9 Short offline Completed without error 00% 24 - #10 Short offline Completed without error 00% 36 - #11 Short offline Completed without error 00% 36 - #12 Short offline Completed without error 00% 12 - #13 Short offline Completed without error 00% 12 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing 255 0 65535 Read_scanning was never started Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@truenas[~]#
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
I repeat - hardware list please - full specs as per forum rules
  • CPU: i7 4770
  • Motherboard: ASUS Z87-PLUS
  • Storage: 2 x 1TB HDD Raid 1 Mirror + 2 x 1TB SSD Raid 1 Mirror
  • SSD 128GB x 2 OS
  • PCIe 10Gb Network card
    The motherboard + CPU is new; it was in a closed store for years.

 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Your SSD is telling us that it has 98% life left and no failed blocks, not bad for a SSD with almost 60K hours and 233TB written.
 
  • Like
Reactions: vii

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
[pedant] the motherboard + CPU is not new - its quite old - just unused [/pedant]

Can you post the result of a "zpool status -v" please
 
  • Like
Reactions: vii

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
1695955863859.png

This screenshot was taken on 20/9, when the problem first appeared. After that, I ran the bad block test. This is the latest update for zpool status. I have cleaned the zpool many times.
I do zpool clean many time and restart my nas
pool: storage.data.charlotte state: ONLINE scan: scrub repaired 0B in 00:15:31 with 0 errors on Thu Sep 28 20:08:12 2023 config: NAME STATE READ WRITE CKSUM storage.data.charlotte ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc2 ONLINE 0 0 0 sdb2 ONLINE 0 0 0
All config
root@truenas[~]# zpool status -v pool: boot-pool state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Sep 29 06:15:36 2023 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda3 ONLINE 0 0 0 sdd3 ONLINE 0 0 0 errors: No known data errors pool: storage.data.charlotte state: ONLINE scan: scrub repaired 0B in 00:15:31 with 0 errors on Thu Sep 28 20:08:12 2023 config: NAME STATE READ WRITE CKSUM storage.data.charlotte ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 6a9a0390-9d77-4044-a208-da8c22dac41f ONLINE 0 0 0 a12a2e2d-bcb1-42cc-9ec2-097bb96e37b6 ONLINE 0 0 0 errors: No known data errors pool: storage.data.violet state: ONLINE scan: scrub repaired 0B in 02:46:26 with 0 errors on Wed Sep 27 19:02:33 2023 config: NAME STATE READ WRITE CKSUM storage.data.violet ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 d64750ec-0007-4a5b-8228-8b2c21e3b66f ONLINE 0 0 0 0a56a742-40c4-45b4-926c-31a70ad99ae8 ONLINE 0 0 0 errors: No known data errors root@truenas[~]#
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I need to learn something new @NugentS, why is the zpool status output flipping back and fourth between the drive/dev ID (sdb/sdc) or the gptid?

Of course I do not see anything obviously wrong with the OP's system, all looks good right now with the exception of my question above.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.

@vii you appear to have 6 disks in total. Which ports on the motherboard are you using?

Intel® Z87 chipset :
6 x SATA 6Gb/s port(s), yellow
Support Raid 0, 1, 5, 10
Supports Intel® Dynamic Storage Accelerator, Intel® Smart Response Technology, Intel® Rapid Start Technology, Intel® Smart Connect Technology*3
ASMedia® ASM1061 controller : *4
2 x SATA 6Gb/s port(s), dark brown
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.
I was looking specifically at the charlotte pool, it changed from sdb/sdc to gptid. Just something I don't recall seeing in the past.
 

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
I was looking specifically at the charlotte pool, it changed from sdb/sdc to gptid. Just something I don't recall seeing in the past.
yeah i use the -L flag with the zpool status -v command. It makes it easy to identify which drive has the problem and errors.

zpool status -v -L
Screenshot 2023-09-29 164316.png
Screenshot 2023-09-29 164333.png

@joeschmuck - mine does the same. The boot-pool comes up as sdp3 & sdo3 where the data pools comes up as gptid. Those drives are enterprise 5 DWPD units - mixed load - so should have excellent endurance.

@vii you appear to have 6 disks in total. Which ports on the motherboard are you using?

Intel® Z87 chipset :
6 x SATA 6Gb/s port(s), yellow
Support Raid 0, 1, 5, 10
Supports Intel® Dynamic Storage Accelerator, Intel® Smart Response Technology, Intel® Rapid Start Technology, Intel® Smart Connect Technology*3
ASMedia® ASM1061 controller : *4
2 x SATA 6Gb/s port(s), dark brown

The devices are connected to the motherboard's 6 SATA ports, and there are two additional ports called media ports . I have utilized all six main ports (1-6). Initially, I used two extra ports before, but they were extremely slow.
This is the order of them: sdb is on p2; before badblock, it was on p4 I changed them to make sure the port was not the problem.
1695995359238.png

I thought the problem might be with the ssd itself, perhaps something in its controller causing it to turn off momentarily.
I think the best thing to do is to use the good SSD as a cache for the HDD to speed things up. Maybe this is the best option,
Thanks to you guys for helping me in this case . I have learned a lot.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
How much memory do you have?
A cache (L2ARC) probably won't help depending on how any users you have, how much memory you have and how you use things
 
  • Like
Reactions: vii

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
yeah i use the -L flag with the zpool status -v command. It makes it easy to identify which drive has the problem and errors.
Der, I should have known something like that. Thanks.
 
  • Love
Reactions: vii

vii

Dabbler
Joined
Aug 17, 2023
Messages
13
How much memory do you have?
A cache (L2ARC) probably won't help depending on how any users you have, how much memory you have and how you use things
1696009338304.png

Two 2GB RAM sticks of the same brand, and two 4GB RAM sticks of the same brand, totaling 12 GB, all DDR3 1333. There are only two users who use the NAS, and we don't work on the same time. I have set up Pi-hole for local DNS management. My requirement is just to be able to read data, with a minimum speed of 500 MB. Additionally, I want to back up this data using RAID 1. All AI/ML projects are executed on my PC.
 
Top