Vrakfall
Dabbler
- Joined
- Mar 2, 2014
- Messages
- 42
Hello !
So, a few days ago (let's not talk about the numerous months I was shamefully ignoring the errors) I saw that one of my 3 main drives stopped working completely. By that I mean there's not a single write/read on it for a long time and the volume says it's `unavailable`. Hopefully, it's part of a ZFS Raid5 (RAIDz1) so its lost isn't a huge problem yet. By the way, my server is on `FreeNAS-9.10.2-U4 (27ae72978)`.
The fisrt messages I encountered were these 2:
Indeed, when I check the pool status, that specific drive is unaivalable.
I then read a few things and tried to follow the instructions contained in this first link and also helped myself with what is contained in this second link.
I checked the quick smart results but didn't see anything interesting to my eyes so I then run a long smart test which gave me the following output after a few hours:
Don't believe S.M.A.R.T. when it says it's on FreeNAS 10.3. I guess this build was aimed for v10 or it's maybe because I tested it at some point.
I might be misreading the output but from what I understand, there's no error in there. Then, no sector number available to force-write. I don't know what this means. I know my drives are old (one is a bit newer because it showed sector errors a few months after I bought it so I used the warranty to claim a new one) and they should fail soon enough, especially this one if it's fixable. I guess I'll buy a new one soon enough. But, from these outputs, do you think my drive is "full kaput" and I should replace it asap or can I do something condemn the faulty sectors and keep using it? Or else, is it just zfs error that could have happened with some power failure at my house and I just need to fully wipe it and replace it by itself in the pool (and let it all recreate)? I didn't try that last solution yet as I wanted some advice first.
Also, I ran a zpool scrub just to try but nothing really new came out of it. Here's the output:
(I hid the gpt Ids over there.)
So, what do you think about this?
Thank you in advance!
P.S.: Don't hesitate to be technical with me.
P.S.2: I hid the gpt Ids because I wasn't sure if it's safe to let them leak publicly. Is it? And is it needed for debugging? I don't think so but I can always be wrong!
Edit:
Here's the full spec of that machine as asked. It's an old workstation I recycled by only putting new drives. It's a very cheap setup as I only had a small budget (which didn't change over time...) but I'm very happy with the results so far as it only cost me 3 disks and a bit of RAM.
- FreeNAS 9.10.2-U4
- Re-purposed Dell Precision 380 Workstation (PWS380):
- Motherboard: CN-0CJ774-70821-65SI0XE
- Socket: LGA775
- Chipset: Intel 955X
- CPU: Intel Pentium 4 HT 650 @3.4GHz (1 core) (Found by semi-guess, see the following cpuid)
- RAM: 2x512MB + 2x2048MB (Both DDR2 533MHz, Dual-channeled)
- Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)
- SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
- Old disks got removed and I put these instead:
- mainVolume (RAIDz1): 3x3TB WesterDigital WDC WD30EZRX-00D8PB0 (ATA/ATAPI-9 SATA 3.x)
- downloadVolume (Temporary volume I don't care losing): 1x1TB Seagate ST31000340AS (ATA/ATAPI-8 SATA 2.x)
- UPS: Eaton Protection Station 500 (2.2V 250W/500VA) No USB feature - I'll change it soon enough since it does sometimes not even hold during simple power failures and it has no USB feature...
So, a few days ago (let's not talk about the numerous months I was shamefully ignoring the errors) I saw that one of my 3 main drives stopped working completely. By that I mean there's not a single write/read on it for a long time and the volume says it's `unavailable`. Hopefully, it's part of a ZFS Raid5 (RAIDz1) so its lost isn't a huge problem yet. By the way, my server is on `FreeNAS-9.10.2-U4 (27ae72978)`.
The fisrt messages I encountered were these 2:
Code:
Device: /dev/ada1, 3 Currently unreadable (pending) sectors The volume mainVolume (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Indeed, when I check the pool status, that specific drive is unaivalable.
I then read a few things and tried to follow the instructions contained in this first link and also helped myself with what is contained in this second link.
I checked the quick smart results but didn't see anything interesting to my eyes so I then run a long smart test which gave me the following output after a few hours:
Code:
~# smartctl -a /dev/ada1 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Green Device Model: WDC WD30EZRX-00D8PB0 Serial Number: WD-WMC4N0694041 LU WWN Device Id: 5 0014ee 6ae8c1170 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Fri Jun 9 16:54:07 2017 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (40020) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 401) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 18 3 Spin_Up_Time 0x0027 212 174 021 Pre-fail Always - 4366 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 250 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 26864 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 249 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 69 193 Load_Cycle_Count 0x0032 146 146 000 Old_age Always - 163776 194 Temperature_Celsius 0x0022 117 099 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 26825 - # 2 Extended offline Completed without error 00% 25466 - # 3 Extended offline Completed without error 00% 25458 - # 4 Extended offline Completed without error 00% 25450 - # 5 Extended offline Completed without error 00% 25442 - # 6 Extended offline Completed without error 00% 25434 - # 7 Extended offline Completed without error 00% 25426 - # 8 Extended offline Completed without error 00% 25418 - # 9 Extended offline Completed without error 00% 25410 - #10 Extended offline Completed without error 00% 25402 - #11 Extended offline Completed without error 00% 25395 - #12 Extended offline Completed without error 00% 25378 - #13 Extended offline Completed without error 00% 25370 - #14 Extended offline Interrupted (host reset) 70% 25360 - #15 Extended offline Completed without error 00% 25358 - #16 Extended offline Completed without error 00% 25350 - #17 Extended offline Completed without error 00% 25340 - #18 Extended offline Completed without error 00% 25314 - #19 Extended offline Completed without error 00% 25306 - #20 Extended offline Completed without error 00% 25299 - #21 Extended offline Completed without error 00% 25291 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Don't believe S.M.A.R.T. when it says it's on FreeNAS 10.3. I guess this build was aimed for v10 or it's maybe because I tested it at some point.
I might be misreading the output but from what I understand, there's no error in there. Then, no sector number available to force-write. I don't know what this means. I know my drives are old (one is a bit newer because it showed sector errors a few months after I bought it so I used the warranty to claim a new one) and they should fail soon enough, especially this one if it's fixable. I guess I'll buy a new one soon enough. But, from these outputs, do you think my drive is "full kaput" and I should replace it asap or can I do something condemn the faulty sectors and keep using it? Or else, is it just zfs error that could have happened with some power failure at my house and I just need to fully wipe it and replace it by itself in the pool (and let it all recreate)? I didn't try that last solution yet as I wanted some advice first.
Also, I ran a zpool scrub just to try but nothing really new came out of it. Here's the output:
Code:
~# zpool status -v mainVolume pool: mainVolume state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: scrub repaired 0 in 23h32m with 0 errors on Fri Jun 9 14:05:02 2017 config: NAME STATE READ WRITE CKSUM mainVolume DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 4384224381238940287 UNAVAIL 0 0 0 was /dev/gptid/########.eli gptid/########.eli ONLINE 0 0 0 gptid/########.eli ONLINE 0 0 0 errors: No known data errors
(I hid the gpt Ids over there.)
So, what do you think about this?
Thank you in advance!
P.S.: Don't hesitate to be technical with me.
P.S.2: I hid the gpt Ids because I wasn't sure if it's safe to let them leak publicly. Is it? And is it needed for debugging? I don't think so but I can always be wrong!
Edit:
Here's the full spec of that machine as asked. It's an old workstation I recycled by only putting new drives. It's a very cheap setup as I only had a small budget (which didn't change over time...) but I'm very happy with the results so far as it only cost me 3 disks and a bit of RAM.
- FreeNAS 9.10.2-U4
- Re-purposed Dell Precision 380 Workstation (PWS380):
- Motherboard: CN-0CJ774-70821-65SI0XE
- Socket: LGA775
- Chipset: Intel 955X
- CPU: Intel Pentium 4 HT 650 @3.4GHz (1 core) (Found by semi-guess, see the following cpuid)
Code:
~# cpuid eax in eax ebx ecx edx 00000000 00000005 756e6547 6c65746e 49656e69 00000001 00000f4a 01020800 0000649d bfebfbff 00000002 605b5001 00000000 00000000 007d7040 00000003 00000000 00000000 00000000 00000000 00000004 00004121 01c0003f 0000001f 00000000 00000005 00000040 00000040 00000000 00000000 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 20100800 80000002 20202020 20202020 20202020 6e492020 80000003 286c6574 50202952 69746e65 52286d75 80000004 20342029 20555043 30342e33 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 08006040 00000000 80000007 00000000 00000000 00000000 00000000 80000008 00003024 00000000 00000000 00000000 Vendor ID: "GenuineIntel"; CPUID level 5 Intel-specific functions: Version 00000f4a: Type 0 - Original OEM Family 15 - Pentium 4 Model 4 - Intel Pentium 4 processor (generic) or newer Stepping 10 Reserved 0 Extended brand string: " Intel(R) Pentium(R) 4 CPU 3.40GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 1 Hyper threading siblings: 2 Feature flags set 1 (CPUID.01H:EDX): bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 Pending Break Enable Feature flags set 2 (CPUID.01H:ECX): 0000649d: SSE3 SSE3 extensions DTES64 64-bit debug store MONITOR MONITOR/MWAIT instructions DS-CPL CPL Qualified Debug Store EST Enhanced Intel SpeedStep Technology CNXT-ID L1 Context ID CX16 CMPXCHG16B xTPR Send Task Priority messages Extended feature flags set 1 (CPUID.80000001H:EDX): 20100800 SYSCALL SYSCALL/SYSRET instructions XD-bit Execution Disable bit EM64T Intel Extended Memory 64 Technology Extended feature flags set 2 (CPUID.80000001H:ECX): 00000001 LAHF LAHF/SAHF available in IA-32e mode Old-styled TLB and cache info: 50: Instruction TLB: 4KB, 2MB or 4MB pages, fully assoc., 64 entries 5b: Data TLB: 4KB or 4MB pages, fully assoc., 64 entries 60: 1st-level data cache: 16-KB, 8-way set associative, sectored cache, 64-byte line size 40: No 2nd-level cache, or if 2nd-level cache exists, no 3rd-level cache 70: Trace cache: 12K-micro-op, 8-way set assoc 7d: 2nd-level cache: 2-MB, 8-way set associative, 64-byte line size Processor serial: 0000-0F4A-0000-0000-0000-0000 Deterministic Cache Parameters: index=0: eax=00004121 ebx=01c0003f ecx=0000001f edx=00000000 > Data cache, level 1, self initializing > 32 sets, 8 ways, 1 partitions, line size 64 > full size 16384 bytes > shared between up to 2 threads index=1: eax=00004143 ebx=01c0103f ecx=000007ff edx=00000000 > Unified cache, level 2, self initializing > 2048 sets, 8 ways, 2 partitions, line size 64 > full size 2097152 bytes > shared between up to 2 threads
Code:
~# dmidecode --type memory # dmidecode 3.0 Scanning /dev/mem for entry point. SMBIOS 2.3 present. Handle 0x1000, DMI type 16, 15 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Single-bit ECC Maximum Capacity: 4 GB Error Information Handle: No Error Number Of Devices: 4 Handle 0x1100, DMI type 17, 27 bytes Memory Device Array Handle: 0x1000 Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 512 MB Form Factor: DIMM Set: None Locator: DIMM_1 Bank Locator: Not Specified Type: DDR Type Detail: Synchronous Speed: 533 MHz Manufacturer: CE00000000000000 Serial Number: F8165BCF Asset Tag: Not Specified Part Number: M3 78T6553CZ3-CD5 Handle 0x1101, DMI type 17, 27 bytes Memory Device Array Handle: 0x1000 Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 2048 MB Form Factor: DIMM Set: None Locator: DIMM_3 Bank Locator: Not Specified Type: DDR Type Detail: Synchronous Speed: 533 MHz Manufacturer: 0000000000000000 Serial Number: 00000006 Asset Tag: Not Specified Part Number: V01D2LF2GB18818867 Handle 0x1102, DMI type 17, 27 bytes Memory Device Array Handle: 0x1000 Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 512 MB Form Factor: DIMM Set: None Locator: DIMM_2 Bank Locator: Not Specified Type: DDR Type Detail: Synchronous Speed: 533 MHz Manufacturer: CE00000000000000 Serial Number: F8165BD6 Asset Tag: Not Specified Part Number: M3 78T6553CZ3-CD5 Handle 0x1103, DMI type 17, 27 bytes Memory Device Array Handle: 0x1000 Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 2048 MB Form Factor: DIMM Set: None Locator: DIMM_4 Bank Locator: Not Specified Type: DDR Type Detail: Synchronous Speed: 533 MHz Manufacturer: 0000000000000000 Serial Number: 00000006 Asset Tag: Not Specified Part Number: V01D2LF2GB18818867
- SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
- Old disks got removed and I put these instead:
- mainVolume (RAIDz1): 3x3TB WesterDigital WDC WD30EZRX-00D8PB0 (ATA/ATAPI-9 SATA 3.x)
- downloadVolume (Temporary volume I don't care losing): 1x1TB Seagate ST31000340AS (ATA/ATAPI-8 SATA 2.x)
- UPS: Eaton Protection Station 500 (2.2V 250W/500VA) No USB feature - I'll change it soon enough since it does sometimes not even hold during simple power failures and it has no USB feature...
Last edited: