Getting Critical ZFS State DEGRADED

Status
Not open for further replies.

Nick Longo

Dabbler
Joined
Jan 30, 2015
Messages
17
Looking for some assistance before I make a mistake and bring the whole thing down. I've been investigating the cause of my DEGRADED volume. I appear to have one drive showing as "REMOVED" from the volume.

This is the out of "zpool status -v"
------------------------------------------------------------------------
pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da1p2 ONLINE 0 0 0

errors: No known data errors

pool: shared
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
shared ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/43c32367-a8fb-11e4-b4eb-001676cda656 ONLINE 0 0 0
gptid/4424473f-a8fb-11e4-b4eb-001676cda656 ONLINE 0 0 0

errors: No known data errors

pool: storage
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/bedc8ea7-c2cb-11e4-afcd-b8ac6fd983df ONLINE 0 0 0
gptid/bfc96bfd-c2cb-11e4-afcd-b8ac6fd983df ONLINE 0 0 0
5638978837161857912 REMOVED 0 0 0 was /dev/gptid/c0c855bf-c2cb-11e4-afcd-b8ac6fd983df

errors: No known data errors
------------------------------------------------------------------------
I have two pools, the pool "storage" is degraded due to the third desk listed at the bottom as "REMOVED".

The drive though is visible with "camcontrol devlist"
-----------------------------------------------------------------------------------
<WDC WD1002FAEX-00Z3A0 05.01D05> at scbus0 target 0 lun 0 (ada0,pass0)
<HL-DT-ST DVDRWBD CT10N A105> at scbus1 target 0 lun 0 (cd0,pass1)
<ST31000528AS CC3E> at scbus2 target 0 lun 0 (ada1,pass2)
<WDC WD30EZRX-00AZ6B0 80.00A80> at scbus2 target 1 lun 0 (ada2,pass3)
<WDC WD30EZRX-00DC0B0 80.00A80> at scbus2 target 2 lun 0 (ada3,pass4)
<WDC WD30EZRX-00SPEB0 80.00A80> at scbus2 target 3 lun 0 (pass5,ada4)
<Port Multiplier 575f197b 000e> at scbus2 target 15 lun 0 (pass6,pmp0)
<Generic- Multi-Card 1.00> at scbus7 target 0 lun 0 (da0,pass7)
<hp v221w 1638> at scbus8 target 0 lun 0 (da1,pass8)
----------------------------------------------------------------------------------------------------
I believe it is the drive liked to "ada4".
The reason I believe this is the output of "gpart show" indicates "scheme not found for geom ada4". This is where i start getting into what is the right step to take for resolve this without potentially bring down the whole array.

Below is the output of "smartctl -q noserial -a /dev/ada4". It doesn't seem to indicate anything obvious to me.
----------------------------------------------------------------------------------------------------
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00SPEB0
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 22 11:10:32 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (41280) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 414) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 19
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 425
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 4162
194 Temperature_Celsius 0x0022 114 106 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
----------------------------------------------------------------------------------------------------

Any help/guidance is appreciated. Things had been working well for a few weeks but this is a new setup with a Mediasmart PROBOX 4 bay enclosure connected via eSATA. The 3TB drives are new as well, running for about 3 weeks.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Oh boy... A port multiplier? That's the single worst thing you could do short of hardware RAID. Those things are unreliable as hell.
In any case, ada4 is quickly becoming a paperweight. 16 pending sectors is quite bad. Replace it ASAP.
Also, you obviously never ran SMART tests on these drives, so do yourself a favor and set them up.

In the future, please use CODE tags to make CLI stuff easier to parse and please don't use a smaller typeface.
 

Nick Longo

Dabbler
Joined
Jan 30, 2015
Messages
17
Thanks. I had heard that the self-tests can add unnecessary stress on the drive if run too often (like the long test). What is a typical "rotation" for automatically running tests. Is something like daily "short" tests and weekly "long" tests typical?

I missed recognizing the sector issue.

I noticed you have listed using the Icy Dock FatCage - do you have it in a large tower? I also purchased the "red" drives this time.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Thanks. I had heard that the self-tests can add unnecessary stress on the drive if run too often (like the long test). What is a typical "rotation" for automatically running tests. Is something like daily "short" tests and weekly "long" tests typical?

I missed recognizing the sector issue.

I noticed you have listed using the Icy Dock FatCage - do you have it in a large tower? I also purchased the "red" drives this time.

A typical rotation is a long test every two weeks with a scrub on the other weeks, with optional short tests every day.

Regarding the chassis, I wouldn't call it "large", it's an average-sized chassis. I looked around a bit for the most 5.25" bays I could reasonably get.
 

Lucky Sidz

Dabbler
Joined
Aug 25, 2014
Messages
11
Hi

Can I please request you to help me how to solve the problem... As I am getting constant ALERT "The volume RSDATADRIVE (ZFS) status is DEGRADED".
Please help me before I loose any data...

Any help/guidance is appreciated.
ss1.png

ss2.png
ss3.png
ss4.png
ss5.png
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Hi

Can I please request you to help me how to solve the problem... As I am getting constant ALERT "The volume RSDATADRIVE (ZFS) status is DEGRADED".
Please help me before I loose any data...

Any help/guidance is appreciated.
ss1.png

ss2.png
ss3.png
ss4.png
ss5.png
Post the same output the op did and also read the forum rules so you can provide that information. Have you read the manual, you should!
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
...and you've somehow managed to provide screenshots of everything under the Storage page except what would be helpful: the volume status page.
 

Lucky Sidz

Dabbler
Joined
Aug 25, 2014
Messages
11
...and you've somehow managed to provide screenshots of everything under the Storage page except what would be helpful: the volume status page.
Hi DNB

Thanks for the suggestion... believe me it was helpfull. I could see the volume status report and then made appropriace disck replacement and applied "replace" command
 

Lucky Sidz

Dabbler
Joined
Aug 25, 2014
Messages
11
Post the same output the op did and also read the forum rules so you can provide that information. Have you read the manual, you should!

Hi Sorry for the inconvenience next time I will see the documentation and then post my issues...

Regards
Sidz
 
Status
Not open for further replies.
Top