Which device has experienced an error?

Status
Not open for further replies.

Bostjan

Contributor
Joined
Mar 24, 2014
Messages
122
When I ran zpool status I get
One or more devices has experienced an error resulting in data corruption. Applications may be affected.

How and where can I find out which device?



---------------------------------------------------------------------
1st run of
# zpool status –v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 6h27m with 0 errors on Sun Apr 26 06:27:32 2015
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/5635b592-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
gptid/571c8555-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/57c8d0a5-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
gptid/589247dc-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/a0de1680-d712-11e4-88b5-d050991b6427 ONLINE 0 0 0
gptid/a1c305d7-d712-11e4-88b5-d050991b6427 ONLINE 0 0 0


errors: Permanent errors have been detected in the following files:

tank/.system:<0x9a>
<0x1c5>:<0xcf1c>

pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors




---------------------------------------------------------------------
Then I run
zpool scrub tank


---------------------------------------------------------------------
After the scrub
# zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 6h9m with 0 errors on Tue May 19 03:32:53 2015
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/5635b592-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
gptid/571c8555-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/57c8d0a5-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
gptid/589247dc-2b8b-11e4-8622-d050991b6427 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/a0de1680-d712-11e4-88b5-d050991b6427 ONLINE 0 0 0
gptid/a1c305d7-d712-11e4-88b5-d050991b6427 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

tank/.system:<0x9a>
<0x1c5>:<0xcf1c>

pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Did you reboot since the error happened?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Doesn't this error mean the data is completely bad and can't be fixed? You could probably delete the .system folder and reboot. After the reboot the folder should be auto created. I would test this before doing it on a live system.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Doesn't this error mean the data is completely bad and can't be fixed? You could probably delete the .system folder and reboot. After the reboot the folder should be auto created. I would test this before doing it on a live system.
Yeah. I think the error counts are zero because there was a reboot or similar situation. The file itself is corrupted, no direct way around it.

Since it's "just" the .system dataset, it's probably fixable, though.
 

Bostjan

Contributor
Joined
Mar 24, 2014
Messages
122
It was rebooted inbetween.


- how to access the directory
tank/.system

- how to delete it?

- what is in this .system dataset? Is it really safe to delete it?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
First of all, post the output of smartctl -a /dev/adaX for all drives.
 

Bostjan

Contributor
Joined
Mar 24, 2014
Messages
122
Do you mean this?

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD30EZRX-00D8PB0
Serial Number: WD-WCC4N7ZFRN50
LU WWN Device Id: 5 0014ee 2b5faa069
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed May 20 21:25:21 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (39060) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 392) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 170 170 021 Pre-fail Always - 6475
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1050
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 1
193 Load_Cycle_Count 0x0032 187 187 000 Old_age Always - 39882
194 Temperature_Celsius 0x0022 118 111 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That drive seems ok. That leaves the other one.

Big problem, however: You haven't run any SMART tests. It's not much of a wonder you ended up in this position.

What bugs me is that the disk says it's ok, which disagrees with the fact that corruption could not be fixed...

Which leads me to the question "What exactly is your hardware?"
 

Bostjan

Contributor
Joined
Mar 24, 2014
Messages
122
Yes I have run SMART test. Haven't I? I have set in the GUI to run SMART. Is this correct?
FoPBtqU.png


This is smartctl -a /dev/adaX output:
https://www.dropbox.com/s/7cu70knr8w604ff/smartctl -a -dev-adaX.txt?dl=0

FreeNAS 9.3 Stable
ASRock C2750D4I
1*8 GB ECC RAM
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

ada4 and ada5 do not have SMART tests set. Just edit the settings you have to add those two drives. You have to do this whenever you replace drives or add drives. For now at least.

ada1 is very dubious at the moment (18 spin retries, lots of command timeouts).

Still, I'd expect at least two dying drives, both in the same vdev. This is all rather odd, but the SMART test results for ada4 and ada5 might help. You can run them manually with smartctl -t long /dev/adaX.
 

Bostjan

Contributor
Joined
Mar 24, 2014
Messages
122
I was also reading all this smartctl -a /dev/adaX output, but I didn't find any usefull information.
Can you please tell me which lines told you all this information.


Now I have run manually smartctl -t long /dev/ada4 and it told me that it would take 392 min to finish. Is that normal? Can I normaly use FreeNAS inbetween? Can I also at the same time run SMART test for ada5?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Is that normal? Can I normaly use FreeNAS inbetween? Can I also at the same time run SMART test for ada5?

Yes to all 3 questions ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I was also reading all this smartctl -a /dev/adaX output, but I didn't find any usefull information.
Can you please tell me which lines told you all this information.
The lines labeled spin_retry_count and command_timeouts.
 
Status
Not open for further replies.
Top