One or more devices has experienced an error resulting in data corruption.

Status
Not open for further replies.

signtrigger

Cadet
Joined
Sep 9, 2017
Messages
5
Hi Friends,
I'm looking for some help related to the error mentioned in subject. My setup is quite new and I got to know this later that an ECC RAM was supposed to be used. However, I've done memtest - 4 passes and Zero errors!
Including couple of outputs for further analysis:
zpool status -v
Code:
  pool: Trigger-Vol																												 
state: ONLINE																													 
status: One or more devices has experienced an error resulting in data															 
		corruption.  Applications may be affected.																				 
action: Restore the file in question if possible.  Otherwise restore the															
		entire pool from backup.																									
   see: http://illumos.org/msg/ZFS-8000-8A																						 
  scan: scrub repaired 576K in 3h20m with 2 errors on Sun Sep  3 03:20:31 2017													 
config:																															 
																																	
		NAME											STATE	 READ WRITE CKSUM												 
		Trigger-Vol									 ONLINE	   0	 0	 0												 
		  mirror-0									  ONLINE	   0	 0	 0												 
			gptid/a94d8ead-5e80-11e7-9e07-708bcda20143  ONLINE	   0	 0	 0												 
			gptid/a9f15727-5e80-11e7-9e07-708bcda20143  ONLINE	   0	 0	 0												 
																																	
errors: Permanent errors have been detected in the following files:																 
																																	
		Trigger-Vol/Games-Setup:<0xd>																							   
		Trigger-Vol/Games-Setup:<0x18c3e>																						   
																																	
  pool: freenas-boot																												
state: ONLINE																													 
status: One or more devices has experienced an error resulting in data															 
		corruption.  Applications may be affected.																				 
action: Restore the file in question if possible.  Otherwise restore the															
		entire pool from backup.																									
   see: http://illumos.org/msg/ZFS-8000-8A																						 
  scan: scrub repaired 0 in 0h1m with 1 errors on Sun Sep  3 03:55:20 2017														 
config:																															 
																																	
		NAME		STATE	 READ WRITE CKSUM																					 
		freenas-boot  ONLINE	   0	 0	 0																					
		  da0p2	 ONLINE	   0	 0	 0																					 
																																	
errors: Permanent errors have been detected in the following files:																 
																																	
		//usr/local/sbin/pkg-static										   


smartctl -a /dev/ada0

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)															 
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org														 
																																	
=== START OF INFORMATION SECTION ===																								
Model Family:	 Western Digital Red																							   
Device Model:	 WDC WD40EFRX-68WT0N0																							 
Serial Number:	WD-WCC4E6VU32KZ																								   
LU WWN Device Id: 5 0014ee 2b8a68660																								
Firmware Version: 82.00A82																										 
User Capacity:	4,000,787,030,016 bytes [4.00 TB]																				 
Sector Sizes:	 512 bytes logical, 4096 bytes physical																			
Rotation Rate:	5400 rpm																										 
Device is:		In smartctl database [for details use: -P show]																   
ATA Version is:   ACS-2 (minor revision not indicated)																			 
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)																			
Local Time is:	Tue Sep 19 10:19:34 2017 PDT																					 
SMART support is: Available - device has SMART capability.																		 
SMART support is: Enabled																										   
																																	
=== START OF READ SMART DATA SECTION ===																							
SMART overall-health self-assessment test result: PASSED																			
																																	
General SMART Values:																											   
Offline data collection status:  (0x80) Offline data collection activity															
										was never started.																		 
										Auto Offline Data Collection: Enabled.													 
Self-test execution status:	  (   0) The previous self-test routine completed													
										without error or no self-test has ever													 
										been run.																				   
Total time to complete Offline																									 
data collection:				(50760) seconds.																					
Offline data collection																											 
capabilities:					(0x7b) SMART execute Offline immediate.															
										Auto Offline data collection on/off support.												
										Suspend Offline collection upon new														 
										command.																					
										Offline surface scan supported.															 
										Self-test supported.																		
										Conveyance Self-test supported.															 
										Selective Self-test supported.															 
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		 
										Supports SMART auto save timer.															 
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														 
Short self-test routine																											 
recommended polling time:		(   2) minutes.																					
Extended self-test routine																										 
recommended polling time:		( 508) minutes.				
Conveyance self-test routine																										
recommended polling time:		(   5) minutes.																					
SCT capabilities:			  (0x703d) SCT Status supported.																	   
										SCT Error Recovery Control supported.													   
										SCT Feature Control supported.															 
										SCT Data Table supported.																   
																																	
SMART Attributes Data Structure revision number: 16																				 
Vendor Specific SMART Attributes with Thresholds:																				   
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0											
  3 Spin_Up_Time			0x0027   176   172   021	Pre-fail  Always	   -	   8158										 
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   48										   
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0											
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   162										 
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0											
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   48										   
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   16										   
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   52										   
194 Temperature_Celsius	 0x0022   113   103   000	Old_age   Always	   -	   39										   
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0											
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0											
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0											
																																	
SMART Error Log Version: 1																										 
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					 
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									 
# 1  Extended offline	Completed without error	   00%	   151		 -													 
# 2  Short offline	   Completed without error	   00%	   142		 -													 
# 3  Short offline	   Completed without error	   00%	   140		 -													 
# 4  Short offline	   Completed without error	   00%	   140		 -													 
																																	
SMART Selective self-test log data structure revision number 1																	 
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing																								
	4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.	   


smartctl -a /dev/ada1

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)															 
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org														 
																																	
=== START OF INFORMATION SECTION ===																								
Model Family:	 Toshiba 3.5" MD04ACA... Enterprise HDD																			
Device Model:	 TOSHIBA MD04ACA400																								
Serial Number:	17SHKTBHFSAA																									 
LU WWN Device Id: 5 000039 78bf039fb																								
Firmware Version: FP2A																											 
User Capacity:	4,000,787,030,016 bytes [4.00 TB]																				 
Sector Sizes:	 512 bytes logical, 4096 bytes physical																			
Rotation Rate:	7200 rpm																										 
Form Factor:	  3.5 inches																										
Device is:		In smartctl database [for details use: -P show]																   
ATA Version is:   ATA8-ACS (minor revision not indicated)																		   
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)																			
Local Time is:	Tue Sep 19 10:20:48 2017 PDT																					 
SMART support is: Available - device has SMART capability.																		 
SMART support is: Enabled																										   
																																	
=== START OF READ SMART DATA SECTION ===																							
SMART overall-health self-assessment test result: PASSED																			
																																	
General SMART Values:																											   
Offline data collection status:  (0x82) Offline data collection activity															
										was completed without error.																
										Auto Offline Data Collection: Enabled.													 
Self-test execution status:	  (   0) The previous self-test routine completed													
										without error or no self-test has ever													 
										been run.																				   
Total time to complete Offline																									 
data collection:				(  120) seconds.																					
Offline data collection																											 
capabilities:					(0x5b) SMART execute Offline immediate.															
										Auto Offline data collection on/off support.												
										Suspend Offline collection upon new														 
										command.																					
										Offline surface scan supported.															 
										Self-test supported.																		
										No Conveyance Self-test supported.														 
										Selective Self-test supported.															 
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		 
										Supports SMART auto save timer.															 
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														 
Short self-test routine																											 
recommended polling time:		(   2) minutes.																					
Extended self-test routine								 
recommended polling time:		( 479) minutes.																					
SCT capabilities:			  (0x003d) SCT Status supported.																	   
										SCT Error Recovery Control supported.													   
										SCT Feature Control supported.															 
										SCT Data Table supported.																   
																																	
SMART Attributes Data Structure revision number: 16																				 
Vendor Specific SMART Attributes with Thresholds:																				   
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x000b   100   100   050	Pre-fail  Always	   -	   0											
  2 Throughput_Performance  0x0005   100   100   050	Pre-fail  Offline	  -	   0											
  3 Spin_Up_Time			0x0027   100   100   001	Pre-fail  Always	   -	   6831										 
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   37										   
  5 Reallocated_Sector_Ct   0x0033   100   100   050	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x000b   100   100   050	Pre-fail  Always	   -	   0											
  8 Seek_Time_Performance   0x0005   100   100   050	Pre-fail  Offline	  -	   0											
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   162										 
10 Spin_Retry_Count		0x0033   100   100   030	Pre-fail  Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   36										   
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0											
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   2											
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   64										   
194 Temperature_Celsius	 0x0022   100   100   000	Old_age   Always	   -	   44 (Min/Max 25/51)						   
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   100   100   000	Old_age   Always	   -	   0											
198 Offline_Uncorrectable   0x0030   100   100   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   253   000	Old_age   Always	   -	   0											
220 Disk_Shift			  0x0002   100   100   000	Old_age   Always	   -	   0											
222 Loaded_Hours			0x0032   100   100   000	Old_age   Always	   -	   148										 
223 Load_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0											
224 Load_Friction		   0x0022   100   100   000	Old_age   Always	   -	   0											
226 Load-in_Time			0x0026   100   100   000	Old_age   Always	   -	   577										 
240 Head_Flying_Hours	   0x0001   100   100   001	Pre-fail  Offline	  -	   0											
																																	
SMART Error Log Version: 1																										 
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					 
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									 
# 1  Extended offline	Completed without error	   00%	   161		 -													 
# 2  Short offline	   Completed without error	   00%	   142		 -													 
# 3  Short offline	   Completed without error	   00%	   140		 -													 
# 4  Short offline	   Completed without error	   00%	   139		 -													 
																																	
SMART Selective self-test log data structure revision number 1																	 
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing			 
   4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.   
					  

 
Joined
Apr 9, 2015
Messages
1,258
Zero errors in Memtest doesn't mean a whole lot right now. It could have been one or two fleeting errors that happened for who knows how many reasons while some data was stored before being written.

You are using a gaming board as well as a Desktop CPU. Hopefully you have backups of the corrupted data so you can restore. I honestly wouldn't even attempt to run a scrub at this point with the data corrupted and non-ECC RAM since there is no telling what other damage it could cause.

I am sure it's not the information you are hoping to hear but once corruption creeps in it may be permanent.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Most likely this file "Games-Setup:" is simply corrupt.

And your boot disk. Save your config, and reinstall, then re-upload. That should fix your boot disk.

Remove the file Games-Setup:, or whatever it is... and continue...

Think about ECC. The file was probably corrupted because of lack of ECC. The same thing would've happened on a normal Filesystem, but you wouldn't know about the corruption.
 

signtrigger

Cadet
Joined
Sep 9, 2017
Messages
5
Zero errors in Memtest doesn't mean a whole lot right now. It could have been one or two fleeting errors that happened for who knows how many reasons while some data was stored before being written.
Okay, I just tested as I read in other posts that mostly RAM is the visible culprit.

You are using a gaming board as well as a Desktop CPU. Hopefully you have backups of the corrupted data so you can restore. I honestly wouldn't even attempt to run a scrub at this point with the data corrupted and non-ECC RAM since there is no telling what other damage it could cause.

No, I don't really have backup of that data but not a concern for those 2 files. Are we saying these kinds of data corruption never happens with ECC RAMs?

I am sure it's not the information you are hoping to hear but once corruption creeps in it may be permanent.
Not a problem. I'm just wondering what to do next. Is there a way to get the system back to normal without having to delete the pool (meaning deleting all the data I have on this box) and re-building it? Is there a way to find out if other files are not corrupted? Can I continue using the system with those 2 files being corrupted and been deleted?
 

signtrigger

Cadet
Joined
Sep 9, 2017
Messages
5
And your boot disk. Save your config, and reinstall, then re-upload. That should fix your boot disk.
Yep, fixing the boot drive shouldn't be difficult. Boot drive got corrupted post upgrade I did last week.

Think about ECC. The file was probably corrupted because of lack of ECC. The same thing would've happened on a normal Filesystem, but you wouldn't know about the corruption.
So, with ECC, data corruption never happens? I might need to do more reading on how the traditional file systems in Windows & Linux deal with this problem. e.g. NTFS or EXT4. Nobody talks about ECC with these file systems. Does it never happen or they don't alert the user or they have a mechanism to fix it without notifying?
 

signtrigger

Cadet
Joined
Sep 9, 2017
Messages
5
Remove the file Games-Setup:, or whatever it is... and continue...
Actually, the zpool status command showed two files inside that dataset (Games-setup is a dataset) and I deleted those files just in an attempt to possibly fix the issue magically. After that only the command gives hex output (possibly a reference to those locations or something).
Since the files are now deleted, is it safe to continue using the system? or deleting the dataset itself would get rid of that error? or does it need me to delete the whole pool?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
If you scrub the pool and there are no errors, then you are good.

You may need to delete the dataset. You can copy files out first.

Yes. No corruption with ECC. ZFS protects from disk and I/o based corruption, not corruption that happens in memory.

Most filesystems don't protect at all.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Unless it's caused by something other than RAM.

But, other than CPU/Register corruption (which is ECC protected too), the rest is detected by ZFS, or some other error correction later. For example UDMA CRC errors.

But yes. If your computer loses the plot and starts corrupting all data streams writing to HDs, there is not much ZFS can do. But it tries very hard ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
But, other than CPU/Register corruption (which is ECC protected too), the rest is detected by ZFS, or some other error correction later. For example UDMA CRC errors.

But yes. If your computer loses the plot and starts corrupting all data streams writing to HDs, there is not much ZFS can do. But it tries very hard ;)
My point was just that drives crapping out beyond ZFS' capacity to repair are still a problem. With ECC, basically everything else is reliable.
 

signtrigger

Cadet
Joined
Sep 9, 2017
Messages
5
Thank you guys for quick help on this issue. With my current hardware config, FreeNAS is overkill, I believe! since this OS expects server grade hardware. As of now, I know what's my next steps to deal with such issues in future.
Thanks again. :)
 
Status
Not open for further replies.
Top