Questions regarding write errors and disk replacement.

CompuGlobalHyperMegaNet · Apr 2, 2017

I feel a little silly posting this but please bare with me. I've had a really stressful few days and this is the first time of encountered and issues with my storage pool, so I'm second guessing everything and having trouble keeping focus. To cap it off, it appears my online backup is expired (*hangs head in shame), thanks spam filter!.... The point is that I was pretty damn stressed before these storage issues and they've only added to the load.

I'd appreciate some hand holding (so to speak) if someone would be so kind.

====================================

I've just received a couple of emails from my server. The first was received at 20:03 and said-

Code:

The volume tank (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The second was received at 20:04 (on minute after the first) and said-

Code:

The volume tank (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

I checked the pool status using-

Code:

# zpool status

I also checked the disks via the GUI. One disk had around 524 write errors (I'm kicking myself for not taking accurate notes) but aside from that, there were no errors or checksum errors reported on any of the disks.

I then cleared the errors using-

Code:

zpool clear tank

and initiated a Scrub, which is currently about 40% done with no errors.

The system had been off for a few days before being booted around 19:30 today (the same day as the above emails/errors). The write errors may have occurred whilst I was transferring around 20GB of data to the FreeNAS server or whilst I was hashing the data.

My hardware can be found in my sig.

=========================================

So my questions are as follows.

1. If the scrub completes and there are no errors, what should my course of action be?

2. If there are errors reported, should I power down the system whilst I wait for the replacement disk to arrive and passed the burn-in test or should I leave the system running?

3. Is there something else I should do before replacing the disk?

4. Once the scrub is complete, should I run a Long S.M.A.R.T. test on the drive in question?

5. What can cause write errors?

P.S. Please forgive me if I'm waffling or asking dumb questions. It's just been one of those weeks...

=====================================

[EDIT] The scrub is now complete and no errors are being reported. I've just run "zpool status" and I get the following-

Code:

[root@freenas ~]# zpool status																									
  pool: freenas-boot																												
state: ONLINE																													
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Mar 27 03:45:04 2017														
config:																															
																																	
		NAME		STATE	 READ WRITE CKSUM																					
		freenas-boot  ONLINE	   0	 0	 0																					
		  ada0p2	ONLINE	   0	 0	 0																					
																																	
errors: No known data errors																										
																																	
  pool: tank																														
state: ONLINE																													
  scan: scrub repaired 0 in 3h3m with 0 errors on Mon Apr  3 00:34:41 2017														
config:																															
																																	
		NAME											STATE	 READ WRITE CKSUM												
		tank											ONLINE	   0	 0	 0												
		  raidz2-0									  ONLINE	   0	 0	 0												
			gptid/fc861721-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			gptid/fcea84f6-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			gptid/fd5098de-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			gptid/fdb415bc-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			gptid/fe17c521-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			gptid/fe7c3f44-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0												
			da5p2									   ONLINE	   0	 0	 0												
			gptid/ff4aa622-41e7-11e5-b1be-002590f5a510  ONLINE	   0	 0	 0

As you can see, the problematic disk (da5p2) is being shown by it's name rather than it's gptid like the others. Why is this?

Am I safe to shut the system down and fall into bed with a stiff drink?

nojohnny101 · Apr 2, 2017

What is the smart output for the problem drive?

CompuGlobalHyperMegaNet · Apr 2, 2017

Thanks for the reply!

Here's the smart status. You'll notice that I'm currently 70% remaining on a long smart test as I figured that would be the next logical course of action.

Code:

[root@freenas] ~# smartctl -a /dev/da5
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	WD-WMC4N0D6FYER
LU WWN Device Id: 5 0014ee 003deb111
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Apr  3 05:40:04 2017 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 247) Self-test routine in progress...
										70% of test remaining.
Total time to complete Offline
data collection:				(40500) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 406) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   4
  3 Spin_Up_Time			0x0027   174   173   021	Pre-fail  Always	   -	   6283
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   140
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   578
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   140
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   138
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   25
194 Temperature_Celsius	 0x0022   121   113   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   356		 -
# 2  Extended offline	Completed without error	   00%	   157		 -
# 3  Extended offline	Interrupted (host reset)	  90%	   150		 -
# 4  Extended offline	Interrupted (host reset)	  80%	   150		 -
# 5  Extended offline	Completed without error	   00%		66		 -
# 6  Conveyance offline  Completed without error	   00%		59		 -
# 7  Short offline	   Completed without error	   00%		59		 -
# 8  Extended offline	Completed without error	   00%		 7		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

CompuGlobalHyperMegaNet · Apr 3, 2017

Now the Long S.M.A.R.T. test has finished, here's the output of "smartctl -a /dev/da5".

Code:

[root@freenas] ~# smartctl -a /dev/da5
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD30EFRX-68EUZN0
Serial Number:  WD-WMC4N0D6FYER
LU WWN Device Id: 5 0014ee 003deb111
Firmware Version: 82.00A82
User Capacity:  3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Mon Apr  3 16:00:49 2017 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (40500) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off supp  ort.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 406) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_  FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  4
  3 Spin_Up_Time  0x0027  174  173  021  Pre-fail  Always  -  6283
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  140
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  589
10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  140
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  138
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  25
194 Temperature_Celsius  0x0022  122  113  000  Old_age  Always  -  28
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  3

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA  _of_first_error
# 1  Extended offline  Completed without error  00%  584  -
# 2  Short offline  Completed without error  00%  356  -
# 3  Extended offline  Completed without error  00%  157  -
# 4  Extended offline  Interrupted (host reset)  90%  150  -
# 5  Extended offline  Interrupted (host reset)  80%  150  -
# 6  Extended offline  Completed without error  00%  66  -
# 7  Conveyance offline  Completed without error  00%  59  -
# 8  Short offline  Completed without error  00%  59  -
# 9  Extended offline  Completed without error  00%  7  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And here's the output of "zpool status tank".

Code:

[root@freenas] ~# zpool status tank
  pool: tank
state: ONLINE
  scan: scrub repaired 0 in 3h3m with 0 errors on Mon Apr  3 00:34:41 2017
config:

  NAME  STATE  READ WRITE CKSUM
  tank  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  gptid/fc861721-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  gptid/fcea84f6-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  gptid/fd5098de-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  gptid/fdb415bc-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  gptid/fe17c521-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  gptid/fe7c3f44-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0
  da5p2									   ONLINE  0  0  0
  gptid/ff4aa622-41e7-11e5-b1be-002590f5a510  ONLINE  0  0  0

errors: No known data errors

nojohnny101 · Apr 3, 2017

looks like to me you have a failed drive.

There is a great hard drive troubleshooting guide that tells you what relevant data you should be looking for in the smart test results to determine a drive failure or not.

You're seems to fit that criteria. More information here

Looks like RMA is your next step.

CompuGlobalHyperMegaNet · Apr 4, 2017

nojohnny101 said:
looks like to me you have a failed drive.

There is a great hard drive troubleshooting guide that tells you what relevant data you should be looking for in the smart test results to determine a drive failure or not.

You're seems to fit that criteria. More information here

To be honest, I thought all was fine and dandy at first. At the risk of sounding like a broken record, I've got so much going on at the moment and for whatever reason, I completely missed the ID# 200 errors. So sincerely, thanks for the reply or else I may have just carried on oblivious whilst I deal with other stuff.

nojohnny101 said:
Looks like RMA is your next step.

I'm gonna start the process as soon as I've posted this reply (and see whether WD accept RMA's of disks like mine). I have a couple of follow up questions though.

Long story and I won't bore anyone with it but I had to shut the server down the other day. The server is still off and before I reboot it, I'd like to know whether it's a bad idea to make a backup before re-silvering? Thankfully, I'm pretty sure I have a local backup of all the really important stuff, so the spam filter (and my own inattentiveness hasn't screwed me too hard) but there's about 2 to 3TB of data that would be time consuming/annoying to replace (Music, Movies and the like). I'd like to know whether there's some reason I shouldn't backup said data before removing or replacing the disk?

The plan is to either backup the data to the various disks I already have, via LAN, or to buy a couple of large disks, create a new vdev in a separate pool and then backup to that using the method described here - https://forums.freenas.org/index.php?threads/how-to-copy-one-pool-into-another.16653/.

nojohnny101 · Apr 4, 2017

No problem, glad to help. So many on here have helped me more times than I can count so glad I could give back.

As far as backing up before resilvering, yes backing up is always a good idea. In fact one of the "party lines" the veterans on here tote (and rightfully so) is that RAIDZ (or for that matter, any form of raid) is NOT a backup. It is simply parity. I know budgets don't always allow but that should be a top priority and honestly be planned from the beginning for anyone that cares about their data.

You can accomplish this many ways, some are cheaper than others and have their advantages and disadvantages, but you seem like a smart guy you can figure those out. Most common methods are:
1) as you said, build another pool inside of your current machine then snapshot and replicate to that.
2) buy a big external USB drive, mount it on a computer that is connected to the same LAN as your FreeNAS server, then just backup whatever you deem necessary to the external USB drive
3) backup up to a paid cloud service. options are endless here, searching on here alone will yield you many ways/services to accomplish this
4) build another budget FreeNAS box, do initial replication to it while it is local, then move it offsite and continue to replicate to it (this option I personally believe has the most advantages compared to others, namely you control the data [no cloud storage provider], it is offsite which protects against catastrophic failures [robbery, fire, lightening strike, etc.] and your backup has the added benefits of FN [bit rot protection, ECC memory, party, etc.).

Good luck! Post back with more questions, this is a great community!

CompuGlobalHyperMegaNet · Apr 4, 2017

nojohnny101 said:
No problem, glad to help. So many on here have helped me more times than I can count so glad I could give back.

Same here. The amount of knowledge these forums contain is really invaluable. I've been lurking for about three years now and despite what some people might say on other forums, I've found the users here to be nothing but helpful, welcoming and friendly... I've lost count of the number of times users of this forum have been labeled as an elitist neckbeard just for having the gall to point out why some new user's plan for FreeNAS is terrible... It's the same with the pfSense forum. I remember one thread in particular where a pfSense newbie was insistent that it was a good idea for pfSense to add NAS functionality and despite numerous very knowledge users pointing out just why it was a terrible idea on many levels, the OP just wouldn't hear any of it.

The argument ultimately boiled down to the OP thinking that the other users were just being closed minded because they couldn't understand that some people would like having data stored on their UTM and that as he didn't care about the privacy of his data so where was the harm?...

[EDIT] I almost forgot! The OP also used power savings as an argument in favour of his idea and basically painted it as anyone not agreeing with his pfSense + FreeNAS idea (or faSense as I call it) not caring about the environment.

Face meet palm.

nojohnny101 said:
In fact one of the "party lines" the veterans on here tote (and rightfully so) is that RAIDZ (or for that matter, any form of raid) is NOT a backup. It is simply parity. I know budgets don't always allow but that should be a top priority and honestly be planned from the beginning for anyone that cares about their data.

*snip

4) build another budget FreeNAS box, do initial replication to it while it is local, then move it offsite and continue to replicate to it (this option I personally believe has the most advantages compared to others, namely you control the data [no cloud storage provider], it is offsite which protects against catastrophic failures [robbery, fire, lightening strike, etc.] and your backup has the added benefits of FN [bit rot protection, ECC memory, party, etc.).

I did have CrashPlan up until recently and as I said in my previous post, (I'm almost certain that) all my important data is backed up locally but I'm aware of the shortcomings of backing up to non-ZFS, non-redundant disks.

I've had to do this all on a tight budget but as you can see from my spec, I try my hardest to do things by the book. The primary reason behind using FreeNAS has always been to preserve the integrity of my data but I know that in order to comply with the 3-2-1 Rule, I really need another copy of my data and a bunch of disks in cold storage just doesn't cut it.

That's why my long term plan has always been to get a second system, when I could afford it, and have it at a relatives house... I guess it's not the best idea to wait for the ideal time though. I should probably take this as a sign to ( to check my spam folder more often and) just bite the bullet and setup another system.

I could get a HPE ProLiant Gen8 for £110 + the cost of 8GB of ECC and the disks. I looked at one a few months back (with using it as a second FreeNAS server in mind) but I will have to do some more research first... I do now that Louwrentius has one though (I'll have to have a look through his post history).

nojohnny101 said:
Good luck! Post back with more questions, this is a great community!

Thank you and I will, I'm sure.

nojohnny101 · Apr 4, 2017

Sounds like you have done proper research and are coming into the better informed then most.

I will just point you to a thread that was a discussion about the necessary specs for a backup box.
https://forums.freenas.org/index.ph...storage-device-for-freenas-nas-backups.51214/

Important Announcement for the TrueNAS Community.

Questions regarding write errors and disk replacement.

CompuGlobalHyperMegaNet

Contributor

nojohnny101

Wizard

CompuGlobalHyperMegaNet

Contributor

CompuGlobalHyperMegaNet

Contributor

nojohnny101

Wizard

CompuGlobalHyperMegaNet

Contributor

nojohnny101

Wizard

CompuGlobalHyperMegaNet

Contributor

nojohnny101

Wizard

Similar threads