Critical error! What to do next?

Status
Not open for further replies.

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, it's always safest to do the burn-in on a completely different system--that way there's no way you inadvertently trash one of your data drives. I don't do that, though; I have spare bays in my FreeNAS server, so if I need to burn in a disk, I just do it there.

The steps you've posted look correct, except that there's no need to reboot the server before kicking off the last long SMART test. I like to add the -v flag to badblocks ( badblocks -b 4096 -wsv /dev/ada3) so that I have a progress indicator as it runs.
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
Well, it's always safest to do the burn-in on a completely different system--that way there's no way you inadvertently trash one of your data drives. I don't do that, though; I have spare bays in my FreeNAS server, so if I need to burn in a disk, I just do it there.

The steps you've posted look correct, except that there's no need to reboot the server before kicking off the last long SMART test. I like to add the -v flag to badblocks ( badblocks -b 4096 -wsv /dev/ada3) so that I have a progress indicator as it runs.

Thanks again! The reboot I read in your second linked post on burning in a disk. But I will do as described then. Including the -v. Fingers crossed now...
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Then I shut down the system and remove the hdd?
Unless your system is able to handle hot-swapping drives.
Or do I need to put it OFFLINE before that?
Once the replace is done, the system will have already offlined the drive and you can just pull it out.
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
Ok guys, so far I received the new disk and mounted it in the FreeNAS case. I have run:

smartctl -t short /dev/ada3 (no results so then it is good, right?)
smartctl -t conveyance /dev/ada3 (also no results)
and started the longtest with smartctl -t long /dev/ada3 (needs 470 minutes (!!!)).

The tomorrow I will start my morning with:

Code:
tmux

Code:
badblocks -b 4096 -wsv /dev/ada3

Hoping that by the time I get home from work the badblocks will be finished. So far so good I guess
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
@danb35 in the fester tutorial it is described to remove the volume. I do not to do this right? The new disk is not part of the volume, is it?
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
Don't bet on it; expect a few days.

ok thanks for the heads up. But by using the tmux command I can shutdown my PC right? I leave it in tonight for the long test. But no results after running the first tests means no issues?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
The new disk is not part of the volume, is it?
No, it isn't.
But by using the tmux command I can shutdown my PC right?
Yes, you can even detach the tmux session (Ctrl-B, D) as soon as you start it.
But no results after running the first tests means no issues?
Probably; the results of the SMART tests are logged with the disk--do smartctl -a /dev/ada3 (or whatever) to see them.
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
@danb35 ok I will go on like described before. Wait for the long test to finish. Then start the badblocks command.
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
After the long test this is what I get by smartctl -a /dev/ada3. I see RAW value not being 0. Can someone please tell me if this is ok so far?

Code:
=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68N32N0
Serial Number:	WD-WCC7K2XS6YU3
LU WWN Device Id: 5 0014ee 2ba60ca65
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu May  3 06:19:37 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(44340) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 470) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   100   253   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   1
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   8
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   1
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   0
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   3
194 Temperature_Celsius	 0x0022   123   119   000	Old_age   Always	   -	   27
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   253   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		 8		 -
# 2  Conveyance offline  Completed without error	   00%		 0		 -
# 3  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Oo that is hard to read, please edit your post and add code tags around the output
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
A value of 0 in the RAW_VALUE column of attribute Raw_Read_Error_Rate is fine.
Code:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   100   253   051	Pre-fail  Always	   -	   0

About adding code tags: When done manually enclose the section in question in a pair of
[ CODE ], [ /CODE ]
tags (without the extra white space) or simply mark it and use the "code" button provided by the forum software.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I see RAW value not being 0.
Which RAW value are you concerned about? Some of them should be 0, others shouldn't.
 

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
Which RAW value are you concerned about? Some of them should be 0, others shouldn't.

I thought that the instructions were that all should be 0.

Code:
[root@freenas] ~# badblocks -b 4096 -wsv /dev/ada3
Checking for bad blocks in read-write mode
From block 0 to 976754645
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
^Bd^R0.24% done, 0:54 elapsed. (0/0/0 errors)
^Bdone
Reading and comparing:  74.80% done, 12:55:34 elapsed. (0/0/0 errors)


So far so good I guess, right?
 
Last edited:

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
I thought based on the percentage that the "badblocks" was nearly done. It stated 95%. But now it seems to have started a new session or something since it says:

Code:
Reading and comparing: done
Testing with pattern 0x55:   9.44% done, 16:04:47 elapsed. (0/0/0 errors)


Is it supposed to be like this @danb35? Also if this is going to be more than a day (or 2)... did not thinks one through enough I guess. I use the FreeNAS as my main data source. So all my files, music and so on are on there.
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
I thought that the instructions were that all should be 0.

For WD Red HDDs a value of 0 in the RAW_VALUE column of attribute Raw_Read_Error_Rate is fine. Other attributes as for example Temperature_Celsius of course show different values as you can see in the output you just posted. Seagate HDDs unfortunately have the property to show high numeric values in Raw_Read_Error_Rate even when completely healthy.

So for so good I guess, right? [...] But now it seems to have started a new session or something since it says

Looks o.k., just let it continue. badblocks cycles through four different test patterns.
 
  • Like
Reactions: JFD

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
  • Like
Reactions: JFD

JFD

Dabbler
Joined
Jul 25, 2016
Messages
46
after days of waiting, the badblocks finished without errors. the long gave me this

Code:
smartctl -A /dev/ada3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   1
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   81
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   1
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   0
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   5
194 Temperature_Celsius	 0x0022   122   116   000	Old_age   Always	   -	   28
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0



I think everything is ok. And I can start the replacement part now. Right?

Sorry for being a noob here ;-) and thanks for all the help. I guess having done this once, makes me more confident to be able to handle similar future events :)
 
Last edited:
Status
Not open for further replies.
Top