Several Critical Alerts lately

monkeybutt · Dec 15, 2017

Been getting some troubling Critical Alerts lately.

Code:

Device: /dev/ada3, 1 Currently unreadable (pending) sectors

Code:

Device: /dev/ada3, Self-Test Log error count increased from 7 to 8

Also getting this every so often in my security run output.

Code:

freenas.local kernel log messages:
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c8 01 40 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
> (ada3:ahcich3:0:0:0): RES: 41 10 c8 01 40 40 00 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 30 9e fb 40 ae 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
> (ada3:ahcich3:0:0:0): RES: 41 10 30 9e fb 40 ae 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e0 e0 17 fc 40 ae 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
> (ada3:ahcich3:0:0:0): RES: 41 10 e0 17 fc 40 ae 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 01 40 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
> (ada3:ahcich3:0:0:0): RES: 41 10 90 01 40 40 00 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command

Code:

freenas.local kernel log messages:
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 e0 e8 e9 40 dc 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 e0 e8 e9 40 dc 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 e8 50 2a 40 45 01 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 e8 50 2a 40 45 01 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 90 ce 67 40 db 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 90 ce 67 40 db 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 f0 1f 41 40 dd 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 f0 1f 41 40 dd 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 90 f5 75 40 dd 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 90 f5 75 40 dd 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 90 76 01 40 dd 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 90 76 01 40 dd 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command

Am I about to lose a drive?

EDIT: Smart test results for this drive:

Code:

[root@freenas] ~# smartctl -a /dev/ada3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4E2ZRALPS
LU WWN Device Id: 5 0014ee 20dcbd767
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Dec 15 09:39:01 2017 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 113) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				(50580) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 506) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   107
  3 Spin_Up_Time			0x0027   177   174   021	Pre-fail  Always	   -	   8141
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   18
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   089   089   000	Old_age   Always	   -	   8659
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   18
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   6
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   111
194 Temperature_Celsius	 0x0022   123   112   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   10%	  8581		 28010584
# 2  Extended offline	Completed: read failure	   90%	  8485		 36709488
# 3  Short offline	   Completed: read failure	   10%	  8422		 36706080
# 4  Short offline	   Completed: read failure	   90%	  8206		 36709704
# 5  Extended offline	Completed: read failure	   90%	  8110		 36709704
# 6  Short offline	   Completed: read failure	   90%	  8038		 36709704
# 7  Short offline	   Completed: read failure	   90%	  7870		 36709704
# 8  Extended offline	Completed: read failure	   90%	  7774		 36709704
# 9  Short offline	   Completed without error	   00%	  7703		 -
#10  Extended offline	Completed: read failure	   90%	  7670		 36709704
#11  Short offline	   Completed without error	   00%	  7462		 -
#12  Extended offline	Completed without error	   00%	  7375		 -
#13  Short offline	   Completed without error	   00%	  7294		 -
#14  Short offline	   Completed without error	   00%	  7126		 -
#15  Extended offline	Completed without error	   00%	  7040		 -
#16  Short offline	   Completed without error	   00%	  6958		 -
#17  Short offline	   Completed without error	   00%	  6743		 -
#18  Extended offline	Completed without error	   00%	  6656		 -
#19  Short offline	   Completed without error	   00%	  6575		 -
#20  Short offline	   Completed without error	   00%	  6407		 -
#21  Extended offline	Completed without error	   00%	  6320		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas] ~#

millst · Dec 15, 2017

Probably. I think it's also possible that there is a problem with the SATA controller, but that is far less likely. You've already lost a sector per the SMART test. Time to RMA.

-tm

danb35 · Dec 15, 2017

millst said:
I think it's also possible that there is a problem with the SATA controller

I'd say highly unlikely. If the "self-test log error count" is increasing, the disk is failing SMART self-tests. Those have nothing to do with the SATA controller.

Johnnie Black · Dec 15, 2017

monkeybutt said:
Been getting some troubling Critical Alerts lately.

You should have received the first warning over a month ago when it failed the 1st SMART extended test, I would recommend to act sooner when you receive a warning like that, sometimes things can go downhill pretty fast.

monkeybutt · Dec 15, 2017

Johnnie Black said:
You should have received the first warning over a month ago when it failed the 1st SMART extended test, I would recommend to act sooner when you receive a warning like that, sometimes things can go downhill pretty fast.

Understood. Been traveling a lot lately, just haven't been able to get to it.

tvsjr · Dec 15, 2017

All those read failures, occurring at multiple locations on the disk? It's dead, Jim. RMA/replace. Especially if you travel a lot and aren't there to lay hands on it too often.

I'm surprised to see a Red dead at that age.

monkeybutt · Dec 15, 2017

tvsjr said:
All those read failures, occurring at multiple locations on the disk? It's dead, Jim. RMA/replace. Especially if you travel a lot and aren't there to lay hands on it too often.

I'm surprised to see a Red dead at that age.

Dang. Thanks for confirmation. I am very surprised too.

fabiob · Dec 16, 2017

tvsjr said:
I'm surprised to see a Red dead at that age.

I got UREs like that on a 3Tb Red at about 1000 hrs. MTBF: 1 million hours

Ericloewe · Dec 16, 2017

fabiob said:
I got UREs like that on a 3Tb Red at about 1000 hrs. MTBF: 1 million hours

Your mistake was assuming the the MTBF number means anything useful.

fabiob · Dec 16, 2017

Ericloewe said:
Your mistake was assuming the the MTBF number means anything useful.

Well, actually is WD mistake, we just RMA :D

wblock · Dec 16, 2017

The "M" in MTBF is "mean", not "maximum". (Makes me think of the Simpsons: "To protect mother earth, each copy contains a certain percentage of recycled paper." "And what percent is that?" "Zero. Zero's a percent.")

millst · Dec 16, 2017

I think you meant it's not "minimum" time between failures.

-tm

tvsjr · Dec 16, 2017

wblock said:
The "M" in MTBF is "mean", not "maximum". (Makes me think of the Simpsons: "To protect mother earth, each copy contains a certain percentage of recycled paper." "And what percent is that?" "Zero. Zero's a percent.")

And I'm more and more convinced that MTBF is synonymous with SWAG... with varying levels of S.

monkeybutt · Dec 17, 2017

Thanks for the input everyone. Already replaced the drive, and started RMA process.

Xelas · Dec 20, 2017

I've had 6 x 3TB WD Reds in my system for 3 years, and have replaced 3 of them already. I'm not a fan of the Red series now.

tvsjr · Dec 21, 2017

Xelas said:
I've had 6 x 3TB WD Reds in my system for 3 years, and have replaced 3 of them already. I'm not a fan of the Red series now.

What temperature are you running your drives at? That's a huge failure rate.

Xelas · Dec 21, 2017

tvsjr said:
What temperature are you running your drives at? That's a huge failure rate.

They are fairly well ventilated. The drive cages sit behind two 12cm fans and the system sits in my office, so the temps can't be high.
My case is a Lian Li PC-V354
http://www.silentpcreview.com/article1153-page2.html

Sent from my SAMSUNG-SM-N910A using Tapatalk

tvsjr · Dec 21, 2017

Xelas said:
They are fairly well ventilated. The drive cages sit behind two 12cm fans and the system sits in my office, so the temps can't be high.
My case is a Lian Li PC-V354
http://www.silentpcreview.com/article1153-page2.html

That means you don't actually know. What speeds are the fans running at? Are they plugged with debris/dust?
SSH into the box and issue a

Code:

smartctl -x /dev/<whatever_drive>

and look for the Temperature_Celsius value. Cut and paste the results here, in CODE tags.

Xelas · Dec 21, 2017

tvsjr said:
That means you don't actually know. What speeds are the fans running at? Are they plugged with debris/dust?
SSH into the box and issue a
Code:
smartctl -x /dev/<whatever_drive>
and look for the Temperature_Celsius value. Cut and paste the results here, in CODE tags.

Trust me, I take good care of it. It gets shut down every 6-9 months or so and thoroughly dusted (in a safe way - I've been in IT for 30+ years, I know what I'm doing). I get all the bunnies that collect in the nooks and crannies in the drive cages, using a very slightly damp paper towel. I check the dust filters fairly often and notice if they look like they may be a bit dusty - they get a light vacuum every month or so as needed.

Code:

root@NAS:~ # smartctl -x /dev/da1 | grep -i 'model'
Model Family:	 Western Digital Se
Device Model:	 WDC WD3000F9YZ-09N20L1

root@NAS:~ # smartctl -x /dev/da2 | grep -i 'model'
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68AX9N0

root@NAS:~ # smartctl -x /dev/da3 | grep -i 'model'
Model Family:	 Western Digital Se
Device Model:	 WDC WD3000F9YZ-09N20L1

root@NAS:~ # smartctl -x /dev/da4 | grep -i 'model'
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68AX9N0

root@NAS:~ # smartctl -x /dev/da5 | grep -i 'model'
Model Family:	 Samsung based SSDs
Device Model:	 Samsung SSD 840 PRO Series

root@NAS:~ # smartctl -x /dev/da6 | grep -i 'model'
Model Family:	 Western Digital Gold
Device Model:	 WDC WD6002FRYZ-01WD5B0

root@NAS:~ # smartctl -x /dev/da7 | grep -i 'model'
Model Family:	 Samsung based SSDs
Device Model:	 SAMSUNG MZ7KM240HAGR-0E005

root@NAS:~ # smartctl -x /dev/da8 | grep -i 'model'
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68AX9N0

The pool (and the drives in question) are da1, da2, da3, da4, da6, da8
The WD SE drives were both what I got back when I RMA'd the Red drives. I don't know why, but I won't look a gift horse in the mouth.
The WD Gold is another replacement I bought to replaced the 3rd failed drive - the most recent (~ 3-4 months). It's actually 6TB (all other are 3TB), so I'm using it at half capacity, but I'll be transitioning all drives to this model over the next few months to expand the pool. I also want a bit better performance, so I'm transitioning to 7200 rpm as well.

There was an "incident" where my daughter taped a picture that she drew over the front of the server, blocking the intake fans. This happened very recently (WAY after the failures), and is what led to the higher drive temps being locked in for posterity. Even so, they never went dangerously high.

Temps aren't relevant to this. I think I may have just gotten a bad batch of drives.

Here are the relevant smartctl excerpts:
da1 (WD SE):

Code:

0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==														
0x05  0x008  1			  40  ---  Current Temperature																			
0x05  0x010  1			  39  ---  Average Short Term Temperature																
0x05  0x018  1			  40  ---  Average Long Term Temperature																
0x05  0x020  1			  55  ---  Highest Temperature																			
0x05  0x028  1			  30  ---  Lowest Temperature																			
0x05  0x030  1			  53  ---  Highest Average Short Term Temperature														
0x05  0x038  1			  37  ---  Lowest Average Short Term Temperature														
0x05  0x040  1			  50  ---  Highest Average Long Term Temperature														
0x05  0x048  1			  40  ---  Lowest Average Long Term Temperature														
0x05  0x050  4			   0  ---  Time in Over-Temperature																	
0x05  0x058  1			  60  ---  Specified Maximum Operating Temperature														
0x05  0x060  4			   0  ---  Time in Under-Temperature																	
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature

da2 (WD Red)

Code:

Current Temperature:					37 Celsius
Power Cycle Min/Max Temperature:	 34/40 Celsius
Lifetime	Min/Max Temperature:	 23/54 Celsius
Under/Over Temperature Limit Count:   0/0

da3 (WD SE):

Code:

0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1			  43  ---  Current Temperature
0x05  0x010  1			  42  ---  Average Short Term Temperature
0x05  0x018  1			  42  ---  Average Long Term Temperature
0x05  0x020  1			  56  ---  Highest Temperature
0x05  0x028  1			  31  ---  Lowest Temperature
0x05  0x030  1			  54  ---  Highest Average Short Term Temperature
0x05  0x038  1			  39  ---  Lowest Average Short Term Temperature
0x05  0x040  1			  51  ---  Highest Average Long Term Temperature
0x05  0x048  1			  42  ---  Lowest Average Long Term Temperature
0x05  0x050  4			   0  ---  Time in Over-Temperature
0x05  0x058  1			  60  ---  Specified Maximum Operating Temperature
0x05  0x060  4			   0  ---  Time in Under-Temperature
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature

da4 (WD Red)

Code:

Current Temperature:					37 Celsius
Power Cycle Min/Max Temperature:	 35/41 Celsius
Lifetime	Min/Max Temperature:	 22/54 Celsius
Under/Over Temperature Limit Count:   0/0

da6 (WD Gold)

Code:

0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1			  43  ---  Current Temperature
0x05  0x010  1			  41  N--  Average Short Term Temperature
0x05  0x018  1			  44  N--  Average Long Term Temperature
0x05  0x020  1			  55  ---  Highest Temperature
0x05  0x028  1			  25  ---  Lowest Temperature
0x05  0x030  1			  52  N--  Highest Average Short Term Temperature
0x05  0x038  1			  25  N--  Lowest Average Short Term Temperature
0x05  0x040  1			  48  N--  Highest Average Long Term Temperature
0x05  0x048  1			  25  N--  Lowest Average Long Term Temperature
0x05  0x050  4			   0  ---  Time in Over-Temperature
0x05  0x058  1			  60  ---  Specified Maximum Operating Temperature
0x05  0x060  4			   0  ---  Time in Under-Temperature

da8 (WD Red)

Code:

Current Temperature:					37 Celsius
Power Cycle Min/Max Temperature:	 34/40 Celsius
Lifetime	Min/Max Temperature:	 22/53 Celsius
Under/Over Temperature Limit Count:   0/0

Important Announcement for the TrueNAS Community.

Several Critical Alerts lately

Dabbler

Contributor

Hall of Famer

Guru

Dabbler

Guru

Dabbler

Dabbler

Server Wrangler

Dabbler

Documentation Engineer

Contributor

Guru

Dabbler

Explorer

Guru

Explorer

Guru

Explorer

Similar threads