HP MicroServer Gen8 with 4 x 8TB Raid-Z1 - CAM status: Command timeout, Error 5, Retries exhausted?

Status
Not open for further replies.

victorhooi

Contributor
Joined
Mar 16, 2012
Messages
184
I have a HP MicroServer Gen8, running a 4 x 8TB RAID-Z1 ZFS array.

I recently noticed the machine wasn't accessible via the FreeNAS WebUI.

I checked the iLo, and saw this:

Code:
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
(ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Unconditionally Re-queue Request
(ada1:ahcichl:0:0:0): Error 5, Periph was invalidated
GE0M_MIRROR: Device swap1: provider ada1p1 disconnected.
GEOM_MIRROR: Device swap1: provider destroyed.
GEOM_MIRROR: Device swap1 destroyed.
GEOM_ELI: Device mirror/swap1.eli destroyed.
GEOM ELI: Detached mirror/swap1.eli on last close.
(ada1:ahcich1:0:0:0): Periph destroyed
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 26 port 0
ahcich1: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 80 serr 00000000 cm 0004fal7 (aprobe0:ahcich1:0:0:0): SOFT RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ipmi0: ECS: Failed to start write
ugen2.2: <BMC Virtual Eeyboard> at usbus2
ukbd0 on uhub2
ukbd0: <Virtual Keyboard > on usbus2
kbd2 at ukbd0
ums0 on uhub2
ums0: <Virtual Mouse> on usbus2


I OCR-ed the above - original screenshot is here - http://i.imgur.com/lLz8evk.png.

I reset the machine via iLO. On bootup, I saw this:

Q7rDgth.png

HL8jNZV.png


When FreeNAS started booting up - my "datastore" zpool was in status unknown.

ocWunV8.png


When I go into disks - I only see 2 disks - when I would expect to see 4?

NeNnxCq.png


Any ideas what's going on?

I should also mention these are Seagate 8TB Archive drives - I recently took the opportunity to replace one of them that had suddenly failed. Also, when that iLO hung the first time, I also updated the HP iLo from 2.50 to 2.54.

Regards,
Victor
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Off the cuff I'd say you have two drives which either do not have power, the SATA connector is bad for both, or you have two hard drive failures. You will need to physically check the machine to see what is going on.

EDIT: in my signature is my Hard Drive Toubleshooting Guide if you find the drives are spinning and the BIOS recognizes all the drives.
 

victorhooi

Contributor
Joined
Mar 16, 2012
Messages
184
I rebooted the machine - and all the drives came back.

Then one of the drives disappeared:

qgpyKTa.png

5yZh8Gk.png



I'm doing an rclone right now - but it's going very slowly (400-500 KB/s upload) as my upstream bandwidth is quite constrained (I live in Australia).

I can shutdown the machine, and try a different network with a faster upload - not sure if rebooting the machine is risky though?

I did obtain SMART results - see below.

I didn't try the long testing yet - should I try that, considering that I've load one drive in a RAID-Z1, and I'm currently running rclone to upload files?

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Archive HDD
Device Model:	 ST8000AS0002-1NA17Z
Serial Number:	Z840JLGV
LU WWN Device Id: 5 000c50 091436f5f
Firmware Version: RT17
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5980 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Aug 12 11:15:59 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
 was completed without error.
 Auto Offline Data Collection: Enabled.
Self-test execution status:	 ( 0) The previous self-test routine completed
 without error or no self-test has ever
 been run.
Total time to complete Offline
data collection:  (	0) seconds.
Offline data collection
capabilities:   (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
 Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
 power-saving mode.
 Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
 General Purpose Logging supported.
Short self-test routine
recommended polling time:   ( 1) minutes.
Extended self-test routine
recommended polling time:   ( 937) minutes.
Conveyance self-test routine
recommended polling time:   ( 2) minutes.
SCT capabilities:		 (0x30b5) SCT Status supported.
 SCT Feature Control supported.
 SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		 FLAG	 VALUE WORST THRESH TYPE	 UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate	 0x000f 117 099 006	Pre-fail Always	 -	 153485320
3 Spin_Up_Time			0x0003 091 091 000	Pre-fail Always	 -	 0
4 Start_Stop_Count		0x0032 100 100 020	Old_age Always	 -	 20
5 Reallocated_Sector_Ct 0x0033 100 100 010	Pre-fail Always	 -	 0
7 Seek_Error_Rate		 0x000f 074 060 030	Pre-fail Always	 -	 60562704978
9 Power_On_Hours		 0x0032 090 090 000	Old_age Always	 -	 8841
10 Spin_Retry_Count		0x0013 100 100 097	Pre-fail Always	 -	 0
12 Power_Cycle_Count	 0x0032 100 100 020	Old_age Always	 -	 20
183 Runtime_Bad_Block	 0x0032 100 100 000	Old_age Always	 -	 0
184 End-to-End_Error		0x0032 100 100 099	Old_age Always	 -	 0
187 Reported_Uncorrect	 0x0032 100 100 000	Old_age Always	 -	 0
188 Command_Timeout		 0x0032 100 100 000	Old_age Always	 -	 0
189 High_Fly_Writes		 0x003a 100 100 000	Old_age Always	 -	 0
190 Airflow_Temperature_Cel 0x0022 063 055 045	Old_age Always	 -	 37 (Min/Max 26/38)
191 G-Sense_Error_Rate	 0x0032 100 100 000	Old_age Always	 -	 0
192 Power-Off_Retract_Count 0x0032 092 092 000	Old_age Always	 -	 16160
193 Load_Cycle_Count		0x0032 088 088 000	Old_age Always	 -	 24492
194 Temperature_Celsius	 0x0022 037 045 000	Old_age Always	 -	 37 (0 23 0 0 0)
195 Hardware_ECC_Recovered 0x001a 117 099 000	Old_age Always	 -	 153485320
197 Current_Pending_Sector 0x0012 100 100 000	Old_age Always	 -	 0
198 Offline_Uncorrectable 0x0010 100 100 000	Old_age Offline	 -	 0
199 UDMA_CRC_Error_Count	0x003e 200 200 000	Old_age Always	 -	 0
240 Head_Flying_Hours	 0x0000 100 253 000	Old_age Offline	 -	 8144 (216 195 0)
241 Total_LBAs_Written	 0x0000 100 253 000	Old_age Offline	 -	 17015412210
242 Total_LBAs_Read		 0x0000 100 253 000	Old_age Offline	 -	 435513238950

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
	1		0		0 Not_testing
	2		0		0 Not_testing
	3		0		0 Not_testing
	4		0		0 Not_testing
	5		0		0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:	 ST8000DM004-2CX188
Serial Number:	WG801RDF
LU WWN Device Id: 5 000c50 0a9b71cbd
Firmware Version: 0001
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5425 rpm
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Aug 12 11:16:17 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
 was completed without error.
 Auto Offline Data Collection: Enabled.
Self-test execution status:	 ( 0) The previous self-test routine completed
 without error or no self-test has ever
 been run.
Total time to complete Offline
data collection:  (	0) seconds.
Offline data collection
capabilities:   (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
 Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
 power-saving mode.
 Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
 General Purpose Logging supported.
Short self-test routine
recommended polling time:   ( 1) minutes.
Extended self-test routine
recommended polling time:   ( 967) minutes.
Conveyance self-test routine
recommended polling time:   ( 2) minutes.
SCT capabilities:		 (0x30a5) SCT Status supported.
 SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		 FLAG	 VALUE WORST THRESH TYPE	 UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate	 0x000f 100 064 006	Pre-fail Always	 -	 1438384
3 Spin_Up_Time			0x0003 094 094 000	Pre-fail Always	 -	 0
4 Start_Stop_Count		0x0032 100 100 020	Old_age Always	 -	 6
5 Reallocated_Sector_Ct 0x0033 100 100 010	Pre-fail Always	 -	 0
7 Seek_Error_Rate		 0x000f 079 060 045	Pre-fail Always	 -	 74291490
9 Power_On_Hours		 0x0032 100 100 000	Old_age Always	 -	 464 (169 60 0)
10 Spin_Retry_Count		0x0013 100 100 097	Pre-fail Always	 -	 0
12 Power_Cycle_Count	 0x0032 100 100 020	Old_age Always	 -	 6
183 Runtime_Bad_Block	 0x0032 100 100 000	Old_age Always	 -	 0
184 End-to-End_Error		0x0032 100 100 099	Old_age Always	 -	 0
187 Reported_Uncorrect	 0x0032 100 100 000	Old_age Always	 -	 0
188 Command_Timeout		 0x0032 100 100 000	Old_age Always	 -	 0
189 High_Fly_Writes		 0x003a 100 100 000	Old_age Always	 -	 0
190 Airflow_Temperature_Cel 0x0022 064 050 040	Old_age Always	 -	 36 (Min/Max 26/36)
191 G-Sense_Error_Rate	 0x0032 100 100 000	Old_age Always	 -	 0
192 Power-Off_Retract_Count 0x0032 100 100 000	Old_age Always	 -	 18
193 Load_Cycle_Count		0x0032 100 100 000	Old_age Always	 -	 27
194 Temperature_Celsius	 0x0022 036 050 000	Old_age Always	 -	 36 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 100 064 000	Old_age Always	 -	 1438384
197 Current_Pending_Sector 0x0012 100 100 000	Old_age Always	 -	 0
198 Offline_Uncorrectable 0x0010 100 100 000	Old_age Offline	 -	 0
199 UDMA_CRC_Error_Count	0x003e 200 200 000	Old_age Always	 -	 0
240 Head_Flying_Hours	 0x0000 100 253 000	Old_age Offline	 -	 460 (243 114 0)
241 Total_LBAs_Written	 0x0000 100 253 000	Old_age Offline	 -	 12742712041
242 Total_LBAs_Read		 0x0000 100 253 000	Old_age Offline	 -	 148776659

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
	1		0		0 Not_testing
	2		0		0 Not_testing
	3		0		0 Not_testing
	4		0		0 Not_testing
	5		0		0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/ada2: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Archive HDD
Device Model:	 ST8000AS0002-1NA17Z
Serial Number:	Z840L2JP
LU WWN Device Id: 5 000c50 0918234a4
Firmware Version: RT17
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5980 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Sat Aug 12 11:16:24 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
 was completed without error.
 Auto Offline Data Collection: Enabled.
Self-test execution status:	 ( 0) The previous self-test routine completed
 without error or no self-test has ever
 been run.
Total time to complete Offline
data collection:  (	0) seconds.
Offline data collection
capabilities:   (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
 Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
 power-saving mode.
 Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
 General Purpose Logging supported.
Short self-test routine
recommended polling time:   ( 1) minutes.
Extended self-test routine
recommended polling time:   ( 926) minutes.
Conveyance self-test routine
recommended polling time:   ( 2) minutes.
SCT capabilities:		 (0x30b5) SCT Status supported.
 SCT Feature Control supported.
 SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		 FLAG	 VALUE WORST THRESH TYPE	 UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate	 0x000f 119 099 006	Pre-fail Always	 -	 221648120
3 Spin_Up_Time			0x0003 091 091 000	Pre-fail Always	 -	 0
4 Start_Stop_Count		0x0032 100 100 020	Old_age Always	 -	 20
5 Reallocated_Sector_Ct 0x0033 100 100 010	Pre-fail Always	 -	 0
7 Seek_Error_Rate		 0x000f 069 060 030	Pre-fail Always	 -	 219447791193
9 Power_On_Hours		 0x0032 090 090 000	Old_age Always	 -	 8842
10 Spin_Retry_Count		0x0013 100 100 097	Pre-fail Always	 -	 0
12 Power_Cycle_Count	 0x0032 100 100 020	Old_age Always	 -	 20
183 Runtime_Bad_Block	 0x0032 100 100 000	Old_age Always	 -	 0
184 End-to-End_Error		0x0032 100 100 099	Old_age Always	 -	 0
187 Reported_Uncorrect	 0x0032 100 100 000	Old_age Always	 -	 0
188 Command_Timeout		 0x0032 100 100 000	Old_age Always	 -	 0
189 High_Fly_Writes		 0x003a 100 100 000	Old_age Always	 -	 0
190 Airflow_Temperature_Cel 0x0022 062 056 045	Old_age Always	 -	 38 (Min/Max 27/38)
191 G-Sense_Error_Rate	 0x0032 100 100 000	Old_age Always	 -	 0
192 Power-Off_Retract_Count 0x0032 093 093 000	Old_age Always	 -	 15512
193 Load_Cycle_Count		0x0032 089 089 000	Old_age Always	 -	 23809
194 Temperature_Celsius	 0x0022 038 044 000	Old_age Always	 -	 38 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 119 099 000	Old_age Always	 -	 221648120
197 Current_Pending_Sector 0x0012 100 100 000	Old_age Always	 -	 0
198 Offline_Uncorrectable 0x0010 100 100 000	Old_age Offline	 -	 0
199 UDMA_CRC_Error_Count	0x003e 200 200 000	Old_age Always	 -	 0
240 Head_Flying_Hours	 0x0000 100 253 000	Old_age Offline	 -	 8105 (70 171 0)
241 Total_LBAs_Written	 0x0000 100 253 000	Old_age Offline	 -	 17156296077
242 Total_LBAs_Read		 0x0000 100 253 000	Old_age Offline	 -	 150835704870

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description	Status				 Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline	Completed without error	 00%	 7822		 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
	1		0		0 Not_testing
	2		0		0 Not_testing
	3		0		0 Not_testing
	4		0		0 Not_testing
	5		0		0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You do not have any data for the drive serial number Z840KXMC.

I did not see anything wrong with the SMART data provided other than you do not test your hard drives. I would recommend you setup a routine SMART short and long test in FreeNAS. Routine testing may have warned you about this failure early enough to avoid it.

My advice is to power down, move the SATA cable from the drive serial number Z840KXMC to a different SATA port. Power up again and as soon as possible grab the SMART data.

Here are the likely causes of your failure:
1) Failed hard drive
2) Failed SATA cable (yes, the can go bad just sitting there)
3) Failed SATA controller

So you want to work this backwards:
1) Move the SATA cable to a different SATA port
2) Replace the SATA cable
3) Replace the hard drive

You can also try to boot up your system using something like FreeBSD Live or Ubuntu Live (I like Ubuntu) and see if you can grab the SMART data. You can also remove that suspect hard drive and attach it to another computer to grab the SMART data. There are many ways to go about it.

I rebooted the machine - and all the drives came back.

Then one of the drives disappeared:
How much time are we talking about before it disappeared? 5 minutes, 30 minutes, longer?
 

muctl

Cadet
Joined
Aug 8, 2017
Messages
3
I have the same problem. Microserver Gen8, 2 Seagate 4TB drives.
The drives are initialized by the bios, in FreeNAS-bootup the second drive gives the "command timeout"-error and the drive disappears in bios at reboot. Some immediate powercyvles later the drive ist at normal operation.
SMART gives no errors at both drives.
I saw, one drive is ACS2, the other ACS3 ... perhaps a problem with the FreeNAS/BSD-SATA-driver ?
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I have the same problem. Microserver Gen8, 2 Seagate 4TB drives.
The drives are initialized by the bios, in FreeNAS-bootup the second drive gives the "command timeout"-error and the drive disappears in bios at reboot. Some immediate powercyvles later the drive ist at normal operation.
SMART gives no errors at both drives.
I saw, one drive is ACS2, the other ACS3 ... perhaps a problem with the FreeNAS/BSD-SATA-driver ?
Start your own thread. Provide freenas version, hardware specs, drive layout and smart data.

Sent from my Nexus 5X using Tapatalk
 
Status
Not open for further replies.
Top