SOLVED Various SCSI sense errors during scrubbing

tobiasbp · Jun 1, 2017

Update with solution highlights:
The reason for the errors turned out to be WD Red drives of this type:
WD60EFRX-68L0BN1

This type of WD Red also throws (a few) errors, but only when drives of type WD60EFRX-68L0BN1 is in the zpool:
WD60EFRX-68MYMN1

Conclusion:

Avoid drives of type WD60EFRX-68L0BN1.
Problems may only occur on RAIDs with a large number of drives (I have 24).
WD60EFRX-68L0BN1 (problematic disk) is the newer version of WD60EFRX-68MYMN1. Seems like drive performance was degraded in the newer version.

Original post:
I have recently set up a new system (My 4th FreeNAS) with 24 6TB WD drives in 12 mirrors. Specs are at the end of this post.

The problem:
During scrubbing, I always get errors (which are fixed by ZFS). I understand, that this is not uncommon, but it happens on every back-to-back scrub I run on the pool.

Looking at dmesg. I see SCSI sense errors. The errors are not always the same, and not always on the same drives (See errors at the end).

SMART data reveals nothing obvious to be wrong with the drives.

I have changed drives, but the errors appear to occur in random disks in the pool. The pool is a mix of older and brand new drives. Drives of all ages are affected. I have changed disks, but errors occur on new drives too. I have a hard time believing, that all my drives are bad, so I have started looking elsewhere.

Errors occur on drives on both backplanes (See specs).

I have done the following without resolving the problem:

Changed SATA cables
Changed disks
Run memtest on RAM
Updated firmware on IBM ServeRAID M1015 (See specs)
Move/reseat M1015
Used a RocketRAID HBA instead of the M1015.
Used the SATA connector on the motherboard in stead of a HBA.

This is what I am considering trying:

Upgrade firmware on backplanes (Don't know how)
Change motherboard
Only run on a single power supply to see if one is bad.
Upgrade to FreeNAS 11 (Hoping for a software issue in the OS as the cause)

I have seen similar threads on the forum, but no solution seems to be identified.

In general: How (un)acceptable is it for SCSI errors to occur?

Thank you for your time,
Tobias

Example SCSI errors

With IBM M1015 HBA

Code:

(da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 44 58 00 00 d0 00 length 106496 SMID 367 terminated ioc 804b scsi 0 state 0 xfer 0
(da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 44 58 00 00 d0 00
(da17:mps0:0:33:0): CAM status: CCB request completed with an error
(da17:mps0:0:33:0): Retrying command
(da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 43 c0 00 00 98 00
(da17:mps0:0:33:0): CAM status: SCSI Status Error
(da17:mps0:0:33:0): SCSI status: Check Condition
(da17:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da17:mps0:0:33:0): Info: 0x7c3e43c0
(da17:mps0:0:33:0): Error 5, Unretryable error
		(da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 ce 40 00 00 b8 00 length 94208 SMID 899 terminated ioc 804b scsi 0 state 0 xfer 0
(da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 ce 40 00 00 b8 00
(da17:mps0:0:33:0): CAM status: CCB request completed with an error
(da17:mps0:0:33:0): Retrying command
(da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 cd 88 00 00 b8 00
(da17:mps0:0:33:0): CAM status: SCSI Status Error
(da17:mps0:0:33:0): SCSI status: Check Condition
(da17:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da17:mps0:0:33:0): Info: 0x7f09cd88
(da17:mps0:0:33:0): Error 5, Unretryable error

(da4:mps0:0:19:0): WRITE(10). CDB: 2a 00 01 a9 72 98 00 00 20 00
(da4:mps0:0:19:0): CAM status: SCSI Status Error
(da4:mps0:0:19:0): SCSI status: Check Condition
(da4:mps0:0:19:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da4:mps0:0:19:0): Info: 0x1a97298
(da4:mps0:0:19:0): Error 22, Unretryable error

(da10:mps0:0:25:0): WRITE(10). CDB: 2a 00 0d 2a 0c c0 00 00 08 00
(da10:mps0:0:25:0): CAM status: SCSI Status Error
(da10:mps0:0:25:0): SCSI status: Check Condition
(da10:mps0:0:25:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da10:mps0:0:25:0): Info: 0xd2a0cc0
(da10:mps0:0:25:0): Error 22, Unretryable error

With RocketRAID HBA

So, these are not SCSI errors. What is the implication of that?
hpt27xx0: <odin> mem 0xdfb40000-0xdfb5ffff,0xdfb00000-0xdfb3ffff irq 32 at device 0.0 on pci3

Code:

interrupt storm detected on "irq32:"; throttling interrupt source
interrupt storm detected on "irq32:"; throttling interrupt source
interrupt storm detected on "irq32:"; throttling interrupt source
hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0x8c382950,LBA[4-7]=0x1.
hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xcafaba8,LBA[4-7]=0x0.

With on board SATA

Code:

isci: 1496151335:247708 ISCI Sending reset to device on controller 0 domain 0 CAM index 17
isci: 1496151335:248864 ISCI isci: bus=1 target=11 lun=0 cdb[0]=35 terminated
(da13:isci0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da13:isci0:0:17:0): CAM status: CCB request terminated by the host
(da13:isci0:0:17:0): Retrying command
isci: 1496183958:044159 ISCI isci: bus=1 target=1b lun=0 cdb[0]=28 terminated
(da23:isci0:0:27:0): READ(10). CDB: 28 00 e4 54 28 70 00 01 00 00
(da23:isci0:0:27:0): CAM status: SCSI Status Error
(da23:isci0:0:27:0): SCSI status: Check Condition
(da23:isci0:0:27:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da23:isci0:0:27:0): Info: 0xe4542870
(da23:isci0:0:27:0): Retrying command (per sense data)
(da23:isci0:0:27:0): READ(10). CDB: 28 00 e4 54 27 70 00 01 00 00
(da23:isci0:0:27:0): CAM status: CCB request terminated by the host
(da23:isci0:0:27:0): Retrying command
isci: 1496198442:738629 ISCI isci: bus=1 target=f lun=0 cdb[0]=88 terminated
(da11:isci0:0:15:0): READ(16). CDB: 88 00 00 00 00 01 2a 0c 10 c8 00 00 00 c8 00 00
(da11:isci0:0:15:0): CAM status: SCSI Status Error
(da11:isci0:0:15:0): SCSI status: Check Condition
(da11:isci0:0:15:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da11:isci0:0:15:0): Retrying command (per sense data)
(da11:isci0:0:15:0): READ(16). CDB: 88 00 00 00 00 01 2a 0c 12 90 00 00 00 e0 00 00
(da11:isci0:0:15:0): CAM status: CCB request terminated by the host
(da11:isci0:0:15:0): Retrying command
isci: 1496198442:993016 ISCI isci: bus=1 target=18 lun=0 cdb[0]=88 terminated
(da20:isci0:0:24:0): READ(16). CDB: 88 00 00 00 00 01 2c 6f 84 18 00 00 01 00 00 00
(da20:isci0:0:24:0): CAM status: SCSI Status Error
(da20:isci0:0:24:0): SCSI status: Check Condition
(da20:isci0:0:24:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da20:isci0:0:24:0): Retrying command (per sense data)
(da20:isci0:0:24:0): READ(16). CDB: 88 00 00 00 00 01 2c 6f 80 40 00 00 01 00 00 00
(da20:isci0:0:24:0): CAM status: CCB request terminated by the host
(da20:isci0:0:24:0): Retrying command
isci: 1496220206:601979 ISCI isci: bus=1 target=15 lun=0 cdb[0]=88 terminated
(da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 5f c1 70 78 00 00 01 00 00 00
(da17:isci0:0:21:0): CAM status: SCSI Status Error
(da17:isci0:0:21:0): SCSI status: Check Condition
(da17:isci0:0:21:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da17:isci0:0:21:0): Retrying command (per sense data)
(da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 5f c1 71 78 00 00 00 d0 00 00
(da17:isci0:0:21:0): CAM status: CCB request terminated by the host
(da17:isci0:0:21:0): Retrying command
(da12:isci0:0:16:0): WRITE(10). CDB: 2a 00 29 71 99 a0 00 00 08 00
(da12:isci0:0:16:0): CAM status: SCSI Status Error
(da12:isci0:0:16:0): SCSI status: Check Condition
(da12:isci0:0:16:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da12:isci0:0:16:0): Error 22, Unretryable error

Tech specs

Code:

Motherboard:
Supermicro X9DRI-LNF4+
https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRi-LN4F_.cfm

CPU:
2x E5-2650 Oct Core

RAM:
128GB EEC RAM

Chassis:
Supermicro SuperChassis 847E16-R1400LPB
http://www.supermicro.com.tw/products/chassis/4U/847/SC847E16-R1400LPB

Power:
Dual 1400W

SATA backplanes:
SAS2-826EL1
SAS2-846EL1

HBA:
IBM ServeRAID M1015
mps0: Firmware: 20.00.07.00, Driver: 21.01.00.00-fbsd

Storage:
24 6TB drives in 12 mirror

OS:
FreeNAS-9.10.2-U4 (27ae72978)

Morpheus187 · Jun 1, 2017

You could also test the following ( if possible )

- Make a test with just 2 disk plugged in to see if the error reoccurs during a scrub, maybe with all disks connected the power is unstable.
- As you suggested try to run on just one power supply
- if you have the possibility to connect the disks without using the backplanes
- run smart test and see if they report something.

As you describe is the nature of the problem seems to be something systematic and not a single disk or a single cable. It's unlikely that ALL cables are faulty or all disks or both backplanes. I would most likely bet on the power supply, power is the only thing that effects everything and could lead to such strange errors. Have you connected a UPS? ( You could test 1 PSU on UPS and the other on normal power and vice versa )

tobiasbp · Jun 1, 2017

Thank you for your suggestions. I have removed half of my drives, thus running 12 degraded mirrors (See below).

I will connect the machine to an UPS after this (fewer drives) test.

The machine is currently not connected to an UPS.

Degraded zpool now beeing scrubbed. Do errors still occur? We will see.

Code:

  pool: ultraman
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: scrub in progress since Thu Jun  1 14:17:58 2017
		22.0G scanned out of 37.5T at 536M/s, 20h22m to go
		0 repaired, 0.06% done
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										DEGRADED	 0	 0	 0
	  mirror-0									  DEGRADED	 0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		2182083406617386173						 REMOVED	  0	 0	 0  was /dev/gptid/6e71919e-1618-11e7-a3b7-0025901ef244
	  mirror-1									  DEGRADED	 0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		17769119138064296099						REMOVED	  0	 0	 0  was /dev/gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244
	  mirror-2									  DEGRADED	 0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		11172241336233623768						REMOVED	  0	 0	 0  was /dev/gptid/3d414933-3a05-11e7-af73-0025901ef244
	  mirror-3									  DEGRADED	 0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		16893817231742477117						REMOVED	  0	 0	 0  was /dev/gptid/9e899578-183c-11e7-ae9d-0025901ef244
	  mirror-4									  DEGRADED	 0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		10314217277093796089						REMOVED	  0	 0	 0  was /dev/gptid/4342a98c-184c-11e7-ae9d-0025901ef244
	  mirror-5									  DEGRADED	 0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		7446801437149806016						 REMOVED	  0	 0	 0  was /dev/gptid/a2851364-184c-11e7-ae9d-0025901ef244
	  mirror-6									  DEGRADED	 0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		13826095765754444936						REMOVED	  0	 0	 0  was /dev/gptid/0e9ed582-184d-11e7-ae9d-0025901ef244
	  mirror-7									  DEGRADED	 0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		15144760684540839910						REMOVED	  0	 0	 0  was /dev/gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244
	  mirror-8									  DEGRADED	 0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		12474442587277789598						REMOVED	  0	 0	 0  was /dev/gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244
	  mirror-9									  DEGRADED	 0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		15352642184540883744						REMOVED	  0	 0	 0  was /dev/gptid/90b48d70-184d-11e7-ae9d-0025901ef244
	  mirror-10									 DEGRADED	 0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		9506731245968758733						 REMOVED	  0	 0	 0  was /dev/gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244
	  mirror-11									 DEGRADED	 0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		3534930770157901362						 REMOVED	  0	 0	 0  was /dev/gptid/83d890bc-3a08-11e7-af73-0025901ef244

errors: No known data errors

Stux · Jun 1, 2017

Sounds like PSU to me, or the power wiring.

How are the drives powered?

You shouldn't be getting any scsi errors.

Ericloewe · Jun 1, 2017

tobiasbp said:
With RocketRAID HBA

So, these are not SCSI errors. What is the implication of that?
hpt27xx0: <odin> mem 0xdfb40000-0xdfb5ffff,0xdfb00000-0xdfb3ffff irq 32 at device 0.0 on pci3

Crap driver. Don't use it.

tobiasbp said:
With on board SATA

That's not SATA, those are all SCSI-attached drives (SATA on an SAS HBA, USB or whatever).

Stux said:
Sounds like PSU to me, or the power wiring.

Yeah, it sounds like bad power.

tobiasbp · Jun 1, 2017

Stux said:
How are the drives powered?

They are hooked up to the backplane. I don't know how to answer in more detail.

This is the chassis:
http://www.supermicro.com.tw/products/chassis/4U/847/SC847E16-R1400LPB

Stux said:
You shouldn't be getting any scsi errors.

Good to know. I was wondering.

Thank you.

DrKK · Jun 1, 2017

Definitely the PSU.

tobiasbp · Jun 5, 2017

I'm now back, and can report on the result of the scrub of the zpool of 12 degraded mirrors.

The backplane is connected to the motherboard with the on (mother)board connector.

There was a single SCSI error during the scrub:

Code:

(da12:isci0:0:16:0): WRITE(10). CDB: 2a 00 0a 6b cb 98 00 00 08 00
(da12:isci0:0:16:0): CAM status: SCSI Status Error
(da12:isci0:0:16:0): SCSI status: Check Condition
(da12:isci0:0:16:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da12:isci0:0:16:0): Error 22, Unretryable error

Highlights from zpool status:

Code:

  pool: ultraman
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 12K in 0h0m with 0 errors on Sat Jun  3 05:28:46 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										DEGRADED	 0	 0	 0

...
...
	  mirror-6									  DEGRADED	 0	 1	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
		13826095765754444936						REMOVED	  0	 0	 0  was /dev/gptid/0e9ed582-184d-11e7-ae9d-0025901ef244
...
...

The disk with the write error in mirror-6 (0dcbcccd-184d-11e7-ae9d-0025901ef244) is the disk da12 with the SCSI error.

What can I tell from this? I'm not sure... I was hoping, that the lower load (12 disk in stead of 24) would result in no SCSI errors, indication a problem with the power supplies.

What should my next test be? Could just running the server off of an UPS make a difference? Would it be better to run off of a single internal power supply (by turn) to try to weed out the bad one?

Any suggestions/thoughts welcome.

Thanks,
Tobias

tobiasbp · Jun 6, 2017

I have pulled out one power supply. I have started a new scrub of the pool. The pool is still degraded (12 disk in stead of 24).

The backplane is connected to the motherboard directly (No HBA in use).

Power levels look good to me:

Code:

[***] ~# ipmitool sdr

CPU1 Temp		| 54 degrees C	  | ok
CPU2 Temp		| 52 degrees C	  | ok
System Temp	  | 36 degrees C	  | ok
Peripheral Temp  | 42 degrees C	  | ok
PCH Temp		 | 46 degrees C	  | ok
P1-DIMMA1 TEMP   | 44 degrees C	  | ok
P1-DIMMA2 TEMP   | 44 degrees C	  | ok
P1-DIMMA3 TEMP   | no reading		| ns
P1-DIMMB1 TEMP   | 42 degrees C	  | ok
P1-DIMMB2 TEMP   | 42 degrees C	  | ok
P1-DIMMB3 TEMP   | no reading		| ns
P1-DIMMC1 TEMP   | 39 degrees C	  | ok
P1-DIMMC2 TEMP   | 39 degrees C	  | ok
P1-DIMMC3 TEMP   | no reading		| ns
P1-DIMMD1 TEMP   | 40 degrees C	  | ok
P1-DIMMD2 TEMP   | 40 degrees C	  | ok
P1-DIMMD3 TEMP   | no reading		| ns
P2-DIMME1 TEMP   | 43 degrees C	  | ok
P2-DIMME2 TEMP   | 43 degrees C	  | ok
P2-DIMME3 TEMP   | no reading		| ns
P2-DIMMF1 TEMP   | 44 degrees C	  | ok
P2-DIMMF2 TEMP   | 45 degrees C	  | ok
P2-DIMMF3 TEMP   | no reading		| ns
P2-DIMMG1 TEMP   | 45 degrees C	  | ok
P2-DIMMG2 TEMP   | 47 degrees C	  | ok
P2-DIMMG3 TEMP   | no reading		| ns
P2-DIMMH1 TEMP   | 46 degrees C	  | ok
P2-DIMMH2 TEMP   | 46 degrees C	  | ok
P2-DIMMH3 TEMP   | no reading		| ns
FAN1			 | 3900 RPM		  | ok
FAN2			 | 4425 RPM		  | ok
FAN3			 | 4350 RPM		  | ok
FAN4			 | 4500 RPM		  | ok
FAN5			 | no reading		| ns
FAN6			 | no reading		| ns
FANA			 | no reading		| ns
FANB			 | no reading		| ns
VTT			  | 1.04 Volts		| ok
CPU1 Vcore	   | 0.93 Volts		| ok
CPU2 Vcore	   | 0.93 Volts		| ok
VDIMM AB		 | 1.47 Volts		| ok
VDIMM CD		 | 1.49 Volts		| ok
VDIMM EF		 | 1.49 Volts		| ok
VDIMM GH		 | 1.49 Volts		| ok
+1.1 V		   | 1.09 Volts		| ok
+1.5 V		   | 1.47 Volts		| ok
3.3V			 | 3.31 Volts		| ok
+3.3VSB		  | 3.31 Volts		| ok
5V			   | 4.99 Volts		| ok
+5VSB			| 4.99 Volts		| ok
12V			  | 11.98 Volts	   | ok
VBAT			 | 2.98 Volts		| ok
HDD Status	   | no reading		| ns
Chassis Intru	| 0x00			  | ok
PS1 Status	   | 0x00			  | ok
PS2 Status	   | 0x01			  | ok

tvsjr · Jun 6, 2017

Common threads are your PSU and your backplane. Your system is nearly identical to mine (see sig... nearly identical) so it should work.

I did update my backplane firmware once upon a time chasing another issue, but it wasn't this issue. The updates have to come from Supermicro directly... I don't believe they publish them.

It also seems very weird to me that you're losing one drive in every mirror... that you don't have one mirror that drops both drives, while another drops no drives.

You've got 24 drives listed here, which I assume means you're plugging them all into the 24-port front backplane. Have you tried plugging just 12 into the rear backplane, to see if there's a difference? I suppose it's possible that there's something physically janky on the backplane... maybe a trace has been partially cut so you're getting excessive voltage drop, etc.

tobiasbp · Jun 6, 2017

tvsjr said:
It also seems very weird to me that you're losing one drive in every mirror... that you don't have one mirror that drops both drives, while another drops no drives.

I pulled out one drive from each mirror in order to see what happens when scrubbing 12 disks instead of 24.

tobiasbp · Jun 6, 2017

Scrubbing the degraded pool of 12 mirrors (Now 12 disks) with only one of my PSUs connected threw no SCSI errors in the log. I have started another scrub to see if I can consistently scrub the pool with no SCSI errors running only on the current PSU. If I can I will reattach the other PSU and see if SCSI errors start appearing again.

I have ordered a new PSU. I would like to have a spare under any circumstances.

In the log, there was this entry:
isci: 1496775261:847504 ISCI Sending reset to device on controller 0 domain 0 CAM index 28
isci: 1496775261:848518 ISCI isci: bus=1 target=1c lun=0 cdb[0]=1c terminated

What is the implication of it? Looks to me like it concerns the "<LSI CORP SAS2X36 0717>".

Code:

camcontrol devlist

<ATA WDC WD6002FRYZ-0 1M02>		at scbus1 target 4 lun 0 (pass0,da0)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 5 lun 0 (pass1,da1)
<ATA WDC WD6002FRYZ-0 1M02>		at scbus1 target 6 lun 0 (pass2,da2)
<ATA WDC WD6002FFWX-6 0A83>		at scbus1 target 9 lun 0 (pass5,da5)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 11 lun 0 (pass7,da7)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 12 lun 0 (pass8,da8)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 14 lun 0 (pass10,da10)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 16 lun 0 (pass12,da12)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 18 lun 0 (pass14,da14)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 21 lun 0 (pass17,da17)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 23 lun 0 (pass19,da19)
<ATA WDC WD60EFRX-68M 0A82>		at scbus1 target 25 lun 0 (pass21,da21)
<LSI CORP SAS2X36 0717>			at scbus1 target 28 lun 0 (pass24,ses0)
<Kingston DataTraveler 2.0 PMAP>   at scbus9 target 0 lun 0 (pass25,da24)
<Kingston DataTraveler 2.0 PMAP>   at scbus10 target 0 lun 0 (pass26,da25)

tobiasbp · Jun 8, 2017

The second scrub with 12 disks connected and only one PSU connected, finished with at single SCSI error. No errors reported in the pool.

I was hoping for no SCSI errors.

Code:

Jun  7 23:04:40 ultraman isci: 1496869480:224844 ISCI isci: bus=1 target=15 lun=0 cdb[0]=88 terminated
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 57 04 15 48 00 00 01 00 00 00
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): CAM status: SCSI Status Error
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): SCSI status: Check Condition
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): Retrying command (per sense data)
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 57 04 19 48 00 00 01 00 00 00
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): CAM status: CCB request terminated by the host
Jun  7 23:04:40 ultraman (da17:isci0:0:21:0): Retrying command

I have swapped PSU's. The one I ran on during the previous two scrubs have been take out of the machine. I am now running on the other PSU (The one not attached during the two previous scrubs,

I will now scrub the degraded (12 disk mirror) twice. This is a test of the second PSU which came with the machine.

Stux · Jun 8, 2017

Which leaves backplane.

You have another right? Can you put the 12 drives on the other backplane?

rs225 · Jun 8, 2017

Don't forget contacting the backplane maker. They may have a very good list of potential causes.

tobiasbp · Jun 9, 2017

Stux said:
Which leaves backplane.

You have another right? Can you put the 12 drives on the other backplane?

I have previously seen the errors on drives on both backplanes. Thus, I have moved all the drives to a single backplane trying to minimize the number of components involved.

tobiasbp · Jun 9, 2017

Scrubbing the pool with my 2nd PSU only, finished with no SCSI errors. This was the first scrub using only that PSU.

I will scrub again. If it finishes with out errors, I feel it would indicate that the source of the SCSI errors is the 1st PSU (The one currently not connected).

tobiasbp · Jun 9, 2017

tvsjr said:
I did update my backplane firmware once upon a time chasing another issue, but it wasn't this issue.

How can you tell what firmware a backplane is running?

tobiasbp · Jun 10, 2017

2nd scrub of the degraded pool with only my 2nd PSU connected completed with out SCSI errors. Looks like the 1st PSU could be the culprit. I will scrub again with the current configuration.

tobiasbp · Jun 11, 2017

3rd scrub of the degraded pool with only my 2nd PSU connected completed with out SCSI errors. It feels like the cause of the problems was the 1st PSU.

I'll scrub again.

Important Announcement for the TrueNAS Community.

SOLVED Various SCSI sense errors during scrubbing

Patron

Explorer

Patron

MVP

Server Wrangler

Patron

FreeNAS Generalissimo

Patron

Patron

Guru

Patron

Patron

Patron

MVP

Guru

Patron

Patron

Patron

Patron

Patron

Similar threads