Update with solution highlights:
The reason for the errors turned out to be WD Red drives of this type:
WD60EFRX-68L0BN1
This type of WD Red also throws (a few) errors, but only when drives of type WD60EFRX-68L0BN1 is in the zpool:
WD60EFRX-68MYMN1
Conclusion:
Original post:
I have recently set up a new system (My 4th FreeNAS) with 24 6TB WD drives in 12 mirrors. Specs are at the end of this post.
The problem:
During scrubbing, I always get errors (which are fixed by ZFS). I understand, that this is not uncommon, but it happens on every back-to-back scrub I run on the pool.
Looking at dmesg. I see SCSI sense errors. The errors are not always the same, and not always on the same drives (See errors at the end).
SMART data reveals nothing obvious to be wrong with the drives.
I have changed drives, but the errors appear to occur in random disks in the pool. The pool is a mix of older and brand new drives. Drives of all ages are affected. I have changed disks, but errors occur on new drives too. I have a hard time believing, that all my drives are bad, so I have started looking elsewhere.
Errors occur on drives on both backplanes (See specs).
I have done the following without resolving the problem:
This is what I am considering trying:
I have seen similar threads on the forum, but no solution seems to be identified.
In general: How (un)acceptable is it for SCSI errors to occur?
Thank you for your time,
Tobias
Example SCSI errors
With IBM M1015 HBA
With RocketRAID HBA
So, these are not SCSI errors. What is the implication of that?
hpt27xx0: <odin> mem 0xdfb40000-0xdfb5ffff,0xdfb00000-0xdfb3ffff irq 32 at device 0.0 on pci3
With on board SATA
Tech specs
The reason for the errors turned out to be WD Red drives of this type:
WD60EFRX-68L0BN1
This type of WD Red also throws (a few) errors, but only when drives of type WD60EFRX-68L0BN1 is in the zpool:
WD60EFRX-68MYMN1
Conclusion:
- Avoid drives of type WD60EFRX-68L0BN1.
- Problems may only occur on RAIDs with a large number of drives (I have 24).
- WD60EFRX-68L0BN1 (problematic disk) is the newer version of WD60EFRX-68MYMN1. Seems like drive performance was degraded in the newer version.
Original post:
I have recently set up a new system (My 4th FreeNAS) with 24 6TB WD drives in 12 mirrors. Specs are at the end of this post.
The problem:
During scrubbing, I always get errors (which are fixed by ZFS). I understand, that this is not uncommon, but it happens on every back-to-back scrub I run on the pool.
Looking at dmesg. I see SCSI sense errors. The errors are not always the same, and not always on the same drives (See errors at the end).
SMART data reveals nothing obvious to be wrong with the drives.
I have changed drives, but the errors appear to occur in random disks in the pool. The pool is a mix of older and brand new drives. Drives of all ages are affected. I have changed disks, but errors occur on new drives too. I have a hard time believing, that all my drives are bad, so I have started looking elsewhere.
Errors occur on drives on both backplanes (See specs).
I have done the following without resolving the problem:
- Changed SATA cables
- Changed disks
- Run memtest on RAM
- Updated firmware on IBM ServeRAID M1015 (See specs)
- Move/reseat M1015
- Used a RocketRAID HBA instead of the M1015.
- Used the SATA connector on the motherboard in stead of a HBA.
This is what I am considering trying:
- Upgrade firmware on backplanes (Don't know how)
- Change motherboard
- Only run on a single power supply to see if one is bad.
- Upgrade to FreeNAS 11 (Hoping for a software issue in the OS as the cause)
I have seen similar threads on the forum, but no solution seems to be identified.
In general: How (un)acceptable is it for SCSI errors to occur?
Thank you for your time,
Tobias
Example SCSI errors
With IBM M1015 HBA
Code:
(da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 44 58 00 00 d0 00 length 106496 SMID 367 terminated ioc 804b scsi 0 state 0 xfer 0 (da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 44 58 00 00 d0 00 (da17:mps0:0:33:0): CAM status: CCB request completed with an error (da17:mps0:0:33:0): Retrying command (da17:mps0:0:33:0): READ(10). CDB: 28 00 7c 3e 43 c0 00 00 98 00 (da17:mps0:0:33:0): CAM status: SCSI Status Error (da17:mps0:0:33:0): SCSI status: Check Condition (da17:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da17:mps0:0:33:0): Info: 0x7c3e43c0 (da17:mps0:0:33:0): Error 5, Unretryable error (da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 ce 40 00 00 b8 00 length 94208 SMID 899 terminated ioc 804b scsi 0 state 0 xfer 0 (da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 ce 40 00 00 b8 00 (da17:mps0:0:33:0): CAM status: CCB request completed with an error (da17:mps0:0:33:0): Retrying command (da17:mps0:0:33:0): READ(10). CDB: 28 00 7f 09 cd 88 00 00 b8 00 (da17:mps0:0:33:0): CAM status: SCSI Status Error (da17:mps0:0:33:0): SCSI status: Check Condition (da17:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da17:mps0:0:33:0): Info: 0x7f09cd88 (da17:mps0:0:33:0): Error 5, Unretryable error (da4:mps0:0:19:0): WRITE(10). CDB: 2a 00 01 a9 72 98 00 00 20 00 (da4:mps0:0:19:0): CAM status: SCSI Status Error (da4:mps0:0:19:0): SCSI status: Check Condition (da4:mps0:0:19:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da4:mps0:0:19:0): Info: 0x1a97298 (da4:mps0:0:19:0): Error 22, Unretryable error (da10:mps0:0:25:0): WRITE(10). CDB: 2a 00 0d 2a 0c c0 00 00 08 00 (da10:mps0:0:25:0): CAM status: SCSI Status Error (da10:mps0:0:25:0): SCSI status: Check Condition (da10:mps0:0:25:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da10:mps0:0:25:0): Info: 0xd2a0cc0 (da10:mps0:0:25:0): Error 22, Unretryable error
With RocketRAID HBA
So, these are not SCSI errors. What is the implication of that?
hpt27xx0: <odin> mem 0xdfb40000-0xdfb5ffff,0xdfb00000-0xdfb3ffff irq 32 at device 0.0 on pci3
Code:
interrupt storm detected on "irq32:"; throttling interrupt source interrupt storm detected on "irq32:"; throttling interrupt source interrupt storm detected on "irq32:"; throttling interrupt source hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0x8c382950,LBA[4-7]=0x1. hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xcafaba8,LBA[4-7]=0x0.
With on board SATA
Code:
isci: 1496151335:247708 ISCI Sending reset to device on controller 0 domain 0 CAM index 17 isci: 1496151335:248864 ISCI isci: bus=1 target=11 lun=0 cdb[0]=35 terminated (da13:isci0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da13:isci0:0:17:0): CAM status: CCB request terminated by the host (da13:isci0:0:17:0): Retrying command isci: 1496183958:044159 ISCI isci: bus=1 target=1b lun=0 cdb[0]=28 terminated (da23:isci0:0:27:0): READ(10). CDB: 28 00 e4 54 28 70 00 01 00 00 (da23:isci0:0:27:0): CAM status: SCSI Status Error (da23:isci0:0:27:0): SCSI status: Check Condition (da23:isci0:0:27:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair) (da23:isci0:0:27:0): Info: 0xe4542870 (da23:isci0:0:27:0): Retrying command (per sense data) (da23:isci0:0:27:0): READ(10). CDB: 28 00 e4 54 27 70 00 01 00 00 (da23:isci0:0:27:0): CAM status: CCB request terminated by the host (da23:isci0:0:27:0): Retrying command isci: 1496198442:738629 ISCI isci: bus=1 target=f lun=0 cdb[0]=88 terminated (da11:isci0:0:15:0): READ(16). CDB: 88 00 00 00 00 01 2a 0c 10 c8 00 00 00 c8 00 00 (da11:isci0:0:15:0): CAM status: SCSI Status Error (da11:isci0:0:15:0): SCSI status: Check Condition (da11:isci0:0:15:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair) (da11:isci0:0:15:0): Retrying command (per sense data) (da11:isci0:0:15:0): READ(16). CDB: 88 00 00 00 00 01 2a 0c 12 90 00 00 00 e0 00 00 (da11:isci0:0:15:0): CAM status: CCB request terminated by the host (da11:isci0:0:15:0): Retrying command isci: 1496198442:993016 ISCI isci: bus=1 target=18 lun=0 cdb[0]=88 terminated (da20:isci0:0:24:0): READ(16). CDB: 88 00 00 00 00 01 2c 6f 84 18 00 00 01 00 00 00 (da20:isci0:0:24:0): CAM status: SCSI Status Error (da20:isci0:0:24:0): SCSI status: Check Condition (da20:isci0:0:24:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair) (da20:isci0:0:24:0): Retrying command (per sense data) (da20:isci0:0:24:0): READ(16). CDB: 88 00 00 00 00 01 2c 6f 80 40 00 00 01 00 00 00 (da20:isci0:0:24:0): CAM status: CCB request terminated by the host (da20:isci0:0:24:0): Retrying command isci: 1496220206:601979 ISCI isci: bus=1 target=15 lun=0 cdb[0]=88 terminated (da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 5f c1 70 78 00 00 01 00 00 00 (da17:isci0:0:21:0): CAM status: SCSI Status Error (da17:isci0:0:21:0): SCSI status: Check Condition (da17:isci0:0:21:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair) (da17:isci0:0:21:0): Retrying command (per sense data) (da17:isci0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 5f c1 71 78 00 00 00 d0 00 00 (da17:isci0:0:21:0): CAM status: CCB request terminated by the host (da17:isci0:0:21:0): Retrying command (da12:isci0:0:16:0): WRITE(10). CDB: 2a 00 29 71 99 a0 00 00 08 00 (da12:isci0:0:16:0): CAM status: SCSI Status Error (da12:isci0:0:16:0): SCSI status: Check Condition (da12:isci0:0:16:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair) (da12:isci0:0:16:0): Error 22, Unretryable error
Tech specs
Code:
Motherboard: Supermicro X9DRI-LNF4+ https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRi-LN4F_.cfm CPU: 2x E5-2650 Oct Core RAM: 128GB EEC RAM Chassis: Supermicro SuperChassis 847E16-R1400LPB http://www.supermicro.com.tw/products/chassis/4U/847/SC847E16-R1400LPB Power: Dual 1400W SATA backplanes: SAS2-826EL1 SAS2-846EL1 HBA: IBM ServeRAID M1015 mps0: Firmware: 20.00.07.00, Driver: 21.01.00.00-fbsd Storage: 24 6TB drives in 12 mirror OS: FreeNAS-9.10.2-U4 (27ae72978)
Last edited: