Unstable Drive Connections (Lots of I/O Errors)

Joined
May 5, 2021
Messages
5
We've just got some iXsystems servers (not TrueNAS Enterprise hardware) that we're loading TrueNAS/FreeNAS Core onto. This is for a lab environment so not really necessary to splurge on the high end gear, yet. However, we're having some problems. I'm really skeptical that it's hardware related because why would iXsystems sell gear that's not going to work out of the box with TrueNAS (even if its not TrueNAS Enterprise gear).

Controller Hardware Specs (iX-2224R-E1CR24L-IXN):
Motherboard: Supermicro X11DPH-T
CPU: 2x Intel Gold 6226R
RAM: 256GB
HBA: 3x 9305-16e w/16.00.12.00 firmware

Expander Specs (iXC-4072DJ-IXN):
3x - 72 bay 2.5" SAS slots filled with Samsung 850 EVO 2TB SATA SSDs

We've arranged the cabling such that each enclosure is connected via 2x SFF-8644 connections (1 meter/3.3 feet cables) to each HBA. We aren't interested in multi-pathing and the 9305-16e controllers have two individual SAS controller cores, so if you connect too many cables it causes standard SATA drives to show up via multiple paths and can cause some strange behavior (not good).

What we're running into is random I/O errors. It doesn't seem to have any rhyme or reason. It affects all 3 HBAs and drives in all 3 enclosures that previously were working flawlessly. When the I/O errors show up the ZFS pool starts faulting drives and eventually takes the array offline. The drives don't even have to be in an array to get I/O errors or even under any load. Sometimes just a S.M.A.R.T. check is enough to cause an error, then FreeNAS/TrueNAS marks it as not capable of S.M.A.R.T.

Here's what we've tried:
TrueNAS 12.2
TrueNAS 11.1
TrueNAS 11.0 (Kernel Panics booting the installer)
Changing Cables
Changing arrangement of cabling
Changing Drives

The ONLY thing that comes to mind is we got a bad batch of cables from our Amazon seller?

I've attached a debug - its a brand new install and there's nothing on here of any security or importance.

Here's some of the errors:
(da52:mpr2:0:799:0): CAM status: SCSI Status Error
(da52:mpr2:0:799:0): SCSI status: Check Condition
(da52:mpr2:0:799:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da52:mpr2:0:799:0): Retrying command (per sense data)
(da72:mpr2:0:819:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00 length 114688 SMID 870 terminated ioc 804b loginfo 31110e0
3 scsi 0 state c xfer 0
(da72:mpr2:0:819:0): READ(10). CDB: 28 00 e8 e0 84 20 00 00 e0 00 length 114688 SMID 796 terminated ioc 804b loginfo 31110e0
3(da72:mpr2:0:819:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
scsi 0 state c xfer 0
(da72:mpr2:0:819:0): READ(6). CDB: 08 00 02 20 e0 00 length 114688 SMID 803 terminated ioc 804b loginfo 31110e03 scsi 0 stat
e c xfer 105480
(da72:mpr2:0:819:0): CAM status: CCB request completed with an error
(da72:mpr2:0:819:0): Retrying command
(da72:mpr2:0:819:0): READ(10). CDB: 28 00 e8 e0 84 20 00 00 e0 00
(da72:mpr2:0:819:0): CAM status: CCB request completed with an error
(da72:mpr2:0:819:0): Retrying command
(da72:mpr2:0:819:0): READ(6). CDB: 08 00 02 20 e0 00
(da72:mpr2:0:819:0): CAM status: CCB request completed with an error
(da72:mpr2:0:819:0): Retrying command
(da72:mpr2:0:819:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
(da72:mpr2:0:819:0): CAM status: SCSI Status Error
(da72:mpr2:0:819:0): SCSI status: Check Condition
(da72:mpr2:0:819:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da72:mpr2:0:819:0): Retrying command (per sense data)
(da106:mpr2:0:977:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00 length 114688 SMID 962 terminated ioc 804b loginfo 31110e
03 scsi 0 state c xfer 0
(da106:mpr2:0:977:0): READ(10). CDB: 28 00 e8 e0 84 20 00 00 e0 00 length 114688 SMID 960 terminated ioc 804b loginfo 31110e
0(da106:mpr2:0:977:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
3 scsi 0 state c xfer 80904
(da106:mpr2:0:977:0): CAM status: CCB request completed with an error
(da106:mpr2:0:977:0): Retrying command
(da106:mpr2:0:977:0): READ(10). CDB: 28 00 e8 e0 84 20 00 00 e0 00
(da106:mpr2:0:977:0): CAM status: CCB request completed with an error
(da106:mpr2:0:977:0): Retrying command
(da106:mpr2:0:977:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
(da106:mpr2:0:977:0): CAM status: SCSI Status Error
(da106:mpr2:0:977:0): SCSI status: Check Condition
(da106:mpr2:0:977:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da106:mpr2:0:977:0): Retrying command (per sense data)
(da108:mpr2:0:979:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00 length 114688 SMID 287 terminated ioc 804b loginfo 31110e
03 scsi 0 state c xfer 39944
(da108:mpr2:0:979:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
(da108:mpr2:0:979:0): CAM status: CCB request completed with an error
(da108:mpr2:0:979:0): Retrying command
(da108:mpr2:0:979:0): READ(10). CDB: 28 00 e8 e0 86 20 00 00 e0 00
(da108:mpr2:0:979:0): CAM status: SCSI Status Error
(da108:mpr2:0:979:0): SCSI status: Check Condition
(da108:mpr2:0:979:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da108:mpr2:0:979:0): Retrying command (per sense data)
 

Attachments

  • debug-freenas-20210505162446.tgz
    231.7 KB · Views: 123

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
TrueNAS 12.2
TrueNAS 11.1
TrueNAS 11.0 (Kernel Panics booting the installer)
Something seems odd here...

You mentioned that you're using CORE and FreeNAS, so 11.0 and 11.1 should be FreeNAS. Why wouldn't you have tried 11.3? (the current/latest FreeNAS)

Also, 12.0-U2 exists, but not 12.2 (and anyway, 12.0-U3.1 is out, so you should be using that).

Is it always the same drives? or always the same controller? (I see only mpr2 listed against the problems in the output you attached)
 
Joined
May 5, 2021
Messages
5
Something seems odd here...

You mentioned that you're using CORE and FreeNAS, so 11.0 and 11.1 should be FreeNAS. Why wouldn't you have tried 11.3? (the current/latest FreeNAS)

Also, 12.0-U2 exists, but not 12.2 (and anyway, 12.0-U3.1 is out, so you should be using that).

Is it always the same drives? or always the same controller? (I see only mpr2 listed against the problems in the output you attached)

Sorry, it had been a long day when I typed this up, the first version we tried was TrueNAS Core 12.0U2, We then Tried FreeNAS 11.1U7 because I found FreeBSD Bug Report #224496, then when that didn't work we Tried 11.0U4. These were tried to rule out a software problem in a specific version.

TO answer your question, no its not always the same drives, a lot of them show up as mpr2 but as soon as we unplug that chassis we see errors on mpr3... I'll try and capture more errors today.
 
Joined
May 5, 2021
Messages
5
Sorry to double post, but here's a couple more from the system just sitting idle...

(da15:mpr1:0:79:0): READ(10). CDB: 28 00 00 00 00 00 00 01 00 00 length 131072 SMID 947 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 89096
(da15:mpr1:0:79:0): READ(10). CDB: 28 00 00 00 00 00 00 01 00 00
(da15:mpr1:0:79:0): CAM status: CCB request completed with an error
(da15:mpr1:0:79:0): Retrying command
(da15:mpr1:0:79:0): READ(10). CDB: 28 00 00 00 00 00 00 01 00 00
(da15:mpr1:0:79:0): CAM status: SCSI Status Error
(da15:mpr1:0:79:0): SCSI status: Check Condition
(da15:mpr1:0:79:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da15:mpr1:0:79:0): Retrying command (per sense data)
(da38:mpr1:0:121:0): READ(10). CDB: 28 00 e8 e0 87 00 00 01 00 00 length 131072 SMID 344 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 89096
(da38:mpr1:0:121:0): READ(10). CDB: 28 00 e8 e0 87 00 00 01 00 00
(da38:mpr1:0:121:0): CAM status: CCB request completed with an error
(da38:mpr1:0:121:0): Retrying command
(da38:mpr1:0:121:0): READ(10). CDB: 28 00 e8 e0 87 00 00 01 00 00
(da38:mpr1:0:121:0): CAM status: SCSI Status Error
(da38:mpr1:0:121:0): SCSI status: Check Condition
(da38:mpr1:0:121:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da38:mpr1:0:121:0): Retrying command (per sense data)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Can you run with only one card and see if the errors can be produced?

I'm wondering if you've got issues with contention for PCIe Lanes based on the slots where they are installed.
 
Joined
May 5, 2021
Messages
5
Can you run with only one card and see if the errors can be produced?

I'm wondering if you've got issues with contention for PCIe Lanes based on the slots where they are installed.

You may be on to something. We went through shelf by shelf, plugging and unplugging - everything is normal till shelf #3 gets added. So either that card, cables, or shelf is bad and is affecting all of the system when online. We've added all of the drives from the other two shelves, built a giant array and we're beating it up and getting no errors.

So, I lied. It's behaving better... still getting some errors but not nearly as many. One of the strangest things was when we unplugged one of the expansion chassis SAS cables the system hard reset, then it hard reset again when we plugged the SAS cable back in.

We're going to try your suggestion now, we're also now on TrueNAS Core 12.0U3.

Can you run with only one card and see if the errors can be produced?

I'm wondering if you've got issues with contention for PCIe Lanes based on the slots where they are installed.

Okay we tried this. We removed the two extra HBAs, consolidated down to one, and connected all 3 enclosures. Still getting errors

(da160:mpr1:0:980:0): CAM status: CCB request completed with an error
(da160:mpr1:0:980:0): Retrying command, 3 more tries remain
(da160:mpr1:0:980:0): READ(10). CDB: 28 00 01 18 35 10 00 00 08 00
(da160:mpr1:0:980:0): CAM status: CCB request completed with an error
(da160:mpr1:0:980:0): Retrying command, 3 more tries remain
(da160:mpr1:0:980:0): READ(10). CDB: 28 00 01 18 35 08 00 00 08 00
(da160:mpr1:0:980:0): CAM status: CCB request completed with an error
(da160:mpr1:0:980:0): Retrying command, 3 more tries remain
(da170:mpr1:0:990:0): CAM status: CCB request completed with an error
(da170:mpr1:0:990:0): Retrying command, 3 more tries remain
(da170:mpr1:0:990:0): READ(10). CDB: 28 00 01 18 4b a0 00 00 08 00
(da170:mpr1:0:990:0): CAM status: CCB request completed with an error
(da170:mpr1:0:990:0): Retrying command, 3 more tries remain
(da170:mpr1:0:990:0): READ(10). CDB: 28 00 01 18 4b 78 00 00 28 00
(da170:mpr1:0:990:0): CAM status: CCB request completed with an error
(da170:mpr1:0:990:0): Retrying command, 3 more tries remain
(da170:mpr1:0:990:0): READ(10). CDB: 28 00 01 18 4b a0 00 00 08 00
(da160:mpr1:0:980:0): READ(10). CDB: 28 00 01 18 35 10 00 00 08 00
(da170:mpr1:0:990:0): CAM status: SCSI Status Error
(da170:mpr1:0:990:0): SCSI status: Check Condition
(da170:mpr1:0:990:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da170:mpr1:0:990:0): Retrying command (per sense data)
(da160:mpr1:0:980:0): CAM status: SCSI Status Error
(da160:mpr1:0:980:0): SCSI status: Check Condition
(da160:mpr1:0:980:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da160:mpr1:0:980:0): Retrying command (per sense data)
mpr1: Controller reported scsi ioc terminated tgt 938 SMID 1041 loginfo 31110e03
mpr1: (da137:mpr1:0:938:0): READ(10). CDB: 28 00 01 2d f3 08 00 00 18 00
Controller reported scsi ioc terminated tgt 938 SMID 1065 loginfo 31110e03
mpr1: Controller reported scsi ioc terminated tgt 938 SMID 523 loginfo 31110e03
(da137:mpr1:0:938:0): CAM status: CCB request completed with an error
(da137:mpr1:0:938:0): Retrying command, 3 more tries remain
(da137:mpr1:0:938:0): READ(10). CDB: 28 00 01 2d f3 38 00 00 20 00
(da137:mpr1:0:938:0): CAM status: CCB request completed with an error
(da137:mpr1:0:938:0): Retrying command, 3 more tries remain
(da137:mpr1:0:938:0): READ(10). CDB: 28 00 01 2d f3 20 00 00 18 00
(da137:mpr1:0:938:0): CAM status: CCB request completed with an error
(da137:mpr1:0:938:0): Retrying command, 3 more tries remain
(da137:mpr1:0:938:0): READ(10). CDB: 28 00 01 2d f3 38 00 00 20 00
(da137:mpr1:0:938:0): CAM status: SCSI Status Error
(da137:mpr1:0:938:0): SCSI status: Check Condition
(da137:mpr1:0:938:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da137:mpr1:0:938:0): Retrying command (per sense data)
mpr1: Controller reported scsi ioc terminated tgt 985 SMID 981 loginfo 3112010a
mpr1: (da165:mpr1:0:985:0): READ(10). CDB: 28 00 01 56 13 c0 00 00 20 00
Controller reported scsi ioc terminated tgt 985 SMID 777 loginfo 3112010a
mpr1: Controller reported scsi ioc terminated tgt 985 SMID 1533 loginfo 3112010a
(da165:mpr1:0:985:0): CAM status: CCB request completed with an error
(da165:mpr1:0:985:0): Retrying command, 3 more tries remain
(da165:mpr1:0:985:0): READ(10). CDB: 28 00 01 56 13 90 00 00 30 00
(da165:mpr1:0:985:0): CAM status: CCB request completed with an error
(da165:mpr1:0:985:0): Retrying command, 3 more tries remain
(da165:mpr1:0:985:0): READ(10). CDB: 28 00 01 56 13 80 00 00 10 00
(da165:mpr1:0:985:0): CAM status: CCB request completed with an error
(da165:mpr1:0:985:0): Retrying command, 3 more tries remain
(da165:mpr1:0:985:0): READ(10). CDB: 28 00 01 56 13 c0 00 00 20 00
(da165:mpr1:0:985:0): CAM status: SCSI Status Error
(da165:mpr1:0:985:0): SCSI status: Check Condition
(da165:mpr1:0:985:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da165:mpr1:0:985:0): Retrying command (per sense data)
mpr1: Controller reported scsi ioc terminated tgt 805 SMID 416 loginfo 31110e03
mpr1: (da109:mpr1:0:805:0): READ(10). CDB: 28 00 01 31 e8 30 00 00 40 00
Controller reported scsi ioc terminated tgt 805 SMID 1923 loginfo 31110e03
mpr1: Controller reported scsi ioc terminated tgt 805 SMID 1178 loginfo 31110e03
(da109:mpr1:0:805:0): CAM status: CCB request completed with an error
(da109:mpr1:0:805:0): Retrying command, 3 more tries remain
(da109:mpr1:0:805:0): READ(10). CDB: 28 00 01 31 e8 78 00 00 08 00
(da109:mpr1:0:805:0): CAM status: CCB request completed with an error
(da109:mpr1:0:805:0): Retrying command, 3 more tries remain
(da109:mpr1:0:805:0): READ(10). CDB: 28 00 01 31 e8 70 00 00 08 00
(da109:mpr1:0:805:0): CAM status: CCB request completed with an error
(da109:mpr1:0:805:0): Retrying command, 3 more tries remain
(da109:mpr1:0:805:0): READ(10). CDB: 28 00 01 31 e8 78 00 00 08 00
(da109:mpr1:0:805:0): CAM status: SCSI Status Error
(da109:mpr1:0:805:0): SCSI status: Check Condition
(da109:mpr1:0:805:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da109:mpr1:0:805:0): Retrying command (per sense data)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Okay we tried this. We removed the two extra HBAs, consolidated down to one, and connected all 3 enclosures. Still getting errors
OK, so that rules out problems with cards competing for lanes with each other. Maybe try that one card in different slots.
 
Top