Errors when using both SAS connectors on the same HBA card for front and rear backplanes

tim7415

Cadet
Joined
Feb 22, 2017
Messages
7
Hello,

I'm building a new FreeNAS server with the following hardware (from ebay):

CSE-847E16-R1K28LPB - chassis (36 hdd bays)
BPN-SAS2-846EL1 - front backplane
BPN-SAS2-826EL1 - rear backplane
X9DRH-iF - mobo
128 GB ECC RAM
LSI 9211-8i
PWS-1K28P-SQ - power supplies

When I connect both backplanes to the two mini-SAS connectors on the LSI 9211-8i I get the following errors from the backplane connected to the second connector:

SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

I get this error when running "badblocks -ws" on a single HDD connected to the backplane. I also see a brief freeze up of writes, but then it continues without errors. I'm running FreeNAS 9.10.2 U1. I get the error regardless of whether I'm only using one of the backplanes or both at the same time.

Things I have tried so far:

1. Swapping the SAS cables for the front and rare backplanes makes the error show up on the other backplane. It seems the connector on top always works and the one on the bottom is giving me errors.

2. Tried four different PCI-E 8x slots with the same results.

3. Tried another LSI 9211-8i card with the same results.

4. Tried firmware P19 and P20(7) with the same results.

5. Tried an M1015 cossflashed to 9211 IT mode P20(7) with the same results.

6. I tried using different HDD bays on both backplanes and different HDDs with the same results. Smart data looks clean on both HDDs.

7. Now I'm running two LSI 9211-i8 cards with backplanes connected to the first min-SAS connector of each card and it has been running without any errors.

Given that both backplanes work when plugged in to the top port or different cards, I assume backplanes are fine. I did run memtest for a few days with no errors, so memory shouldn't be an issues. I tried 3 different HBA cards, so I'm guessing the fault is not there. I have 2 redundant power supplies with plenty of power, so they are probably not the root cause. I am running out of ideas. I don't have another mobo to swap in and try.

Should a single LSI 9211-8i be able to support both backplanes? Are there any compatibility issues between the LSI and my mobo? I couldn't find anything online. Do you think it is safe to proceed with using two separate LSI cards - one for each backplane? I sort of don't trust it much, because it may just be making the problem less likely to occur, but not fixing it.

Thanks!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Should a single LSI 9211-8i be able to support both backplanes?
Yes, of course.

My first guess is that the backplanes might benefit from a firmware update. Update them both to whatever is latest and try again.
 

tim7415

Cadet
Joined
Feb 22, 2017
Messages
7
I updated both backplanes to 55.14.18.0 (from 55.14.11.0) and that didn't help. Still getting errors from the backplane on the second mini-SAS connector of the 9211. In fact, now when both backplanes are in use it seems there are more reset errors than before, but that is only a subjective observation.

This is the complete error message:
(da1:mps0:0:10:0): WRITE(10). CDB: 2a 00 01 cf 23 00 00 01 00 00
(da1:mps0:0:10:0): CAM status: CCB request completed with an error
(da1:mps0:0:10:0): Retrying command
(da1:mps0:0:10:0): WRITE(10). CDB: 2a 00 01 cf 23 00 00 01 00 00
(da1:mps0:0:10:0): CAM status: SCSI Status Error
(da1:mps0:0:10:0): SCSI status: Check Condition
(da1:mps0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:10:0): Retrying command (per sense data)

Errors are always from da1, which is connected to the rear backplane, which is on the second 9211 mini-SAS.

Here are all the devices, in case something obviously wrong pops out from the logs:

mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf600000-0xdf603fff,0xdf580000-0xdf5bffff irq 32 at device 0.0 on pci2
mps0: Firmware: 20.00.07.00, Driver: 21.01.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
ses0 at mps0 bus 0 scbus0 target 9 lun 0
ses0: <LSI SAS2X36 0e12> Fixed Enclosure Services SPC-3 SCSI device
ses0: Serial Number
ses0: 600.000MB/s transfers
ses0: Command Queueing enabled
ses0: SCSI-3 ENC Device
ses1 at mps0 bus 0 scbus0 target 11 lun 0
ses1: <LSI SAS2X28 0e12> Fixed Enclosure Services SPC-3 SCSI device
ses1: Serial Number
ses1: 600.000MB/s transfers
ses1: Command Queueing enabled
ses1: SCSI-3 ENC Device
da0 at mps0 bus 0 scbus0 target 8 lun 0
da0: <ATA Hitachi HUS72403 A5F0> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number XXXX
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 2861588MB (5860533168 512 byte sectors)
da1 at mps0 bus 0 scbus0 target 10 lun 0
da1: <ATA Hitachi HUS72403 A5F0> Fixed Direct Access SPC-4 SCSI device
da1: Serial Number XXXX
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 2861588MB (5860533168 512 byte sectors)
ses1: da1,pass2: Element descriptor: 'Slot 12'
ses1: da1,pass2: SAS Device Slot Element: 1 Phys at Slot 11
ses1: phy 0: SATA device
ses1: phy 0: parent 50030480002xxxxx addr 50030480002xxxxx
ses0: da0,pass0: Element descriptor: 'Slot 18'
ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 17
ses0: phy 0: SATA device
ses0: phy 0: parent 50030480002xxxxx addr 50030480002xxxxx
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What's the output of sas2flash -listall?
 

tim7415

Cadet
Joined
Feb 22, 2017
Messages
7
Code:
[root@freenas ~]# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:02:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.


By the way, I ran some tests under Windows and got similar results. The tests worked fine but I get errors in the event log when using the backplane attached to the second mini-SAS connector. I guess that means it must be a hardware issue.
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
I'm totally pulling this out of thin air, but try re-flashing an HBA without the BIOS.
 

tim7415

Cadet
Joined
Feb 22, 2017
Messages
7
I tried flashing without the BIOS and it didn't make any difference.

I noticed that if I use both ports on the LSI - e.g. by running badblocks on hdds on both port - then the errors from the second port are much more frequent. If I only use the second port, then errors are not as frequent. And, of course, there are no errors when using the first port only. I'm thinking that if this was a power issue, then I would see the errors consistently on all ports and/or a single backplane. But I only see the errors on the backplane connected to the second HBA port.

Could this be caused by a motherboard BIOS misconfiguration?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Could this be caused by a motherboard BIOS misconfiguration?
No, the system BIOS has no interactions with PCI devices while the OS is running. On a modern OS, that is.
 

tim7415

Cadet
Joined
Feb 22, 2017
Messages
7
I have been running badblocks on 10 disks in both backplanes for a few days without any errors. But this is using two LSI cards and each backplane is connected to the first mini-SAS connector of the corresponding card.

I am at a loss. Can't figure out why using the second port doesn't work.
 

timbiotic

Cadet
Joined
Oct 7, 2015
Messages
2
Did you solve this? I’m having same issues. Thought it might be cables but I think it’s the two ports
 

Doc Chacha

Dabbler
Joined
Sep 18, 2016
Messages
28
I know necromancy is bad.

But I'm curious to know if anyone found something about this since 2019.

I'm running TrueNAS 12.0-U8, with an LSI9211-8i flashed to P20 IT mode, and behind that, 2 HD boxes with 4 drives in each, connected with an SFF8088 cable to the LSI card..
I keep getting the same kind of error messages as described in the top post, only on device da0.
I tried cable switching, many ways, drives switching (aways da0 showing errors, whatever HD is there), tried to change the way the cables cross each others (magnetic interaction hypothesis)... It's not a cable problem... It's not a HD problem... When I change the cables, the da0 slot changes from one box to the other, so it's not a box problem.

Nothing seems to work to get around this strange behaviour.

So, if anyone has any idea to help a lonely necromancer....
;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Are the backplanes running the latest firmware? If they are, there's not a whole lot else you can try. Maybe check if the card is being cooled properly, including new thermal paste...

Although, wait a minute...
2 HD boxes with 4 drives in each, connected with an SFF8088
You could simply be exceeding the capabilities of the SATA physical layer, which is a 1-meter cable inside a chassis... And even then, with some devices, it's a bit of a stretch. And it has to account for backplanes, gender changers, etc.
SAS gets around this with higher-voltage signalling.
 

Doc Chacha

Dabbler
Joined
Sep 18, 2016
Messages
28
You could simply be exceeding the capabilities of the SATA physical layer, which is a 1-meter cable inside a chassis..(...)
That's a hit.
The total length of cable was around 1.5 meter (with a gender changer in the middle) . After your answer, I found it worthy to test a shorter setup.
Replacing the thing with a 1 meter SFF8088 to SFF8087 cable made the error go away.
Of course, from a technical point of view, having 2 cables going out of my server case is not ideal. But I solved my problem.

Many thanks to you !
 
Top