LSI mrsas0: Initiaiting OCR because of FW fault! - Reset failed, killing adapter.

svoltaire · Aug 18, 2019

Freenas 11.2 - FreeBSD 11.2-STABLE

For the last few weeks, I had issue with some disks that I had to replace; everything went fine, except now the controller/disk seems to be not available any more.
The LSI card is a m5016 flashed with the latest LSI firmware.
The problem seems to happen under some heavy load (resilver + something else); I have the issue with the mrsas driver, but also mfi (using different freenas 11.x).
Freenas is a VM under ESXI 6.7 with pci passthrough; it used to work for years properly.
The drives are all HSGT (6TB, and 3 disk now 10TB).

dmesg:

Code:

AVAGO MegaRAID SAS FreeBSD mrsas driver version: 06.712.04.00-fbsd
mrsas0: <AVAGO Thunderbolt SAS Controller> port 0x4000-0x40ff mem 0xfd4fc000-0xfd4fffff,0xfd480000-0xfd4bffff irq 18 at device 0.0 on pci3
mrsas0: Using MSI-X with 4 number of vectors
mrsas0: FW supports <16> MSIX vector,Online CPU 4 Current MSIX <4>
mrsas0: MSI-x interrupts setup success
...
mrsas0: Initiaiting OCR because of FW fault!

When everything is ok, I have:

Code:

root@freenas:~ # camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (cd0,pass0)
<VMware Virtual SATA Hard Drive 00000001>  at scbus3 target 0 lun 0 (pass1,ada0)
<VMware Virtual SATA Hard Drive 00000001>  at scbus4 target 0 lun 0 (pass2,ada1)
<IBM ServeRAID M5016 3.46>         at scbus33 target 0 lun 0 (pass3,da0)
<IBM ServeRAID M5016 3.46>         at scbus33 target 1 lun 0 (pass4,da1)
<IBM ServeRAID M5016 3.46>         at scbus33 target 2 lun 0 (pass5,da2)
<IBM ServeRAID M5016 3.46>         at scbus33 target 3 lun 0 (pass6,da3)
<IBM ServeRAID M5016 3.46>         at scbus33 target 4 lun 0 (pass7,da4)
<IBM ServeRAID M5016 3.46>         at scbus33 target 5 lun 0 (pass8,da5)

When the issue happens:

Code:

mrsas0: Initiaiting OCR because of FW fault!
mrsas0: Reset failed, killing adapter.
(da2:mrsas0:0:2:0): Invalidating pack
(da3:mrsas0:0:3:0): Invalidating pack
(da4:mrsas0:0:4:0): Invalidating pack
(da0:mrsas0:0:0:0): Invalidating pack
da2 at mrsas0 bus 0 scbus33 target 2 lun 0
da2: <IBM ServeRAID M5016 3.46> s/n 00b60db33f2a0400ff40f83f04b00506 detached
da3 at mrsas0 bus 0 scbus33 target 3 lun 0
da3: <IBM ServeRAID M5016 3.46> s/n 000e21c530a17800ff40f83f04b00506 detached
da4 at mrsas0 bus 0 scbus33 target 4 lun 0
da4: <IBM ServeRAID M5016 3.46> s/n 0002ef5b597a65892440f83f04b00506 detached
da0 at mrsas0 bus 0 scbus33 target 0 lun 0
da0: <IBM ServeRAID M5016 3.46> s/n 00c6f493401b41d32440f83f04b00506 detached
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
GEOM_MIRROR: Device swap0: provider da3p1 disconnected.
(da1:mrsas0:0:1:0): Invalidating pack
da1 at mrsas0 bus 0 scbus33 target 1 lun 0
da1: <IBM ServeRAID M5016 3.46> s/n 00095d1603310000ff40f83f04b00506 detached
GEOM_MIRROR: Device swap1: provider da2p1 disconnected.
GEOM_MIRROR: Device swap1: provider da4p1 disconnected.
GEOM_MIRROR: Device swap1: provider destroyed.
GEOM_MIRROR: Device swap1 destroyed.
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
GEOM_ELI: Device mirror/swap1.eli destroyed.
(da0:mrsas0:0:0:0): Periph destroyed
(da1:mrsas0:0:1:0): Periph destroyed
(da2:mrsas0:0:2:0): Periph destroyed
(da3:mrsas0:0:3:0): Periph destroyed
(da4:mrsas0:0:4:0): Periph destroyed

and camcontrol reports:

Code:

root@freenas:~ # camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (cd0,pass0)
<VMware Virtual SATA Hard Drive 00000001>  at scbus3 target 0 lun 0 (pass1,ada0)
<VMware Virtual SATA Hard Drive 00000001>  at scbus4 target 0 lun 0 (pass2,ada1)
<IBM ServeRAID M5016 3.46>         at scbus33 target 5 lun 0 (pass8,da5)

the pool still reports resilvering:

Code:

errors: No known data errors

  pool: voluz
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Aug 18 19:01:46 2019
    3.66G scanned at 3.66G/s, 0 issued at 0/s, 23.2T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                                              STATE     READ WRITE CKSUM
    voluz                                             DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        replacing-0                                   DEGRADED     0     0     0
          18047640745310007856                        UNAVAIL      0     0     0  was /dev/gptid/bad99e6b-c804-11e5-8416-000c29b3c3eb
          gptid/70f08df2-b2f3-11e9-bdf3-000c29b3c3eb  ONLINE       0     0     0
        gptid/bb6ea102-c804-11e5-8416-000c29b3c3eb    ONLINE       0     0     0
        gptid/89a4c1fc-69bd-11e9-8b8e-000c29b3c3eb    ONLINE       0     0     0
        gptid/bc8a0c7d-c804-11e5-8416-000c29b3c3eb    ONLINE       0     0     0
        gptid/035b4da6-86ee-11e9-b16d-000c29b3c3eb    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x312>:<0x1>
        voluz/vm-iscsi/osxboot:<0x1>

I have an 'error message' when I boot, see attached; not sure if it is a real one or not? See the vdev_geom_open message (it seems truncated but I am not sure where the full message is expected to be stored?)

When the issue happens, commands such as the one below does not help to report the drives
camcontrol reset all

storcli /c0 show... is reporting controller not found

if I restart the VM everything works fine again (until I get the problem again).

On other forum I read that the temperature of the LSI could be the reason, but it is surprising it is working again like a charm as soon as I reboot.
It seems for me a bug in the firmware/mrsas driver when the reset happens under heavy load (it seems there are some bug report about it but my dmesg looks different)
Debug file available - not sure how to upload it however.

Do you have any recommendation?

svoltaire · Aug 18, 2019

debug data if that helps

jgreco · Aug 18, 2019

FreeNAS isn't expected to work well with a RAID controller. Pull the RAID controller and replace it with an LSI 9211 HBA flashed to IT firmware 20.00.07.00.

https://www.ixsystems.com/community...ide-to-not-completely-losing-your-data.12714/

https://www.ixsystems.com/community/threads/confused-about-that-lsi-card-join-the-crowd.11901/

Because you've created a frankenstein with this, you will probably need to rebuild your pool. No happy solutions for you and no idea how you got this far with such a setup.

Important Announcement for the TrueNAS Community.

LSI mrsas0: Initiaiting OCR because of FW fault! - Reset failed, killing adapter.

svoltaire

Cadet

svoltaire

Cadet

Attachments

jgreco

Resident Grinch

Similar threads