mps driver probe problems with 9.10.2 and 9200-8e controller to EXP3000 DAS

Status
Not open for further replies.

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Hello!

This is my first post here.

I wanted to get some advice on how to proceed with a problem I've encountered. I've been running FreeNAS bare-metal for 2 years now on a lenovo TS440 with no problems.

I'm trying to get a second FreeNAS server running on an IBM x3650 M3 (eBay) with vmware 6.0. The controller is an LSI 9200-8e (amazon) in PCI passthrough to an external drive array IBM EXP3000 with 12 x Seagate Constellation 1TB SAS drives (also eBay).

The problem arises at boot time when it is probing the drives (I think). I get a "Fatal trap 12" after a few timeout probe attempts. It's hard for me to paste the errors, since the vSphere client does not give me any text copy. I can get snapshots on my Mac though. (Note -- I'm using the vSphere client on a Windows 10 box, and connecting using RDP from my Mac because I know the Windows client controls PCI passthrough more correctly than the Web interface.)

If I turn off the drive array, the VM will boot up fine and does appear to recognize the controller card. It assigns the MPS driver to the controller instance. (So this implies the problem lies with the drives or the SAS expander in the array.)

Since this is all virtual, I did some testing with different operating systems. I didn't write down all my results, but if I recall correctly, FreeNAS 9.3 and FreeBSD 10 behaved the same. Then I tried OmniOS, and that worked fine. Just recently I dropped the OmniOS vm and installed Napp-It with the vm OVA file. With this system, I can see the drives, create all sorts of vdevs and it all seems to work. OmniOS uses the "mpt-sas" driver for the controller and "sd" for the drives.

Now. I've also tried moving the card over to my TS440 to eliminate VMware as the culprit. Again, this part of the troubleshooting was not documented well, so by memory, I *think* the fatal trap was the same, but it certainly did not boot. I could re-do this test if it's helpful.

In all my searching, I don't see any folks having problems with the 9200-8e. I can't really find much one way or the other for the EXP3000. So that got me wondering whether I might have a counterfeit card? I bought off amazon.com from a 3rd party vendor. The card came in plain plastic with no box or papers, so it might have simply been one of a bulk pack, or it might be more sinister. But then the question becomes what is different between the FreeBSD and illumos drivers and is it something worth changing on the FreeBSD side?

OK. I'm done writing now. I'll put my flame-suit on and see what comes back!

Regards
Michael
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What is the firmware version on the card? You want P20.00.07.
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Hey Eric,

Good question -- I forgot to include that information. The card came with P16 loaded and I did flash it up to P20 in my testing. it behaved the same under both revs. It does display P20 IT mode in the WebBIOS utility and using sas2flash on vmware.

I guess what I'm asking is two questions -- first, is there anything simple/obvious that I've missed that's worth trying? Second, what kind of information should I be gathering for the bug report, and how can I capture the information needed if I can't actually get the system to boot (does the failed boot actually log any info somewhere?).

EDIT -- added sas2flash output:
Code:
[root@vmware:~] /opt/lsi/bin/sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

    Adapter Selected is a LSI SAS: SAS2008(B2)  
    Controller Number              : 0
    Controller                     : SAS2008(B2)  
    PCI Address                    : 00:24:00:00
    SAS Address                    : 500605b-0-0660-01e0
    NVDATA Version (Default)       : 14.01.00.07
    NVDATA Version (Persistent)    : 14.01.00.07
    Firmware Product ID            : 0x2213 (IT)
    Firmware Version               : 20.00.07.00
    NVDATA Vendor                  : LSI
    NVDATA Product ID              : SAS9200-8e
    BIOS Version                   : 07.39.02.00
    UEFI BSD Version               : N/A
    FCODE Version                  : N/A
    Board Name                     : SAS9200-8e
    Board Assembly                 : H3-25260-02F
    Board Tracer Number            : SP33605766

    Finished Processing Commands Successfully.
    Exiting SAS2Flash.



Cheers
Michael
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Before filing a bug report, check with LSI support to see if they can identify your card as genuine.
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Well -- Joseph at LSI was nice to chat with. He verified this is a known good board from their supply chain, but originally shipped September 2015, so warranty expired already. I wasn't expecting warranty, so that's cool.

He recommended taking the updated driver from the LSI website for FreeBSD and trying to load that under FreeNAS. I guess I should muddle through that process and see if that helps. Will be useful.

Any other next steps?

UPDATE: I reconfigured the system with the 9200-8e configured as PCI passthrough to the FreeNAS 9.10.2 vm. I have NOT updated the FreeNAS software yet, so this is vanilla from the iso image. Here's the dmesg output with the DAS turned off. You can see the virtual drive recognized as mpt0 and the 9200 as mps0:

Code:
mpt0: <LSILogic 1030 Ultra4 Adapter> port 0x1400-0x14ff mem 0xfeba0000-0xfebbffff,0xfebc0000-0xfebdffff irq 17 at device 16.0 on pci0
mpt0: MPI Version=1.2.0.0
pcib2: <ACPI PCI-PCI bridge> at device 17.0 on pci0
pci2: <ACPI PCI bus> on pcib2
em0: <Intel(R) PRO/1000 Legacy Network Connection 1.1.0> port 0x2000-0x203f mem 0xfd5c0000-0xfd5dffff,0xfdff0000-0xfdffffff irq 18 at device 0.0 on pci2
em0: Ethernet address: 00:0c:29:72:2a:ab
pcib3: <ACPI PCI-PCI bridge> at device 21.0 on pci0
pci3: <ACPI PCI bus> on pcib3
mps0: <Avago Technologies (LSI) SAS2008> port 0x4000-0x40ff mem 0xfd4fc000-0xfd4fffff,0xfd480000-0xfd4bffff irq 18 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.01.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That was fast.

He recommended taking the updated driver from the LSI website for FreeBSD
  • The driver is the latest, kept updated by LSI/Avago/Broadcom employees
  • There's no "load a driver willy-nilly" on FreeNAS
UPDATE: I reconfigured the system with the 9200-8e configured as PCI passthrough to the FreeNAS 9.10.2 vm. I have NOT updated the FreeNAS software yet, so this is vanilla from the iso image. Here's the dmesg output with the DAS turned off. You can see the virtual drive recognized as mpt0 and the 9200 as mps0:
Wait, it works unless you attach the external drives? What happens if you hot-plug the thing? Any chance there's a firmware update for the expander in the thing?
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
That was fast.
Yeah -- working from home with a lot of free time today.

  • The driver is the latest, kept updated by LSI/Avago/Broadcom employees
  • There's no "load a driver willy-nilly" on FreeNAS
Yep -- I think the driver on their support site is actually older than the v21 reported in dmesg. And I understand that even if drivers are dynamically loaded, there's still immense compatibility issues with changing them around. (but done with care, I do think it can be part of a brute-force troubleshooting method :)

Wait, it works unless you attach the external drives? What happens if you hot-plug the thing? Any chance there's a firmware update for the expander in the thing?
Very good call. Looks like IBM has a firmware upgrade to version DS_ESM_v1.9A for the ESM cards in the DAS. It's a bootable CD-ROM which I'll have to burn. In the meantime I'll see if I can query the chassis from the WebBIOS at boot time.

And I'll go capture what I can with a hot-start of the DAS.

Thanks for your pointers!
Michael
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Hey folks,

Sorry for taking a while to reply. Busy.

Anyway. I did test out powering up the disk array while the FreeNAS VM was running. I'm not sure how to capture the text, though -- it scrolled by on the VMRC app so quickly. It appeared to be a whole bunch of tracebacks, followed by a reboot. Then it went through the same boot up procedure I described earlier -- it got to the point where it started doing some SCSI probes, and after three probe attempts, I got a Fatal trap 12.

I'll try to upload the screenshot of the trap here. (Upload File? Haven't used this interface before. Did i get that right?)

So -- any suggestions on how I can capture the console output as text?

Thanks for any guidance. Sorry for being such a n00b here. (Despite the fact that I started using FreeBSD in the early/mid 90s...)

Regards
Michael
 

Attachments

  • Screen Shot 2017-02-16 at 22.21.45.png
    Screen Shot 2017-02-16 at 22.21.45.png
    362.3 KB · Views: 444

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'll try to upload the screenshot of the trap here. (Upload File? Haven't used this interface before. Did i get that right?)
It's ok, but you can just drag and drop or copy-paste images into the forum and XenForo will deal with them appropriately.

So -- any suggestions on how I can capture the console output as text?
Connect via SSH and set the buffer to something long enough. It might be best to then paste it into a file and upload the file.

That said, it probably won't have anything particularly interesting. This is still without the updated firmware, right?
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Well. Two steps forward, two steps backward.

I identified the EXP3000 firmware update on the IBM support website, but after registering, creating a device inventory and signing multiple agreements their tool simply won't let me download the file without a support contract! And as far as I can tell, there is no way to query the current firmware revision loaded on the EXP3000, so I don't even know if I'm up to date. Makes me very tempted to replace this with the similar (slightly newer) Lenovo SA120. At least I know lenovo will let me download firmware without a support contract!

So -- I have not upgraded the firmware, and I do not know the current firmware revision.

I also spent a bunch of time trying to capture the console output on the FreeNAS vm. Tried to configure syslog-ng to write to a local file, but every time I modify the /etc/local/syslog-ng.conf file, it gets rewritten from somewhere else. I tried editing the version under /conf/base as well, but same result. Anyone know how I can make permanent changes to the syslog-ng configuration?

I also tried setting up a serial port copied to a file under vmware. I got the vmware part working, but i'm struggling to get the /boot/loader.conf config correct to copy console output to /dev/ttyu0 as well as /dev/console. Any suggestions on that one? I verified that /dev/ttyu0 does indeed send the output to the vmware filesystem.

At this point, I'm tempted to buy a NEW SA120 from amazon, strip out the drives from this EXP3000, and re-sell it back on eBay. Or just use OmniOS/napp-it instead of FreeNAS. Pity really -- I quite like FreeNAS.
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Well -- I spent a few hours doing careful troubleshooting tonight, and I've made significant progress in understanding this. It really seems to be a timing issue -- it's like FreeNAS/FreeBSD can't handle too many SCSI timeouts concurrently.

This might be long, but I'll try to summarize as best I can.

First, I moved off the x3650 M3 and put the 9200-8e card into an x4 slot in my TS440 server. Initially, I tested using FreeNAS bare-metal. It booted fine with the card installed, but no connection to the exp3000. Then, when I attached the enclosure, I'd get the same behavior. The good thing is that I was able to capture the fatal trap in this environment, since the /var/log directory is actually logging to a ZFS pool (whereas on the x3650 I had no other pool defined). I decided to switch to ESXi, though, and I haven't rebooted back into bare-metal, so I can't post that file just yet.

Now. With ESXi booting on the TS440 and with a the 9200-8e in passthrough to FreeNAS 9.10.2, I can get it to boot fine with the exp3k powered down. If I turn on the exp3k, it takes three probes for FreeNAS to crash. But I was able to switch off the exp3k quickly and NOT cause the full trap/reboot scenario. Here's what I could capture in /var/log/messages with these "brief" powered connections:
Code:
Mar  1 21:57:00 freenas da1 at mps0 bus 0 scbus3 target 12 lun 0
Mar  1 21:57:00 freenas da1: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da1: Serial Number 9WK2GG6P0000C1176QT1
Mar  1 21:57:00 freenas da1: 300.000MB/s transfers
Mar  1 21:57:00 freenas da1: Command Queueing enabled
Mar  1 21:57:00 freenas da1: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da2 at mps0 bus 0 scbus3 target 11 lun 0
Mar  1 21:57:00 freenas da2: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da2: Serial Number 9WK2KXFS0000C118873A
Mar  1 21:57:00 freenas da2: 300.000MB/s transfers
Mar  1 21:57:00 freenas da2: Command Queueing enabled
Mar  1 21:57:00 freenas da2: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da3 at mps0 bus 0 scbus3 target 18 lun 0
Mar  1 21:57:00 freenas da3: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da3: Serial Number 9WK2KYZM0000C11888DJ
Mar  1 21:57:00 freenas da3: 300.000MB/s transfers
Mar  1 21:57:00 freenas da3: Command Queueing enabled
Mar  1 21:57:00 freenas da3: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da4 at mps0 bus 0 scbus3 target 9 lun 0
Mar  1 21:57:00 freenas da4: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da4: Serial Number 9WK2KZ780000C1185HL6
Mar  1 21:57:00 freenas da4: 300.000MB/s transfers
Mar  1 21:57:00 freenas da4: Command Queueing enabled
Mar  1 21:57:00 freenas da4: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da5 at mps0 bus 0 scbus3 target 14 lun 0
Mar  1 21:57:00 freenas da5: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da5: Serial Number 9WK2KZ870000C11888F4
Mar  1 21:57:00 freenas da5: 300.000MB/s transfers
Mar  1 21:57:00 freenas da5: Command Queueing enabled
Mar  1 21:57:00 freenas da5: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da6 at mps0 bus 0 scbus3 target 8 lun 0
Mar  1 21:57:00 freenas da6: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da6: Serial Number 9WK2K45A0000C1185FPN
Mar  1 21:57:00 freenas da6: 300.000MB/s transfers
Mar  1 21:57:00 freenas da6: Command Queueing enabled
Mar  1 21:57:00 freenas da6: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da7 at mps0 bus 0 scbus3 target 16 lun 0
Mar  1 21:57:00 freenas da7: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da7: Serial Number 9WK2K4MS0000C1185HCR
Mar  1 21:57:00 freenas da7: 300.000MB/s transfers
Mar  1 21:57:00 freenas da7: Command Queueing enabled
Mar  1 21:57:00 freenas da7: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da8 at mps0 bus 0 scbus3 target 10 lun 0
Mar  1 21:57:00 freenas da8: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da8: Serial Number 9WK2K4BY0000C1185GP1
Mar  1 21:57:00 freenas da8: 300.000MB/s transfers
Mar  1 21:57:00 freenas da8: Command Queueing enabled
Mar  1 21:57:00 freenas da8: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da9 at mps0 bus 0 scbus3 target 17 lun 0
Mar  1 21:57:00 freenas da9: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da9: Serial Number 9WK2K4W80000C118B1RH
Mar  1 21:57:00 freenas da9: 300.000MB/s transfers
Mar  1 21:57:00 freenas da9: Command Queueing enabled
Mar  1 21:57:00 freenas da9: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da10 at mps0 bus 0 scbus3 target 15 lun 0
Mar  1 21:57:00 freenas da10: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da10: Serial Number 9WK55STP0000920363LR
Mar  1 21:57:00 freenas da10: 300.000MB/s transfers
Mar  1 21:57:00 freenas da10: Command Queueing enabled
Mar  1 21:57:00 freenas da10: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da11 at mps0 bus 0 scbus3 target 19 lun 0
Mar  1 21:57:00 freenas da11: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da11: Serial Number 9WK55SLF0000C203EEDB
Mar  1 21:57:00 freenas da11: 300.000MB/s transfers
Mar  1 21:57:00 freenas da11: Command Queueing enabled
Mar  1 21:57:00 freenas da11: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:00 freenas da12 at mps0 bus 0 scbus3 target 13 lun 0
Mar  1 21:57:00 freenas da12: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 21:57:00 freenas da12: Serial Number 9WK2KXFK0000C1188660
Mar  1 21:57:00 freenas da12: 300.000MB/s transfers
Mar  1 21:57:00 freenas da12: Command Queueing enabled
Mar  1 21:57:00 freenas da12: 953869MB (1953525168 512 byte sectors)
Mar  1 21:57:01 freenas (probe12:mps0:0:20:0): INQUIRY. CDB: 12 01 83 00 fc 00
Mar  1 21:57:01 freenas (probe12:mps0:0:20:0): CAM status: SCSI Status Error
Mar  1 21:57:01 freenas (probe12:mps0:0:20:0): SCSI status: Busy
Mar  1 21:57:01 freenas (probe12:mps0:0:20:0): Retrying command

Then I switched it off, and we see some detach messages. note that it does actually recognize all 12 disks.

Next, I decided to try pulling all the drives out of the enclosure. This was really helpful.

If I tried to boot up with the enclosure powered on, attached, but empty, I would STILL get the 3 probes followed by the trap. But if I disconnected it and allowed the system to boot first, THEN attach the cable at the HBA, then I was able to get the enclsoure recognized:

Code:
Mar  1 22:48:58 freenas ses0 at mps0 bus 0 scbus3 target 20 lun 0
Mar  1 22:48:58 freenas ses0: <IBM-ESXS EXP3000 01C1> Fixed Enclosure Services SPC-2 SCSI device
Mar  1 22:48:58 freenas ses0: 300.000MB/s transfers
Mar  1 22:48:58 freenas ses0: Command Queueing enabled
Mar  1 22:48:58 freenas ses0: SCSI-3 ENC Device
Mar  1 22:48:58 freenas (probe0:mps0:0:20:1): INQUIRY. CDB: 12 00 00 00 24 00 
Mar  1 22:48:58 freenas (probe0:mps0:0:20:1): CAM status: SCSI Status Error
Mar  1 22:48:58 freenas (probe0:mps0:0:20:1): SCSI status: Busy
Mar  1 22:48:58 freenas (probe0:mps0:0:20:1): Retrying command
Mar  1 22:48:59 freenas (probe0:mps0:0:20:2): INQUIRY. CDB: 12 00 00 00 24 00 
Mar  1 22:48:59 freenas (probe0:mps0:0:20:2): CAM status: SCSI Status Error
Mar  1 22:48:59 freenas (probe0:mps0:0:20:2): SCSI status: Busy
Mar  1 22:48:59 freenas (probe0:mps0:0:20:2): Retrying command
Mar  1 22:49:01 freenas (probe0:mps0:0:20:2): INQUIRY. CDB: 12 00 00 00 24 00 
Mar  1 22:49:01 freenas (probe0:mps0:0:20:2): CAM status: SCSI Status Error
Mar  1 22:49:01 freenas (probe0:mps0:0:20:2): SCSI status: Busy
Mar  1 22:49:01 freenas (probe0:mps0:0:20:2): Retrying command

note that this time, we see the ENCLOSURE itself recognized, whereas earlier it was just the drives.

awesome -- we now have an enclosure talking! let's try inserting the drives, one by one.

we start with disk 1
Code:
Mar  1 22:54:53 freenas da1 at mps0 bus 0 scbus3 target 12 lun 0
Mar  1 22:54:53 freenas da1: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 22:54:53 freenas da1: Serial Number 9WK2GG6P0000C1176QT1
Mar  1 22:54:53 freenas da1: 300.000MB/s transfers
Mar  1 22:54:53 freenas da1: Command Queueing enabled
Mar  1 22:54:53 freenas da1: 953869MB (1953525168 512 byte sectors)
Mar  1 22:54:54 freenas ses0: da1,pass3: Element descriptor: 'SLOT 01 '
Mar  1 22:54:54 freenas ses0: da1,pass3: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys
Mar  1 22:54:54 freenas ses0:  phy 0: SAS device type 1 id 0
Mar  1 22:54:54 freenas ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
Mar  1 22:54:54 freenas ses0:  phy 0: parent 500a0b8755a83000 addr 5000c50025b96325


excellent! now, let's try attaching the second SFF8088 cable (allowing multiparth to work)...
Code:
Mar  1 22:56:54 freenas da2 at mps0 bus 0 scbus3 target 25 lun 0
Mar  1 22:56:54 freenas da2: <IBM-ESXS ST31000424SS BC2D> Fixed Direct Access SPC-3 SCSI device
Mar  1 22:56:54 freenas da2: Serial Number 9WK2GG6P0000C1176QT1
Mar  1 22:56:54 freenas da2: 300.000MB/s transfers
Mar  1 22:56:54 freenas da2: Command Queueing enabled
Mar  1 22:56:54 freenas da2: 953869MB (1953525168 512 byte sectors)
Mar  1 22:56:57 freenas ses1 at mps0 bus 0 scbus3 target 33 lun 0
Mar  1 22:56:57 freenas ses1: <IBM-ESXS EXP3000 01C1> Fixed Enclosure Services SPC-2 SCSI device
Mar  1 22:56:57 freenas ses1: 300.000MB/s transfers
Mar  1 22:56:57 freenas ses1: Command Queueing enabled
Mar  1 22:56:57 freenas ses1: SCSI-3 ENC Device
Mar  1 22:56:57 freenas (probe0:mps0:0:33:1): INQUIRY. CDB: 12 00 00 00 24 00 
Mar  1 22:56:57 freenas (probe0:mps0:0:33:1): CAM status: SCSI Status Error
Mar  1 22:56:57 freenas (probe0:mps0:0:33:1): SCSI status: Busy
Mar  1 22:56:57 freenas (probe0:mps0:0:33:1): Retrying command
Mar  1 22:56:57 freenas ses1: da2,pass4: Element descriptor: 'SLOT 01 '
Mar  1 22:56:57 freenas ses1: da2,pass4: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys
Mar  1 22:56:57 freenas ses1:  phy 0: SAS device type 1 id 1
Mar  1 22:56:57 freenas ses1:  phy 0: protocols: Initiator( None ) Target( SSP )
Mar  1 22:56:57 freenas ses1:  phy 0: parent 500a0b8755be0000 addr 5000c50025b96326

ok.. this is good -- we see the second enclsoure instance and the second instance of disk 1 via the secondary path.

I kept inserting drives one-by-one until all 12 were inserted and recognized.

There was a slight problem here with multiparth -- it was recognized for some drives but not all. I ended up deleting the multipath configs and also removing the second cable later on.

with all drives attached I was about to use the "Volume Import" option on the Storage page. But because of the multiparth problems, it wasn't actually working right.

I ended up deleting the volume imported and trying to re-configure a zpool with 6 mirror vdevs. seemed to be working fine. of course, I can't reboot the system since it would never boot up.


I wonder if there is any way to extend the SCSI command timeout?

Have I shared enough info here? (Sorry -- I'm really tired, so it's hard to write a good precis of the investigation)

Regards
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Definitely add that information to the bug report, it may help the devs figure out the problem.
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
It looks like there have been some changes to SCSI timeout handling in newer revs of FreeBSD, so I'm going to load up the latest FreeNAS and FreeBSD images and see what it looks like with those running.

As I mentioned earlier, I know that OmniOS handles this enclosure successfully, so I'm fairly convinced this is a software issue in the FreeBSD kernel.

I think the driver tries to talk to the enclosure, sends a bunch of INQUIRY messages, but gets no response or maybe gets some kind of SCSI nak (I'm not a SCSI protocol expert). Then we see a whole bunch of tracebacks, ultimately followed by the page fault trap 12. I think it's trying to clean up state information from the failed INQUIRYs but something's getting corrupted -- maybe a pointer corruption, or maybe it's a failure to free structures correctly.

Unfortunately, I just don't have the time to troll through the mps driver code myself to try to find this.

But first.. Let's see if latest code is better.
 

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Results using FreeNAS 10 BETA 2, updated to NIGHTLY build...

Same result. I can get the enclosure to be recognized if I pull all drives and only attach it after the system has booted. Then I can engage all drives and create vdevs. Multipath is working nicely in FreeNAS 10 -- it picked up all drives correctly. Still had to use the dd trick to identify which slot was which. Curiously, the IBM EXP3000 identifies its slots as 1..4 on top row, 5..8 on middle row and 9..12 on bottom row. Multipath allocated the numbers in columns -- disk 0..2 in first column (leftmost column, drive 0 at top and 2 at bottom), then 3..5, 6..8 and 9..11.

If I reboot the system after setting everything up, I get the probes followed by the crash. FreeNAS 10 seems to give more useful information on the console -- here's what I see for the trap..
15892-bb33c18012a3bcee4e6c1d6a70ae4ba3.jpg

It's more obvious here that it's a data structure integrity issue.

I guess I have enough information for a bug report now. I will have to start writing one up. And maybe give FreeBSD 12.0-current a go just in case.

Thanks for your help Eric -- any other suggestions?
 

Attachments

  • Screen Shot 2017-03-02 at 12.36.54.png
    Screen Shot 2017-03-02 at 12.36.54.png
    438 KB · Views: 327

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

miku

Dabbler
Joined
Feb 15, 2017
Messages
10
Hey folks,

Thought I'd update this thread with more information.

I decided to hold off the bug report so that I could try the release FreeNAS Corral. It also exhibits the behavior, but I did finally work out how to change the kern.cam.scsi-delay value by simply editing the GRUB options at boot time. Setting it out to 60000ms actually allowed FreeNAS Corral to boot successfully in an ESXI vm (ONCE only...) I was able to create pools, destroy them, test them out. Finally, I needed to reboot the system, but even with the larger SCSI delay, I had the same panic on reboot. Tried power cycling the host system (x3650 m3), but that didn't help. Tried extending the SCSI delay to 150000ms, no help. I have a suspicion this is not simply a delay problem in FreeBSD -- I think there is genuinely some kind of bug in the CAM queue management.

At this point, I thought it might be useful to test out Linux as well, knowing that OmniOS boots fine. I decided to use Ubuntu Server 16.10 for the test. And here I had success again. A bit more work to load the ZoL package and get everything up and running, but everything is working smoothly. While I do miss the FreeNAS GUI for graphical monitoring, I'm OK with going CLI for all config/control. Setting up snapshots and replication is more of a pain on Ubuntu, but not impossible. The really neat thing with Ubuntu is that I set up my Plex server using a Docker container instead of a separate vm. I imagine Corral will be very popular for this improvement.

So I now have a heterogeneous environment, with FreeNAS serving my personal files, and Ubuntu serving as my media archive. They both send readonly snapshots to each other for backup. I actually like having different systems backing each other up. Safety in diversity (but it doubles my threat/bug surface and power consumption too, I guess).

I also ordered an LSI 9217-8i card off eBay to try out. Maybe it will behave differently. For now, I'm shelving this investigation.
 
Status
Not open for further replies.
Top