Nightmarish experience installing LSI HBA, advice needed

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
My first experience installing an LSI HBA yesterday was borderline catastrophic and I wanted to get some help diagnosing the problem before I make another attempt, as I nearly lost my entire primary pool in the process. (Sorry for the length, the situation is a bit complicated.)

To set the stage, I picked up an HP H220/9207-8i pre-flashed to IT mode from Art of Server on Ebay, and a pair of these cables. My FreeNAS box has an encrypted (yes I know) RAIDZ1 pool, 4x6TB previously connected to the motherboard's SATA controller.

Installed the card, connected the four drives to one of the HBA's ports, and because I wanted to test both ports and cables to double-check that everything was good, also connected four old SSDs to the HBA's other port.

On first boot, the HBA's BIOS screen sat at the "Initializing... \" animation for a very long time, what felt like 5-ish minutes before dumping its report of connected drives faster than I had a chance to read and finally booting the machine. (Note I didn't enter the configuration utility here, but my impression per Art of Server is the card should have essentially been JBOD out of the box as he handles all the flashing before shipment.)

The FreeNAS boot threw a ton of hardware/read errors, and when I got to the web UI, sure enough only three of the hard drives and three of the SSDs were visible (putting the pool into a degraded state, obviously). To try to rule out a bad cable, I shut down and swapped two of the SATA connectors between the hard drive that failed to connect and one of the good ones. Powering back on, the drives had swapped places: the previously good drive failed to register, and the one that had originally failed to connect was now connected -- but it had become UNAVAIL per zpool status.

At this point I panicked, pulled the HBA, and reconnected the four drives to the motherboard's SATA ports. In some stroke of luck or divine intervention, three of the drives did come back up as ONLINE, but the fourth remained UNAVAIL. If a second drive had gone UNAVAIL this would be a much sadder and angrier post, as I didn't have a backup of this pool (in my defense, setting up a permanent backup is the end goal of getting this HBA in the first place). I was unable to easily "zpool online" the unavailable drive, so before I risked further banging on a degraded array, I grabbed a pair of big USB hard drives and am currently backing up the entire pool. (Rather than trying to zpool replace the corrupted drive after the backup is done, I may take this opportunity to simply destroy the pool and recreate it unencrypted and restore the data, since running an encrypted pool has been a lot of hassle and worry for little benefit so far.)

Does anyone have any insight into this behavior with the HBA? Did I make any obvious mistakes in the installation process? The results of swapping the SATA connectors would suggest that the cable is at fault, but it's hard to believe that both cables in the package are bad in the exact same way, and further, I'm confused as to why a bad cable could have been so destructive to one of my drives. I'm not sure how to proceed here, whether to order another set of cables or contact the HBA seller, though I will reinstall the card and continue troubleshooting once I'm done backing up the pool and there's no risk of total data loss.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I know it'll sound harsh but this is why you don't do hardware testing on your live ZFS pool. Learn that lesson well!

A possible culprit here is that you haven't explained how you powered the drives. Be certain that you're not using SATA power splitters, especially for the HDD's (probably good enough for the SSD's).

Pull your pool drives and set them aside. Focus on just testing with the LSI BIOS. Yes the LSI BIOS will take an eff-ton of time trying to make sense out of what it is seeing out there if there is something wrong. This is a sign that something is wrong. It is also why I like having the LSI BIOS burned on the card, you can do lots of detective work without having to risk your pool or ZFS's sanity.

Validate one SFF8087 at a time. Make sure you are firmly seating the SFF8087 until it latches (the little annoying metal clippy thing). The connector is somewhat delicate so do not roughhouse with it. Failure to seat is one thing that can cause a lot of issues. I've got some right angle 8087's I'm wrestling with this morning with a Dell controller and I've taken to sanding down the edge of the PERC PCB a bit to get them to clip in.

Get one working cable and one working port with four drives showing up. Then switch HBA ports and see if it works on the other one. Then switch the cable, and repeat, until you isolate any problem. Don't be embarrassed if the problem is the human. It happens.
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
Thanks for the response. I absolutely should have backed up the pool before now, but had been waiting till I could connect the backup directly to the NAS rather than having to transfer everything over the network, which would take (and currently is taking) ages. Feeling very fortunate that I didn't lose anything though so I'm putting up with the wait.

I've got all four drives (they're WD Reds) plugged into a single 4x SATA power cable run from the PSU -- is that what you mean by splitter? If that's a bad idea, I can certainly rewire something like a 2x2 configuration, though for what it's worth the drives/pool haven't given me any trouble prior to installing the LSI card.

That's good advice about isolating the adapter for testing. Is there anything I should look for in the card's configuration utility (or is there another control mode I don't know about)? This is what Art of Server lists as being on the card when he ships it out -- is there anything objectionable here?
  1. LSI Avago IT mode firmware version P20 (20.00.07.00)
  2. MPTSAS2 BIOS ROM flashed version 07.39.02.00
  3. MPTSAS2 UEFI ROM flashed version 07.27.01.01
Lastly just to expedite my troubleshooting a bit, what are best practices for hot-swapping SATA drives (both spinning and solid state) while the machine is powered on? If it's safe to attach and remove the drives to the HBA and motherboard's ports without having to shut down every time, that would make things a lot quicker. And if so, what's the best way to monitor the status of the drives in real time?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I've got all four drives (they're WD Reds) plugged into a single 4x SATA power cable run from the PSU -- is that what you mean by splitter?

No, a splitter goes from one male SATA power to two female. The problem is that the SATA power connector is just barely sufficient for a single hard drive. The older style (often called "Molex") was designed for much higher ratings because they used to spin full height 5.25" HDD's, and even today those connectors are able to handle more current.

Whether or not it is wise to plug 4 HDD's into a single power lead is still a good question, and my general feeling is that it tends to be a bad idea if you happen to have synchronous spinup. The power load on the 12V can suddenly be spiking at 8-10A for four drives.

If that's a bad idea, I can certainly rewire something like a 2x2 configuration, though for what it's worth the drives/pool haven't given me any trouble prior to installing the LSI card.

It's not always easy to spot brownout behaviors but it can be doing damage to the drives without you being aware. The thing I was thinking was that the HBA seemed to be freaking out at what it was seeing from the drives, and typically this is a sign of drive failure or other hardware issues with the drives.

That's good advice about isolating the adapter for testing. Is there anything I should look for in the card's configuration utility (or is there another control mode I don't know about)? This is what Art of Server lists as being on the card when he ships it out -- is there anything objectionable here?
  1. LSI Avago IT mode firmware version P20 (20.00.07.00)
  2. MPTSAS2 BIOS ROM flashed version 07.39.02.00
  3. MPTSAS2 UEFI ROM flashed version 07.27.01.01

I've never really seen any issue with mismatched BIOS/firmware. The important bit is the driver/firmware combination when FreeNAS is running.

Normally if I'm going to be doing a lot of dinking around with disks on an HBA, I'll load FreeBSD or FreeNAS in singleuser mode and then you can use "camcontrol devlist" and "camcontrol rescan" to probe what's out there.

Lastly just to expedite my troubleshooting a bit, what are best practices for hot-swapping SATA drives (both spinning and solid state) while the machine is powered on? If it's safe to attach and remove the drives to the HBA and motherboard's ports without having to shut down every time, that would make things a lot quicker. And if so, what's the best way to monitor the status of the drives in real time?

Connecting and disconnecting the data is absolutely fine.

The problem is power. If you can get a separate power lead to each HDD, that's far preferable, because a sudden attach of a drive can cause a brownout to the other drive(s) on that power lead. Be very careful to keep the connector dead straight when doing power operations, as the design of the connector intends for safe attach and detach through use of different length contacts. If you mash it on sideways that can be bad.

SSD's have substantially lower power requirements so there probably aren't brownout considerations there.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi rudds,

Did I make any obvious mistakes in the installation process?

Well... You did...

As @jgreco said, you never do something like this on a live pool, even less when that one has no backups.

Second big one you did is when you noticed your pool was damaged but not destroyed, you keep hammering it and tried to destroy it a second time.

Your Raid-Z1 pool can survive the loss of only a single drive. Even that, once a drive is lost, recovery will be correct if everything else is perfect. Should there be any error in the pool, there is no more redundancy to detect and fix that.

So after you lost one drive, you took one the remaining drive and moved it to where you knew there was a good chance for it not to mount anymore. So now that one drive is lost, lets do our best to loose a second one and push that Raid-Z1 beyond recovery.

What you should have do is to backup that pool ASAP. Once data are safe, you try to recover the failed drive in the safest way possible for the remaining drives. Clearly, that includes not touching whatever is working for now.

So you should have move the 4th drive to the onboard controller. Once back online, you re-silver the pool. Once the pool is back healthy, you should have shutdown your server before moving the drives back to the onboard controler. Don't do it live because that would degrade the pool and require a new re-silver for each drive.

Once all drives are back onboard. You can test with empty drives and keep your pool safe.

But doing your best to loose a second drive was clearly a no go...
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
Well, thankfully I now have a full backup of my entire storage pool, so any further testing I do here won't cause a complete disaster.

One question: in the HBA's configuration utility, the card is ringing up as an "H220" -- if the LSI firmware was installed successfully, should it be showing as a "9207-8i" instead? Or does that name string not change with a cross-flash? What I'm trying to do is rule out the possibility that the firmware didn't get updated properly -- eyeballing the PCB of the HP card in my hand against photos of an actual LSI-branded 9207-8i, they appear to be identical cards beyond the logos on the board.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's fine for it to show up as an HP card as long as the firmware running is LSI. What you want to look for is something like this in your dmesg:

mps0: <Avago Technologies (LSI) SAS2008> port 0x4000-0x40ff mem 0xfd3fc000-0xfd3fffff,0xfd380000-0xfd3bffff irq 18 at device
0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
Quick update: After ordering a second set of cables and doing a tremendous amount of testing, I've determined the H220 has to be at fault here and have shipped it back to the seller.

As a replacement I found the Supermicro AOC-S2308L-L8E in stock for a good price from a nearby seller, so I ordered one that should be here tomorrow. The seller's page didn't mention this info and I didn't see anyone mention it in my searches on this forum (just a lot of praise for the card), but I'm seeing some disagreements in old Reddit posts about whether this card will work in a non-Supermicro motherboard. I'm currently using an Asus board in my FreeNAS machine so this is obviously a concern for me.

Can anyone weigh in on this? If it's true, is it possible to flash this card to a firmware that will work in any motherboard?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Anything's possible. The Dell H310's have a known issue where the SMbus interferes with the host. See for ex. this link.
 

MisterPi

Cadet
Joined
Mar 15, 2020
Messages
8
This is a very interesting thread for me. I just received a "Genuine LSI 6Gbps SAS HBA LSI 9200-8i = (9211-8I) IT Mode ZFS FreeNAS unRAID US" and the necessary cables and was wondering about the best way to move my 4 drive pool onto the new controller. I contacted the seller and asked whether the HBA had been flashed IT Mode, and the response was Yes.

Following the advice above, I will install the card and test it and the cables with spare drives that are not a part of my main (only) pool.
Once I am convinced that my HBA will reliably control the drives, is there a "best practice" for moving the pool from 4 of the 8 motherboard SATA ports to the new HBA?


System: Gigabyte GA-F2A88XM-D3HP, AMD A6-7400K & 24GB of memory. I have 4 3TB disks, 1 VDEV, RaidZ2. All pretty generic. I just installed a second 120GB SSD so I have a mirrored boot pool. My use is as a home server, potentially with Plex (installed but not yet running).
 
Joined
Dec 29, 2014
Messages
1,135
FreeNAS identifies drives by the gptid. I would suggest backing up your pool first, but you should be able to move the connections without needing to make any configuration changes.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
jgreco said:
Whether or not it is wise to plug 4 HDD's into a single power lead is still a good question, and my general feeling is that it tends to be a bad idea if you happen to have synchronous spinup. The power load on the 12V can suddenly be spiking at 8-10A for four drives.
The ATX 1.3 (when SATA was added) power supply specification and newer states 18 AWG wires to the SATA connector. (section 4.2.2.5 table 20)
18 AWG copper wire is rated at 16A continuous.

But I know where you're coming from. I can look at the current ratings of the connectors and tell you there's nothing electrically wrong with using a splitter, but I'm still not comfortable doing so.
 

MisterPi

Cadet
Joined
Mar 15, 2020
Messages
8
18 AWG copper wire is rated at 16A continuous.
18AWG has a resistance of 6.5Ohms/1000ft and we generously estimate 3' of wire with all of the drives clustered at one end drawing 16A, the net voltage drop across the wire is 0.3v or about 2.6%. The bigger problem will be that all of those drives are drawing a time-varying current which makes for a variable voltage drop. Nevertheless, probably not an issue in real life because 12V is for the motors, not the R/W electronics.

When my power supply ran out of SATA power connectors, I started using the molex-to-dual SATA power splitters and things seem to be working well so far. But I'm new so there is still time for things to go all pear-shaped.
 

MisterPi

Cadet
Joined
Mar 15, 2020
Messages
8
Apropos new HBA:
I plugged in the HBA and rebooted, when it came up, I saw LSI prompts and I disabled boot support because I will boot off the motherboard. Booted into FreeNAS, everything was OK. Shut down.
Connected the two cables and put 1 drive on Port 0-P1. Booted, everything OK, FreeNAS saw it correctly as da0 (unused). Shut down.
Repeated with one drive on P2 & P3, repeated with a drive on P4 & Port 1-P1, repeated with drives on Port 1 P2, 3 & 4. OK. Each time FreeNAS saw the drives correctly.

I think that says both ports and both cables work. I think that also says I have the correct firmware for FreeNAS (no RAID prompts or anything).
Next step: Get 4 drives to form another pool on the HBA and copy the data (~1.5TB?) to the new pool and see how that goes. If that works, I will move the "old" drives to the HBA.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
MisterPi said:
18AWG has a resistance of 6.5Ohms/1000ft and we generously estimate 3' of wire with all of the drives clustered at one end drawing 16A, the net voltage drop across the wire is 0.3v or about 2.6%.
And this is accounted for as well. The ATX power supply specifications state voltage must be +/-5%, but SATA specifications state the device must accept +/-10%. Meaning at worst case, if your PSU is putting out the minimum acceptable voltage of 11.4v, you drop .3v over the length of the cables leaving you at 11.1v at the last drive in the chain, and you're still above the minimum the drive must operate normally at which is 10.8v.
 
Top