Idle reboot bug: aacraid0: COMMAND 0xfffffe00007bdfa0 TIMEOUT AFTER 3857 SECONDS, shutting down controller...done

Status
Not open for further replies.
Joined
Aug 9, 2020
Messages
6
Greetings iXsystems Community,

I've experienced what I believe to be a bug affecting FreeNAS/TrueNAS Core 11.x/12.x systems. When the system is completely idle (zero disk activity for a period of time), the system will report the following error and then automatically reboot:

1596963335267.png


I also see the following in /var/log/messages just prior to the reboot:

Code:
aacraid0: COMMAND 0xfffffe00007bdfa0 TIMEOUT AFTER XXXX SECONDS
aacraid0: shutting down controller...done
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: New comm. interface type2 enabled
aacraid0: Power Management enabled


The issue does not occur if there is disk activity. Storage servers with constant, heavy disk activity, are not affected and we see high up-times, eg:

2:07AM up 284 days, 3:40, 1 user, load averages: 3.59, 4.00, 3.15

To troubleshoot the issue, I've built a test system with no pools created. If I leave the system idle, it will reboot every ~1-72 hours or so. This issue affects Adaptec 7 and 8 series controllers, either using the inbox or Adaptec's drivers (eg: https://storage.microsemi.com/en-us/speed/raid/aac/unix/aacraid_freebsd_b56008_tgz.php).

This test system contains the following:

Code:
TrueNAS 12.0 BETA
Supermicro X9SRL-F Motherboard
Intel Xeon E5-2603 CPU
32GB ECC RAM
Adaptec 71605 Controller (HBA/RAW Pass Through mode)
KINGSTON SVP200S360G (boot disk, not connected to Adaptec controller)
Disks connected to Adaptec controller:
INTEL SSDSC2BB24
INTEL SSDSC2BB24
WDC WD30EFRX-68A
WDC WD20EFRX-68E
WDC WD30EFRX-68E
WDC WD30EFRX-68A
WDC WD30EFRX-68A


Please let me know what I can do or what other information is needed to help fix this issue.

Cheers,
Greg
 
Joined
Aug 9, 2020
Messages
6
Greetings again,

Adaptec/Microsemi have dropped support for 7 and 8 series controllers, so I decided to take a crack at solving this myself.

The issue is triggered by the following code in aacraid.c:

Code:
/*
 * Check for commands that have been outstanding for a suspiciously long time,
 * and complain about them.
 */
static void
aac_timeout(struct aac_softc *sc)
{
    struct aac_command *cm;
    time_t deadline;
    int timedout;

    fwprintf(sc, HBA_FLAGS_DBG_FUNCTION_ENTRY_B, "");
    /*
     * Traverse the busy command list, bitch about late commands once
     * only.
     */
    timedout = 0;
    deadline = time_uptime - AAC_CMD_TIMEOUT;
    TAILQ_FOREACH(cm, &sc->aac_busy, cm_link) {
        if (cm->cm_timestamp < deadline) {
            device_printf(sc->aac_dev,
                      "COMMAND %p TIMEOUT AFTER %d SECONDS\n",
                      cm, (int)(time_uptime-cm->cm_timestamp));
            AAC_PRINT_FIB(sc, cm->cm_fib);
            timedout++;
        }
    }

    if (timedout)
        aac_reset_adapter(sc);
    aacraid_print_queues(sc);
}


To mitigate/mask the issue, I've compiled a new driver with the following code update:

Code:
    if (timedout)
        /*
         * Resetting the adapter causes FreeNAS 11.x/12.x to panic with a "command not in queue" error.
         * Let's not reset the adapter and carry on as if nothing happened.
         * This is the command we are leaving out:
         * aac_reset_adapter(sc);
         */
    aacraid_print_queues(sc);


This results in the "command timeout" message being displayed , but no longer causes the system to reboot

1598010085129.png


Why commands are sitting in the queue for extended periods of time is unclear to me, but they do seem to be eventually cleared out by themselves with no harmful effects.

Attached are the new driver source code and compiled drivers for FreeBSD 11.2, 11.3, 11.4, 12.0 and 12.1. I've completed some basic testing with these drivers, and so far, everything is working just fine. All of the standard "use at your risk" and "ymmv" disclaimers apply if you choose to use the attached drivers.

Cheers,
Greg
 

mevans336

Dabbler
Joined
Aug 16, 2016
Messages
23
Hi, I am also facing this issue and would really rather not give up this very nice adaptec card, although I have an LSI on the way. Can you provide instructions for how to compile this? I'm use to Linux with makefiles and ./configure, so when presented with just .c files, I'm a bit stuck.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
That is way too much effort for something that may or may not work and does have the real-world validation that the LSI HBAs have.
You're better off selling the card, unless you're specifically looking to get into driver development.
 

mevans336

Dabbler
Joined
Aug 16, 2016
Messages
23
That is way too much effort for something that may or may not work and does have the real-world validation that the LSI HBAs have.
You're better off selling the card, unless you're specifically looking to get into driver development.
Thanks, but your post isn't very helpful. I'd like to solve this issue with this card.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Thanks, but your post isn't very helpful. I'd like to solve this issue with this card.

The point being made is that you're not really going to solve the issue with that card. There are very few disk controllers that can actually handle the punishment that ZFS throws at them, and even with those controllers, you have to be using exact versions of firmware that are known to work correctly. See more detail at

https://www.truenas.com/community/r...bas-and-why-cant-i-use-a-raid-controller.139/

Even if you were a UNIX kernel developer who was an expert in hacking on device drivers, the basic problem is that your typical RAID card is designed for a different purpose than what ZFS needs, so you might fix the one issue noted above, but are very likely to run into other issues. Worse, you are likely to NOT run into other hidden problems that will only manifest themselves under adverse conditions like a disk failure, and then if your I/O subsystem falls apart while ZFS is busy cramming massive I/O at a resilver, and data on disk becomes corrupted, you're screwed.

You are not the first person who has desperately wanted to use an alternative controller and tried to find ways to make it work, and you won't be the last, because it is a natural desire. Unfortunately, it just doesn't work out in practice.

So while you might be able to appear to make that card work "better" with specific patches for specific issues, the overall experience people have had with the Adaptec cards is that they aren't really at the 100.00000% correct level that ZFS needs to be able to do its thing. The Adaptec cards may not be quite as bad as the CISS based cards, for example, but that isn't really the measuring stick that needs to be used, alas.

The LSI HBA cards in IT mode with the 20.00.07.00 firmware have BILLIONS and BILLIONS of aggregate run-hours on them without issue. The only controllers known to work correctly are the LSI 6Gbps, LSI 12Gbps, Intel AHCI SATA, and Intel SCU's, plus some non-Intel AHCI controllers that properly implement AHCI.

Please note that forum members cannot always fix problems in the manner which you desire, so @Ericloewe 's post was perhaps terse but it did contain the correct answer.
 

mdedetrich

Dabbler
Joined
Jul 9, 2021
Messages
16
I am gonna throw my 2 cents here.

I am currently using SAS-72405 card (which uses this driver) however its set in HBA mode which means that all RAID/caching/multiplexing functionality is disabled and its just directly passing through the disks to the OS. I have experienced the problem set out in this thread which is that if the disk is not being used basically the aacraid driver locks up the entire system by calling aac_reset_adapter(sc);. This has happened to me twice in the following situations

1. The disks sitting behind the HBA are unmounted (and hence do not get any read/writes), this triggers the timeout and just locks up the system
2. A drive sitting behind the HBA fails because of a S.M.A.R.T check and since the hard drive did fail because of a S.M.A.R.T check, no read or writes are happening to the drive which also causes the timeout

Apart from these issues, at least this specific HBA has been solid (even managing to recover from a RAID-Z2 whole when the system had a critical shutdown while concurrent writes were happening to the system). Of course while I cannot claim that this means the card in general is rock solid I have to disagree with @jgreco assessments because it causes a chicken and egg situation where a card cannot achieve the aforementioned rock hard stability because the upstream driver doesn't get the bugfixes/patches and hence people are dissuaded from using the card's because the drivers don't receive the fixes necessary so it gets to this state where it can run at these "BILLIONS OF HOURS WITH NO PROBLEMS".

Because of this I am in the process of posting the fix posted by @SilverNetworks upstream to freebsd, while it may not be a perfect fix it at least is a massive improvement to the current situation because from what I have experienced, the `aac_reset_adapter(sc);` call when
timedout triggers seems to be some workaround/hack when the driver "thinks" something is "wrong" (and by "thinks" I mean the hard disk is not getting any I/O which itself is an incorrect definition of "wrong", there are plenty of valid reasons why this can occur).

@SilverNetworks I would like to attribute the credit for the upstream fix to you, I have created a PR on github to validate that it passes CI (continuous integration) however if you have a name/email that would be great so I can give you credit.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Apart from these issues, at least this specific HBA has been solid (even managing to recover from a RAID-Z2 whole when the system had a critical shutdown while concurrent writes were happening to the system). Of course while I cannot claim that this means the card in general is rock solid I have to disagree with @jgreco assessments because it causes a chicken and egg situation where a card cannot achieve the aforementioned rock hard stability because the upstream driver doesn't get the bugfixes/patches and hence people are dissuaded from using the card's because the drivers don't receive the fixes necessary so it gets to this state where it can run at these "BILLIONS OF HOURS WITH NO PROBLEMS".

That's true, but overall the problem has been that people have repeatedly said things like this over the years, proceeded on their way for awhile, and then come back having hit a snag or problem or significant disaster. This is absolutely the case with CISS-based and Areca-based cards, and there is no particular reason to think that there may not be other issues lurking.

It is going to be very difficult to get the sort of exhaustive testing for a different card, especially a card that is no longer being produced or supported.

Additionally, since commenting out the card reset (rather than fixing the underlying issue causing the panic when the card is being reset) is clearly not a proper fix, you have potentially introduced new failure modes into the driver, because someone saw a case where it was necessary to reset the card when a timeout occurred, and you have broken this without really understanding the bigger picture of how error recovery for this card was supposed to work.

Since this seems to be authored by some FreeBSD driver guys, it seems like the better thing to do here would have been to contact them or file a bug report against it over on the FreeBSD bugs site, and see if you can scare a correct fix out of them.
 

mdedetrich

Dabbler
Joined
Jul 9, 2021
Messages
16
That's true, but overall the problem has been that people have repeatedly said things like this over the years, proceeded on their way for awhile, and then come back having hit a snag or problem or significant disaster. This is absolutely the case with CISS-based and Areca-based cards, and there is no particular reason to think that there may not be other issues lurking.
The more pertinent question is whether its a software or hardware issue. This is one of the beauties of open source which means that even if the manufacturer doesn't sell the product anymore and hence stops caring about it, the community improve on the driver so that it gets onto a stable state.

I do suspect that these issues are driver related, but don't hold me on that one. I can however replicate @SilverNetworks behavior and with the patch I haven't gotten issues.

Since this seems to be authored by some FreeBSD driver guys, it seems like the better thing to do here would have been to contact them or file a bug report against it over on the FreeBSD bugs site, and see if you can scare a correct fix out of them.

Already have, I posted a pull request and I am getting feedback on it. I suspect that in the end its going to be a judgment call on what is worse, locking up the entire FreeBSD system or potentially solving some rare problem with this hacky workaround (because from reading the code, thats what it looks like).
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
People always say they accept the risk until bad things happen....

Here, you can not --see-- any problem, so you consider there are none. The day you will be proven wrong, you will end up in trouble. There will be no point blaming us, the hardware or whatever : your data will be gone.

The reboot was in place for a reason at first. Why was it there ? You may very well have unleashed a much bigger problem by bypassing that workaround. Bigger does not mean that it will happen as often or more. Does not mean you will experience it at all. But the fact is that you do not have any warranty, any security about that.

Here, we can help you ensure your data are safe. By using such a controller with a modified driver, what you do is not safe at all. Risk your data if that is what you wish to do. Hope you have extra backups in this case. Expose your data to as much risks as you wish. Just do not complain when shit will hit the fan.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
The more pertinent question is whether its a software or hardware issue.

Doesn't seem like it. Multiple people are observing it, and the card doesn't actually seem to need a reset in the specific case you've stumbled upon. That's a pretty compelling argument that it is a software fix. Which could be either in the driver or the card firmware.

Still, it's unclear why you wouldn't just get a cheap HBA known to work correctly rather than a RAID controller that you've shown isn't behaving correctly. In general, the goal of people running FreeNAS is to run ZFS to provide high reliability file storage. If you don't care about the reliability of the storage, then running random cards is probably fine, but experience says that's a bit of an edge case.
 

mdedetrich

Dabbler
Joined
Jul 9, 2021
Messages
16
Still, it's unclear why you wouldn't just get a cheap HBA known to work correctly rather than a RAID controller that you've shown isn't behaving correctly. In general, the goal of people running FreeNAS is to run ZFS to provide high reliability file storage. If you don't care about the reliability of the storage, then running random cards is probably fine, but experience says that's a bit of an edge case.

To answer your question, at the time this was the only HBA card I could use. Was trying to save costs and afaik this is the only 24 port SAS HBA you can buy and the motherboard I am using has limited PCI slots (when taking into account other things like NIC's). Ironically since I am using a 24 slot server rack it was also the recommend HBA to get.

The card is also much cheaper than any LSI card (which I know is whats recommended) if you need a 24 port HBA. I will likely try to upgrade to a more recommended system in the future. If there was a 24 port LSI card then I definitely would have gotten that exist but afaik it doesn't exist.

As I said before we are using the card in direct HBA mode so its also important to distinguish between cases of people using the card in HBA mode (which you should absolutely do) and in cases otherwise.

The reboot was in place for a reason at first. Why was it there ? You may very well have unleashed a much bigger problem by bypassing that workaround. Bigger does not mean that it will happen as often or more. Does not mean you will experience it at all. But the fact is that you do not have any warranty, any security about that.

Maybe, but at least when the card is running in HBA mode the reset toggled by the timedout hasn't occurred whatsoever apart from the cases I have mentioned earlier where it shouldn't occur. It could be that this code is relevant for other modes (i.e. if you are using the card as an actual RAID card) which I am not.

Even if you look at the code from the original author you can see that its not even a priority if it doesn't execute, i.e.


Code:
        /*
         * While we're here, check to see if any commands are stuck.
         * This is pretty low-priority, so it's ok if it doesn't
         * always fire.
         */
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
To answer your question, at the time this was the only HBA card I could use.

That seems unlikely. The LSI HBA's have been around since the late oughties.

Was trying to save costs and afaik this is the only 24 port SAS HBA you can buy and the motherboard I am using has limited PCI slots (when taking into account other things like NIC's). Ironically since I am using a 24 slot server rack it was also the recommend HBA to get.

I don't know whose recommendation that would have been.

The card is also much cheaper than any LSI card (which I know is whats recommended) if you need a 24 port HBA.

An LSI 9211-8i card can be had for $25-$30, and a 24-bay-capable SAS expander from $80-$145, so that's $110-$175 for the LSI solution.

I just checked yours on eBay, for about $200-$300.

To my mind, that puts the LSI solution as significantly cheaper, just a bit more than half the cost of the Adaptec...

Now it's fine if you've never worked with something like an SAS expander before. You don't actually need to put them in a PCI slot, and they can be mounted in a random available spot in the chassis. This is the normal way to get additional SAS support for HDD's. We have an SAS primer article if this is an area of unfamiliarity.

As I said before we are using the card in direct HBA mode so its also important to distinguish between cases of people using the card in HBA mode (which you should absolutely do) and in cases otherwise.

Yes, it is. However, note that as the author of the "RAID card" sticky, the reason I've written that is because there are sharp edges at many layers, and having observed people trying to use other random cards for more than a decade on these forums, the success rate is pretty low. It's actually quite difficult to make a determination that something actually works correctly, because (for example) you never really know when a drive freaking out in an unusual way will cause the controller to freak out or other badness to happen, and unless you actually run through all the possible weird edge cases, you can't really know it is truly stable. This is something you really only get with craptons of run-hours under the belt over thousands of installs, and even then, things like the reliability of the LSI HBA is only demonstrated real-world stability rather than an actual proof of correctness. So please don't make the mistake of thinking I don't understand what you're saying, I do, and I understand the difficulties involved from a number of angles.
 

mdedetrich

Dabbler
Joined
Jul 9, 2021
Messages
16
That seems unlikely. The LSI HBA's have been around since the late oughties.

To clarify, when I say I couldn't use other options I meant that I am using a motherboard that only has a single PCIe slot available for a HBA (I am reusing existing hardware)

I don't know whose recommendation that would have been.

The manufacturer of the 24bay 4U server rack I bought

An LSI 9211-8i card can be had for $25-$30, and a 24-bay-capable SAS expander from $80-$145, so that's $110-$175 for the LSI solution.

I just checked yours on eBay, for about $200-$300.

Not where I am from, a sas-9305-16i is going for 460 euro(see https://www.ebay.de/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=sas-9305-16i&_sacat=0)

A LSI 9211-8i typically goes for for 120 euro (see https://www.ebay.de/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=LSI+9211-8i&_sacat=0) and that is without the expander

To my mind, that puts the LSI solution as significantly cheaper, just a bit more than half the cost of the Adaptec...

I can get an Adpatec SAS-72405 for 150 euro on a good day (see https://www.ebay.de/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=sas-72405&_sacat=0) which is far cheaper than an LSI + an expander such as a RES2SV240 (which also costs around 150 euroes)

In America sure it can be cheaper, but here in Europe/Germany, LSI cards are much more expensive for whatever reason

Now it's fine if you've never worked with something like an SAS expander before. You don't actually need to put them in a PCI slot, and they can be mounted in a random available spot in the chassis. This is the normal way to get additional SAS support for HDD's. We have an SAS primer article if this is an area of unfamiliarity.

Thanks I didn't realize this (I was under the impression you had to actually mount the card in the pcie slot), but in any case its still more expensive and I would have to find some spot where to cram the card. There is also the bandwidth limit of 6gb/second, this is too low for the usecase I am using the drives (we do need 12gbit or higher, I am writing to the pool at 10gb/ second with an Intel 10Gbe NIC).

This is something you really only get with craptons of run-hours under the belt over thousands of installs, and even then, things like the reliability of the LSI HBA is only demonstrated real-world stability rather than an actual proof of correctness. So please don't make the mistake of thinking I don't understand what you're saying, I do, and I understand the difficulties involved from a number of angles.

Sure, I mean at the end of the day from briefly skimming the code it appears that when you run the card in direct HBA mode its just passing through the SCSI commands directly to the controller (as it should be doing). Of course at this point you are relying on the firmware being correct.

I really cannot comment on correctness here or what your definition is, before you implied that correctness is real world usage with a large cumulative amount of hours demonstrating there aren't problems but now you are talking about mathematical proof of correctness aka formal verification (which at least in terms of software engineering + hardware is a completely different kettle of fish, having dealt with correctness/formal reasoning in my profession I can guarantee you that no driver and/or FreeBSD can claim mathematical correctness especially if its programmed in C; you have to a very limited subset of C with a theorem prover or just another language).

In any case, I agree that the LSI on the surface are definitely are more stable but I don't think its mature to say that its due to some mathematical proof, rather its that the cards had so much real world usage with the same firmware that speaking in terms of probability its highly unlikely that there are bugs in the system.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
There is also the bandwidth limit of 6gb/second,
6 Gb/s times 8 lanes for 48 Gb/s. Should be plenty for any realistic 10 GbE usage, no?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
The manufacturer of the 24bay 4U server rack I bought

Generally someone with a vested interest in your buying something they have available is not a good source for fair recommendations. Mmm.


So you're comparing a decade old discontinued Adaptec 6Gbps card to the pricing on a much newer current production LSI 12Gbps card?

How about a more fair comparison? See

https://www.ebay.de/itm/294300897375?epid=1764111100&hash=item4485b3505f:g:FI0AAOSw3ExhAWOG

I'm guessing that's 34 euro.

A LSI 9211-8i typically goes for for 120 euro (see https://www.ebay.de/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=LSI+9211-8i&_sacat=0) and that is without the expander

No it doesn't. A quarter of that. Just quoted above.

I can get an Adpatec SAS-72405 for 150 euro on a good day (see https://www.ebay.de/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=sas-72405&_sacat=0) which is far cheaper than an LSI + an expander such as a RES2SV240 (which also costs around 150 euroes)

And I can beat 150 euro easily with my first eBay searches.

https://www.ebay.de/itm/184956740817?hash=item2b1047e4d1:g:nKEAAOSwFv1g-B3u

92 euros plus 34 euros = 126 euros. I know you guys have all sorts of weird issues over there with shipping and VAT and stuff, so forgive me if I'm missing things that are obvious to you.

Thanks I didn't realize this (I was under the impression you had to actually mount the card in the pcie slot), but in any case its still more expensive and I would have to find some spot where to cram the card. There is also the bandwidth limit of 6gb/second, this is too low for the usecase I am using the drives (we do need 12gbit or higher, I am writing to the pool at 10gb/ second with an Intel 10Gbe NIC).

No, a 6Gbps SAS lane is 6Gbps per lane. An SFF8087 multilane cable has four lanes and handles 24Gbps of traffic. You can even team two of them up for 48Gbps.

So, look, I'm basically just trying to educate so that you know how to optimize next time.

1) Yes the 24-port HBA is attractive from the point of view of being a single card solution, it just has the downsides of being a poorly supported card, and your one fix may not be the only sharp edge.

2) The better solution would have been to get a backplane with a SAS expander built in. The upside to this is that you get the simplest cabling option, a single (or double) SFF8087 from the HBA to the backplane.

3) You've engineered yourself into a hard spot by having only a single PCIe slot available and needing to drive 24 SAS lanes from it. A 24 port SAS controller is certainly the attractive fix.

Sure, I mean at the end of the day from briefly skimming the code it appears that when you run the card in direct HBA mode its just passing through the SCSI commands directly to the controller (as it should be doing). Of course at this point you are relying on the firmware being correct.

Well, and, also the driver. The patch suggested above comments out a driver bit, and is a reasonable example.

Having been dealing with this for years, basically the problem with all HBA/RAID is that you run into these cases where the host driver and card firmware manage to get somehow out of sync.

The experience with FreeBSD and FreeNAS is that even the LSI's can suffer this, which is why it is necessary to flash a very particular version of the controller firmware that exactly matches what the host driver expects. It is probably LESS likely for such events to happen when the things being done by the card firmware for HBA are simpler; we know for certain that many RAID cards operating in full RAID mode can fail in spectacular ways for a variety of underlying issues, but this seems to be mostly a matter of degree of complexity.

I really cannot comment on correctness here or what your definition is, before you implied that correctness is real world usage with a large cumulative amount of hours demonstrating there aren't problems but now you are talking about mathematical proof of correctness aka formal verification (which at least in terms of software engineering + hardware is a completely different kettle of fish, having dealt with correctness/formal reasoning in my profession I can guarantee you that no driver and/or FreeBSD can claim mathematical correctness especially if its programmed in C; you have to a very limited subset of C with a theorem prover or just another language).

I hail from the medical electronics world, where there's a pragmatic mix of ways to validate gear.

The problem with RAID controllers (or the lobotomized RAID controllers known as HBA's) are that we're not privy to the source code for the firmware, and this basically means that you need to focus on the results.

For me, this means that I trust a card with demonstrated billions of aggregate run-hours a lot more than I trust the Adaptec card of the guy who is coming in with complaints of how it isn't stable in an otherwise problem-free hardware platform.

Of course, this does not guarantee that the LSI is correct. It's the classic de Havilland Comet problem. :smile:
 

mdedetrich

Dabbler
Joined
Jul 9, 2021
Messages
16
6 Gb/s times 8 lanes for 48 Gb/s. Should be plenty for any realistic 10 GbE usage, no?
Ah I thought it was 6 Gb/s for the whole controller not for each PCIE lane, in which case its fine then


Generally someone with a vested interest in your buying something they have available is not a good source for fair recommendations. Mmm.

To be fair, they were expecting for you to use the card as RAID rather than HBA with ZFS


So you're comparing a decade old discontinued Adaptec 6Gbps card to the pricing on a much newer current production LSI 12Gbps card?

How about a more fair comparison? See

https://www.ebay.de/itm/294300897375?epid=1764111100&hash=item4485b3505f:g:FI0AAOSw3ExhAWOG

I'm guessing that's 34 euro.



No it doesn't. A quarter of that. Just quoted above.



And I can beat 150 euro easily with my first eBay searches.

https://www.ebay.de/itm/184956740817?hash=item2b1047e4d1:g:nKEAAOSwFv1g-B3u

92 euros plus 34 euros = 126 euros. I know you guys have all sorts of weird issues over there with shipping and VAT and stuff, so forgive me if I'm missing things that are obvious to you.

The main issue is the sellers, we have gotten bad/weird cards before. So you can get stuff really cheap but the results you posted don't look like the highest quality. We also have time issues so we actually found a local seller (via ebay) where we can get the cards pretty much instantly

No, a 6Gbps SAS lane is 6Gbps per lane. An SFF8087 multilane cable has four lanes and handles 24Gbps of traffic. You can even team two of them up for 48Gbps.

So, look, I'm basically just trying to educate so that you know how to optimize next time.

Thanks for telling me, will know better next time. Didn't realize it was 8 gb/s per PCIe lane

1) Yes the 24-port HBA is attractive from the point of view of being a single card solution, it just has the downsides of being a poorly supported card, and your one fix may not be the only sharp edge.

We will see I guess, so far the only issue has been this idle timeout fix

2) The better solution would have been to get a backplane with a SAS expander built in. The upside to this is that you get the simplest cabling option, a single (or double) SFF8087 from the HBA to the backplane.

The 4U chassis I got didn't have a backplane extender, instead it just has 6 SFF-8087 connectors on the backplane (which is why the manufacturer recommended the Adaptec card)

3) You've engineered yourself into a hard spot by having only a single PCIe slot available and needing to drive 24 SAS lanes from it. A 24 port SAS controller is certainly the attractive fix.

Again agreed but it is what it is



Well, and, also the driver. The patch suggested above comments out a driver bit, and is a reasonable example.

Having been dealing with this for years, basically the problem with all HBA/RAID is that you run into these cases where the host driver and card firmware manage to get somehow out of sync.

The experience with FreeBSD and FreeNAS is that even the LSI's can suffer this, which is why it is necessary to flash a very particular version of the controller firmware that exactly matches what the host driver expects. It is probably LESS likely for such events to happen when the things being done by the card firmware for HBA are simpler; we know for certain that many RAID cards operating in full RAID mode can fail in spectacular ways for a variety of underlying issues, but this seems to be mostly a matter of degree of complexity.

Yup, all I can say is that with this card you can disable RAID mode completely so it should reduce the subset cases you have to handle.



I hail from the medical electronics world, where there's a pragmatic mix of ways to validate gear.

The problem with RAID controllers (or the lobotomized RAID controllers known as HBA's) are that we're not privy to the source code for the firmware, and this basically means that you need to focus on the results.

For me, this means that I trust a card with demonstrated billions of aggregate run-hours a lot more than I trust the Adaptec card of the guy who is coming in with complaints of how it isn't stable in an otherwise problem-free hardware platform.

Of course, this does not guarantee that the LSI is correct. It's the classic de Havilland Comet problem. :smile:

Agreed! I mean I did do some research specifically with the Adaptec Card with FreeBSD and although some people said its not completely supported they also said that as long as you put it in HBA mode no one had any problems apart from this specific one.

Unfortunately I didn't know that with an expander you don't have to plug it in PCIe (this is my first time doing homelab server systems with commodity hardware) but at least I know better for next time. I have no idea what kind of a system this is where you have a PCIcard that doesn't actually need to be slotted in PCIe but ¯\_(ツ)_/¯


Thanks for spending the time posting the info! At least if anything has come out of this, we may have made the aacraid driver more stable which is an improvement over nothing.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
Didn't realize it was 8 gb/s per PCIe lane
Not PCIe lane, SAS lane, of which most controllers have 8. Technically, for the PCIe 2.0 cards, you're bottlenecked slightly by the PCIe interface (8x ~4Gb/s), but it's still plenty for most cases. Nothing a PCIe 3.0 card won't fix, of course.

By the way, forum tip: To quote someone, highlight the text and a "Reply" button will show up and allow for easy quoting.
 

bsdimp

Cadet
Joined
Aug 8, 2021
Messages
3
Can someone with this problem try this patch and report back the results?
It comments out the completion of busy commands when resetting the controller since the posted screen dump suggests that after we do that, the controller sends completions back for the commands. It also speeds up the checks from every 20s of idleness to every second of idleness. That may be too much, but the TIMEOUT that was reported took almost an hour to fire.

Warner

P.S. No clue if it will attach here because it didn't like aacraid.diff, so I uploaded here and to my project http area.
 

Attachments

  • aacraid.txt
    1.1 KB · Views: 216
Status
Not open for further replies.
Top