Idle reboot bug: aacraid0: COMMAND 0xfffffe00007bdfa0 TIMEOUT AFTER 3857 SECONDS, shutting down controller...done - RELOADED

Joined
Aug 9, 2020
Messages
6
Greetings iXsystems Community,

This is an update to my original post over at: https://www.truenas.com/community/t...-seconds-shutting-down-controller-done.86761/

First, I would like to thank bsdimp for investing time and effort into attempting to resolve this issue properly. My apologies for not following up sooner and testing bsdimp’s patch.

For those interested, I’ve been running my original patched version since 2020 on a number of different systems and have not had any issues. One such system is running a 48x8TB HDD RAID-Z3 array, and is under very heavy load for days or weeks at a time.

Since https://github.com/openzfs/zfs/issues/15526, I’ve been keeping an eye out for a patched version of TrueNAS Core. Now that 13.0-U6.1 has been released to fix the aforementioned bug, I thought it would be timely to compile some new drivers with the “idle reboot bug” fix.

Attached are three compiled versions accompanied by their source code and diffs:

3.2-10-131001: My patch, as per the previous post, where we simply don’t reset the controller and pretend like everything is fine

3.2-10-131002: bsdimp’s version of the patch, with driver version updated to “131002”

3.2-10-131003: bsdimp’s version, but I left the timeout check at the default 20 seconds and updated the driver version to “131003”

Installing the driver is straight-forward:
  1. check current driver version: arcconf getconfig 1 ad|grep Driver
  2. copy desired aacraid.ko to /boot/kernel/
  3. add the following /boot/loader.conf:
  4. aacraid_load="YES"
  5. reboot
  6. verify the new driver is loaded: arcconf getconfig 1 ad|grep Driver

The old driver should report something like:

Driver : 3.2-10 (1)

The new driver should report something like:

Driver : 3.2-10 (131003)


Note:
  1. All of the standard "use at your risk" and "ymmv" disclaimers apply if you choose to use the attached drivers.
  2. Driver source came from FreeBSD 13.1-RELEASE
  3. All drivers were compiled on FreeBSD 13.1-RELEASE
  4. I've not confirmed if the idle timeout bug occurs in TrueNAS Core 13.x with the in-box driver since I don't have any in production yet. I've created these drivers as 'safety net' fixes, but perhaps someone could confirm if they have seen this issue in modern versions of TureNAS Core.

Diffs for quick reference:

3.2-10-131001:

aacraid.c.diff


2188c2188,2193
< aac_reset_adapter(sc);
---
> /*
> * Resetting the adapter causes FreeNAS 11.x/12.x to panic with a "command not in queue" error.
> * Let's not reset the adapter and carry on as if nothing happened.
> * This is the command we are leaving out:
> * aac_reset_adapter(sc);
> */

aacraid_var.h.diff

54c54
< # define AAC_DRIVER_BUILD 1
---
> # define AAC_DRIVER_BUILD 131001


3.2-10-131002:

aacraid.c.diff


3799a3800
> #if 0 /* Controller seems to return these at least some of the time after we reset */
3817c3818
<
---
> #endif

aacraid_var.h.diff

54c54
< # define AAC_DRIVER_BUILD 1
---
> # define AAC_DRIVER_BUILD 131002
115c115
< #define AAC_PERIODIC_INTERVAL20/* seconds */
---
> #define AAC_PERIODIC_INTERVAL1/* seconds */

3.2-10-131003:

aacraid.c.diff


3799a3800
> #if 0 /* Controller seems to return these at least some of the time after we reset */
3817c3818
<
---
> #endif

aacraid_var.h.diff

54c54
< # define AAC_DRIVER_BUILD 1
---
> # define AAC_DRIVER_BUILD 131003

Cheers,
Greg

PS: I just noticed that my comment in my patch references "FreeNAS 11.x/12.x." Perhaps I'll remember to update this on the next run.
 

Attachments

  • 13.1 aacraid.zip
    435.6 KB · Views: 50
Joined
Aug 9, 2020
Messages
6
I've completed some testing with the new drivers in 13.0U-6.1 with the following results:

3.2-10-131001: Stable. Same behaviour as in 12.x and 11.x systems.

3.2-10-131002: Unstable. Once the controller is reset, the driver keeps resetting the controller in an unending cycle. Eventually the controller fails to reset. No IO seems to be possible after the first controller reset. Some examples of what shows up on the console/in the logs:

Code:
aacraid0: COMMAND 0xfffffe003a7bfde0 TIMEOUT AFTER 5728 SECONDS
aacraid0: shutting down controller...done
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: New comm. interface type2 enabled
aacraid0: Power Management enabled
aacraid0: COMMAND 0xfffffe003a7c0740 TIMEOUT AFTER 11307 SECONDS
aacraid0: shutting down controller...done
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: New comm. interface type2 enabled
aacraid0: Power Management enabled
aacraid0: COMMAND 0xfffffe003a7c1190 TIMEOUT AFTER 5644 SECONDS
aacraid0: shutting down controller...done
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: New comm. interface type2 enabled
aacraid0: Power Management enabled
...


Code:
aacraid0: COMMAND 0xfffffe003a7b3cc0 TIMEOUT AFTER 8239 SECONDS
aacraid0: shutting down controller...done
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: New comm. interface type2 enabled
aacraid0: Power Management enabled
aacraid0: COMMAND 0xfffffe003a7b3cc0 TIMEOUT AFTER 8464 SECONDS
aacraid0: FATAL: controller not coming ready, status bcc20100
aacraid0: COMMAND 0xfffffe003a7b3cc0 TIMEOUT AFTER 8851 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3db0 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3bd0 TIMEOUT AFTER 407 SECONDS
aacraid0: FATAL: controller not coming ready, status bcc20100
aacraid0: COMMAND 0xfffffe003a7b3cc0 TIMEOUT AFTER 9238 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3db0 TIMEOUT AFTER 794 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3bd0 TIMEOUT AFTER 794 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3720 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3d38 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b4008 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b40f8 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b4170 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3f90 TIMEOUT AFTER 407 SECONDS
aacraid0: COMMAND 0xfffffe003a7b3e28 TIMEOUT AFTER 407 SECONDS
...


Code:
Dec 11 04:48:05 freenas aacraid0: COMMAND 0xfffffe0000fb3b10 TIMEOUT AFTER 6180 SECONDS
Dec 11 04:48:06 freenas aacraid0: shutting down controller...done
Dec 11 04:48:30 freenas aacraid0: Enable Raw I/O
Dec 11 04:48:30 freenas aacraid0: Enable 64-bit array
Dec 11 04:48:30 freenas aacraid0: New comm. interface type2 enabled
Dec 11 04:48:30 freenas aacraid0: Power Management enabled
Dec 11 04:48:40 freenas aacraid0: COMMAND 0xfffffe0000fb3de0 TIMEOUT AFTER 6215 SECONDS
Dec 11 04:48:56 freenas aacraid0: shutting down controller...done
Dec 11 04:49:19 freenas aacraid0: Enable Raw I/O
Dec 11 04:49:19 freenas aacraid0: Enable 64-bit array
Dec 11 04:49:19 freenas aacraid0: New comm. interface type2 enabled
Dec 11 04:49:19 freenas aacraid0: Power Management enabled
Dec 11 04:49:29 freenas aacraid0: COMMAND 0xfffffe0000fb3de0 TIMEOUT AFTER 6263 SECONDS
Dec 11 04:49:45 freenas aacraid0: shutting down controller...done
Dec 11 04:50:07 freenas aacraid0: Enable Raw I/O
Dec 11 04:50:07 freenas aacraid0: Enable 64-bit array
Dec 11 04:50:07 freenas aacraid0: New comm. interface type2 enabled
Dec 11 04:50:07 freenas aacraid0: Power Management enabled
...



3.2-10-131003: Unstable. Same as 3.2-10-131002.

While bsdimp's version of the patch prevents a kernel panic, it can leave the system in an unusable state, requiring a hard reset to recover.

Cheers,
Greg
 
Joined
Aug 9, 2020
Messages
6
And I forgot to add these events:

Code:
Dec 11 06:51:26 freenas aacraid0: COMMAND 0xfffffe0000fb3de0 TIMEOUT AFTER 13580 SECONDS
Dec 11 06:51:26 freenas aacraid0: COMMAND 0xfffffe0000fbbdd8 TIMEOUT AFTER 775 SECONDS
Dec 11 06:54:26 freenas aacraid0: shutting down controller...FAILED.


Cheers,
Greg
 
Top