After Reboot, HBA keep getting reset and pool become unavail

morphin · Jul 5, 2023

Hello.

I have a serious problem and I could not find similar case on the forum.

When I create a new pool and reboot the server, during the boot process something causes HBA reset and because of the reset my data pool becomes unavailable.

I checked logs and I saw that md raid also fails during boot

I created a zfs pool manualy and tried to reboot, everything is fine. (No auto import activated)
Is there any way to deactivate swap on data pools ?
What should I do?

sretalla · Jul 6, 2023

I'm not sure why you're blaming the swap partition... if the disk goes offline, of course the swap on that disk is also taken offline.

With that many disks offline, it's too many for swap to be involved anyway (10 maximum, I think).

I suggest it may be the firmware of the HBA needing an update... what hardware and firmware is it?

morphin · Jul 6, 2023

sretalla said:
I'm not sure why you're blaming the swap partition... if the disk goes offline, of course the swap on that disk is also taken offline.

With that many disks offline, it's too many for swap to be involved anyway (10 maximum, I think).

I suggest it may be the firmware of the HBA needing an update... what hardware and firmware is it?

Then why I don't have any problem if I create the pool manually?
I checked dmesg logs from starting truenas boot sequence and I saw that the first thing who triggers the hba task aborts is looks like md mounts.
Maybe its not and I'm mistaken but with manuel created pool, I have no problem.
Also I have no problem if I create the pool from dashboard and do not reboot the server.
I havily used the pool with randrw writes and I did not see any problem.

Thats why I don't think this is related with HBA card and firmware but I will check the version and get back to you.

sretalla · Jul 6, 2023

morphin said:
Then why I don't have any problem if I create the pool manually?

I don't have enough information about your system to tell you what's different other than the swap partitions not being created (which you can put a stop to if you really want by setting that number from 2 to 0 using the API call... midclt call system.advanced.update '{"swapondrive": 0}')

neofusion · Jul 6, 2023

With no actual system information all we can do is guess.
I'm thinking maybe a power issue?
Does this still happen if you import the pool with by-id instead of device names?

Arwen · Jul 6, 2023

Please list your HBA, brand & model.

Some disk expansion cards may have staggered disk spin up. If it takes too long, TrueNAS' pool import may occur before all the drives are ready. We need the brand & model of your HBAs.

As for why you did not have problems on a manually created pool, did you make the exact same pool configuration, with the exact same disks, without the swap, from the command line?

Comparing apples to oranges does not work in all cases...

morphin · Jul 6, 2023

I dont have external Jbod. My HBA cards are internal.

Server: 2 x Cisco UCS S3260 (two seperate server, same config, same result on both servers.)

HBA: 2x UCS S3260 Dual Pass Through Controller based on Broadcom 3316 ROC - UCS-S3260-DHBA
mpt3sas_cm0: LSISAS3316: FWVersion(13.00.08.00), ChipRevision(0x01), BiosVersion(15.00.06.00)
mpt3sas_cm1: LSISAS3316: FWVersion(13.00.08.00), ChipRevision(0x01), BiosVersion(15.00.06.00)

Code:

admin@st01[~]$ lsscsi -s
[0:0:0:0]    enclosu Cisco    C3260            2     -               -
[0:0:1:0]    enclosu Cisco    C3260            2     -               -
[1:0:0:0]    disk    ATA      Samsung SSD 870  2B6Q  /dev/sda    500GB (OS  drive from different sata controller)
[2:0:0:0]    disk    ATA      Samsung SSD 870  2B6Q  /dev/sdb    500GB (OS  drive from different sata controller)
[5:0:0:0]    enclosu AHCI     SGPIO Enclosure  2.00  -               -
[6:0:0:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdc   4.00TB
[6:0:1:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdd   4.00TB
[6:0:2:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sde   4.00TB
[6:0:3:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdf   4.00TB
[6:0:4:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdg   4.00TB
[6:0:5:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdh   4.00TB
[6:0:6:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdi   4.00TB
[6:0:7:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdj   4.00TB
[6:0:8:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdk   4.00TB
[6:0:9:0]    disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdl   4.00TB
[6:0:10:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdm   4.00TB
[6:0:11:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdn   4.00TB
[6:0:12:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdo   4.00TB
[6:0:13:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdp   4.00TB
[6:0:14:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdq   4.00TB
[6:0:15:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdr   4.00TB
[6:0:16:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sds   4.00TB
[6:0:17:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdt   4.00TB
[6:0:18:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdu   4.00TB
[6:0:19:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdv   4.00TB
[6:0:20:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdw   4.00TB
[6:0:21:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdx   4.00TB
[6:0:22:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdy   4.00TB
[6:0:23:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdz   4.00TB
[6:0:24:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaa  4.00TB
[6:0:25:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdab  4.00TB
[6:0:26:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdac  4.00TB
[6:0:27:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdad  4.00TB
[6:0:28:0]   enclosu Cisco    C3260            2     -               -
[6:0:29:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdae  4.00TB
[6:0:30:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaf  4.00TB
[6:0:31:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdag  4.00TB
[6:0:32:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdah  4.00TB
[6:0:33:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdai  4.00TB
[6:0:34:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaj  4.00TB
[6:0:35:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdak  4.00TB
[6:0:36:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdal  4.00TB
[6:0:37:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdam  4.00TB
[6:0:38:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdan  4.00TB
[6:0:39:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdao  4.00TB
[6:0:40:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdap  4.00TB
[6:0:41:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaq  4.00TB
[6:0:42:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdar  4.00TB
[6:0:43:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdas  4.00TB
[6:0:44:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdat  4.00TB
[6:0:45:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdau  4.00TB
[6:0:46:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdav  4.00TB
[6:0:47:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaw  4.00TB
[6:0:48:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdax  4.00TB
[6:0:49:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sday  4.00TB
[6:0:50:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdaz  4.00TB
[6:0:51:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdba  4.00TB
[6:0:52:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdbb  4.00TB
[6:0:53:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdbc  4.00TB
[6:0:54:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdbd  4.00TB
[6:0:55:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdbe  4.00TB
[6:0:56:0]   disk    TOSHIBA  MG04SCA40EN      5705  /dev/sdbf  4.00TB

I will share the dmesg output in my next reply.
I'm thinking to share 3 output:
1- Create a new pool from dashoard and
zpool status
dmesg -T > dmesg-output-after-creating-pool-from-dashboard.txt
2- reboot the server, get into shell, wait for the crash and
zpool status
dmesg -T > dmesg-output-after-reboot-server.txt
3- reboot the server, boot with initial-install selection at the boot menu.
zpool import $poolname
zpool status
dmesg -T > dmesg-output-after-initial-install-boot.txt

Do you need anything else?

Arwen · Jul 6, 2023

TrueNAS & ZFS do not support hardware RAID controllers:

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

Most of the time when such a controller is replaced with a plain HBA, the problems go away.

morphin · Jul 6, 2023

Arwen said:
TrueNAS & ZFS do not support hardware RAID controllers:

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

Most of the time when such a controller is replaced with a plain HBA, the problems go away.

You are mistaken. my HBA card is in IT mode so it is not raid card.
I played around HBA cards before and upgrade their firmware with sasXflash or ircu.
My firmware is old, I accept that but I used this version before and had no problem.

As I told before, I don't have any problem whith these scnearios:
1- I can create any pool without any problem via Truenas Scale Dashboard.
2- I write 10TB with 10 parallel fio workers, had no problem or I did not get any task abort or timeout response from mpt3sas.
3- When I reboot the server, problem starts!
- After Linux and kernel boot sequence, Truenas services starts,
- First truenas tries to mount md swap partition and it is successful. no problem...
- After or before I'm not sure yet, truenas imports zfs pool.
- Few seconds later md swap mount fails and you or mdadm starts a check and the drives being used by mdadm (I think this is the cause of timeout) because I start to see mpt3sas task abort logs.
- After too many mpt3sas task aborts because of the drive timeout, LSI firmware decides to reset the HBA card and we lose all of the drives for 5-10 seconds.
- Zpool suspends the pool as expected and do not continue automaticly. A lot of devices becomes unavaliable for the pool even they exist on the kernel.
- After the HBA reset, When I type zpool clear, I get all the drives but start to see the task aborts again, few sec later hba resets again. This turns to an endless loop. I can not clear the pool, I can not destroy or export.
- After few min later Linux kernel crashes and do not respondes new processes but we can use the shell via kvm. (I'm not sure about crash I have to check dmesg output one more time)

At this point, I reset the server via ipmi and I choosed the second option in the Truenas boot menu (initial install if I remember correctly)
1- Kernel starts without any issue, any task abort or any problem.
2- No mdadm service, no auto md mount on initial-install image.
3- No auto pool import.
4- When I try to zpool import $poolname, the import is successful and I dont see any task abort or timeout.
5- The pool is healthy, the test data is %100 correct, I'm able to write data and saturate the IOPS without any issue.

So I'm %100 sure the root cause is something with Truenas boot sequence.
I did not check the sequence yet I was busy. Tomorrow I'm going to find that and cut its had.

I really wanted to use Truenas Core to be more stable but my network card driver was not exist.
So I'm thinking now which choice will be better for me in this case.
1- Fix the Scale hba reset issue
2- Do not trust Scale if it has these kind of simple and not logical problems. Use Truenas Core, Import the network driver.

Patrick M. Hausen · Jul 6, 2023

morphin said:
You are mistaken. my HBA card is in IT mode so it is not raid card.

Did you flash it with IT firmware? If you didn't it is you who is mistaken. A RAID controller in "IT mode" is still unsupported.

neofusion · Jul 7, 2023

Perhaps the output of sas3flash -list can clarify the current status of your cards?

Edit: And you really should import the pool with id:s instead of device names, since device names can change around every single boot.

Arwen · Jul 7, 2023

It is possible that I am wrong.

And it is possible on some LSI RAID cards to flash IT firmware. I don't know about your particular ones.

As @neofusion requests, we can confirm true IT firmware with the output of the command.

morphin · Jul 7, 2023

Arwen said:
It is possible that I am wrong.

And it is possible on some LSI RAID cards to flash IT firmware. I don't know about your particular ones.

As @neofusion requests, we can confirm true IT firmware with the output of the command.

I understand your concern and after I check my LSI card I started to feel unconfortable because it is an OEM card so one way or another it is better to have new and original firmware.
But my experience tells me, I had a problem with changing firmware on Dell OEM card. sasflash tool didn't let me to upgrade.
Ofc there is other ways but also a failure possibility in this game.

I decided to change the firmware I found this and it looks that te card is 9361-16i.

I searched the firmware as "9361-16i" and "SAS3316" but couldn't find the firmware. I think I'm very tired, I should sleep and try tomorrow :)

Patrick M. Hausen said:
Did you flash it with IT firmware? If you didn't it is you who is mistaken. A RAID controller in "IT mode" is still unsupported.

I did not flash, it is default and brand new card.

MY storage server: UCS S3260, Firmware version: 4.2(3a) (Secure)
I only have 1 compute unit and the unit's bios version: S3X60M5.4.2.3b.0.1016222320

HBA card is internal built-in default card and the card is:
UCS S3260 Dual Pass Through Controller based on Broadcom 3316 ROC
mpt3sas_cm0: LSISAS3316: FWVersion(13.00.08.00), ChipRevision(0x01), BiosVersion(15.00.06.00)

As you can see in the SS below, the card firmware is IT.
CARD 1:

CARD 2:

To see the HBA card in their BIOS, I have to enable UEFI.
I use legacy because Treunas Scale has boot stuck issue with uefi.

I started to understand your concerns. I stop dealing with hba's 3-4 years ago and I always used default LSI cards.
These cards was not able to have both IR+IT mode and switch between them. So if it is an IT, then it was an IT.
But after you insist, I checked the hba vendor website.

This fourth-generation SAS RAID-on-Chip (ROC), based on its Fusion-MPT architecture, integrates the latest enhancements in SAS and PCIe technology. For OEMs and server vendors looking to build enterprise-class data protection and throughput into their entry to mid-range servers, the SAS 3316 ROC is an ideal, cost-effective choice.

After I read this I decided to flash a new firmware.
Do you have any recommend version and link please?

morphin · Jul 7, 2023

neofusion said:
Perhaps the output of sas3flash -list can clarify the current status of your cards?

Hi neofusion, The output:

neofusion said:
Edit: And you really should import the pool with id:s instead of device names, since device names can change around every single boot.

I don't want to import manualy so it is not my consern. I imported the pool manualy only for testing when I boot with the second option at Truenas boot grub menu (initial install) so it will not hurt and does not matter for testing. And it is working as expected.
The problem is Truenas backend service causes HBA reset at boot sequence. :)

Don't wory about this type of things, I have a lot of experience with zfs and linux.

neofusion · Jul 8, 2023

morphin said:
Hi neofusion, The output:

View attachment 68097

View attachment 68098

I don't want to import manualy so it is not my consern. I imported the pool manualy only for testing when I boot with the second option at Truenas boot grub menu (initial install) so it will not hurt and does not matter for testing. And it is working as expected.
The problem is Truenas backend service causes HBA reset at boot sequence. :)

Don't wory about this type of things, I have a lot of experience with zfs and linux.

Thank you.
It's just that the GUI shouldn't have imported the pool with device names like /dev/sdc, compare to this example that I did using the GUI:

Code:

root@tnas[~]# zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:39:52 with 0 errors on Mon Jun 19 07:39:53 2023
config:

    NAME                                      STATE     READ WRITE CKSUM
    tank                                      ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        cd4db217-58f9-45d3-8b28-f18a88940b0e  ONLINE       0     0     0
        e892aeae-d461-4efd-a482-ec1efee4c125  ONLINE       0     0     0

errors: No known data errors

Take note of how zfs is referring to the devices by the id's instead of device names.

That suggests that you have imported the pool manually in a less than ideal way, using device names.
It's probably not related to your other issues but it might make troubleshooting harder in the future if you need to find a specific disk.

samarium · Jul 8, 2023

For testing purposes I don't think it matters a whit if imported with default names or -d /dev/disk/by-partuuid. Import is import. Good idea to export and then import using the webui after testing, or at least using -d /dev/disk/by-partuuid. Not that I like partuuids, they are a PITA to cross reference, but at least they are stable unlike normal disk names.

Manually creating the pools often means manually partitioning the disks.

You don't say whether you have disabled the swap partition slicing that Scale does on disks when creating pools, chopping by default 2GB off the disks.

That 2GB is then mirrored 2x or 3x to create md devices, and a encrypted device mapper volume is created over the top and added as a swap device. These swap devices have to sync up on every boot, causing an IO/CPU spike, possibly enough to interrupt pool import by using up most of the IOPs and a chunk of CPU for the crypto.

Manual pool create, no swap partitions, so TNS systemd swap service doesn't create and sync crypto swap devices.
Webui pool create, swap partitions, systemd service creates and syncs crypto swap devices. IO/CPU spike.

Possible mechanism.

You could try turning of the swap partition slicing IIRC system->swap and there is a setting on the right somewhere. Then use the webui to create a pool, and see what happens on import, if it behaves more like a manual pool, or if you get the same problem, in which case this is probably not the issue.

neofusion · Jul 8, 2023

samarium said:
For testing purposes I don't think it matters a whit if imported with default names or -d /dev/disk/by-partuuid. Import is import. Good idea to export and then import using the webui after testing, or at least using -d /dev/disk/by-partuuid. Not that I like partuuids, they are a PITA to cross reference, but at least they are stable unlike normal disk names.

Manually creating the pools often means manually partitioning the disks.

You don't say whether you have disabled the swap partition slicing that Scale does on disks when creating pools, chopping by default 2GB off the disks.

That 2GB is then mirrored 2x or 3x to create md devices, and a encrypted device mapper volume is created over the top and added as a swap device. These swap devices have to sync up on every boot, causing an IO/CPU spike, possibly enough to interrupt pool import by using up most of the IOPs and a chunk of CPU for the crypto.

Manual pool create, no swap partitions, so TNS systemd swap service doesn't create and sync crypto swap devices.
Webui pool create, swap partitions, systemd service creates and syncs crypto swap devices. IO/CPU spike.

Possible mechanism.

You could try turning of the swap partition slicing IIRC system->swap and there is a setting on the right somewhere. Then use the webui to create a pool, and see what happens on import, if it behaves more like a manual pool, or if you get the same problem, in which case this is probably not the issue.

I would use the GUI or failing that, add the -d /dev/disk/by-id-flag because that's at least generated based on the hardware serial number and is persistent. There are cases where the partuuid could change (if using MBR).

Edit: Could you elaborate on why you would need to partition disks manually when making the pool?

Arwen · Jul 8, 2023

Thank you for the clarification on the SAS controllers. I agree, they look like IT mode, (aka not IR or full RAID mode).

It is possible that you need a different version of firmware. TrueNAS has had a history of preferring specific versions of firmware on LSI SAS HBAs for reliable operation. I don't follow this item, and newer 12Gbps SAS cards are less well covered here in the forums. Sorry I can't help on this issue.

samarium · Jul 8, 2023

neofusion said:
I would use the GUI or failing that, add the -d /dev/disk/by-id-flag because that's at least generated based on the hardware serial number and is persistent. There are cases where the partuuid could change (if using MBR).

Edit: Could you elaborate on why you would need to partition disks manually when making the pool?

I also generally prefer by-partlabel or by-id or by-vdev, however TNS webui imports using by-partuuid, and partitions using GPT, so not important enough to bother fighting the tide. Note also he was just importing for testing, so even less important.

Manual vs system partitioning is if you just want to partition the disks and ensure that there is not a swap partition, rather then updating the webui swapsize to 0 that I was talking about above, and then using webui to partition the disks and hoping it removes the swap. Ultimately should not matter, but can be useful for testing too, manual partitioning means you know what the partition layout is, where as webui I would have to go and check again to make sure it was what I expected. Manual pool creation can also be required if you want to tweak the pool config differently to TNS on creation, easier to create, export, then webui import. As I said above, still don't know that swap and swap partitions are a problem, but I was talking to someone else on discord who had a huge IO/CPU spike on startup and import failure, could be the same person of course.

Generally I like the extra bite that TN takes out of the disks as it is a way, per the documentation, to adjust disk sizes when there is a slight mismatch. I'm not sure if just changing the partition type from 8200 to something else would be sufficient for the systemd service not to auto setup swap on the devices.

Patrick M. Hausen · Jul 9, 2023

samarium said:
I'm not sure if just changing the partition type from 8200 to something else would be sufficient for the systemd service not to auto setup swap on the devices.

It's not systemd but the TrueNAS middleware that sets up swap at boot time. It goes a long way to try and establish a redundancy level for swap identical with the one of the ZFS pool. E.g. if you have a single RAIDZ2 pool it will create 3-way mirror swap devices, so any 2 disks can fail.

Important Announcement for the TrueNAS Community.

After Reboot, HBA keep getting reset and pool become unavail

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Contributor

MVP

Dabbler

MVP

Dabbler

Hall of Famer

Contributor

MVP

Dabbler

Attachments

Dabbler

Contributor

Contributor

Contributor

MVP

Contributor

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After Reboot, HBA keep getting reset and pool become unavail"

Similar threads