Tunable needed to detect more than 16 NVME drives on AMD Epyc?

firesyde424 · Aug 10, 2023

As part of the buildout of a high performance AMD Epyc based TrueNAS Core build, I ran into an issue where only 16 of the 24 NVME drives were being detected by the OS. The server's specs are as follows:

Dell PowerEdge R7525
2 x AMD Epyc 7H12 CPUs, 128 cores\256 threads total @2.6Ghz
1TB DDR4 Registered ECC RAM @ 3200MT/sec
24 x 30.72TB Micron 9400 Pro U.3 NVME drives
2 x Chelsio T62100-LP-CR dual port 100Gbe network adapters
TrueNAS Core 13.5

After some work with a third party consultant, we were able to make a change to a tunable that allowed all 24 drives to be detected. For whatever reason, an update from TrueNAS Core 13.4 to 13.5 removed the tunable and the system suddenly could only detect 16 drives. Adding the tunable back and rebooting allowed all the drives to be detected again.

The tunable is "hw.nvme.num_io_queues=64". We played with different numbers and discovered that anything above 4 allowed the drives to be detected. A quick google search of that tunable doesn't actually turn up much in the way of useful information.

Does anyone know what this tunable is and why the default setting in TrueNAS Core wouldn't allow for more than 16 NVME drives to be detected?

***Edit***
I think I just figured out why the tunable was removed on upgrade. The tunable is actually set in the bootloader and I'm guessing the update either modifies or entirely replaces the loader.conf file?

Ericloewe · Aug 10, 2023

Have you seen a performance impact from different settings of the tunable?

Davvo · Aug 10, 2023

Have you checked the FreeBSD man page?

nvme(4)

man.freebsd.org

Ericloewe · Aug 10, 2023

This tunable does not seem to be documented in the nvme driver man page.

firesyde424 · Aug 10, 2023

Ericloewe said:
Have you seen a performance impact from different settings of the tunable?

As near as we can tell, there appears to be no performance impact. The only change we were able to see was that the system was able to detect and use all 24 drives.

Patrick M. Hausen · Aug 10, 2023

You need to set tunables in the UI. Otherwise they will be lost at each update. As for the question what that does, actually - I guess Warner Losh would be the person to contact. Not posting personal info of another guy, but you can find him on Twitter, he's got an @freebsd.org email address, there are the mailing lists.

In case I meet him in the regular bhyve production users call that starts in 10 minutes, I'll ask him

firesyde424 · Aug 10, 2023

Patrick M. Hausen said:
You need to set tunables in the UI. Otherwise they will be lost at each update. As for the question what that does, actually - I guess Warner Losh would be the person to contact. Not posting personal info of another guy, but you can find him on Twitter, he's got an @freebsd.org email address, there are the mailing lists.

In case I meet him in the regular bhyve production users call that starts in 10 minutes, I'll ask him

We did set the tunable in the UI.

Patrick M. Hausen · Aug 10, 2023

firesyde424 said:
We did set the tunable in the UI.

That is surprising and that should definitely survive any update. Did you file an issue in JIRA?

P.S. I did not meet Warner tonight.

bsdimp · Aug 10, 2023

> hw.nvme.num_io_queues=64

OK. Normally, we try to have one I/O queue per core. On high core count machines, or with *lots* of cards that have a high number of queues, this can exhaust the MSIx message slots. When the driver can't allocate one, it fails (which would appear as a failure to detect). By limiting the number of I/O queues, you are making sure this resource isn't exhausted.

hw.nvme.min_cpus_per_ioq is also a good tuneable to set. This helps the driver to assign N cpus to each of the I/O queues thta are allocated. It's how we deal with the large core count machines that we have at work. Here's the logic that's implemented:

Code:

        /*
         * Try to allocate one MSI-X per core for I/O queues, plus one
         * for admin queue, but accept single shared MSI-X if have to.
         * Fall back to MSI if can't get any MSI-X.
         */
        num_io_queues = mp_ncpus;
        TUNABLE_INT_FETCH("hw.nvme.num_io_queues", &num_io_queues);
        if (num_io_queues < 1 || num_io_queues > mp_ncpus)
                num_io_queues = mp_ncpus;

        per_cpu_io_queues = 1;
        TUNABLE_INT_FETCH("hw.nvme.per_cpu_io_queues", &per_cpu_io_queues);
        if (per_cpu_io_queues == 0)
                num_io_queues = 1;

        min_cpus_per_ioq = smp_threads_per_core;
        TUNABLE_INT_FETCH("hw.nvme.min_cpus_per_ioq", &min_cpus_per_ioq);
        if (min_cpus_per_ioq > 1) {
                num_io_queues = min(num_io_queues,
                    max(1, mp_ncpus / min_cpus_per_ioq));
        }

        num_io_queues = min(num_io_queues, max(1, pci_msix_count(dev) - 1));

They all should be documented in nvme(4), but hw.nvme.num_io_queues apperas to be missing.

Warner

Important Announcement for the TrueNAS Community.

Tunable needed to detect more than 16 NVME drives on AMD Epyc?

firesyde424

Contributor

Ericloewe

Server Wrangler

Davvo

MVP

nvme(4)

Ericloewe

Server Wrangler

firesyde424

Contributor

Patrick M. Hausen

Hall of Famer

firesyde424

Contributor

Patrick M. Hausen

Hall of Famer

bsdimp

Cadet

Similar threads