Tunable needed to detect more than 16 NVME drives on AMD Epyc?

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
As part of the buildout of a high performance AMD Epyc based TrueNAS Core build, I ran into an issue where only 16 of the 24 NVME drives were being detected by the OS. The server's specs are as follows:
  • Dell PowerEdge R7525
  • 2 x AMD Epyc 7H12 CPUs, 128 cores\256 threads total @2.6Ghz
  • 1TB DDR4 Registered ECC RAM @ 3200MT/sec
  • 24 x 30.72TB Micron 9400 Pro U.3 NVME drives
  • 2 x Chelsio T62100-LP-CR dual port 100Gbe network adapters
  • TrueNAS Core 13.5
After some work with a third party consultant, we were able to make a change to a tunable that allowed all 24 drives to be detected. For whatever reason, an update from TrueNAS Core 13.4 to 13.5 removed the tunable and the system suddenly could only detect 16 drives. Adding the tunable back and rebooting allowed all the drives to be detected again.

The tunable is "hw.nvme.num_io_queues=64". We played with different numbers and discovered that anything above 4 allowed the drives to be detected. A quick google search of that tunable doesn't actually turn up much in the way of useful information.

Does anyone know what this tunable is and why the default setting in TrueNAS Core wouldn't allow for more than 16 NVME drives to be detected?

***Edit***
I think I just figured out why the tunable was removed on upgrade. The tunable is actually set in the bootloader and I'm guessing the update either modifies or entirely replaces the loader.conf file?
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Have you seen a performance impact from different settings of the tunable?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
This tunable does not seem to be documented in the nvme driver man page.
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
Have you seen a performance impact from different settings of the tunable?
As near as we can tell, there appears to be no performance impact. The only change we were able to see was that the system was able to detect and use all 24 drives.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
You need to set tunables in the UI. Otherwise they will be lost at each update. As for the question what that does, actually - I guess Warner Losh would be the person to contact. Not posting personal info of another guy, but you can find him on Twitter, he's got an @freebsd.org email address, there are the mailing lists.

In case I meet him in the regular bhyve production users call that starts in 10 minutes, I'll ask him:wink:
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
You need to set tunables in the UI. Otherwise they will be lost at each update. As for the question what that does, actually - I guess Warner Losh would be the person to contact. Not posting personal info of another guy, but you can find him on Twitter, he's got an @freebsd.org email address, there are the mailing lists.

In case I meet him in the regular bhyve production users call that starts in 10 minutes, I'll ask him:wink:
We did set the tunable in the UI.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
We did set the tunable in the UI.
That is surprising and that should definitely survive any update. Did you file an issue in JIRA?

P.S. I did not meet Warner tonight.
 

bsdimp

Cadet
Joined
Aug 8, 2021
Messages
3
> hw.nvme.num_io_queues=64

OK. Normally, we try to have one I/O queue per core. On high core count machines, or with *lots* of cards that have a high number of queues, this can exhaust the MSIx message slots. When the driver can't allocate one, it fails (which would appear as a failure to detect). By limiting the number of I/O queues, you are making sure this resource isn't exhausted.

hw.nvme.min_cpus_per_ioq is also a good tuneable to set. This helps the driver to assign N cpus to each of the I/O queues thta are allocated. It's how we deal with the large core count machines that we have at work. Here's the logic that's implemented:
Code:
        /*
         * Try to allocate one MSI-X per core for I/O queues, plus one
         * for admin queue, but accept single shared MSI-X if have to.
         * Fall back to MSI if can't get any MSI-X.
         */
        num_io_queues = mp_ncpus;
        TUNABLE_INT_FETCH("hw.nvme.num_io_queues", &num_io_queues);
        if (num_io_queues < 1 || num_io_queues > mp_ncpus)
                num_io_queues = mp_ncpus;

        per_cpu_io_queues = 1;
        TUNABLE_INT_FETCH("hw.nvme.per_cpu_io_queues", &per_cpu_io_queues);
        if (per_cpu_io_queues == 0)
                num_io_queues = 1;

        min_cpus_per_ioq = smp_threads_per_core;
        TUNABLE_INT_FETCH("hw.nvme.min_cpus_per_ioq", &min_cpus_per_ioq);
        if (min_cpus_per_ioq > 1) {
                num_io_queues = min(num_io_queues,
                    max(1, mp_ncpus / min_cpus_per_ioq));
        }

        num_io_queues = min(num_io_queues, max(1, pci_msix_count(dev) - 1));


They all should be documented in nvme(4), but hw.nvme.num_io_queues apperas to be missing.


Warner
 
Top