Partition NVMe drive as L2ARCs for multiple pools

ilmarmors

Dabbler
Joined
Dec 27, 2014
Messages
25
I have warm storage server with two pools - 1) 3 vdevs of 11 disks in RAIDZ2 (long term, usage close to write once, read many, one dataset with 1M record size) and 2) 1 vdev of 2 disks in mirror (scratch space for incoming data upload, processing and preparation for ingest into long term pool, multiple datasets with 128K recordsize). 128GB RAM and Intel OPAL D7-P4610 1.6T NVMe for L2ARC.

I have installed FreeNAS-11.3-U5, can migrate to TrueNAS-12.0 probably when U2 is out, or might upgrade to TrueNAS-12.0-U1 sooner, if there is good reason to do so.

My main goal is to get filesystem metadata in the L2ARC as much as possible. I don't have random load - mainly rsync, find, du, streaming files (which are 1-30MB usually). Biggest gain I have seen is when ARC has cached filesystem metadata, but it can't fit everything in ARC (RAM) and data in ARC is replaced by file data itself during normal use case.

Ideal solution would be shared L2ARC, when ZFS takes care of balancing usage of L2ARC among multiple pools, but that is not possible currently and probably won't be for some time - https://github.com/openzfs/zfs/issues/9859

FreeNAS interface allows adding whole devices as L2ARC to the pool. I have only one NVMe drive, and due to reasons beyond my control, I won't be able to upgrade particular server.

How safe or dangerous is following workaround - partition NVMe disk manually and add individual paritions as L2ARC cache for different pools? In my case I created two partiions - 128G for scratch pool and remaining for big tank pool:

root@freenas# gpart create -s GPT /dev/nvd0
nvd0 created
root@freenas# gpart add -t freebsd-zfs -a 1m -l l2arca -s 128G /dev/nvd0
nvd0p1 added
root@freenas# gpart add -t freebsd-zfs -a 1m -l l2arcb /dev/nvd0
nvd0p2 added
root@freenas# zpool add scratch cache nvd0p1
root@freenas# zpool add tank cache nvd0p2


Running zpool status shows nvd0p1 and nvd0p2 under cache sections of scratch and tank pool respectively.

On pool status page FreeNAS UI shows /dev/nvd0p1 and /dev/nvd0p2 for pools respectively instead of nvd0 (full device, can be attached to one pool via UI) in cache sections.

I there any downsides with approach I took? Something that might bite my ass down the road? Things I should not forget and remember during upgrades, config restores or anything else in the future?

Is there any way how to let ZFS know that I prefere caching filesystem metadata in the L2ARC? If yes, what is the correct way to congure that? I would like to avoid long sequential file data reads or writes expunging metadata from L2ARC.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Is there any way how to let ZFS know that I prefere caching filesystem metadata in the L2ARC? If yes, what is the correct way to congure that?
You can elect to have it store only metadata... if that's interesting, check here:

Also of interest would be this one (once you upgrade to 12)
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
L2ARC is duplicate data so if it blows up, corrupts, or whatever, the file system can go back to the pool for the missing data.

I found a metadata-only L2ARC to be a huge benefit for rsync operations. By default, it took three passes for the L2ARC cache to get “hot” with metadata and maximize its benefit. As of TrueNAS 12, the L2ARC can be made persistent.

Another option in TrueNAS 12 is setting up a special VDEV for metadata. However, unlike L2ARC, that sVDEV is essential for the pool. Hence, the sVDEV hardware / configuration should be designed to match the redundancy of your pool. For example, I will use a 3-way mirror of identical Intel SSDs for my sVDEV. Sizing the sVDEV properly is also important.
 
Last edited:

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I'd recommend not using nvd0p1 as the device name to add to the pool but gptid/<rawuuid-of-nvd0p1> instead. You can get this with gpart list nvd0.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Can this be done per the GUI with those UUIDs as well? I ask since I agree with your approach but want it stick to the GUI as much as possible due to the repeated warnings here not drop into the shell for this sort of stuff.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Nope. Only the entire disk can be added via UI. But doing so would lead to
  • a partition table with a single partition being created
  • the rawuuid of that partition being used to insert the disk as a cache into the pool
So you already dropped into the shell, didn't you? I only advise to use the identifiers TrueNAS would use if it had that feature to partition a disk. TN always uses the UUIDs for vdevs.

Look at the output of zpool status. With the exception of the boot pool it's gptid/something throughout.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Having been bitten by changing ada# designations in the past, I concur with using UUID, whenever possible. It's the only foolproof way, but it's too bad it has to be done via the shell since that increases the likelihood that I somehow screw it up... :smile:
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Like omitting the "cache" keyword which would lead to an additional unmirrored vdev ...
zpool checkpoint is your friend in this case. Before you do anything else.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Yup.
I will have to educate myself on the “how” of adding a small files as well as a metadata sVDEV on separate partitions via the shell then. That’s for a different thread. :smile:

good thing my plan was to nuke the current pool once I had multiple backups setup. That should make verifying the correct setup a lot easier.

also, I reckon the GUI will show the results as expected once I’ve set it all up via the shell?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Be aware that you must provide some resiliency to metadata vdevs. I.e. at least mirror them and check the write endurance of the devices you are planning to use. These are not cache devices. If the metadata special vdev is lost, the pool is toast.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Based on my inquiries here, a 3-way mirror consisting of S3610s seemed to fit the bill for my use case (a pool with largely dormant data). I also have a cold spare on hand. All of these SSDs have been burned in.

Other use cases may need something more robust!
 
Top