NVDIMMS and TrueNAS 12

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
A summary for NVDIMMs would be:
1) Lose power or system crashes - no data loss
2) NVDIMM dies. SLOG no longer used and so system performance drops
3) lose power and NVDIMM dies at the same time (very unlikely) - 5 seconds of data loss

Mirror the NVDIMMs...and then data loss is extremely unlikely except for fire, sprinklers etc.
The TrueNAS M-Series uses mirrored NVDIMMs. The second NVDIMM is on the other controller.
 

alexr

Explorer
Joined
Apr 14, 2016
Messages
59
Right, but there's still only one NVDIMM. If it dies at the same time as the server does (eg: power surge) that's still lost data.
We don't have redundant DRAM -- we're relying on ECC to work properly.

A whole DIMM could short out just as easily as an NVDIMM could fail. That seems unlikely.

Either way, not trying to sell anybody on NVDIMMs, just providing the fixes I needed to apply when updating to TrueNAS 12.0-U7.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
A whole DIMM could short out just as easily as an NVDIMM could fail. That seems unlikely.
Sure, but the system will likely panic rather quickly. It's not really comparable, unless you're running it in memory mode.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We don't have redundant DRAM -- we're relying on ECC to work properly.

A whole DIMM could short out just as easily as an NVDIMM could fail. That seems unlikely.

Either way, not trying to sell anybody on NVDIMMs, just providing the fixes I needed to apply when updating to TrueNAS 12.0-U7.

ECC is a form of redundancy itself.

It's not about the device type or speed but your original statement:

this device is battery-backed to absolve the need for a mirror

That's only the means by which that device achieves its PLP (power-loss-protection) of in-flight data. Others write direct-to-NAND (Optane) or use onboard supercapacitors.

Having or not having PLP is a completely distinct issue from "complete device failure" which is the purpose of the UI warning to mirror your SLOG.

Consider it a sliding scale.

Async writes are "0% safety" - you're guaranteed to lose pending data in case of any failure.
Sync with single SLOG is "99.9% safety" - if you lose both the server and your SLOG device simultaneously, such as via a power surge that kills both your HBA and your SLOG, then you'll lose data.
Sync with mirrored SLOG is "99.999% safety" - now you have to lose the server, and TWO SLOG devices, all simultaneously, in order to lose data. As @morganL points out, you're usually at "physical environment" levels of failure - fire, water, or aliens.

It's the same as when people say "I can run async, I have a UPS" - loss of power at the input is far from the only fault you can experience.
 

alexr

Explorer
Joined
Apr 14, 2016
Messages
59
Again, not sure how this turned into "explain NVDIMMs to the guy who actually owns one," but I did say that it has a super cap as part of it. It's sufficiently redundant for my needs.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It's sufficiently redundant for my needs.

Risk acceptance is always on the implementer, but there's a reason that message comes up in the UI and it's not "because the iX engineers were looking for a way to kill time."
 

alexr

Explorer
Joined
Apr 14, 2016
Messages
59
Risk acceptance is always on the implementer, but there's a reason that message comes up in the UI and it's not "because the iX engineers were looking for a way to kill time."
I don't disagree with the message, except that I find this solution sufficient.

Since y'all decided to kill the messenger, I happened to be motivated to look and as @Rand mentioned in another thread, all this stuff is obsolete now so I was able to order a new Micron module and a PowerGEM for ~$125 total. That's a hefty savings over what I paid for my setup new.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I don't disagree with the message, except that I find this solution sufficient.

And that's likely to be perfectly fine, and statistically speaking you'll never experience a failure.

NVDIMMs don't "wear out" like NAND does, so you're unlikely to experience the edge-case failure "scenario 3" laid out by @morganL of a simultaneous system crash and SLOG failure.

That's the scenario that mirrored SLOG addresses, which is why that warning is still present for any non-mirrored SLOG. The individual SLOG device having power-loss-prevention has no bearing on that.
 

winstontj

Explorer
Joined
Apr 8, 2012
Messages
56
You can't mix and match NVDIMM-N and -P on one board so based on your use case pick either.
I spoke to tech support at a mobo manufacturer today (10 Feb. 2022) who stated that while running both battery-backed nvdimm-n AND optane pmem nvdimm-p on same board/socket is "unsupported", they made a point to tell me: "I'm not saying it doesn't work, I'm just saying it is not recommended and we do not support it."

I have the pieces in a saved shopping cart. I'd love to run a 32gb nvdimm-n for slog, 2x pmem nvdimm-p sticks for metadata... and hopefully have enough memory left for ARC.

How can I learn more about what is available in SCALE?
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
A summary for NVDIMMs would be:
1) Lose power or system crashes - no data loss
2) NVDIMM dies. SLOG no longer used and so system performance drops
3) lose power and NVDIMM dies at the same time (very unlikely) - 5 seconds of data loss

Mirror the NVDIMMs...and then data loss is extremely unlikely except for fire, sprinklers etc.
The TrueNAS M-Series uses mirrored NVDIMMs. The second NVDIMM is on the other controller.

There is actually a fourth case (typical edge use case)
- Power loss for an extended time - in my unrepresentative case I had a power failure, UPS ran dry after a while, server went down and PowerGem ran dry too trying to keep NVDimm powered.
Caused a pool issue o/c, but nothing a force wouldnt fix.

No issue at home, unlikely to happen at company site:)

How can I learn more about what is available in SCALE?

Thats actually a very good question - will NVDimms be working in TNS? Technically there shouldnt be an issue since NVDimms work fine in Linux...


Edit: To answer my own question - works fine out of the box.

1645363751209.png
 
Last edited:
Top