Risk of using non-ecc ram

choong · May 19, 2015

I know this topic is a sensitive area to touch on but let me explain the background. I already have a server setup as per the forum suggested, ie largest ecc ram I can fit, intel nic, Z2 etc. (lets call this backup server).

I'm starting a new company with a few guys. So I'm contemplating setting up another server with scrap parts. I would configure the new server (lets call it Scrap server) to run with non-ecc ram, maybe Z1 depending on the no of spare HD i can get. I would configure the Scrap server to run syncthing and backup everything to the Backup Server every 10 min or less. All our data are works created by employees so anything loss is just waste of employee time which we can handle. So my startup company can withstand 1 day downtime. At full scale operations, "critical" data would probably is about 100gb with 400gb non-critical.

What kind of risk am I looking at other than the scrap server itself screwing up badly. Will there be a possibility that this kind of setup will corrupt at file level and the corrupted files get backed up? or if the zpool dies, the whole thing just die

Scareh · May 20, 2015

well regardless if you backup from scrap to backup, if you backup garbage, you'll still be screwed...
With that I mean: if bad ram is reading your data and garbaging it, you'll be backing up unreadable data. Which you'll most likely won't notice untill you need the data, see its corrupt, go look at your backup and see its corrupt aswell.
You say you can handle the loss of employee time, but you're willing to back up every 10 mins, which means to me you're not willing to handle the loss tbh.

Bite the bullet and do it the right way from the start: ECC or gtfo ;-)

choong · May 20, 2015

Thanks. Actually I found the answer to this question on one of cyberjock post after I posted

depasseg · May 20, 2015

choong said:
What kind of risk am I looking at other than the scrap server itself screwing up badly. Will there be a possibility that this kind of setup will corrupt at file level and the corrupted files get backed up? or if the zpool dies, the whole thing just die

For any future visitor, I'll respond. Using non-ECC RAM won't likely result in the server screwing itself up badly. In fact, you might never know that there was ever a problem. That is, until you attempt to open or access a file, and find it (and all of the snapshots and backups) corrupted. So, if you care about your data not becoming corrupt, use ECC RAM. If you do not care about your data silently becoming corrupt, then don't use non-ECC RAM.

jgreco · May 20, 2015

And to emphasize, when we say "data", we mean "pool", not files. You're not just risking bytes within individual files. You're risking the integrity of the pool itself, since there are no automatic mechanisms to repair ZFS structural damage.

Ericloewe · May 20, 2015

Non-ECC can be a valid choice in very specific personal use cases.

For any sort of business environment, ECC is the only financially sound option. Let's face it, if it's worth having a server for, it's worth protecting properly.

jgreco · May 20, 2015

Data valuable --> Use ECC

Data not valuable --> Why you storin' it anyways

Jailer · May 20, 2015

Why would you be diligent in the storage and backup of your data and not be diligent in it's creation?

jgreco · May 20, 2015

Actually I can maybe see a use case for storing transient scratch or cached data. But, still...

Ericloewe · May 20, 2015

jgreco said:
Actually I can maybe see a use case for storing transient scratch or cached data. But, still...

But again, silent corruption in such a scenario can have nasty results.

jgreco · May 20, 2015

Of course. So that's why I didn't say "for all use cases involving ...." But the risks there are much closer to the sorts of risks you normally accept with other filesystems.

mjws00 · May 20, 2015

jgreco said:
But the risks there are much closer to the sorts of risks you normally accept with other filesystems.

QFT. We've been dealing with silent corruption for decades. Even before ZFS (u wot m8?). It's why we invented GFS backup schemes. Journaling filesystems yadda yadaa. The issue is imho, ZFS expects end-to-end protection, and says "Screw you, nothing can go wrong. No repair tools necessary."

A better question might be. If I am bootstrapping a startup and want to use spare gear. What might be an appropriate choice for OS and filesystem? ext3, ufs, ntfs, all have tools to address errors. The risk of silent corruption is mitigated elsewhere.

I'm as ECC or GTFO as anyone. However, I also accept that many filesystems are not COW, don't snapshot, or protect against silent corruption. If you skip scrubs, and accept that a silently corrupt file is corrupt... ZFS still has some advantages. Is a simple ZFS mirror on non-ecc REALLY more dangerous than a mirror on UFS or ext3? The designers say no (lots of citations also reddit spew).

@jgreco Us plebs store non valuable data for convenience, posterity, and pleasure. ;)

jgreco · May 20, 2015

mjws00 said:
The issue is imho, ZFS expects end-to-end protection, and says "Screw you, nothing can go wrong. No repair tools necessary."

It's not really that at all, no. ZFS is sufficiently complicated that repair tools rapidly become nightmarish in nature. A repair implies that a particular type of failure can be predicted and corrected. Consider: FFS was designed with the assumption it was running on raw disk, and therefore failures are expected. The correction part of the problem is very difficult even in simpler systems such as FFS, which is why a lot of stuff ends up in lost+found when there's a problem, and sometimes damage is severe enough it can't be fixed. ZFS tackles that from the other end, by providing redundancy and error-resistance, so that failures are mitigated through redundant data. You aren't required to design your pool to provide that redundancy, but in such a case, failure turns potentially catastrophic.

A better question might be. If I am bootstrapping a startup and want to use spare gear. What might be an appropriate choice for OS and filesystem? ext3, ufs, ntfs, all have tools to address errors.

Yes.

The risk of silent corruption is mitigated elsewhere.

An outright untruth in most cases. Most data out there is simply unprotected. Let's just be honest and call it as it is.

If you skip scrubs, and accept that a silently corrupt file is corrupt... ZFS still has some advantages. Is a simple ZFS mirror on non-ecc REALLY more dangerous than a mirror on UFS or ext3? The designers say no (lots of citations also reddit spew).

It's a matter of perspective, but I don't agree with "no": the problem is still that the pool metadata has no way to be repaired, and in that way, yes, ZFS is a little more dangerous. Damage to a pool's metadata can conceivably live on nearly forever, which means that once it is damaged, there's some opportunity for bad things to happen. With FFS or ext3 you can force a filesystem check for the critical operating bits and if fsck decides the filesystem is clean, you've got a fairly decent guarantee that you're no going to run into some odd corruption.

@jgreco Us plebs store non valuable data for convenience, posterity, and pleasure. ;)

Uh huh.

SirMaster · May 21, 2015

jgreco said:
It's a matter of perspective, but I don't agree with "no": the problem is still that the pool metadata has no way to be repaired, and in that way, yes, ZFS is a little more dangerous. Damage to a pool's metadata can conceivably live on nearly forever, which means that once it is damaged, there's some opportunity for bad things to happen. With FFS or ext3 you can force a filesystem check for the critical operating bits and if fsck decides the filesystem is clean, you've got a fairly decent guarantee that you're no going to run into some odd corruption.

I think this is a great way to put it. Given misbehaving RAM, ZFS should not corrupt data any easier than any other filesystem like EXT or NTFS. In fact, given that ZFS stores multiple copies of its metadata and keeps checksums means that it can actually still recover gracefully at least in some cases better than the average filesystem, even without ECC memory.

Take this simple example. Let's say you read a block of data off a ZFS pool and lets say it reads that block into RAM which corrupts it (because it's misbehaving). Before serving the user the data, it will load the checksum into a different section of RAM and check it against the block. If the RAM corrupted the block then it won't match, and will notify the user that something has gone wrong and it will attempt to reconstruct a good copy that matches the checksum before serving it to the user.

It's very unlikely that ZFS even in the face of faulty RAM will serve corrupt data to the user since the block and the checksum have to match before it will deem the block uncorrupted and good to serve up.

Compare this to something like EXT or NTFS and if this happened, and the block being read corrupted in RAM, then that corrupted copy would be served to the user without hesitation.

So if you already for example have a system that you are going to use to store data, not supporting ECC is not a reason to avoid ZFS in favor of another software RAID, because it won't be any "better".

But, as you put it since ZFS doesn't have tools to repair itself, should some corruption sneak in, it is in THAT respect more fragile than other filesystems which do have repair tools that can be used to save themselves.

jgreco · May 21, 2015

Well, no, not exactly. Data in memory isn't checksummed, so you're screwed if it corrupts in-core. You're assuming that the process is read-data, get-corrupted, verify-checksum, give-data-to-user. The process can easily see the data corrupted AFTER the checksum is verified. Worse, if you're doing a read-update-write cycle, where the data may have been read ten seconds (or minutes or hours) ago, there's a huge window for bits to rot. This seems particularly likely to be applicable to metadata such as inodes, directories, and free lists, where that stuff might be in-core due to high frequency of use. Since the checksum verification happens when the block is read off disk, that's bad.

It is worth thinking about that, then realizing that ZFS is designed around the idea of caching massive amounts of stuff in ARC, and then re-reading what I just wrote. It will (or should) horrify you more then.

SirMaster · May 21, 2015

jgreco said:
Well, no, not exactly. Data in memory isn't checksummed, so you're screwed if it corrupts in-core. You're assuming that the process is read-data, get-corrupted, verify-checksum, give-data-to-user.

Yes, it's certainly possible for the RAM to corrupt the block in-memory after it has been checked against the checksum, but again, this isn't worse than another filesystem which would just give you the corrupted block no matter when it was corrupted. But for ZFS it can at least still sometimes successfully detect corruption caused by bad memory even if it's not ECC memory.

jgreco said:
This seems particularly likely to be applicable to metadata such as inodes, directories, and free lists, where that stuff might be in-core due to high frequency of use. Since the checksum verification happens when the block is read off disk, that's bad.

It is worth thinking about that, then realizing that ZFS is designed around the idea of caching massive amounts of stuff in ARC, and then re-reading what I just wrote. It will (or should) horrify you more then.

Very true as well. Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

I would consider enabling the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10) on a non-ECC memory system. What this does is tells ZFS to checksum the data at rest in memory and to checksum it right before writing anything to disk which will at least minimize the window of opportunity for corruption.

Ericloewe · May 21, 2015

SirMaster said:
Yes, it's certainly possible for the RAM to corrupt the block in-memory after it has been checked against the checksum, but again, this isn't worse than another filesystem which would just give you the corrupted block no matter when it was corrupted. But for ZFS it can at least still sometimes successfully detect corruption caused by bad memory even if it's not ECC memory.

Very true as well. Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

I would consider enabling the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10) on a non-ECC memory system. What this does is tells ZFS to checksum the data at rest in memory and to checksum it right before writing anything to disk which will at least minimize the window of opportunity for corruption.

Are you saying ZFS actually somewhat-supports the Rube Goldbergian scheme of checksumming data in RAM in software?

solarisguy · May 21, 2015

Can one safely use non-ECC FreeNAS server to serve data from a read-only pool ?

Ericloewe · May 21, 2015

solarisguy said:
Can one safely use non-ECC FreeNAS server to serve data from a read-only pool ?

"Read-only" doesn't apply to the lower levels of ZFS. It still corrects any corruption it finds.

jgreco · May 21, 2015

SirMaster said:
Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

For more traditional filesystems, usually that's happening outside the filesystem layer, so, no, it works somewhat differently. The virtual memory management systems are typically responsible for that, and long-term cached data won't be used as part of a read-update-write cycle because it happens below that level, i.e. in the filesystem layer. In comparison, ZFS is doing all of that internally, and it definitely will take a cached block, update it, and write it back out.

Regardless, for a typical UNIX or Windows box, you're usually sizing the box to the task, and not adding many gigs of extra RAM for ARC caching. The additional RAM added to a ZFS system represents a larger risk since there's more that can be corrupted. You could probably make a reasonable argument to reduce the amount of RAM.

Important Announcement for the TrueNAS Community.

Risk of using non-ecc ram

Dabbler

Contributor

Dabbler

FreeNAS Replicant

Resident Grinch

Server Wrangler

Resident Grinch

Not strong, but bad

Resident Grinch

Server Wrangler

Resident Grinch

Guru

Resident Grinch

Patron

Resident Grinch

Patron

Server Wrangler

Guru

Server Wrangler

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Risk of using non-ecc ram"

Similar threads