Risk of using non-ecc ram

Status
Not open for further replies.

choong

Dabbler
Joined
Jul 2, 2014
Messages
10
I know this topic is a sensitive area to touch on but let me explain the background. I already have a server setup as per the forum suggested, ie largest ecc ram I can fit, intel nic, Z2 etc. (lets call this backup server).

I'm starting a new company with a few guys. So I'm contemplating setting up another server with scrap parts. I would configure the new server (lets call it Scrap server) to run with non-ecc ram, maybe Z1 depending on the no of spare HD i can get. I would configure the Scrap server to run syncthing and backup everything to the Backup Server every 10 min or less. All our data are works created by employees so anything loss is just waste of employee time which we can handle. So my startup company can withstand 1 day downtime. At full scale operations, "critical" data would probably is about 100gb with 400gb non-critical.

What kind of risk am I looking at other than the scrap server itself screwing up badly. Will there be a possibility that this kind of setup will corrupt at file level and the corrupted files get backed up? or if the zpool dies, the whole thing just die
 
Last edited:

Scareh

Contributor
Joined
Jul 31, 2012
Messages
182
well regardless if you backup from scrap to backup, if you backup garbage, you'll still be screwed...
With that I mean: if bad ram is reading your data and garbaging it, you'll be backing up unreadable data. Which you'll most likely won't notice untill you need the data, see its corrupt, go look at your backup and see its corrupt aswell.
You say you can handle the loss of employee time, but you're willing to back up every 10 mins, which means to me you're not willing to handle the loss tbh.

Bite the bullet and do it the right way from the start: ECC or gtfo ;-)
 

choong

Dabbler
Joined
Jul 2, 2014
Messages
10
Thanks. Actually I found the answer to this question on one of cyberjock post after I posted
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
What kind of risk am I looking at other than the scrap server itself screwing up badly. Will there be a possibility that this kind of setup will corrupt at file level and the corrupted files get backed up? or if the zpool dies, the whole thing just die
For any future visitor, I'll respond. Using non-ECC RAM won't likely result in the server screwing itself up badly. In fact, you might never know that there was ever a problem. That is, until you attempt to open or access a file, and find it (and all of the snapshots and backups) corrupted. So, if you care about your data not becoming corrupt, use ECC RAM. If you do not care about your data silently becoming corrupt, then don't use non-ECC RAM.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And to emphasize, when we say "data", we mean "pool", not files. You're not just risking bytes within individual files. You're risking the integrity of the pool itself, since there are no automatic mechanisms to repair ZFS structural damage.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Non-ECC can be a valid choice in very specific personal use cases.

For any sort of business environment, ECC is the only financially sound option. Let's face it, if it's worth having a server for, it's worth protecting properly.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Why would you be diligent in the storage and backup of your data and not be diligent in it's creation?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Actually I can maybe see a use case for storing transient scratch or cached data. But, still...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Actually I can maybe see a use case for storing transient scratch or cached data. But, still...
But again, silent corruption in such a scenario can have nasty results.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Of course. So that's why I didn't say "for all use cases involving ...." But the risks there are much closer to the sorts of risks you normally accept with other filesystems.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
But the risks there are much closer to the sorts of risks you normally accept with other filesystems.
QFT. We've been dealing with silent corruption for decades. Even before ZFS (u wot m8?). It's why we invented GFS backup schemes. Journaling filesystems yadda yadaa. The issue is imho, ZFS expects end-to-end protection, and says "Screw you, nothing can go wrong. No repair tools necessary."

A better question might be. If I am bootstrapping a startup and want to use spare gear. What might be an appropriate choice for OS and filesystem? ext3, ufs, ntfs, all have tools to address errors. The risk of silent corruption is mitigated elsewhere.

I'm as ECC or GTFO as anyone. However, I also accept that many filesystems are not COW, don't snapshot, or protect against silent corruption. If you skip scrubs, and accept that a silently corrupt file is corrupt... ZFS still has some advantages. Is a simple ZFS mirror on non-ecc REALLY more dangerous than a mirror on UFS or ext3? The designers say no (lots of citations also reddit spew).

@jgreco Us plebs store non valuable data for convenience, posterity, and pleasure. ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The issue is imho, ZFS expects end-to-end protection, and says "Screw you, nothing can go wrong. No repair tools necessary."

It's not really that at all, no. ZFS is sufficiently complicated that repair tools rapidly become nightmarish in nature. A repair implies that a particular type of failure can be predicted and corrected. Consider: FFS was designed with the assumption it was running on raw disk, and therefore failures are expected. The correction part of the problem is very difficult even in simpler systems such as FFS, which is why a lot of stuff ends up in lost+found when there's a problem, and sometimes damage is severe enough it can't be fixed. ZFS tackles that from the other end, by providing redundancy and error-resistance, so that failures are mitigated through redundant data. You aren't required to design your pool to provide that redundancy, but in such a case, failure turns potentially catastrophic.

A better question might be. If I am bootstrapping a startup and want to use spare gear. What might be an appropriate choice for OS and filesystem? ext3, ufs, ntfs, all have tools to address errors.

Yes.

The risk of silent corruption is mitigated elsewhere.

An outright untruth in most cases. Most data out there is simply unprotected. Let's just be honest and call it as it is.

If you skip scrubs, and accept that a silently corrupt file is corrupt... ZFS still has some advantages. Is a simple ZFS mirror on non-ecc REALLY more dangerous than a mirror on UFS or ext3? The designers say no (lots of citations also reddit spew).

It's a matter of perspective, but I don't agree with "no": the problem is still that the pool metadata has no way to be repaired, and in that way, yes, ZFS is a little more dangerous. Damage to a pool's metadata can conceivably live on nearly forever, which means that once it is damaged, there's some opportunity for bad things to happen. With FFS or ext3 you can force a filesystem check for the critical operating bits and if fsck decides the filesystem is clean, you've got a fairly decent guarantee that you're no going to run into some odd corruption.

@jgreco Us plebs store non valuable data for convenience, posterity, and pleasure. ;)

Uh huh.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
It's a matter of perspective, but I don't agree with "no": the problem is still that the pool metadata has no way to be repaired, and in that way, yes, ZFS is a little more dangerous. Damage to a pool's metadata can conceivably live on nearly forever, which means that once it is damaged, there's some opportunity for bad things to happen. With FFS or ext3 you can force a filesystem check for the critical operating bits and if fsck decides the filesystem is clean, you've got a fairly decent guarantee that you're no going to run into some odd corruption.

I think this is a great way to put it. Given misbehaving RAM, ZFS should not corrupt data any easier than any other filesystem like EXT or NTFS. In fact, given that ZFS stores multiple copies of its metadata and keeps checksums means that it can actually still recover gracefully at least in some cases better than the average filesystem, even without ECC memory.

Take this simple example. Let's say you read a block of data off a ZFS pool and lets say it reads that block into RAM which corrupts it (because it's misbehaving). Before serving the user the data, it will load the checksum into a different section of RAM and check it against the block. If the RAM corrupted the block then it won't match, and will notify the user that something has gone wrong and it will attempt to reconstruct a good copy that matches the checksum before serving it to the user.

It's very unlikely that ZFS even in the face of faulty RAM will serve corrupt data to the user since the block and the checksum have to match before it will deem the block uncorrupted and good to serve up.

Compare this to something like EXT or NTFS and if this happened, and the block being read corrupted in RAM, then that corrupted copy would be served to the user without hesitation.

So if you already for example have a system that you are going to use to store data, not supporting ECC is not a reason to avoid ZFS in favor of another software RAID, because it won't be any "better".

But, as you put it since ZFS doesn't have tools to repair itself, should some corruption sneak in, it is in THAT respect more fragile than other filesystems which do have repair tools that can be used to save themselves.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, no, not exactly. Data in memory isn't checksummed, so you're screwed if it corrupts in-core. You're assuming that the process is read-data, get-corrupted, verify-checksum, give-data-to-user. The process can easily see the data corrupted AFTER the checksum is verified. Worse, if you're doing a read-update-write cycle, where the data may have been read ten seconds (or minutes or hours) ago, there's a huge window for bits to rot. This seems particularly likely to be applicable to metadata such as inodes, directories, and free lists, where that stuff might be in-core due to high frequency of use. Since the checksum verification happens when the block is read off disk, that's bad.

It is worth thinking about that, then realizing that ZFS is designed around the idea of caching massive amounts of stuff in ARC, and then re-reading what I just wrote. It will (or should) horrify you more then.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
Well, no, not exactly. Data in memory isn't checksummed, so you're screwed if it corrupts in-core. You're assuming that the process is read-data, get-corrupted, verify-checksum, give-data-to-user.

Yes, it's certainly possible for the RAM to corrupt the block in-memory after it has been checked against the checksum, but again, this isn't worse than another filesystem which would just give you the corrupted block no matter when it was corrupted. But for ZFS it can at least still sometimes successfully detect corruption caused by bad memory even if it's not ECC memory.

This seems particularly likely to be applicable to metadata such as inodes, directories, and free lists, where that stuff might be in-core due to high frequency of use. Since the checksum verification happens when the block is read off disk, that's bad.

It is worth thinking about that, then realizing that ZFS is designed around the idea of caching massive amounts of stuff in ARC, and then re-reading what I just wrote. It will (or should) horrify you more then.

Very true as well. Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

I would consider enabling the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10) on a non-ECC memory system. What this does is tells ZFS to checksum the data at rest in memory and to checksum it right before writing anything to disk which will at least minimize the window of opportunity for corruption.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yes, it's certainly possible for the RAM to corrupt the block in-memory after it has been checked against the checksum, but again, this isn't worse than another filesystem which would just give you the corrupted block no matter when it was corrupted. But for ZFS it can at least still sometimes successfully detect corruption caused by bad memory even if it's not ECC memory.



Very true as well. Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

I would consider enabling the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10) on a non-ECC memory system. What this does is tells ZFS to checksum the data at rest in memory and to checksum it right before writing anything to disk which will at least minimize the window of opportunity for corruption.
Are you saying ZFS actually somewhat-supports the Rube Goldbergian scheme of checksumming data in RAM in software?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can one safely use non-ECC FreeNAS server to serve data from a read-only pool ?
"Read-only" doesn't apply to the lower levels of ZFS. It still corrects any corruption it finds.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Although any modern OS caches massive amounts of data (several gigabytes worth in the presence of lots of free memory) in memory on "regular" filesystems like EXT and NTFS too as in my previous mentionings so they could just as easily be corrupted by read-modify-write too.

For more traditional filesystems, usually that's happening outside the filesystem layer, so, no, it works somewhat differently. The virtual memory management systems are typically responsible for that, and long-term cached data won't be used as part of a read-update-write cycle because it happens below that level, i.e. in the filesystem layer. In comparison, ZFS is doing all of that internally, and it definitely will take a cached block, update it, and write it back out.

Regardless, for a typical UNIX or Windows box, you're usually sizing the box to the task, and not adding many gigs of extra RAM for ARC caching. The additional RAM added to a ZFS system represents a larger risk since there's more that can be corrupted. You could probably make a reasonable argument to reduce the amount of RAM.
 
Status
Not open for further replies.
Top