Data Compression and Deduplication Demystified

}

September 27, 2016

These two data optimization technologies not only squeeze more capacity out of your storage system but can bring a performance boost as well.
You have probably heard of “data compression” in various forms over the years but may not know how it plays an integral role in applications and on storage systems like FreeNAS and TrueNAS to make the most of your raw storage capacity. This article will explore how data compression and deduplication are used at different levels of the computing stack to multiply your usable storage capacity.
compression
The most familiar forms of data optimization are the lossless compression you use when “zipping” up files or the “lossy” compression when saving an image file such as an optimized JPEG. The first of these allows you to collect and shrink files for emailing, while the second one could be used for a thumbnail of a large photo for quick reference. I will focus only on the “lossless” data optimization methods that are used in the OpenZFS file system that do not modify your data in any way.
At a conceptual level, “compression” reduces the size of any given file or block of data while “deduplication” eliminates duplicate data and replaces any identical files or data blocks with individual references to one “master” copy. In practice however, the lines between the two are somewhat blurred because any lossless compression is in fact a form of file or block-level deduplication. Fortunately, you only need to know when you are likely to get the maximum benefit from any given methods.

Block-level Compression and Deduplication

The OpenZFS file system used in every iXsystems storage product supports both inline compression and deduplication at the block level. This means that every block of data being stored is either individually compressed and/or is deduplicated by being compared to every other block on the volume using a deduplication table that indexes all unique blocks.
While compressing every block may sound inefficient, it actually isn’t given the dramatic speed difference between the system’s CPU and the SSDs and hard disks in the storage subsystem: The CPU can compress data far faster than it can be written or read, resulting in a storage system speed boost, plus the additional capacity that the compression produces. We’ve seen this additional capacity reach up to 2.5X the raw capacity of the system and OpenZFS knows to not bother compressing data that will not yield more than a 12.5% space savings. This means that pre-compressed data, such as music or video, will be stored exactly as it is received without additional compression and in some cases it may be wise to disable compression. In all other cases, OpenZFS compression gives you both a speed and capacity increase.
Deduplication by contrast can yield even higher capacity savings, up to 10X, but is very use-case specific. Effective block-level deduplication requires that your data is suited for deduplication and that you can accommodate the performance impact that comes with comparing every incoming block of data to the master table of master blocks. OpenZFS deduplication can be used quite effectively on an all-flash array and our sales engineers are happy to determine if deduplication will benefit your use case. OpenZFS compression on the other hand provides instant benefits in most cases at zero cost. Finally, unlike post-processing types of compression and deduplication, the methods used by OpenZFS are inline and do not require space for a temporary duplicate of your data.

Other Types of Compression and Deduplication

Just as lossless compression is in fact a form of deduplication, deduplication can also take place in other highly-effective ways. The first of these is OpenZFS clone-based deduplication. If you configure for example a master virtual machine image or project directory, these can both be snapshotted in their pristine state and cloned to create additional, near-zero overhead additional copies. When used, each clone will only be as large as the difference in changes between the “master” and clone. OpenZFS clones can be “promoted” to being fully-unique datasets demonstrating OpenZFS’ remarkable flexibility.
Compression and deduplication can take place at another level: The application level. Most contemporary desktop file formats include some form of lossless compression that is tailored to the given data type. Applications can also be quite clever such as when a mail server receives a message destined for say, 3,000 students. The mail server will store one master copy of the message and 3,000 references to it, yielding a significant space savings, especially if the message includes an attachment such as the student handbook. Each of these examples is entirely application-specific but if you understand how and when they operate, you can truly make the most of your OpenZFS-based storage system.
It is worth noting that the benefits of storage optimization are not strictly “more space”. Storage optimization is important for offsetting the raw capacity that is lost to redundancy and also for “behind the scenes” uses such as file system snapshots and other metadata.
I hope this gives you a better sense of how your existing storage system may be effectively or ineffectively using its raw capacity, and how OpenZFS offers significant features at every level to help you make the most of your storage investment.

Michael Dexter
Senior Analyst

Join iX Newsletter

iXsystems values privacy for all visitors. Learn more about how we use cookies and how you can control them by reading our Privacy Policy.
π