Dell server config with dedup and ZFS replication

harald · Nov 11, 2015

Hello all together,

we plan to use some Dell R530 hardware with following specs to store backup copies of virtual machines using NFS datstore. This was working well until we had an issue with housekeeping of older backups and the dedup tables were growing too big. But more details first.

Server Dell R530, Xeon E5-2603 v3 @ 1.60GHz, 128 GB ECC RAM, 4 x Broadcom Gigabit LAN, no RAID controller (using internal SATA ports), 7 x 4 TB SATA 7200 rpm in RAID-Z1 giving a 20 TB pool and one 200 GB mixed mode SSD as cache/L2ARC. Redundant power supplys and UPS attached. Network is configured in one failover pair for management IP (lagg0) and a second failover pair for data traffic (lagg1).

As i wrote we had some issues with removing old backup copies due to an error in a script, so we kept more than 100 copies of the VMs instead of max 10. But deduplication was working very good, we still had lots of free disk space

One site had a dedup ratio of 42 with 18,8M allocated blocks, another site had a ratio of 19 with 65,7M blocks. We are now starting from scratch with fixed housekeeping script and are going to test the environment.

The future plan is to use ZFS replication to sync all day by day changes to another site. This is the main reason why we want to use deduplication - to identify the changed blocks in the backup data. Dedup is the key to minimize the traffic for the replication.

Seeing this plan, what do you think about the configuration?
Shall we switch to RAID-Z2 because due to compression and dedup we won't need all that space even in big sites. Or use a second SDD as ZIL instead of one hard disk (the server has 8 drive bays).
How will this configuration work with ZFS snapshots and the replication?

We usually have lots of write traffic and only read access if we need to restore a VM, which happens very rarely.

What is your suggestion on this configuration?

Thanks in advance, Harald

HoneyBadger · Nov 11, 2015

Welcome Harald!

You've got the requisite "absolute gobs of memory" for dedup, so let's just head that one off with a "you're probably good" on that front. I'd suggest putting some "zpool status -D" and parsing of the output into your housekeeping script, and having it fire you a "summary DDT size on disk/core" so you can track it and be mindful of when it might see pressure. Changing the arc_meta tunable family might also be useful to ensure that your entire DDT stays in RAM.

On to the disk setup. A backup/restore workload won't benefit from L2ARC at all unless you plan to restore the same thing repeatedly, and SLOG is for latency-sensitive random writes, not backups where you need to ingest sequential data as fast as possible.

Personally, I'd toss the L2ARC/SLOG device entirely, use that bay for an eighth disk, go RAIDZ2, and write asynchronously. By definition the backup isn't the only copy of the data, so it's not like you're risking something other than some lost time. You also have redundant power and UPS, so even that risk is minimal.

Snapshots and replication. Remember what I said about your housekeeping/health-check being mindful of DDT size? Entries aren't dropped from DDT until all referenced blocks are gone, including blocks in snapshots. You found that one out the hard way when you kept 100 backups instead of 10. Make damned sure that your script is watching for this. Assuming that's in place - after your initial replication of snapshots, your snapshots of the deduplication changes are going to be tiny, which is perfect for shunting across what I assume is a pretty low-bandwidth WAN link to your other site.

Are you using incremental backups and compression up at a higher layer in the backup software, or are you just doing full backups and letting ZFS dedup handle it?

harald · Nov 12, 2015

Thanks for your analisys.

We are using Zabbix monitoring system and can add the zpool status to our monitored items.

After I re-created the overfilled volumes yesterday, we have now the first backups stored again.
Current status is: dedup: DDT entries 36954204, size 745 on disk, 142 in core
This means the current dedup table is using 36954204 * 745 = 25,6 GB on disk and 36954204 * 142 = 4,9 GB in RAM?
"On disk" really means on the rotating disks, not on the L2ARC SSD, correct?

arc_summary.py is showing
vfs.zfs.arc_meta_limit 33034731520
vfs.zfs.arc_meta_used 6926726256
That means 6,9 GB used by Metadata which includes the dedup table, correct?
Are you suggesting to increase the 25% arc_meta_limit to e.g. 50% or 75% to ensure there is enough space for the dedup table?

The other 2 GB of meta data are filesystem information and the tables to map the L2ARC?
Would it lower the amount of meta data if I remove the L2ARC drive?
If the dedup table does no more fit into RAM, will it use the L2ARC for that or not?

From monitoring point of view we would also be able to check arc_meta_limit > arc_meta_used.

We had very low NFS write speed at the beginning and used one hint found in web to patch the kernel to allow async NFS writes.
http://www.ateamsystems.com/tech-bl...ith-freebsd-zfs-backed-esxi-storage-over-nfs/
Now we have fast write speed without changing any ZFS filesystem options.
I know that using a patched kernel is not a great solution, but from my point of view this was looking better than disabling ZFS sync.
What's your opinion?

I think we should use a low amount of snapshots to be sure that we don't overgrow the dedup tables. Lesson learned ;-)
Our backup script is a customized version of ghettovcb, it simply creates full clones of our VMs.

Thank you, Harald

HoneyBadger · Nov 12, 2015

harald said:
This means the current dedup table is using 36954204 * 745 = 25,6 GB on disk and 36954204 * 142 = 4,9 GB in RAM?
"On disk" really means on the rotating disks, not on the L2ARC SSD, correct?

Dedup math is correct, and "core" does mean "size in ARC/L2ARC" vs "disk" being "size in the zpool."

harald said:
That means 6,9 GB used by Metadata which includes the dedup table, correct?
Are you suggesting to increase the 25% arc_meta_limit to e.g. 50% or 75% to ensure there is enough space for the dedup table?

Also correct but at this point you're not under any pressure so far with only 6.9/33GB of your "metadata RAM" used. I don't see a need to bump arc_meta_limit up yet, but if it does start to increase you can change that. While it's a "soft limit" (I think?) you don't want to have your DDT lose the coin-flip/die-roll against a chunk of frequently or recently accessed ARC data. In your situation I would happily sacrifice the ARC as backup storage doesn't really benefit much from read caching unless you plan to restore the same data over and over. ;)

harald said:
The other 2 GB of meta data are filesystem information and the tables to map the L2ARC?
Would it lower the amount of meta data if I remove the L2ARC drive?
If the dedup table does no more fit into RAM, will it use the L2ARC for that or not?

Yes to all three. L2ARC mapping tables will consume ARC (specifically arc_meta) memory, you can lower the amount of metadata there by removing it, and DDT will overflow into L2ARC based on "core" size. If you really want to keep the L2ARC around, I would suggest using it in metadata-only mode by setting the zfs property "secondarycache=meta" at the pool level. This will mean you're only caching metadata in your L2ARC, rather than trying to hold actual data. Again, in backup storage, read caching isn't likely to contribute significant value; so don't bother with it. Think of the L2ARC in this case as a safety net that will warn you when your DDT starts to exceed arc_meta_limit; performance will take a hit as SSD is much slower than RAM, but it won't completely die from having to swap to spinning disk.

harald said:
I know that using a patched kernel is not a great solution, but from my point of view this was looking better than disabling ZFS sync.
What's your opinion?

Patched kernels are definitely in the "we don't support that" level. Personally I can't see a change to NFS being something that would be affected beyond "your performance will suck if you do an update that undoes it" but you would definitely get the "you're on your own" response pretty fast from any kind of professional support.

Hopefully I haven't mixed too much Solaris in with my FreeNAS here.

harald · Nov 20, 2015

Hello,

it's me again

Thanks for your great explanation, this was really helpful for me.

After watching some days with fixed housekeeping we saw that arc_meta_used is not exessively growing, value in the biggest site was 60% of arc_meta_limit.

As next step I tried to get rid of the patched kernel and use the standard kernel - but not completely disable ZFS sync.
Standard Kernel with VMware and NFS gave me 4-5 MB/sec write performance -> not useable.
Then i've configured the SDD as ZIL/Log device instead of L2ARC and write speed climbed up to 100 MB/sec (network link was saturated) -> goal reached.
I know that my MLC SSD is not the best selection for this usage but I'll try it for the next time.

Did I understood correct that there will be no data loss if the SSD fails, only performance will drop because ZIL from RAM is then mirrored to the rotating disks instead of the SSD?
This configuration (non-mirrored SSD) does not create any risk for the ZFS filesystem consistency?

Unfortunately I've noticed that arc_summary.py does no more list the value vfs.zfs.arc_meta_used after the SSD configuration has changed.
Is this expected or do I have some configuration error?
Any other (simply and easy) option to monitor the arc_meta usage?
If not I need to grab dedup info from zpool status and do some calculations in Zabbix monitoring.

Thank you, Harald

jgreco · Nov 20, 2015

HoneyBadger said:
On to the disk setup. A backup/restore workload won't benefit from L2ARC at all unless you plan to restore the same thing repeatedly, and SLOG is for latency-sensitive random writes, not backups where you need to ingest sequential data as fast as possible.

Personally, I'd toss the L2ARC/SLOG device entirely,

This isn't necessarily true. Having the DDT available in ARC/L2ARC is incredibly useful, but can get very stressy if there's a large number of blocks with only modest duplication levels (like "1"). It would be a reasonable thing to have a decent sized L2ARC device available where ZFS could stuff DDT entries.

HoneyBadger · Nov 20, 2015

jgreco said:
This isn't necessarily true. Having the DDT available in ARC/L2ARC is incredibly useful, but can get very stressy if there's a large number of blocks with only modest duplication levels (like "1"). It would be a reasonable thing to have a decent sized L2ARC device available where ZFS could stuff DDT entries.

You mean like

HoneyBadger said:
If you really want to keep the L2ARC around, I would suggest using it in metadata-only mode by setting the zfs property "secondarycache=meta" at the pool level. This will mean you're only caching metadata in your L2ARC, rather than trying to hold actual data. Again, in backup storage, read caching isn't likely to contribute significant value; so don't bother with it. Think of the L2ARC in this case as a safety net that will warn you when your DDT starts to exceed arc_meta_limit; performance will take a hit as SSD is much slower than RAM, but it won't completely die from having to swap to spinning disk.

? ;)

Important Announcement for the TrueNAS Community.

Dell server config with dedup and ZFS replication

harald

Cadet

HoneyBadger

actually does care

harald

Cadet

HoneyBadger

actually does care

harald

Cadet

jgreco

Resident Grinch

HoneyBadger

actually does care

Similar threads