This is an interesting approach. I actually did something similar under FreeBSD, but my approach was entirely different.
We had been using an enterprise tape system, and tapes were shipped off-site on a weekly basis. When I came on the job, I was asked to restore some data, so I tried to read the tape and got nothing. So I started an end to end evaluation of the backup process. The first thing I noticed was the read/write stats on the tapes showed lots of writes, but not enough reads to have read the tapes even once. This means no backup tape had ever been verified, and for me this means there is no backup. I went into the UI of the tape system, and turned on the verification step, and we never had a "successful" backup again. Turns out the tapes were not readable, and hadn't been in years. Previous staff blindly fed the tape robot, and must have never had any requests to restore data. So I started evaluating options, ranging from replacing the tape drives in the robot, buying a new enterprise backup, or doing something else. Replacing the drives was going to cost around 10k, and there was the problem of stale software licenses and support that were going to get sticky. And the tapes were expensive, and flushing through a fresh set of tapes was going to triple the overall cost. Management didn't like the sudden expense, but they realized they had a huge exposure.
After analyzing all of the options, I asked for about $5k to build a completely new open source system, based on a FreeBSD beta that had been released during the second week of my research, which included ZFS. It wasn't today's version of ZFS, but it had some features I wanted, and the price was right. The $5k was split between drives, and a drive chassis with extra drive trays. The software I wrote was nothing more than sh scripts, and used Rsync to do all of the main work, and use snapshots to add a day to day granularity in picking files to restore. Now this /was/ a network based backup, but rsync added some magic because the day to day deltas were quite small, so once the system was seeded with the first backup, it was hoped that regular backups would go quickly with little network load.
I had 300 machines to back up, and the first backup took nearly two weeks to complete. The surprise came when I did the second rsync, it completed so fast I was sure it had failed. As it turns out, it was taking less than 5 minutes to run through all of the servers, and complete the backup. The databases on systems that had them were handled differently. Databases, and most log files were excluded. But depending on the nature of each database, a dump was produced, and placed in a folder that was not excluded from the rsync. Some of the dumps were daily, and others were less often.
My script would start each morning by making a snapshot of all of the filesystems on the backup server. It would then initiate the rsync script on each server over ssh. Each server was backed up FILE BY FILE to a directory with the server name, and permissions were managed with NIS, so the server could be accessed by users with the same permissions as the original systems. The daily snapshots were read-only, and used a naming convention that made it easy to find the date you wanted, and when you cd's into the snapshot you saw the list of machine-named-folders, inside of which was a file by file copy of the data from that day.
I started off making daily snapshots, and had planed to trim to weekly after the first 30 days, and then to monthly at some later time, by simply removing snapshots that I wanted to trim. Disk space was being used slower than anticipated, so I never got around to writing the trim script.
The off-site backup requirement was fulfilled by designating one stack of drives in the enclosure as the swap stack, and I would export that filesystem, pack the drives in a pelican case, and leave that with the office manager. The offsite records storage people would come in every Monday with a case of drives, swapping it for the current set. I would then put that set in, and run the script that imported the drives, and ran an rsync to move the most current snapshot data to them. So the offsite sets were limited to weekly only. The offsite sets were 1/3rd the size of the on-site sets, so the expectation was the on-site would have the most day to day granularity, and the offsite DR copy was less granular, but provided a complete DR set.
Rsync and snapshots came together for that job in a way that at the time was surprising. It reduced the bandwidth needed for a network backup enough that direct off-site backup would have been possible. For a fraction of the cost of the off-site records people carrying the drives in and out, I could locate a backup server in a secure datacenter, and add day to day granularity to the off-site space. Or for even less I could locate the off-site storage in a site such as a remote location, or the home of one of the company principals.
Making the first backup was the "expensive" part network and timewise. Once the backup is seeded with one complete set of data, the daily deltas are quite small. If I included log files, the daily deltas would be much greater, as the snapshots would do nothing to collapse disk space inside of the log files. True deduplication would do that in theory, but deduplication is not a sound theory in my opinion.
On my next job, a network of 10 hospitals and 65 clinics, and each hospital has a data center full of vmware servers housing the clinical staff's virtual desktops, each clinician could walk to any computer terminal, log in and see their still running windows desktop. We had terminals in nursing stations, offices, and scattered all over the place on carts. Each of those systems data was stored on a EMC SAN, and in two of the larger hospitals we has an Avamar system running across a mirrored pair of SANs. Integrating Avamar took so long, that the SAN lifecycle was exausted before Avamar was fully implemented. Avamar deduplication was supposed to collapse the dataset to minimal space, because there was huge levels of file level redundancy when you are talking about a bunch of windows boxes... But we hired Avamar themselves to do the work, and they were later acquired by EMC, and with all of the resources of Avamar and EMC working on this, they failed to implement a working Avamar system in the time it took new SAN hardware to become obsolete and be replaced. I accomplished in a few weeks with simple shell scripts a reasonable level of machine by machine level of deduplication that outperformed what EMC was never able to implement with a limited dataset. The hospital didn't run everything through Avamar, they just did one single SAN pair as a proof of concept before attempting a larger scale Avamar system. It would have been a huge win if EMC had managed to get us to double the number of SANs we had, so there were many millions of dollars riding on the success of that proof of concept. Clearly something was proven in that effort, but it was limited to a proven failure. I moved on to another job before I heard of any large scale successes in that system. I did suffer through my enterprise VMware datasets being migrated and completely lost in the process... Only the data I was keeping on local unmanaged storage survived EMC. There was no rollback capability, when my data dropped from the remote SAN, that event was conducted into the production SAN, and it was gone. It was human error, and the criticality of the data was minor, but between that and the apparent impossibility of deduplication across a large dataset makes me wary of people talking about deduplication. I think deduplication works on small data sets, but the combination of rsync and snapshots is very "light weight" and works for many larger datasets quite well, as long as deltas can be isolated as files rather than data inside of a file. But that requires application design upstream of the backup process. That or a rsync system that is application aware. Enterprise databases can do this to some extent...
George