Rsync or Replication?

JaimieV · Oct 12, 2012

Is there any particular reason that I should consider moving to replication instead of continuing with rsync?

My setup: I've got two FreeNAS 8 boxes, HP Microserver N36L's connected by GigE and all UPSed.

Box1, primary NAS:
FreeNAS 8.2
5Gb RAM
4x2Tb RAIDZ1 (6Tb)

Box2, just contains a backup of Box1, no data of it's own. Also used for version upgrade testing:
FreeNAS 8.3b3
1Gb RAM
4x1Tb striped (4Tb)

So first question - will a 6Tb dataset replicate to a 4Tb destination? If not, that's the decision made for me.
(Box2's disks are the previous primary NAS RAID set, passed down into use in the backup device, which is why the difference)

I'm currently backing up Box1 by an rsync pull from Box2, which works fine.
I've also set weekly snapshots x 8, on both.

What benefits would I see from replication over what I have now?
Data transfer time is not an issue - it's a domestic situation without urgency.
Filecount isn't high enough to be a problem with rsync and memory, even though box2 only has 1gig.
Would I lose the ability to compress the data on Box2 and not Box1? That looks like a fs option so I would guess I get to keep the choice.

JaimieV · Oct 14, 2012

Given the thrilling back and forth discussion on this so far, I'm beginning to suspect there's no particular reason for me to change from rsync...!

BobCochran · Oct 14, 2012

I have the same questions, really. I want to see if I can rsync data to an entire separate set of hard drives connected to the same box as the the source storage system, but that does seem rather silly in some ways. The purpose is to have a backup of my ZFS filesystem. It might be best to push the data being backed up to another box.

If I use snapshots, it seems to me the destination box (the pull box) will have to have a ZFS disk configuration that exactly mirrors the one on the source (push) box. I have not done any research or trial and error on this yet to see if my supposition is correct. On the plus side, snapshots give me a rollback capability. If I have snapshots taken for Monday, Tuesday, Wednesday, Thursday, and Friday, then on Saturday I could roll back to the snapshot from Wednesday and recover some file from that day. This does sound like an expensive form of backup, though: all those extra hard drives to buy.

Rsync doesn't do that for me. I don't think so? In the case of my customer, rollback might not be a feature that is wanted.

Bob

BobCochran · Oct 14, 2012

I could set up two FreeNAS virtual machines on my existing box, using Fedora 17 or the upcoming 18, or maybe CentOS 6.3 if it can host guest machines. Then rsync all the data from the source box to a separate set of hard drives on the destination box. That would be cheap and make efficient use of the physical server itself. I could increase system memory to 64 Gb to help things along.

Bob

Stephens · Oct 14, 2012

Before creating all these virtual machines with various OS's, I'd say the first step is to read the rsync man page. Some of the questions asked here are plainly answered there. For instance...

You can also use rsync in local-only mode, where both the source and destination don't have a ':' in the name. In this case it behaves like an improved copy command.

In other words, you can copy from one pool to another on the same machine.

JaimieV · Oct 15, 2012

BobCochran said:
That would be cheap and make efficient use of the physical server itself.

I'm not really clear on why you would consider this to be a backup.

You're moving data from one VM to another, sure, but the underlying files that make up the two VM disks could still be on the same physical disks. This does two things - fails to improve your data reliability, and makes the physical disks unnecessarily busy.

The host machine's susceptibility to problems is the SPOF to both VMs.

BobCochran · Oct 15, 2012

Well, my idea with virtual machines is: I believe that each virtual machine can have its own set of hard drives. For example, a FreeNAS virtual machine can have complete access to all the drives I add to the ZFS pools. A second FreeNAS virtual machine can run with its own, completely different, set of hard drives. It can be denied access to the hard drives on the first FreeNAS virtual machine. A CentOS 6.3 virtual machine could have it's own hard drive. And so on. Of course, the host operating system has to pass through device access to the guest virtual machines.

Supposing this is correct, my idea is to modify my current FreeNAS configuration as follows:

a. Remove one of the IBM M1015 SAS HBAs.

b. Add an SAS expander.

c. Add a sufficient number of new hard drives to allow backup capacity for 11 Tb.

d. Create a FreeNAS virtual machine which is a second instance of FreeNAS, and allow it to access only the hard drives added in step c.

e. Rsync data from the first FreeNAS instance to the FreeNAS virtual machine created in step d.

The advantage of this is that I do not need to set up a second physical box with FreeNAS and find real estate to put it in, and also pay for additional maintenance costs on it such as a higher electric bill. I think it is much more expensive to set up a separate box, than to use a virtual machine and an SAS expander and additional cabling.

Bob

JaimieV · Oct 15, 2012

The disadvantage is that your "backup" server is no such thing. It's merely more space on the same server, and therefore susceptible to anything that takes out the host.

You don't gain any of the usual advantages of running in a VM, since the way you're allocating host disks to the backup VM means that you lose all the portability of a VM - the disks are tied to the host hardware, and will be a pain in the arse to recover if the host pops.

Really, you should enumerate the likely failure modes and consider carefully. There are very good reasons why backup servers - VM or not - are kept on physically seperate hardware, on separate power supplies, separate network switches, separate racks - and preferably in physically separate datacentres.

Here are a couple of scenarios to get you started:
1) The host OS wedges due to a kernel panic, and the virtual disk writes in progress have not been passed through the VM to the physical disks.
2) The host rack suffers cooling failure, and disks/cpu bake.
3) PSU fails.
etc etc. Keep enumerating, and compare with how you would survive such things using your NASes on two physical instances.

My analysis? There's no point at all having a "backup VM" if it's on the same hardware. You would be better off simply raising your RAIDZ up one level (eg RAIDZ2 to RAIDZ3). But it's not my data, so it's not my decision.

Now, back to my questions! Anyone know if ZFS replication needs same-size recv filesystem as the send filesystem? I would imagine not, since it's not a naive blockwise copy, but can't find an answer.

Stephens · Oct 16, 2012

If you'll accept my answer without a reference, no, they don't have to be the same size.

JaimieV · Oct 16, 2012

Lovely, ta.

Which takes me all the way back to the top - is there a sound reason to change from rsync to replications?

cyberjock · Oct 16, 2012

JaimieV said:
The disadvantage is that your "backup" server is no such thing. It's merely more space on the same server, and therefore susceptible to anything that takes out the host.

You don't gain any of the usual advantages of running in a VM, since the way you're allocating host disks to the backup VM means that you lose all the portability of a VM - the disks are tied to the host hardware, and will be a pain in the arse to recover if the host pops.

Really, you should enumerate the likely failure modes and consider carefully. There are very good reasons why backup servers - VM or not - are kept on physically seperate hardware, on separate power supplies, separate network switches, separate racks - and preferably in physically separate datacentres.

Here are a couple of scenarios to get you started:
1) The host OS wedges due to a kernel panic, and the virtual disk writes in progress have not been passed through the VM to the physical disks.
2) The host rack suffers cooling failure, and disks/cpu bake.
3) PSU fails.
etc etc. Keep enumerating, and compare with how you would survive such things using your NASes on two physical instances.

My analysis? There's no point at all having a "backup VM" if it's on the same hardware. You would be better off simply raising your RAIDZ up one level (eg RAIDZ2 to RAIDZ3). But it's not my data, so it's not my decision.

Now, back to my questions! Anyone know if ZFS replication needs same-size recv filesystem as the send filesystem? I would imagine not, since it's not a naive blockwise copy, but can't find an answer.

YOU WIN!

Your backup and original are on the same hardware. That's BAD joo-joo. You can virtualize one or both, but never ever on the same hardware. All your "backups" are doing is protecting you from a few failures. You can almost accomplish the same level of protection with snapshots. Sure, you are protected from SOME failures, mostly dumb mistakes(deleted files, zpool deletion etc). But I can think of alot of examples where you will lose everything.

If your power supply fails and takes all of your hardware with it, you have a total of zero copies of your data. If you end up with some kind of virus or trojan that writes garbage to all of the drives. What about if you drop the hardware physically and destroy a bunch of drives? I can think up lots of them.

The bottom line, you don't have true backups. You simply have a second copy of your data. Protected from a failure of your OS and alot of other failure modes. It's just VERY slightly more useful than doing a mirror of your zpool. No business would EVER call 2 virtual machines on the same physical hardware a backup. Big businesses where time is BIG money won't even consider 2 copies of the data at the same physical building a backup!

There's a reason why I hate virtualizing file servers. Too many smart people get lulled into the false security. Don't be one of those!

No, it's not backups. We go through this argument every few months about virtualizing and backups. There's just no protection from a long list of failures. In the end, you don't want backups for the failure modes you CAN think of. You want it for the failure modes you CAN'T think of.

BobCochran · Oct 16, 2012

noobsauce80 said:
If your power supply fails and takes all of your hardware with it, you have a total of zero copies of your data.

Can you define what "power supply failure" is? In 10 years of providing end-user computer support, I've never seen a single power supply failure that has killed anyone's data. Most end-user systems suffer from malware attacks much more than they do hardware failure.

Bob

cyberjock · Oct 16, 2012

BobCochran said:
Can you define what "power supply failure" is? In 10 years of providing end-user computer support, I've never seen a single power supply failure that has killed anyone's data. Most end-user systems suffer from malware attacks much more than they do hardware failure.

Bob

If your power supply regulator fails and outputs a high enough voltage(and believe me.. they can output more than 18v if you adjust it so it can definitely fry some stuff) you can kill just about everything in your computer. These days alot of power supplies do not have overvoltage protection.

But really.. are you REALLY going to argue with me over the power supply issue when I can think up so many other ways on top of that one? Virtualizing your backup server on the same physical hardware as your actual data is just flat out a big no-no.

I saw a computer that had every part of the computer with burned components on their boards. Not all of the components, but it only takes 1 to ruin a component. The power supply smelled of burned electronics. It had not blown a fuse but would not power on. Didn't bother to troubleshoot it after we took the cover off to examine it.

I saw a RAID controller that went bezerk and it started writing garbage data. The more the system "saved" to the hard drive, the more corrupt data was stored. When we started looking closely and the data that we wrote wasn't there and the file system was going to crap we rebooted. The RAID controller had a hardware failure, but continued to function just well enough to write garbage instead of data every time real data was supposed to be written. All the drives attached(6 individual drives.. no arrays) had badly corrupted file systems and files were corrupt all over the place. Very little was salvageable in that system. Unfortunately the files that were constantly being accessed were the important ones.. and they were also the ones that suffered complete corruption because the file was constantly being written to every few minutes. After that incident I've always wondered if that will ever happen to my server despite the fact that both have/had ECC cache.

BobCochran · Oct 16, 2012

Let me analyze this a little more. Long ago, I read W. Curtis Preston's book "Unix Backup and Recovery", and still have it on the shelf. That book is old -- early 1990s and out of print. Preston's setting is a data center. He discussed writing backup data to various forms of tape media. Isn't that an example of a host machine writing to its own physical devices? Those hosts could certainly suffer hardware failure too, and no doubt they did, but the threat of hardware failure wasn't enough to cancel out the benefit of doing regular backups. He also discussed using remote copy to send data to another host, but noted the protocol was insecure. I seem to recall he was definitely more focused on tape backups and on testing the actual backups to make sure data could be restored when an emergency struck.

Mainframe systems have been virtualized for a long time now, and are normally connected to numerous devices, and these virtual hosts write backup data to connected devices. Again, hardware failure, while always possible, isn't a sufficient reason for not doing a backup using the resources available and on a regular schedule.

Backup solutions aimed at Tier 3 end users likewise focus on having the user's current system(s) send data to some type of connected media or increasingly to a remote host somewhere in the cloud. For example Dropbox. External drives connected to the user's system are very popular choices. The user's system can certainly crash due to hardware failure. That doesn't preclude doing a backup, however.

What these systems seem to do is focus on writing data on a regularly scheduled basis to removable media, whether this is tape media or a MyBook. Or send the data to some entity in the "Cloud", a remote host with storage capacity.

So I reason that a virtual machine sending data to another virtual machine for purposes of creating a backup is plausible if the storage media is portable and backups are done on a regularly scheduled basis. There is a risk of hardware failure. But that is true of all hardware. The risk is acceptably low. Supposing my physical hardware self-destructs the night of the third backup, what do I do? For me, the downtime would be at most two weeks while waiting for replacement hardware to come in. Between the ZFS array I am trying to protect through backups, and the backup disks associated with the third night's backups, and the backup disks associated with the successful second night, I should be able to recover with very low data loss. Setting up a second physical host merely to receive backup data is an extremely expensive course to implement. A virtual guest makes use of the excess capacity of the physical host for very low risk. Considering the time that would have to be invested into setting up a second physical host, the costs of components for that host, the monthly costs of storing the host somewhere, and other maintenance costs for the host...this is an interesting exercise in weighing benefits, costs, and risks.

I think Preston operates a website devoted to backup and restore issues. Perhaps I ought to consult that as I consider the options.

Bob

Joshua Parker Ruehlig · Oct 16, 2012

I think backups on the same system are still useful if on different harddrives.
For example I have a dataset where I store my documents / pictures / misc files. I am thinking of adding another hardrive (likely external) to my system, as a separate zpool to just have slave datasets. I would consider the data on my master zpool the hot data, the snapshots on the slave dataset the local backup, and I rsync to a offsite location for my remote backup.

Now, sure both local zpools are susceptible to getting wiped out / stolen at once, but that's what my offsite is for.

Now for my webserver files / databases backups I consider the data on my webserver the hot data, the dumps + file backups on my local freenas the local backup, and the rsync to an offsite location as the remote backup.

JaimieV · Oct 17, 2012

Data can barely be considered to exist unless it's in at least three places, it's too ephemeral otherwise. This is a practical point, not a philosophical one.

Your two copies of the data will essentially be in the same place, therefore it's not a backup.

Why bother to go to all the effort of using a data-scrubbing clever filesystem like ZFS when you haven't arranged for a real backup?

Your backup box could be as simple as a Microserver with 5x3Tb drives elsewhere on the network, costing peanuts (it needs no UPS, no special network services, replicate encrypted and you're fine).

On the third hand, if you can handle two weeks outage without the business going down, and don't mind having to "recover" your data rather than just having it there available to you, then your needs are very minimal indeed. Less protection than I give my home data (see top of thread), let alone work.

cyberjock · Oct 17, 2012

BobCochran said:
Let me analyze this a little more. Long ago, I read W. Curtis Preston's book "Unix Backup and Recovery", and still have it on the shelf. That book is old -- early 1990s and out of print. Preston's setting is a data center. He discussed writing backup data to various forms of tape media. Isn't that an example of a host machine writing to its own physical devices? Those hosts could certainly suffer hardware failure too, and no doubt they did, but the threat of hardware failure wasn't enough to cancel out the benefit of doing regular backups. He also discussed using remote copy to send data to another host, but noted the protocol was insecure. I seem to recall he was definitely more focused on tape backups and on testing the actual backups to make sure data could be restored when an emergency struck.

Mainframe systems have been virtualized for a long time now, and are normally connected to numerous devices, and these virtual hosts write backup data to connected devices. Again, hardware failure, while always possible, isn't a sufficient reason for not doing a backup using the resources available and on a regular schedule.

Backup solutions aimed at Tier 3 end users likewise focus on having the user's current system(s) send data to some type of connected media or increasingly to a remote host somewhere in the cloud. For example Dropbox. External drives connected to the user's system are very popular choices. The user's system can certainly crash due to hardware failure. That doesn't preclude doing a backup, however.

What these systems seem to do is focus on writing data on a regularly scheduled basis to removable media, whether this is tape media or a MyBook. Or send the data to some entity in the "Cloud", a remote host with storage capacity.

That is the secret. The bold. Tape backups is totally different from hard disks for multiple reasons. Most notably that they have their own power supply, they are usually physically removed after the backup is completed, and they arre often proprietary enough that you can't execute a single command that serves various useful functions(dd for example) and destroy the data on them. Not to mention the fact that for the quantity of data you backup a Tape drive is virtually out of the question.

Tapes are not readily accessible to the OS with millisecond notice. Often with a tape, if you "update" a file, the tape simply write another copy of the file to tape wherever the next free space is. If you go to do a recovery you'll see multiple copies on the same tape.

While tape stores data just like a hard drive, its inner workings, its functions and operation are very different internally. Often SATA drives need zero drivers to work. The motherboard drivers are common and generic and will "just work" with most OSes that are in use. Tape is often totally different. If you are using a single tape drive, it may simply be plug and play, but backing up data will certainly take someone swapping tapes out VERY regularly. If a tape is destroyed you just go to the previous days' tape. With multiple tape decks and other big multi-tape setups they can't access all the tapes at once and wipe them all out. Typically the tape machine is sent a command "load tape 54", then "mount tape 54", then "write this data". You must wait for the first command to finish before you send the second. That is not true with SATA(obviously to anyone that understands NCQ).

So no, tape is not the same as hard disks.

Also, you said that your storage media is portable. I'm sorry, but I disagree. Just because it's hot swappable and in a tray doesn't make it "portable". I'd consider tapes portable. Hard drives much less so. Every USB hard drive I have bought in the last 6-8 years has even come with a big red sheet of paper when you open the box that says "I am not portable.. do not use me as such". Even ones with laptop drives! The only exception was a Kingston HyperX(SSD).

While your logic is sound, it's also missing a lot of the bigger picture. You are welcome to sleep well at night with your backups. I know I wouldn't. I've seen virtual machines go down and take all of the data they had attached with them to data-heaven. I virtualize FreeNAS only for experimenting and testing things. The only files in my virtual machine are a backup of my HTPC, laptop, and desktop. The virtual machine is actually located on my file server. Notice there is no physicality between my backups and the computers they backup. I've never had to use the backups, and I've occasionally crashed FreeNAS pretty badly(woohoo for snapshots). But I'd never think of setting up a system with the original and backups always online and always available to the OS like you do.

Bigger picture, you are talking about virtualizing your backups to save you about $500-800(perhaps less depending) by not building another physical system. That tells me that your data is worth less than $800. If your data was worth $250k, you'd be paying money for offsite backups with some large company or otherwise dedicating time to asking a friend to let you borrow a drawer in his bedroom to store AES-256 bit encrypted hard drives. It's always easy to see what someone's data is worth by looking at how much they're willing to spend. When they won't spend more than $x, you know that's what they value their data at. I've seen computer virus wipe out data on internal and externally connected drives simultaneously. Risk is the name of the game. If you are okay with that risk, than a USB drive is for you. If you've had it happen to you before and you think its likely to happen again, you'll DEFINITELY consider another form of backup.

If you sleep better at night because of what you call backups, then great. If I sleep better at night because of what I call backups, then great. When you lose your data I'll feel bad for you, but I won't miss your data anywhere near as much as you will. The reverse is true if I lose my data. It's as simple as that.

You've chosen to accept certain risks for your configuration. They are either acceptable risks for you or you don't know enough to realize the risks(which is always universally bad). If you accept the risks versus the cost, then great. I had a tornado go down my street in 2010. It scared the crap out of me and made me realize that if my house was leveled(my neighbor's was) that I would REALLY miss some of my data. Pictures of my time in the Navy and going all over the world would be gone forever. I bought a $150 2TB drive and I backed up all of the data that would devastate me if I lost it, including scans of important documents and pictures of my entertainment center and what-not. I then had a friend 25 miles away keep it in a shoebox in his basement. I will tell you that when you lose all of your data you are heartbroken beyond explanation. To me, spending that $150 made me feel much better after that brush with the tornado.

Important Announcement for the TrueNAS Community.

Rsync or Replication?

JaimieV

Guru

JaimieV

Guru

BobCochran

Contributor

BobCochran

Contributor

Stephens

Patron

JaimieV

Guru

BobCochran

Contributor

JaimieV

Guru

Stephens

Patron

JaimieV

Guru

cyberjock

Inactive Account

BobCochran

Contributor

cyberjock

Inactive Account

BobCochran

Contributor

Joshua Parker Ruehlig

Hall of Famer

JaimieV

Guru

cyberjock

Inactive Account

Similar threads