Backing up virtual machines

Status
Not open for further replies.

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
ZFS is awesome for backing up data like files and such but we're also hosting virtual machines with VMware on a freenas NFS share. From time to time I need to restore those machines.

I'm finding restoring from a snapshot doesn't work that well. Even though ZFS is copy on write, when I restore from a snapshot the VM acts as if it's had it's power pulled. I guess that makes sense. Anyways it usually complains the filesystem needs to be checked or repaired.

So I'm wondering what the vmware / kvm people here are doing for backup? Are you pausing the VMs and then running a snapshot? Or relying on something else for backups like Veeam?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The only way to have 100% working VMs is to either do a shutdown of the VM, snapshot the VM, etc. and then snapshot the VM.

Veeam may or may not solve the problem, you'd have to look into Veeam. I don't use it so I can't really talk about how well it does or doesn't work.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
The only way to have 100% working VMs is to either do a shutdown of the VM, snapshot the VM, etc. and then snapshot the VM.

Veeam may or may not solve the problem, you'd have to look into Veeam. I don't use it so I can't really talk about how well it does or doesn't work.

Thanks for chiming in Jock. I can always count on you to reply faster than I can even post :)

Yea I figured as much about shutting down the VMs. I thought about trying to time it all. Have VCenter shutdown the VMs at a particular time, then ZFS snapshot, then power them back on. I worry that I'd be counting on to many things to work correctly every night. With my luck one of the servers would stop booting at a question and I'd wake up to a bunch of phone calls in the morning because some server was still down.

I've skimmed the Veeam site. I've seen lots of marketing speak, but not much to answer some questions I have. Gonna call a sales rep to get more info tomorrow and then spin it up in a lab.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
Well I am using KVM with QEMU and my virtual disks are QCOW2. I back them up while the machines are running which works via built-in snapshots (not ZFS snapshots) but I have not had a problem with it complaining about needing to check the disks.

http://wiki.qemu.org/Features/Livebackup

Not really familiar with backup process with ESXi.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I will say that this kind of integration with ESXi is potentially coming in 9.3.

If it is as it is currently planned then you'll setup snapshots of your iSCSI zvol (which is the only recommended way in 9.3+ for performance reasons). When a snapshot of the zvol is due it sends a command to the ESXi machine to quiesce the file system, then a snapshot of the zvol is taken, then everything continues as normal.

I can't vouch that it is for 100% certainty going to be in 9.3 for a bunch of reasons, but that's the plan as I heard it last week. ;)
 

PacketLoss

Cadet
Joined
Oct 31, 2014
Messages
2
@jamiejunk aside from appearing to be not properly shutdown, do the VMs come back and function properly?

I ask because I just ordered parts to build a FreeNAS machine to do JUST what you're doing and "backup while the VM is online" (replication of snapshots) is the linchpin for the whole plan. If you end up with corrupted VMs, then I need to think about other options.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The problem with VM images is that you are taking a snapshot of the disk at some point when the OS may not necessarily have a coherent state on disk. In general, the OS is unlikely to have a coherent state unless maybe the filesystem is mounted read-only or something like that. You are indeed effectively getting a disk image that looks for all intents and purposes like someone powered off the machine while it was up and running. Most OS's will need a consistency check or fsck.

Products like Veeam have put in hooks into VMware and Windows to cause the system to be able to generate a quiesced snapshot, but this basically requires some hooks at the hypervisor and OS level in order to make it happen, and additionally it often doesn't work if your VM's are busy with lots of I/O.

In the service provider world, it isn't uncommon to simply have multiple snapshot backups and pray that one of them is retrievable.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
@jamiejunk aside from appearing to be not properly shutdown, do the VMs come back and function properly?

I ask because I just ordered parts to build a FreeNAS machine to do JUST what you're doing and "backup while the VM is online" (replication of snapshots) is the linchpin for the whole plan. If you end up with corrupted VMs, then I need to think about other options.

Freenas has been great for doing user files over NFS. Snapshots on user files? Amazing! Replication? When it works, it awesome.
Hosting VMs, it could be better. I believe in ZFS. But it's very trying at times. You can throw a shit ton of money at it, and still not get decent performance. Or at least I can't.
As for restoring VMs from snapshots. It works, but like jgreco said, the filesystem is in a state where it's like you pulled the power plug. I've lost a few VMs that I couldn't get the filesystem to fsck and was basically screwed. Has that happened a lot? No. But it has happened. So i'm looking for something to keep that from happening.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
ZFS is awesome for backing up data like files and such but we're also hosting virtual machines with VMware on a freenas NFS share. From time to time I need to restore those machines.

I'm finding restoring from a snapshot doesn't work that well. Even though ZFS is copy on write, when I restore from a snapshot the VM acts as if it's had it's power pulled. I guess that makes sense. Anyways it usually complains the filesystem needs to be checked or repaired.

So I'm wondering what the vmware / kvm people here are doing for backup? Are you pausing the VMs and then running a snapshot? Or relying on something else for backups like Veeam?

I use Veeam for our VMWare cluster to backup to a freeNAS box, then that freeNAS box is replicated offsite for disaster recovery. Same goes for our three other locations. Veeam has been great for backups, you can spin-up an archived VM with networking disabled, get single file recovery and Veeam also indexes the VM's file system so you can search for data without opening all your backups.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The main problem is that you lose the benefit of snapshots - virtually instantaneous, frequent, nearly free backup images. Veeam means you need to have a virtual machine running it (or multiple VM's), and it needs to access and read the backed-up VM's data, and write the data out somewhere, and all of that. This means significantly more disk space required, more CPU, etc. But it is a nice system especially if you're doing Windows or something like that.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The main problem is that you lose the benefit of snapshots - virtually instantaneous, frequent, nearly free backup images. Veeam means you need to have a virtual machine running it (or multiple VM's), and it needs to access and read the backed-up VM's data, and write the data out somewhere, and all of that. This means significantly more disk space required, more CPU, etc. But it is a nice system especially if you're doing Windows or something like that.

That's why FreeNAS 9.3 is coming with the new ESXi support. ;)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I know the WebGUI portion is there. I can't vouch for how well it works or if its been tested extensively since it's so new. I personally haven't tested it, but I know that the dev that was working on it had done some experiments with it and it worked well for him.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That's why FreeNAS 9.3 is coming with the new ESXi support. ;)

Well, that's great and all, though getting all the details right is actually rather complicated. The real problem is that all the virtual machines on a volume or datastore or whatever need to be quiesced at the moment the snapshot occurs, which is really rather more difficult to arrange than it might seem at first. Sadly, none of this stuff is actually magic.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, that's great and all, though getting all the details right is actually rather complicated. The real problem is that all the virtual machines on a volume or datastore or whatever need to be quiesced at the moment the snapshot occurs, which is really rather more difficult to arrange than it might seem at first. Sadly, none of this stuff is actually magic.

And that's precisely what 9.3 takes care of. It will make sure that the file systems are quescent at the moment of the snapshot. Josh Paetzel is taking care of that because we've seen users with this problem and the only "good" solution is to manage this entirely on the FreeNAS/TrueNAS server. Even things like Veeam work fine, but there's no way to be 100% sure that Veeam has quiesced the file system before the ZFS snapshot takes place since Veeam has no way of communicating with FreeNAS. So FreeNAS will talk to ESXi directly to make everything all p-rrty and stuff. ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And that's precisely what 9.3 takes care of. It will make sure that the file systems are quescent at the moment of the snapshot. Josh Paetzel is taking care of that because we've seen users with this problem and the only "good" solution is to manage this entirely on the FreeNAS/TrueNAS server. Even things like Veeam work fine, but there's no way to be 100% sure that Veeam has quiesced the file system before the ZFS snapshot takes place since Veeam has no way of communicating with FreeNAS. So FreeNAS will talk to ESXi directly to make everything all p-rrty and stuff. ;)

The latter half of your message makes no sense; I'm not sure what role you think Veeam is playing there. The first half is merely strange.

Okay, now, here's the problem. This paragraph is background for the audience; I think and hope you already understand it. So you have a virtual machine disk file being served up via NFS. We'll call it a vmdk. So right at >< this moment you take a snap while running, which gives you an inconsistent disk image because, y'know, some stuff might have been in flight or not fully committed and who knows what the OS does. If you look at the snapshot, it looks like the image of a disk from a machine someone shut down hard while it was running. It may be recoverable via fsck/chkdsk/whatever, or it may not. In an OS that carefully writes its metadata out in an orderly fashion, it ought to be recoverable, but, y'know, real world and all that.. This is the conventional "snapshot" problem. So VMware introduced a sync driver component to VMware Tools, which allows the hypervisor to ask the VM to quiesce the disk(s). The OS receives a quiesce request and is then supposed to flush dirty buffers, and/or anything else necessary to make the filesystem consistent and suitable for backup. This operation necessarily causes (in simplest form) I/O to be paused within the VM in order to allow the hypervisor time to execute a snapshot, and a corresponding resume operation will return things to normal operation. Note that there are actually a few different mechanisms for the quiescing operation, including the sync driver, the vmsync driver, or Microsoft VSS. But let's keep this simple.

During that quiesced state, it is safer (not necessarily safe, merely safer) to make a snapshot and you're more likely to get a consistent disk image. But there are caveats. The biggest ones:

1) The VM has to be running appropriate drivers and an OS that is agreeable to quiescing.

2) The VM's I/O has to be of a nature that it can be quiesced for a period of time; many busy server VM's do not qualify.

But now comes a more interesting issue. Let's take it for granted that you have a VM where you can actually do this. Lots of VM's can. But that's the problem. Your typical datastore and hypervisor do not service a single VM. You might have a hundred VM's on that datastore.

Doing the snapshot at the datastore level requires that *all* the VM's be quiesced, which basically introduces a new issue, which is that you get a shitstorm of writes to the datastore as all the VM's flush their buffers, and then they ALL have to wait, both for the quiescing to succeed, then for the snapshot to occur, then for the resume. Because the datastore-level snapshot can't pick and choose which VM to snapshot.

(By comparison, Veeam uses ESXi VM snapshots to manage its tasks, which means that it is doing it one VM at a time. But there's a lot more data being shoveled around. This is not necessarily better or any more desirable, just a different set of evils.)

So then we go off into alternative realms, and it just gets worse.

A) If you're not using NFS and are instead using iSCSI, then you have a vmfs3 or vmfs5 formatted zvol; this is more-crazymaking to recover things from and still suffers approximately the same issues.

B) You can provision a separate ZFS dataset for each VM. This gives you the sort of granularity you need for snapshots, but turns into an NFS (or iSCSI) mount nightmare. Nobody really wants an individual datastore for each VM. It doesn't scale.

Which brings me back to what I was saying earlier in this thread. It's great to have better ESXi support, but it is only incrementally better, because some of the underlying problems are inherently Hard (Big Giant Capital H Hard).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yep. All of that is already well understood by me. FreeNAS' iSCSI initiator will send the appropriate VAII commands to the ESXi server (you actually have to provide login credentials to the ESXi server) to handle all of it.

FreeNAS has the advantage of taking the whole snapshotting and snapshot deletion of the VMs (not for ZFS itself which is nearly instantaneous) and making it take seconds. In the lab we were able to "more safely" snapshot 50 VMs in like 30 seconds flat. ESXi handled it like a champ. On the flipside, Veeam might have taken an hour or more, and you had no guarantee if/when the ZFS snapshot takes place.

Read this link: http://kb.vmware.com/selfservice/mi...nguage=en_US&cmd=displayKC&externalId=1021976

It mentions this...

Native Snapshot Support – Allows creation of virtual machine snapshots to be offloaded to the array.

That's the feature that FreeNAS has integrated into iSCSI and the WebGUI in 9.3. ;)

I would test it for you, but I'm not at home. But as BETA is due in just a few days I'd tend to think that its expected to work already.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
(By comparison, Veeam uses ESXi VM snapshots to manage its tasks, which means that it is doing it one VM at a time. But there's a lot more data being shoveled around. This is not necessarily better or any more desirable, just a different set of evils.)

Just a side note, Veeam supports parallel VM processing so it can quiesce and process a VM per Veeam server core. Mine does 8 at a time currently ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Just a side note, Veeam supports parallel VM processing so it can quiesce and process a VM per Veeam server core. Mine does 8 at a time currently ;)

Yes, but that's not what this is about. The options here are basically to do per-VM snapshotting or to do per-datastore snapshotting. Both have their pros and cons. You can certainly parallelize operations for per-VM snapshotting.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yep. All of that is already well understood by me. FreeNAS' iSCSI initiator will send the appropriate VAII commands to the ESXi server (you actually have to provide login credentials to the ESXi server) to handle all of it.

FreeNAS has the advantage of taking the whole snapshotting and snapshot deletion of the VMs (not for ZFS itself which is nearly instantaneous) and making it take seconds. In the lab we were able to "more safely" snapshot 50 VMs in like 30 seconds flat. ESXi handled it like a champ. On the flipside, Veeam might have taken an hour or more, and you had no guarantee if/when the ZFS snapshot takes place.

Read this link: http://kb.vmware.com/selfservice/mi...nguage=en_US&cmd=displayKC&externalId=1021976

It mentions this...

Native Snapshot Support – Allows creation of virtual machine snapshots to be offloaded to the array.

That's the feature that FreeNAS has integrated into iSCSI and the WebGUI in 9.3. ;)

I would test it for you, but I'm not at home. But as BETA is due in just a few days I'd tend to think that its expected to work already.

I'm familiar with the feature, but basically this isn't all that useful if it has to quiesce all the virtual machines at once - because to issue the ZFS snapshot, the VM's need to be quiet. 30 seconds is probably about 25 too many for many production environments. The complexity gets dazzlingly great as you get up to a few dozen hypervisors. If the process is being driven from FreeNAS, how would you even know which hypervisor is currently hosting the VM?

p.s. it's VAAI.
 
Status
Not open for further replies.
Top