My Experience with Virtualizing FreeNAS (Disaster example included)

Status
Not open for further replies.

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
I've been experimenting with FreeNAS for a couple months now and thought I would share my experience with virtualizing it. Anyone looking for a prime example of how virtualizing can be a disaster with a specific cause, look no further.

Background - You can skip to the disaster or cause below for a shorter read
A little bit before the new year, I decided to add some server equipment to my home lab (at the time it was only Cisco and ProCurve networking equipment). Up until now I had been using a couple Cisco 2911 routers with blades in them but they cap out at 16GB of RAM (if you can find the VLP DDR2 modules!) with 2x500GB drives. They also were limited to ESXi 5.0 (from Cisco) with no upgrade options aside from RAM.

So I picked up an HP DL160 G6, dual 4 core processor, 48GB ECC RAM. I spent a lot of time looking on the internet for suggestions on configurations with ESXi and came across a lot of examples from HardForums of AIO boxes using FreeNAS or another ZFS platform as the storage component. After trying out a few options, I ended up choosing FreeNAS for the simplicity (best GUI, simplest config, in my opinion at least).

I setup my first FreeNAS VM with a RAID card in passthrough that supported JBOD but not SMART passthrough. This worked well with 4x4TB WD Black drives, aside from the lack of SMART support. My only problem with this was the power consumption. Ridiculous power draw.

At work I switched our server platform from IBM to HP in the last year as I really like their support and IPMI platform (iLO). Compared to the IMM from IBM and the DRAC I saw on a few Dell's at work, the iLO was far superior (subjective, I know). On top of the interface and IP KVM in .NET and Java it had an iOS app for remote KVM. I can't tell you how many times the iOS app has come in handy for the servers at work.

Anyways, due to this I wanted to find a Gen8 HP server for home as well. Something without the 8k price tag obviously. Well on a few of the forums I researched the ESXi lab info on, a lot of users had HP Microservers (the old gen and the Gen8s). These are low power, compact, cheap, and they had iLO to boot. Seemed like a perfect fit. I picked an open box one up from NewEgg, 2x8GB HP refurbished ECC RAM and M1015 card off eBay and an E3-1265L Xeon off Amazon for it. When I received my open box unit from NewEgg, I first loaded up FreeNAS direct on the system to test it out. I moved my disks from the DL160 G6 to the Microserver in addition to the new M1015. The move went without a hitch so I loaded up ESXi and virtualized FreeNAS, passed the M1015 card to the VM and it was beautiful. SMART configured now as well.

From here I started replicating data from the Microserver VM to the DL160 VM. Hourly snapshots and replication with a 2 month duration. This is when I started rotating a cold backup off site. Even though I wasn't storing anything of great importance on the system yet, I wanted to make sure I understood the backup solution I was going to use. Things were looking good, but I still wasn't finished making changes.

The whole point of the Microserver was to cut power but if I used the DL160 for a replication target, I had to keep it on all the time. This wasn't good for conserving power either. Ok well I could pick up another Microserver but now I want something with more expandability. In the end I picked up an HP ML310e Gen8 v2, another M1015 and 4x8GB refurbished HP ECC RAM for it. I added a couple cheap Icy Dock flex fit trios to the tower to hold 4x2.5" and 2x3.5" in addition to the 4x3.5" drives in the system.

I moved my main pool over to this new tower with the Microserver as the backup target now. This allowed for more than 8GB to dedicate for the FreeNAS VM, I could do a 16 and 16 split between FreeNAS and my other VMs. The Microserver kept an 8GB RAM VM with FreeNAS for the replication and now a redundant DNS/DHCP/NTP VM. I also added Plex as a VM on the ML310e along with a few other VMs for various services. I was almost done with changes and so far FreeNAS had been solid through all my modifications and moves.

My only concern now was my ESXi storage was a single SSD... I wanted a RAID1 setup for this. So one night last week I picked up a pair of Samsung EVO 250GB drives from Microcenter (price matched to Amazon/BB) to host the ESXi storage. I popped these in the ML310e v2 on the host B120i and temporarily added the M1015 from the Microserver to it. I moved the original SSD from the B120i to the M1015 and setup the new pair on the B120i. Then I turned the RAID feature on in the BIOS and mirrored the drives. When I booted up ESXi it picked up the old storage and VMs on the M1015 and the new mirrored SSDs on the B120i. I setup the new storage in ESXi and migrated my VMs from the old to the new. Then I took down the system and pulled out the old SSD and the temp M1015. I booted it back up and everything was golden.

Disaster
A few hours later I went to play a video using Plex from my phone to my Apple TV. It started up OK but after 2 minutes or so it paused and was waiting for more data (or so it appeared on the TV). This was a first, normally videos played flawlessly over the wireless to my phone and Airplayed to the TV. Plex was pulling data off the FreeNAS VM via an NFS share. My first thought was that there was an issue with the wireless, so I messed around with the wireless configs for a bit to try and find where the throughput problem was. Couldn't find any problems, everything else worked great over wireless.

Ok, so what else could it be? Oh, my Time Machine backup is failing now too. Well, time to check FreeNAS. I hopped on the site and to my dismay the light was yellow. Bummer. I logged in via SSH and checked the pool only to find many read, write and check sum errors along with this message:

Code:
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.


Also with this was a list of 5 files that were not fixable and one metadata piece, my Time Machine dataset.

Not ready to throw in the towel and ask for help on the forums, I spent a while with Google searching the FreeNAS forums (Google works better to find the stuff I want usually). As I'm reading, I'm realizing my snapshots are still running and a snapshot could blow my hot backup. So I turned off replication, scrubs and both systems.

Temporarily while troubleshooting further, I loaded up a USB stick with FreeNAS and booted up the Microserver off it. No issues with this system but I wanted a solid foundation for recovery when I fixed the ML310e.

From here I swapped out the M1015 in my ML310e and the drives with some older ones. I rebuilt a fresh pool and started restoring the good data from my backup to see if it was the card. I did not want to reuse the original drives in case I needed them for something later (best to leave as is until the issue is fixed). After copying 50GB back over, the pool took a dump again due to a kernel panic. Ok so it isn't an issue with the drives or the controller.

Next I ran memtest for 6 passes. While this was running, I spent time reading up more but I couldn't find a solution still. Ok, memtest finished without error. I also ran multiple long SMART tests on all my drives without error. What's left? I moved the M1015 to a different slot and tried again. Same problem. But by now, I've found a potential cause while researching online.

Cause
I came across an article on homeservershow (my goto site for the Microserver) from December where a few users running the Microserver as an AIO with ESXi/ZFS had their M1015 cards go crazy. Apparently if you turn on RAID for the B120i controller on the Microserver, passthrough of the M1015 becomes extremely flaky. Great, an answer!

The next morning I put my old SSD back in the system on another RAID controller card as JBOD and migrated my VMs off the RAID1. I moved the single SSD to the B120i and the pair of SSDs to this other RAID controller. I then disabled RAID in the BIOS for the B120i and mirrored the SSDs on the RAID card. Loaded up ESXi and migrated the VMs back over to the new raid. Then I setup FreeNAS again, reconfigured my pool and started restoring data. No issues.

After the restore, I loaded up my cold backup (picked it up over the weekend) and used that to restore the permanently corrupted files and all my old time machine backups.

TL;DR
* Minor changes to the host can have a disastrous impact on your VM's behavior (kernel panics, HBA corruption, etc)
* If a scrub runs when your ESXi host corrupts the passthrough HBA... your pool will be completely gone. Same principle as bad RAM.
* Cold backups are essential when you start storing anything important

Honestly, this has been quite a fun experience for me. Breaking stuff sucks but figuring out the problem and fixing it after is always enjoyable. Thanks everyone for all the great posts, especially Cyberjock and jgreco for all the newb tutorials and references for FreeNAS and hardware.

Now that I'm done changing/buying stuff, I'm not sure how I will break it next :)
 
Status
Not open for further replies.
Top