Degraded Volume After Replacing

Status
Not open for further replies.

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
Hello,

I am running FreeNAS 8.0.1 release version. I set up a raidZ2 volume with 6 drives. I had to replace one of the drives so I powered down, physically replaced the drive, and booted up. From the web gui I saw the volume in degraded status, clicked on view drives, saw the unavailable drive and clicked replace. The new drive showed up with a status of replacing and I removed the old drive from the pool.

After all that the volume's status was still degraded and the new drive's status was replacing. I figured that replacing could take a while so I let it go. In the meantime I could still access the share and all the files on it.

Several days later, it is still in a degraded/replacing status. I feel like it should have completed by now. I logged in through SSH and saw this:

Code:
zpool status -x
 pool: zfs
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
       corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
       entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

       NAME                       STATE     READ WRITE CKSUM
       zfs                        DEGRADED     0     0     0
         raidz2                   DEGRADED     0     0     0
           ada0p2                 ONLINE       0     0     0
           ada1p2                 ONLINE       0     0     0
           ada2p2                 ONLINE       0     0     5
           ada3p2                 ONLINE       0     0     3
           ada4p2                 ONLINE       0     0     0
           replacing              DEGRADED     0     0     2
             4841527635549522674  UNAVAIL      0     0     0  was /dev/ada5p2/old
             ada5p2               ONLINE       0     0     0

errors: 3346 data errors, use '-v' for a list


I also did a zpool status -v and it showed tons of files and said they had permanent data errors. I tried opening one of the listed files from the share and it worked fine. I tried others, several dozen, and there doesn't seem to be anything wrong with any of the files it lists.

Why does it think there are data errors here? I don't notice anything wrong; is there really something wrong with my files? If nothing is wrong, how do I fix this, finish replacing the drive, and get rid of degraded status?
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
There's a ticket open for this, but if you follow the directions in the Unofficial FAQ in my signature for replacing a disk, there's a detach command to get rid of that was /dev/ada5p2/old and then you'll be back in business.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
I've managed to make the situation worse. The detach command didn't work. It would say no valid replica. I should have stopped and asked for help again when that didn't work. Instead, I started a scrub, since I heard that the scrub process was meant to check for and fix data errors.

Once the scrub started, I did a zpool status -v. It said something like scrub in progress, 1% complete 102 hours remaining. I figured that was a bogus estimate since it had just begun. About 10 minutes later, I tried another zpool status -v. This time nothing returned. The console was stuck. Ctrl-C wouldn't work either. I opened the web gui and clicked on Storage. It displayed the loading bar and nothing more.

I opened another ssh terminal. Running ps showed me the two zpool status processes running. A kill -9 command didn't stop them either. At that point I noticed I couldn't hear the hard drives running, so I figured the scrub process was stuck too. I rebooted the system.

It did not boot up again. It reaches the point where it says Mounting local file systems and it stops there. I see in the boot menu there is an option for verbose mode, I assume that might spit out some info that would tell me what is failing. Unfortunately FreeNAS doesn't detect my keyboard until further in the boot process so I can't use that menu.

In an attempt to debug this myself, I unplugged all of the drives and booted again. It worked. I poked at fstab and some of the init scripts and didn't see anything that looked out of place.

I wondered if maybe my new drive was just bad. I plugged it in by itself and booted. It worked, but I obviously can't do anything with a single drive. I repeated this with every drive, trying to find one which might be stalling the boot process. One by one, I could boot up fine.

Eventually I discovered, I could plug in any three drives and the machine would boot up. As soon as I plug in a fourth drive, it gets stuck at Mounting local file systems.

If I understand raidZ2 correctly, I must have at least 4 out of my 6 drives in order to bring the pool back up. Any ideas on how I might be able to boot with all of the drives plugged in?
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
Hi Sotwiz,

I'm sorry to hear that. The detach shouldn't have been a problem. You're right, it's too bad you didn't post. I'm worried that the scrub might have made things worse in some way, but lets not panic. The problem is what you said here:

It reaches the point where it says Mounting local file systems and it stops there. I see in the boot menu there is an option for verbose mode, I assume that might spit out some info that would tell me what is failing. Unfortunately FreeNAS doesn't detect my keyboard until further in the boot process so I can't use that menu.

If you can get your keyboard to work so you could access that menu, then you could get to a place where you can run zpool commands. The 'Mounting local file system' part is failing because its trying to mount your pool, but it can't. Stuff in your fstab gets overwritten or updated with info from the FreeNAS database, so the mountpoint for your pool doesn't get created until the database gets read to find out where it is supposed to create the mountpoint.

See what you can do to get your keyboard to respond, other than that I'm not sure what else to recommend right now.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
Yeah it's odd that I can access the bios settings with my keyboard, then it starts booting FreeNas and I loose it. While it's booting I can see it bring up the USB interfaces and right after that it says the keyboard is connected. This is all after that boot menu.

I'll see if I can find someone with a PS/2 keyboard I can borrow. That'll probably work.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
Okay, I am able to boot into single user mode.

zpool status -v reports no pools available.

Here's the output of zpool import. I can't copy/paste this so I'll try to retype it.

pool: zfs
id: 2353578023703232593
state: DEGRADED
status: The pool was last accessed by another system.
action: The pool can be imported despite missing or damaged devices. The fault tolerance of the pool may be compromised if imported.
see: http://www.sun.com/msg/ZFS-8000-EY
config:
zfs DEGRADED
- raidz2 DEGRADED
-- ada0p2 ONLINE
-- ada1p2 ONLINE
-- ada2p2 ONLINE
-- ada3p2 ONLINE
-- ada4p2 ONLINE
-- replacing UNAVAIL insufficient replicas
--- gptid/d3m3hg45-er33-45s2-456a-4k4j346mhbj6 UNAVAIL cannot open
--- 7686786786784588645 UNAVAIL cannot open
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
I would use the -f and import it at the command line, then I'm a little unsure, do you have all 6 drives reconnected?

I might try a zpool clear and then a zpool status -v, and depending the output a zpool scrub. After you can get your pool back online at the command line, then you can go back to the gui and Auto Import it if it isn't already online in the GUI.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
Okay, I ran zpool import -f zfs

It's been 45 min to an hour and it's still running. Does import usually take a while or is it stuck?

If it's stuck, what other options do I have?
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
It doesn't usually take but a minute, but since it crashed in the middle of scrub it could take awhile. Can you hear the disks making noise?

If you can hear them 'chattering' I'd let it keep going all night if I had to. How much data did you have? Depending on how full they were it can take awhile for scrub. I have a little more than 4TB on a 5 disk raidz2 and a scrub takes 12.5 hours.

If you don't get a prompt back after waiting, then I'd have to think about what to do next.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
I don't hear the disks actively making noise like when I know there are file operations going on. Yet, there's a tiny chirp every couple minutes or so which makes me think something may still be happening. There was less than 900 Gb of data stored.

I would really like to be able to keep the data stored there, but I'm not in a hurry. I'll let it sit for now. If the import doesn't finish after a couple days, I'll come back and ask more questions.
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
If the import doesn't finish after a couple days, I'll come back and ask more questions.

I'd probably not wait any longer than 24 hours, but it's up to you. It's strange that the status said it could be imported even though it was degraded. It sort of makes me wonder if there's some hardware problem since it crashed on you when you did the scrub. See what happens.
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
Yeah, it's stuck. Not hearing any hard drive activity, I grew impatient. I tried ctrl-c at the terminal and it didn't stop the process. It was just like my zpool status commands were before.

I did more searching around. This looks like a similar problem:
http://forums.freenas.org/showthread.php?395-zpool-import-hangs-forever-help-please!!!

In this and other threads I found about zpool processes getting stuck, the most common solution was to use a newer version of zfs.

Should I try to boot Linux with the zfs fuse drivers to recover this zpool, or are there other options I can try first?
 

sotwizz

Cadet
Joined
Oct 6, 2011
Messages
9
I was able to recover. It wasn't an easy fix. There was a lot of trial and error but I'll try to tell the story of the path, as far as I can tell, which lead to success.

I downloaded the Fedora 15 64 bit live image.

I installed VirtualBox on my desktop computer, booted a virtual machine from the Fedora 15 live image, and installed the OS to a virtual hard drive.

Rebooted the VM from the virtual hard drive. Once booted, I quickly killed the PackageKit daemon so it would not try to update everything.

I searched the internet for the kernel-devel rpm package that matched the kernel version that came with Fedora 15, downloaded and installed the rpm. (yum kept trying to download a newer version)

From http://zfsonlinux.org/ I downloaded spl-0.6.0-rc5 and zfs-0.6.0-rc5.
I followed the rpm build instructions for each. The builds ran overnight.

I created a folder, copied the freshly built rpms into it, and copied the kernel-devel rpm into it.

I used Fedora's LiveUSB tool to copy the live image to a USB stick.

I booted my FreeNas machine off of the USB stick. I had to select simple graphics mode during the boot otherwise it couldn't log in with Gnome 3.

Opened a terminal, 'su -' to become root.
'yum -y install lsscsi'

While that ran, I opened another terminal and became root.
I copied the rpms off my vm with this.
'scp -r root@ip.of.my.vm:/path/to/rpms .'

Once both of those finished,
'cd rpms/'
'rpm -Uvh *'

That took about a minute to run. Once finished:
zpool status reported no pools available.
zpool import showed my pool in degraded state. It said the pool was in the process of resilvering, might be active on anther system, etc.

From there I followed the advice from this forum:
http://opensolaris.org/jive/thread.jspa?threadID=120474

'zpool import -nfF zfs' reported that the pool could be imported losing 20 some seconds of transactions.
'zpool import -fF zfs' took about 3 minutes to run.
Then zpool status -x showed that a resilver was in progress 1% complete.

In the meantime I could cd into the mount point and see all of the files.
From there, I copied everything off to another drive I had connected to my desktop computer.

I ran into problems with this process though, in the end I got it working but I'm not sure exactly what fixed it. The first time I was able to get 'zpool import' to display my zpool, it took a very long time to run. Once it finished, I could hear the disks working and the terminal kept spitting out ATA errors. It was annoying, they kept appearing on top of anything else I was doing in the terminal. Seeing the ATA errors made me think one or two of my hard drives might be bad. These errors didn't say anything specifically about which drive or drives were causing the problem.

Also the original replace I was doing just outright failed. It said replacing DEGRADED and both drives below that were UNAVAIL. I'm pretty sure this happened because I didn't zero the drive before I replaced. That's likely the cause of this whole mess to begin with.

Despite the ATA errors and DEGRADED status I started copying files off of zpool. I used SCP again to push the files across the network. I tried several zpool status commands so I could see the progress of resilvering. The first couple looked fine, then they started to hang like I experienced in FreeNas, except I could still terminate the process with ctrl-c. It was copying, I just let it go. At some point while it was running the whole system froze up. This might have had something to do with the ATA errors, but I think it's more likely that it ran out of memory, and since the live os runs in memory, it just died.

I thought one of my drives was bad so I unplugged it. When I rebooted the live operating system forgets everything, so I had to repeat many of the steps above. When I got to zpool import, it didn't work. The zpool status was FAULTED. One of the remaining four drives had a status of FAULTED. It couldn't import.

At that point, I was missing one drive and I thought one or two more drives might be bad and one of my good drives was faulted. I need at least four drives online to read data. I went back to the old drive I originally replaced. I didn't replace it because it went bad, I replaced it because it was small and it was drastically lowering the overall size of the pool. I put it back in and I unplugged/replugged all of the sata cables on both ends. Now all of the drives were put back to how they originally were when I created the pool.

I booted up, went through the install steps. 'zpool import -f zfs' worked perfectly. I didn't event need the -F. 'zpool status -x' showed all drives online, resilvering still in progress. The replace was still degraded, but the original drive was online. I started SCP ing the files again, I started from a different folder so I wouldn't have to recopy the files that already copied. I let it run over night.

In the morning the system was frozen again, I rebooted, reinstalled, reimported. The status was the same, six drives online, the extra replace drive unavailable, resilvering in progress. I was able to detach the extra drive, and everything went to online status.

I didn't know where SCP had finished, so I started copying the files over with
'rsync -rz --progress --size-only /zfs/ root@my.other.ip.address:/path/to/mount'

With these options rsync will not try to copy over files that exist both locally and remotely and have the same file size. So I let rsync do the work.

Every 6-12 hours it freezes, and I have to restart the process, but the zpool status has been the same. The resilvering process gets a little more % complete each time. This has been going on for several days. My network is extremely slow at copying files and the resilvering probably slows things down too.

It actually just finished copying all of it as I was typing this. I have a complete data backup so I'm happy. I just tried zpool status -x, but it hung. After my most recent import, status said resilvering was at 88%. I'm going to let it run and try to finish resilvering before I export and go back to FreeNAS.

I bought three new drives, so hopefully I can be rid of all my problems soon.

Thank you very much for all the help you've provided. Hopefully some part of this story could help someone prevent or overcome a similar problem.
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
I would have probably let the resilvering finish before I tried to copy stuff because like you found out, it slowed things down, but also because I wouldn't trust the data to be reconstructed completely until it finished. The thing about any recovery situation like this is that files can appear to be recovered, but until you actually open them and look at stuff, you never know if you just got the header with a bunch of garbage or random bits and pieces of each file.

Thanks for sharing your story, I'm certain it will be helpful to someone, but at the same time I don't wish a data disaster on anyone.
 
Status
Not open for further replies.
Top