My turn to ask for some help!!!

andyclimb · Oct 18, 2014

Hi,

I've been using freenas for a while now and generally not had any significant problems. Until a few days ago! I recommended that a friend get a freenas box as its cheap (compared to things like QNAP) and you get more bang for your buck. So he purchased an HP proliant N40L I think or P... and installed 8Gb of ECC ram and put in 4 hard drives. At the point he was running 9.2.1.5, then upgraded to 9.2.1.6... and now finally is running .8.. He has one pool RAIDZ1, one jail running crashplan, one user.. and all is very simple...

only the other day the crash plan stopped working, so i suggested he reboot... but now it refused to boot. this was all a bit weird... so i suggested he re-image a new USB stick with 9.2.1.8 which he did. The machine boots just fine, now when i auto import the volume. it reboots... immediately...

if i run

zpool import -f -R /mnt tank

the machine crashes immediately...

if I run
zpool import -f -R /mnt -o rdonly=on tank

then the pool mounts, and all the data is there... this is good... but my questions is this...

1) what do I do now... I don't really have the option 'just yet' to copy the data of the NAS to something else and recreate the pool.. which i guess is the straightforward easy option.. so apart from this, is there anything else that I can try...

2) Why did this happen... I've had a lot of faith in ZFS .. but there are other posts about this thing happening... and it is rather disconcerting.. given that there are no rescue tools available. I'm running ECC ram... system is up to spec, freenas is uptodate.. there is loads of space on the pool... technically I don't think i've done anything wrong... cyberjock may not agree with me on that one.. but i do have ECC!!!!!!!!

cyberjock · Oct 18, 2014

1. You do exactly what you think you need to do... copy your data elsewhere and recreate the pool.

2. There's a bunch of reasons why RAIDZ1 is bad. My finger points there first considering the problems are well known and well documented. But I will admit you have no smoking gun.

diedrichg · Oct 18, 2014

With 4 drives, definitely recreate in RAIDZ2!

andyclimb · Oct 23, 2014

Hi,

Ok so data copied off, pool recreated in RAIDz2, data copied back. All ends well.... however... I'm still a little disconcerted...

1) from what i see there were no drive errors, no smart errors, no checksum problems with the failed pool. Scrubs were run once per week, no problem detected...
2) as cyberjock pointed out... there is no smoking gun here....

The only advice from people is that I 'should really be running raidz2 and not raidz1'

but I'm a little confused... The data is all there.. and the pool can be mounted in read only which suggests that there is nothing actually wrong with the underlying data. From my reading what this implies, is that there is something wrong with the transaction write queue... or pending transactions in ZFS that is causing it all to fall over when importing. So my multimillion-dollar question is... would having another disk of redundancy really stop this error from happening? it seems to me that the error is at a higher level of the ZFS pool than that of an individual disk, and its recorded data.. I might be totally wrong about this as i'm not a ZFS expert. I thought the ifs import -F is meant to try an roll back the transaction history... but then... this failed and the error CAUSING THE MACHINE TO IMMEDIATELY REBOOT... seems a bit overkill...

anyone care to throw in their 2cents?

A

cyberjock · Oct 23, 2014

andyclimb said:
but I'm a little confused... The data is all there.. and the pool can be mounted in read only which suggests that there is nothing actually wrong with the underlying data. From my reading what this implies, is that there is something wrong with the transaction write queue... or pending transactions in ZFS that is causing it all to fall over when importing. So my multimillion-dollar question is... would having another disk of redundancy really stop this error from happening? it seems to me that the error is at a higher level of the ZFS pool than that of an individual disk, and its recorded data.. I might be totally wrong about this as i'm not a ZFS expert. I thought the ifs import -F is meant to try an roll back the transaction history... but then... this failed and the error CAUSING THE MACHINE TO IMMEDIATELY REBOOT... seems a bit overkill...

Bold emphasis is mine.

That is almost certainly the problem. Here's the most likely scenario IMO:

1. ZFS decides to write to the transaction write queue or whatever.
2. ZFS sends the writes.
3. Each disk receives its writes at slightly different times.
4. One or more disks performed the write, but one or more disks didn't at the time of shutdown (or) the data was being committed by more than 1 disk and because of the loss of power the data written to the pool was garbled.
5. Next bootup ZFS goes to do its thing and its like:

"wtf is this shit!? I can't use this. I'll go to the parity data"
"Aww shit.. parity data is fubared too!"
"Crap.. I have no out.. well, I'll process this crappy data that I have and hope it works out"
ZFS *thinks* it worked out
ZFS crashes because it didn't work out.
Game over!

When you load a pool read-only the drives are writeable, but the pool is readonly. What that means is that if your pool has corrupted data and it knows and can repair it, it will. But if it can't repair it then it stays like it is. Also, if you have a huge section of the pool that is corrupt, as long as you don't have to access that section of the pool it could, in theory, stay mounted forever. But once you hit that corrupt part it might kernel panic the box.

In your case the process of mounting the pool is making ZFS have to access something that is corrupt that it knows needs to be committed, but can't figure out how to do it. But ZFS doesn't try to commit anything if you mount the pool read-only.

So how does Z2 help this problem over Z1? You have a stronger potential for your parity data to save you and more writes(since there is naturally more drives and more parity) necessary for the transaction to be 'complete'. I'm a bit fuzzy on the "why" to a deep level at this point, but that's what I got from a ZFS guy a year+ ago, so I just press the "I believe" button and go with it.

At the end of the day the shitty reality is that Z1 can do this and I haven't seen this on Z2 from a forum user. At least, not yet.

andyclimb · Oct 24, 2014

Thank you for getting into that for me! makes sense, kind of along the lines I was thinking. I still think this is a little bit unacceptable of ZFS, ie.. it should have an ability to roll back and revive a pool in any previous state regardless of what doesn't make sense... i.e. just forget all pending writes.. or where there are snapshots there use them... but causing a kernel panic with no information but a cold reboot is a bit poor... ... but the fact that you've seen it happen for lots of Z1 volumes but not Z2 means that now my friend is covered.. he has raid z2 now.. and I've always had raidz2...

anyway... he is mid way through a crash plan backup.. and a backup to my machine... so all is good.

as always thank you for the help!

A

DaPlumber · Oct 28, 2014

@andyclimb: There *ARE* tools for manually performing the types of operation that would be required to manually repair your pool: zdb - the ZFS Debugger (and it is installed on FreeNAS) for diagnosing the issues and things like "zpool reguid" for fixing them. However a quick perusal of the zdb etc. man pages as well as reading some of the horror stories if you Google "zfs repair FreeBSD" should convince you this is a *VERY* non-trivial process. In short if it's a personal pool what you did was entirely correct: it's going to be far easier to copy the data off a read-only pool (presuming it mounts read-only and you haven't tried to export it yet), than it is to try and diagnose and repair the damage. It's one of those Engineering rules-of-thumb: As the complexity of a system increases the probability of the cost of repair exceeding the cost of replacement increases, usually as a power of the complexity. ZFS is a lot of wonderful things, but simple it is not and its data structures for the most part are not human-readable.

One of the reasons Z2 reduces the chances of error is that it allows for quorum checksumming: I have 3 checksums, the one I just generated and the 2 saved in z2. If 2 out of the 3 agree it's pretty much a certainty which one is corrupted. With just a single stored checksum it isn't always possible to verify if the checksum is correct or the data. There are some types of corruption (as opposed to outright failures) that will still allow for a valid checksum to be generated and even data to be rebuilt or the checksum itself may be corrupted or the wrong checksum for the data.

Extra credit for the masochistic: do a "zdb -C poolname" to print out the pool configuration and count the number of lines of which you understand the meaning and possible values. If the answer is anything other than "all of them" you should not *EVER* attempt manual zfs repair (unless it's out of interest after copying what you can salvage of the data). I've been messing around with ZFS since its infancy on Solaris and I fail that test too.

I'm really glad your friend got his data back!:D

BTW I know this is going to earn me a growl from Cyberjock :p, but creating a recovery pool to temporarily house the copy of the data is one of the few instances where I will cross my fingers and use USB (preferably USB3) connected drives if there's nothing else to use. Mirrored pair strongly reccomended. If there's a problem, I can always reformat and start the copy again or reboot and redo the restore.

Ericloewe · Oct 28, 2014

DaPlumber said:
@andyclimb: There *ARE* tools for manually performing the types of operation that would be required to manually repair your pool: zdb - the ZFS Debugger (and it is installed on FreeNAS) for diagnosing the issues and things like "zpool reguid" for fixing them. However a quick perusal of the zdb etc. man pages as well as reading some of the horror stories if you Google "zfs repair FreeBSD" should convince you this is a *VERY* non-trivial process. In short if it's a personal pool what you did was entirely correct: it's going to be far easier to copy the data off a read-only pool (presuming it mounts read-only and you haven't tried to export it yet), than it is to try and diagnose and repair the damage. It's one of those Engineering rules-of-thumb: As the complexity of a system increases the probability of the cost of repair exceeding the cost of replacement increases, usually as a power of the complexity. ZFS is a lot of wonderful things, but simple it is not and its data structures for the most part are not human-readable.

One of the reasons Z2 reduces the chances of error is that it allows for quorum checksumming: I have 3 checksums, the one I just generated and the 2 saved in z2. If 2 out of the 3 agree it's pretty much a certainty which one is corrupted. With just a single stored checksum it isn't always possible to verify if the checksum is correct or the data. There are some types of corruption (as opposed to outright failures) that will still allow for a valid checksum to be generated and even data to be rebuilt or the checksum itself may be corrupted or the wrong checksum for the data.

Extra credit for the masochistic: do a "zdb -C poolname" to print out the pool configuration and count the number of lines of which you understand the meaning and possible values. If the answer is anything other than "all of them" you should not *EVER* attempt manual zfs repair (unless it's out of interest after copying what you can salvage of the data). I've been messing around with ZFS since its infancy on Solaris and I fail that test too.

I'm really glad your friend got his data back!:D

BTW I know this is going to earn me a growl from Cyberjock :p, but creating a recovery pool to temporarily house the copy of the data is one of the few instances where I will cross my fingers and use USB (preferably USB3) connected drives if there's nothing else to use. Mirrored pair strongly reccomended. If there's a problem, I can always reformat and start the copy again or reboot and redo the restore.

I want the last 60 seconds of my life back (It took a while to open puTTY)! I was expecting a cluster**** on the scale of Samba's source code and all I got was a somewhat readable representation of my pool's properties.

On a more serious note, you wouldn't happen to know what a "space map refcount mismatch" is, would ya? I don't appreciate mismatches on a pool that's barely been worked with.

cyberjock · Oct 28, 2014

That's apparently normal and expected. I don't know the exact reasoning because, like you, everything I read says they should match. But they seem to always mismatch on FreeNAS. Maybe I should ask a dev. :P

DaPlumber · Oct 28, 2014

@Ericlowe: I'm speculating that might have to do with FreeNAS insisting on GELI and/or the "stealing" of some blocks for the swap partition. The error message comes from the Spacemap Histogram (I think!) so it's where ZFS is trying to guess/predict the free space meta slab "holes" and the real count is coming up different from the expected. I'm not sure the ZFS code is checking the actual space in the partition or taking a short cut by using the device size. It's more like a count mismatch and it lasts the life of the pool AFAIK. No-one appears to care very much about it, as it's more of a "I guessed wrong, devs: time to tweak the algorithm" type thing. Anyhow, here's the section of code from zdb.c that generates that message:

static int
verify_spacemap_refcounts(spa_t *spa)
{
uint64_t expected_refcount = 0;
uint64_t actual_refcount;

(void) feature_get_refcount(spa,
&spa_feature_table[SPA_FEATURE_SPACEMAP_HISTOGRAM],
&expected_refcount);
actual_refcount = get_dtl_refcount(spa->spa_root_vdev);
actual_refcount += get_metaslab_refcount(spa->spa_root_vdev);

if (expected_refcount != actual_refcount) {
(void) printf("space map refcount mismatch: expected %lld != "
"actual %lld\n",
(longlong_t)expected_refcount,
(longlong_t)actual_refcount);
return (2);
}
return (0);
}

(for your enjoyment.)

If you want a heavier dose of zfs jargon :p do a "leaked space check" (traversal) with "zdb -bb poolname". It might take a while to run. ;) Oh, if it does find any issues, on a mounted zpool it might not be there again as it's more like an internal garbage collection thing that gets fixed "automagically".

One of the better tutorials on what the <bleep> all this zdb stuff means and how to use zdb (at least for relatively simple stuff) is still Ben Rockwood's 2008 blog entry @ http://cuddletech.com/?p=407

Be warned: ZFS Internals are fascinating to anyone with a technical turn of mind, but that Rabbit Hole goes very, VERY deep.:D

cyberjock · Oct 28, 2014

The official answer is that if the pool is mounted the time between when you get the two parameters are different times and a pool in use will result in the numbers not matching because a write to the pool likely too place. So they will naturally not match.

Now if the pool isn't mounted and there's a mismatch then there is cause for concern.

Got this info from a dev. ;)

Kind of makes sense after you think about it. :P

DaPlumber · Oct 28, 2014

cyberjock said:
The official answer is that if the pool is mounted the time between when you get the two parameters are different times and a pool in use will result in the numbers not matching because a write to the pool likely took place. So they will naturally not match.

No duh. :p I should have worded that better, and remembered to include the between pool mounts part.

I'm still curious as to if the histogram is thrown off by zfs not having access to the "whole disk". Admittedly I'm going on a knowledge of zfs internals that's Solaris oriented and a few years out of date, so I could be completely wrong.

Now if the pool isn't mounted and there's a mismatch then there is cause for concern.

Eh, I'd be concerned if a mount and a scrub doesn't fix it.

Got this info from a dev. ;)

Kind of makes sense after you think about it. :p

It takes a certain kind of mind to wrap itself around zfs internals. I have respect for those who do, because I can barely keep up understanding the discussion if they go slow and with copious notes! ;)

Ericloewe · Oct 28, 2014

Ok, that's reassuring.

Now, if the ZFS devs could learn to write less scary error messages for errors that aren't really errors...

Important Announcement for the TrueNAS Community.

My turn to ask for some help!!!

andyclimb

Contributor

cyberjock

Inactive Account

diedrichg

Wizard

andyclimb

Contributor

cyberjock

Inactive Account

andyclimb

Contributor

DaPlumber

Patron

Ericloewe

Server Wrangler

cyberjock

Inactive Account

DaPlumber

Patron

cyberjock

Inactive Account

DaPlumber

Patron

Ericloewe

Server Wrangler

Similar threads