Unable to offline drive for replacement

andyl · Nov 18, 2014

Hi,

I have FreeNAS-9.2.1.7-RELEASE-x64 running on a n40l microserver (no hot swap).
I have all four bays (0 through 3) populated with 2Tb drives, and the drive in bay 3 is showing SMART errors. I have a replacement drive already purchased and I'm looking to install it.

Following the 9.2.1 User's Guide, I attempt to offline the erroring drive, but the GUI just returns to the previous screen without change. I have a tail -f /var/log/messages running in an adjacent ssh sessions and I see the following;

Code:

Nov 19 10:10:22 bitbucket notifier: swapoff: /dev/ada3p1.eli: No such file or directory
Nov 19 10:10:22 bitbucket notifier: geli: No such device: /dev/ada3p1.
Nov 19 10:10:22 bitbucket manage.py: [middleware.exceptions:38] [MiddlewareError: Disk offline failed: "cannot offline gptid/bc216b13-3b58-11e2-ab99-e8393520a421: no valid replicas, "]

The User's Guide mentions the "no valid replicas" error and informs that running a scrub should correct this issue. I have a 14 day scrub schedule in place which completed a few days ago, but I ran another one anyway. It did not correct the problem.

Can anybody provide assistance as to how I can offline the drive and replace it?
Would powering the box down and replacing the drive without offlining it seriously damage the pool?

Thanks for your help!
Andy.

cyberjock · Nov 18, 2014

andyl said:
Would powering the box down and replacing the drive without offlining it seriously damage the pool?

Normally, no. In your case you WILL lose data. Potentially the pool.

Can you post the output of "zpool status" in pastebin? I'm guessing you have a non-redundant pool and therefore can't do disk replacements.

andyl · Nov 18, 2014

cyberjock said:
Normally, no. In your case you WILL lose data. Potentially the pool.

Can you post the output of "zpool status" in pastebin? I'm guessing you have a non-redundant pool and therefore can't do disk replacements.

zpool status;

Code:

[root@bitbucket] ~# zpool status
  pool: Vol1
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 240K in 9h19m with 1 errors on Wed Nov 19 02:02:48 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        Vol1                                            ONLINE       0     0     4
          raidz1-0                                      ONLINE       0     0     8
            gptid/ba7a3326-3b58-11e2-ab99-e8393520a421  ONLINE       0     0     0
            gptid/baeba256-3b58-11e2-ab99-e8393520a421  ONLINE       0     0     0
            gptid/7db116e4-67cd-11e4-a023-e8393520a421  ONLINE       0     0     0
            gptid/bc216b13-3b58-11e2-ab99-e8393520a421  ONLINE       0     0     1

errors: 1 data errors, use '-v' for a list

Edit: Oops - you said pastebin - why?

Thanks for your help.

cyberjock · Nov 18, 2014

that will do.

Ah. You already have permanent corruption of your pool. See that "errors: 1 data errors.." message.

So no, you don't do disk replacements now. You do pool replacements.

See that warning in my sig about RAIDZ1 being dead. You have now demonstrated that the RAIDZ1 warnings are still valid. That's 3 in less than 24 hours! :(

andyl · Nov 18, 2014

Sorry - I'm not sure what you're suggesting. Can you be a little clearer? What's a "pool replacement"?

Thanks.

cyberjock · Nov 18, 2014

What I mean is you backup all of the data on your pool, you destroy the pool, then you create a new pool and copy your data back.

See, since you have permanent errors you have at *least* 2 bad disks in your pool. Since you have RAIDZ1 you cannot (and were not) protected from the corruption. The corruption is evident from the line I quoted above.

So you really need to backup your data, destroy your pool, replace/RMA the bad disks, *then* create the new pool. Super easy huh? Ok, not really. But this is why we tell people to do RAIDZ2.

andyl · Nov 18, 2014

Righto, gotcha. That's a bit of a bummer.

The drive in bay 3 has been reporting SMART errors for a while, but I thought that was predicting an imminent fail, so I was waiting for it to go completely bad (plus we ran a bit low on the old cashola, and a replacement drive was pretty low on the wishlist). Then a second drive (bay 2) failed completely and I replaced it. Now we have a bit of spare cash & time so I'm replacing bay 3 too, except I can't. I'm guessing that the combination of both failures caused data corruption which is blocking the offlining process.

I'm starting to think FreeNAS & ZFS maybe a bit overkill for my needs. This is just a home NAS for media and shared files, anything important is backed up onto cloud storage.
Given my hardware constraints as below, what would be the largest "safe" amount of storage going forward with FreeNAS?

HP Microserver N40L
8Gb ECC RAM
4 of 2 Tb NAS grade drives.

If I were to backup the data and destroy the pool, how should I recreate it?

RAIDZ2 is not really an option, since I'd only get about 3-4Tb of storage & we're already bursting at the seams of 5Tb. I can't squeeze anymore drives into the microserver, so I may look at some sort of lower quality software RAID solution like UnRAID or NAS4Free.

It seems a bit overkill to have to destroy the whole pool due to a small amount of corruption, and I think this maybe the second time FreeNAS has bitten me like this. I destroyed the pool before and restored from backup.... Just re-read the old post - it seems when I upgraded some drives the resilver never finished....

Also, I had in mind that I would replace the NAS sometime next year. I was thinking of much the same hardware, but doubling the memory and disk;

HP Microserver N54L (assuming I can still get one - the new gen8s seem to use twice as much power)
16Gb ECC RAM
4 of 4 Tb NAS grade drives.

Thanks again for your help - I'm off to try and track down my old 3Tb external drive for backing up :(

cyberjock · Nov 18, 2014

Honestly, if RAIDZ2 is not an option you've got bigger problems to deal with. RAIDZ1 (and RAID5) were "dead" in years ago. So you're trying to fight something that you aren't (and didn't) win.

andyl · Nov 18, 2014

RAIDZ1 (and RAID5) were "dead" in years ago.

From my current understanding, I disagree. I think I understand what you're saying re data loss in respect to the sizing, frequency of read errors and MTBF of modern HDDs - with current reliability values and volume size, data corruption should be expected within the life span of the equipment unless you have an n+2 redundant storage system. With three copies of the data (ie. RAIDZ2), it should always be possible to determine what the true value of any given binary digit should be. But small amounts of data corruption isn't really a problem for me.

The impending loss of all my data is - since I now no longer have any resiliency against a drive fail, which is actually quite a likely occurrence since my array is constructed out of four drives, quadrupling the failure rate.

I just need a large file storage and sharing location. I don't need a system to guarantee every single binary digit against corruption indefinitely. It is a nice value add that ZFS is/was able to provide that functionality, but it looks to me like ZFS prioritises absolute data integrity over actual data availability. For my purposes, I think this is the wrong way round.

Or am I misunderstanding?
Is FreeNAS less likely to suffer catastrophic data loss (read zpool loss) during the lifespan of the equipment than alternative products?

Again, thanks for your help - I've read a lot of the FAQs and links you have provided, but I'm still trying to figure out why I need the complexity overhead than ZFS brings.

cyberjock · Nov 18, 2014

We're having two discussions here and you probably don't know it...

Conversation 1:

andyl said:
I just need a large file storage and sharing location. I don't need a system to guarantee every single binary digit against corruption indefinitely. It is a nice value add that ZFS is/was able to provide that functionality, but it looks to me like ZFS prioritises absolute data integrity over actual data availability. For my purposes, I think this is the wrong way round.

Your assessment is basically 100% correct. For your purpose this is probably back-asswards.

Conversation 2:

andyl said:
Is FreeNAS less likely to suffer catastrophic data loss (read zpool loss) during the lifespan of the equipment than alternative products?

It's not less likely than something else. The problem is that, statistically, if you do a RAID5 hardware raid with NTFS, once you've fubared the file system (which is what happened to you with ZFS), you're virtually guaranteed to lose data. If it's the bits for an obscure directory with 4 files you might not cry. But if it's root directory information for the drive, you can kiss your data goodbye just as easily with ZFS as you can with NTFS.

So you might be arguing that you don't need 100% resiliency from any and all data corruption, but you sure as hell can't handle data corruption in your file system no matter what file system you choose. The consequences of a corrupt file system are all universally bad... data loss. You are witnessing this exact problem firsthand.

So we're back to what I said in the beginning when I said RAID5/RAIDZ1 is dead. Now if you think that you somehow are going to be luckier with RAID5 and NTFS (or some other combination that means "not FreeNAS") then go for it. You will notice that the arguments against RAID5 do NOT mention file systems at all. That's not a coincidence and its not some major flaw in the argument that was overlooked. You'll take it from the rear no matter what OS and file system you use. Of course, ZFS provides the end-to-end protection that hardware RAID can't provide, so recovery from a disk failure with ZFS is far less likely to result in corruption (which can and does happen) assuming you have redundancy to correct for the errors.

So you need to make a choice. The choice is "how much risk do I think I can handle the next time a disk fails?" Frankly, and this is strictly by the numbers, RAIDZ2 is about the only "smart" choice. Anything else is nothing more than waiting for your RAID array (or zpool) to fail. Is that really the game you want to play? If it is, then create a new RAIDZ1 right now and keep going with what you are already doing. There's definitely nothing to be gained or lost by doing what you are doing if you are willing to accept that you are not in it for the long haul and only interested in keeping the pool available "until the next mishap".

But people on this forum. We're into storage for the long haul. We're not here to discuss things that involve phrases like "until the next disk failure" and "unless I'm lucky I'll probably lose it all".

andyl · Nov 18, 2014

Fair enough - thanks for your help.

andyl · Nov 24, 2014

I think I may have been able to resolve this issue.

Whilst I was backing up the pool to alternate storage before destroying and re-creating it, my copy software generated an error indicating that it was unable access a single file. Assuming this was the file showing corruption during a scrub, I deleted it. Then I ran a scrub - this still showed errors. So I destroyed all snapshots and ran another scrub. After this scrub, no data errors were present.

I offlined the failing drive and replaced it as per the manual. Now my pool shows the following;

Code:

[root@bitbucket] ~# zpool status
  pool: Vol1
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: resilvered 1.70T in 19h18m with 0 errors on Mon Nov 24 17:26:59 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    Vol1                                            ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/ba7a3326-3b58-11e2-ab99-e8393520a421  ONLINE       0     0     0
        gptid/baeba256-3b58-11e2-ab99-e8393520a421  ONLINE       0     0     0
        gptid/7db116e4-67cd-11e4-a023-e8393520a421  ONLINE       0     0     0
        gptid/67b3e427-7309-11e4-be82-e8393520a421  ONLINE       0     0     0

errors: No known data errors

The supported but not enabled features report is concerning, but I'll raise that in another post.

Thanks for all your help!

Important Announcement for the TrueNAS Community.

Unable to offline drive for replacement

andyl

Explorer

cyberjock

Inactive Account

andyl

Explorer

cyberjock

Inactive Account

andyl

Explorer

cyberjock

Inactive Account

andyl

Explorer

cyberjock

Inactive Account

andyl

Explorer

cyberjock

Inactive Account

andyl

Explorer

andyl

Explorer

Similar threads