Safest course when all disks possibly compromised...

Status
Not open for further replies.
Joined
Jul 13, 2013
Messages
286
We've got a box running for two years (5+1 pool; yeah, I know that's risky and inefficient). Two weeks ago a disk reported itself bad, we replaced it, and it resilvered.

But this week, the replacement disk reported itself bad. No other disks showed any problems. Infant mortality is always a possibility of course, but...

When we opened the box, the disks all seemed kind of hot. On further investigation, it looks like probably the main fan that cools the disks wasn't hooked up. Seems very unlikely that problem is two years old and didn't show up until now, so maybe we messed it up when changing the disk two weeks ago, or maybe the connector vibrated loose, or maybe gremlins. Whatever, we can't prove anything.

The thing is -- now I'm sitting on a degraded pool (working, but no redundancy), the remaining disks of which may well have been exposed to over-temp for up to two weeks (enough, maybe, to kill the new disk). So now I'm terrified that the remaining disks have very limited future life. Panic! (Server is currently powered off, so that life isn't currently ticking away).

Options we've thought of include:

  1. Replace the bad disk again, and let it start resilvering. Be very sure the fans are hooked up and running right :smile:. Then get the backup server in place for regular automatic backups (most of the data actually exists in multiple places, but not in an adequately organized fashion and it's not being automatically kept current).
  2. Bring the pool up in degraded state (don't replace the failed disk), and immediately start replicating the data onto another server (we've got one, intended to be the backup server, nearly ready to go).
  3. Bit-copy the disks individually somehow to new drives, bring the copies up in a server, and then replace the missing disk and let it resilver.
#3 is more work and time. However, #3 appears to require the fewest hours of continued life in the existing disks; only a disk actively being copied would be powered on at all, and the copying should be more efficient than other methods (this pool is hideously full, so copying unused space won't hurt us much). The point here is to minimize the chance of one of the drives that may have been exposed to over-temp conditions failing before we can copy the data off it.

Is #3 possible, using FreeBSD or Windows tools? Do the general class of Windows utilities that do drive replication without regard to partitions or filesystem copy everything that matters (at least master boot record, partition table, and all partitions)? Has anybody actually done it?

We've ruled out #1 as far too risky.

I'm planning to examine the failed disks tonight and tomorrow; look at SMART data and such. If they were exposed to overtemp, there should be a record there, shouldn't there? I kind of would have expected the short SMART test to catch overtemp conditions and report them, also, and we got no such reports.

Is there some other approach that has better odds of recovering the pool?

Pending the results of examining the failed disks, we're kind of leaning towards #2 currently.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
I would go with option #2, back up the data asap then rebuild the old server with raidz2 pool.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I think #1 is less risky than 2# because one read error and the pool can be gone with #2 but it depends on how bad the state if this drive is.

SMART tests doesn't warn you about overtemp, the SMART service in FreeNAS do, but only when the moon is full, on even days of the month and if you jump 3 times... --> more seriously there's a bug and I never received an email because of an overtemp (even when doing tests where I know it should send the mail), for some everything works, there's a thread about this bug somewhere. It's a shame because it's a very important thing and it's still unresolved.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Scary situation all around. Especially if there is the chance that any other drive could be bad (due to the heating issue). Myself, I would think about option #3 since I would want to baby it as much as possible.

Disclaimer: I have not tried this for a FreeNas or even RAIDed Drive, but have done it for individual OS Drives (like Windows and Ubuntu)..

I would presume that you could take one drive at a time and put it in a "known good" system and make a backup or direct copy with CloneZilla (via bootable USB Thumb Drive). Personally, I would make a backup image to a network share or safe storage first.

Then once you have all the drives duplicated you can bring up the newly duplicated drives from there and proceed. If you made backup images with CloneZilla, you could even restore the Images to VM Hard Drive(s) and possibly run the scenario in a VM instance of FreeNas (again have not tested this either).

Last Note: I am unsure if FreeNas would freak out or not if all the disks were new and not the same identifier at once. Perhaps someone could chime in on this?

Best of luck.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
I think #1 is less risky than 2# because one read error and the pool can be gone with #2
One read error on resilver and the pool is gone too. At least with option #2 you get some data recovery. First thing should be to save as much data as possible then replace the drives.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
#3 is possible but don't use the Windows tools. Use FreeBSD to make a dd copy from the cooked drives onto new drives.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
and also #3 is probably the SAFEST as long as you don't screw up and overwrite a disk you are trying to recover.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
One read error on resilver and the pool is gone too.
Why would this be the case? If there's a block read error somewhere in the data, the result should be that the file containing that block would be reported as having a data error. If it's in the metadata, all the metadata is redundant (up to 6x redundant, IIRC) and checksummed, so ZFS would know there's an error, and should be able to grab a clean copy somewhere else. Sure, if another disk outright dies, the pool is gone. But a bad block should have much more limited consequences. Or am I missing something?
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
and also #3 is probably the SAFEST as long as you don't screw up and overwrite a disk you are trying to recover.

Agreed, that is why I mentioned making an image with CloneZilla. At least from there you can do multiple things with it; image a new drive or drop the image on a virtual drive.
 
Joined
Jul 13, 2013
Messages
286
Well, I'm surprised. We re-booted the system with the "failed" new drive in place (which showed no signs of problems in the SMART data, at least; yeah, I know how much that can sometimes miss) and it took that drive back into the pool. So we're back to the original (inadequate) level of redundancy, at least. It does seem to be keeping the temp down, so the fan issue was real and we did fix it.

Still transferring the data off as fast as we can of course. Which is also having problems, and will be addressed in a new thread....
 

TwittyFlash

Dabbler
Joined
Mar 23, 2016
Messages
20
Hi folks,

Was experimenting with clonezilla to backup my Freenas until I disovered an error when using clonezilla.
I have created an image of Freenas 9.3 and 9.3.1 with clonezilla, but the image was unable to be restored.
I only saw a "Booting..." for 9.3 after restoring the clonezilla image.
and for version 9.3.1, I received a " NTLDR is missing
Press Ctrl+Alt+Del to restart
" for 9.3.1.
Has anyone encountered this?
Any advice will be appreciated. Thanks.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
It's probably a good idea to start a new thread. And looks like you are trying to boot a windows disk.
 
Status
Not open for further replies.
Top