SOLVED Can't replace drive in GUI

cods69 · Sep 11, 2016

Hi all.

Hoping someone can help with what seems should be a simple problem to sort out.

Firstly, running latest Freenas 9.10 (on this date) stable. RAIDZ2.
8x4TB WD NAS drives on Asrock C2750D4I, about 50% utilized. 16GB ECC RAM

Had one drive fail, which seemed to trigger a scub. Waited for that to finish and did some homework on what I should do in the user guide etc.
After that finished, there was no 'Offline' button, so according to doco, skipped that step.
I shut down the machine, replaced the correct drive and when I powered up, believe it or not, a second drive showed failed. Shut it down again, raced to the shop and got another drive.

With both new drives in place, powered up and system up, Volume Status shows both new drives "UNAVAIL".
Hit the replace buttons on either drive comes up with the popup to replace some long drive name "Replacing disk xxxxxxxxxxxxx", and nothing in the Member Name dropdown to select, which is apparently required to proceed.

The failed drives were named ADA2 and ADA4, which now shows other drives in those places, from ADA0 to ADA5.

Code:

[root@freenas ~]# zpool status -v                                                                                              
  pool: freenas-boot                                                                                                          
state: ONLINE                                                                                                                
  scan: scrub repaired 0 in 0h9m with 0 errors on Wed Aug 31 03:54:34 2016                                                    
config:                                                                                                                        
                                                                                                                              
        NAME                                          STATE     READ WRITE CKSUM                                              
        freenas-boot                                  ONLINE       0     0     0                                              
          gptid/f4ffff42-a067-11e5-8b5e-d05099c044b8  ONLINE       0     0     0                                              
                                                                                                                              
errors: No known data errors                                                                                                  
                                                                                                                              
  pool: volume01                                                                                                              
state: DEGRADED                                                                                                              
status: One or more devices could not be opened.  Sufficient replicas exist for                                                
        the pool to continue functioning in a degraded state.                                                                  
action: Attach the missing device and online it using 'zpool online'.                                                          
   see: http://illumos.org/msg/ZFS-8000-2Q                                                                                    
  scan: resilvered 4.50M in 0h0m with 0 errors on Sun Sep 11 17:06:39 2016                                                    
config:                                                                                                                        
                                                                                                                              
        NAME                                            STATE     READ WRITE CKSUM                                            
        volume01                                        DEGRADED     0     0     0                                            
          raidz2-0                                      DEGRADED     0     0     0                                            
            gptid/cf7186d5-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            3754489478423547485                         UNAVAIL      0     0     0  was /dev/gptid/cfef8fea-a07a-11e5-b543-d05099c044b8                                                                                                                            
            gptid/d079b93c-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            gptid/d0f87a49-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            gptid/d16bc8cf-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            gptid/d1e9f921-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            gptid/d26a48fb-a07a-11e5-b543-d05099c044b8  ONLINE       0     0     0                                            
            17850452976696034476                        UNAVAIL      0     0     0  was /dev/gptid/d2ef02b4-a07a-11e5-b543-d05099c044b8                                                                                                                            
                                                                                                                              
errors: No known data errors

TBH, I'm a in a bit of a panic as to how to get these 2 new drives working.
I've read another thread on here https://forums.freenas.org/index.php?threads/disk-offline-cant-replace.39254/
but could not really make sense of the solution, nor if it was relative to my situation.

If anyone could assist, I'd be extremely grateful.
Thanks in advance!

EDIT:
It's not as simple as what this guy did is it?
https://forums.freenas.org/index.php?threads/problems-to-detach-a-unavailable-disk.45908/
Which would be for me...
zpool offline volume01 3754489478423547485
zpool detach volume01 3754489478423547485
...would it? (obviously performed for both drives)

EDIT 2:
Found a IX video where there was a comment that someone else had the same issue, and a basic reboot brought it back.
Tried that and I can now see ADA2 (in the member name) which can be selected for either replacement drive. I assume that this has to be done for one drive at a time.
Hit replace and I got a nasty looking warning saying "Disk is not clear, partitions or ZFS labels were found.", with a force option.
Thoughts? My guess would be yes, but confirmation would be a nice warm fuzzy feeling. TIA again!

Stux · Sep 11, 2016

are you sure you replaced the right drives? Did you cross reference the drive serial numbers when replacing them?

If not, I'd suggest putting back the drives you pulled and restarting.

cods69 · Sep 11, 2016

Stux said:
are you sure you replaced the right drives?

100% positive. Mapped out all drives, all cables, SATA ports and system allocation names when system was built as well as confirmed prior to pulling apart.
Serials of drives pulled checked against the faulted drives too.
Additionally, I did put back the original drives after this drama started and was left in the same situation, with only two "faulted" drives and a "replace" button for each, no offline button.

EDIT:

If not, I'd suggest putting back the drives you pulled and restarting.

After having a think about this, I will put those old drives back in again and do a few restarts, just to see if anything is different to the first time around.
I'll report back when that's done in a few hours.

Any suggestions other than what the documentation says, if all I have is a replace button?

emk2203 · Sep 11, 2016

cods69 said:
Hit replace and I got a nasty looking warning saying "Disk is not clear, partitions or ZFS labels were found.", with a force option.
Thoughts? My guess would be yes, but confirmation would be a nice warm fuzzy feeling. TIA again!

Just replaced 10 disks, had this warning for each of them. They probably come pre-formatted now, which gives you the warning.

With two drives failing, I would be cautious that the GUI is really doing the right thing. You could always use the basic commands from the CLI or via SSH, and after you are done, export the pool and re-import it to get a clean slate.

Stux · Sep 11, 2016

Even if two drives are failed... I'd replace one, let that resilver... then replace the other. Is it possible to resilver two drives simultaneously? Would that theoretically put less load on the other remaining drives? I doubt it, but I don't know :)

If you're lucky, I guess one of the drives might come back... a dodgy drive gives you more redundancy than a drive which is not in the system ;)

cods69 · Sep 11, 2016

emk2203 said:
Just replaced 10 disks, had this warning for each of them. They probably come pre-formatted now, which gives you the warning.

Interesting. I'll keep that in mind if I end up going this way, to force the drives back via the GUI.

You could always use the basic commands from the CLI or via SSH, and after you are done, export the pool and re-import it to get a clean slate.

I would love to have the luxury of devices and the knowledge on how to do this, but all I have is a scattering of drives in different PC's and a stack of usb drives to back up on to. Hardly ideal, I know, but it is what it is.

cods69 · Sep 11, 2016

Stux said:
Even if two drives are failed... I'd replace one, let that resilver... then replace the other. Is it possible to resilver two drives simultaneously? Would that theoretically put less load on the other remaining drives? I doubt it, but I don't know :)

If you're lucky, I guess one of the drives might come back... a dodgy drive gives you more redundancy than a drive which is not in the system ;)

Yes, this is probably the way I'm going to go. As soon as I can get out of this office, I'll chuck the old bad drives back in, do a few reboots, then try replacing just the one drive.
This seems like the safest course of action.
I'll report back on progress after that first drive replacement.

Thanks for all the suggestions thus far guys. Greatly appreciated. :)

Stux · Sep 11, 2016

Oh, the best advise, and I nearly forgot it, is if you have a backup, you should refresh it before doing the resilver.

Resilvering an array is the most stress you can put on it, and if you have any other latent issues, then that will trigger them. And unfortunately, with 2 drives gone you have no room for further failures...

The best approach would be to ensure your backup is up to date, as updating the backup (assuming incremental/differences only) would put the least stress on the array...

And then... with an uptodate backup... well... you have nothing to lose, so you can stop sweating bullets :)

cods69 · Sep 11, 2016

Stux said:
And then... with an uptodate backup... well... you have nothing to lose, so you can stop sweating bullets :)

Thanks mate. There has indeed been plenty of sweat and probably a few tears already shed over this.
I will heed the advice and back up the most important stuff before I do anything.
Again, appreciate all the advice guys.

cods69 · Sep 12, 2016

FYI after I put the old faulted drives back in, the volume has picked them up in their original position, then freenas has started a scrub.
If I select one of the faulted drives in Volume Status, I only have "edit disk" and "replace" as options, same as to start with.
If I hit replace, this time it shows in Member Disk "In-Place [ada2 (4.0TB)]"
Should I be hitting replace from here? The doco conflicts with that in 8.1.10 "If there is no “Offline” button but only a “Replace” button, then the disk is already offlined and you can safely skip this step."
Also, is it safe to reboot when a scrub is running? I would have thought not, but happy to be corrected. EDIT Apparently it is, and can also be stopped by doing a zpool scrub -s <Poolname>

emk2203 · Sep 12, 2016

I never rebooted a scrub out of fear that the system starts over again. Good to know about the -s option. If a drive is faulted, just hitting replace is fine. Don't you have a eSATA port or some other free SATA on the machine? It is so much easier on the nerves to rebuild the system when there is still redundancy. In this case, you should get the option to replace with the new disk in addition to the in-place option.

cods69 · Sep 12, 2016

emk2203 said:
Don't you have a eSATA port or some other free SATA on the machine? It is so much easier on the nerves to rebuild the system when there is still redundancy. In this case, you should get the option to replace with the new disk in addition to the in-place option.

I do have some free ports, but they are buried under the cage. Out of sheer fear of 'poking the bear' and touching something else that could cause a problem, I don't want to move anything until it feels safe to do so.
It was set up that way trying to avoid those Marvel ports which caused many people problems on these boards (apparently now fixed by firmware on the controller, which I've applied during the build).

IF I manage to get a SATA cable in there without moving anything, then add the new drive successfully, does FreeNAS handle moving sata cables around elegantly, or is that another trap? I only ask because if I get it back up and running I like the way I had the cables managed.
In the middle of mass copies competing with a scrub on a degraded system. Slowwww with all those small files and will take days by the look. :(

emk2203 · Sep 12, 2016

cods69 said:
IF I manage to get a SATA cable in there without moving anything, then add the new drive successfully, does FreeNAS handle moving sata cables around elegantly, or is that another trap?

Yes, this is handled gracefully. My eSATA port sits between the other positions, bumping my ada4 disk to ada5 and hogging ada4 position when something is attached. zfs handled this gracefully, reassigning as needed.

In the middle of mass copies competing with a scrub on a degraded system. Slowwww with all those small files and will take days by the look. :(

My one system with uncorrectable errors went from 2 bad sectors to over 80 in the span of a week, with only a very light load. Time is also a factor, so try to minimize the time if possible. Copying small files takes ages via the file system. Do you use rsync, compression to a remote archive or zfs send/receive? This could help you a lot.

Go to /var/log/daemon.log and look for your disks, something like cat /var/log/daemon.log | grep ada4 was sufficient to see the messages for the problematic disk. If you compare the issues of the disks, maybe you can make an educated guess which should be replaced first. smartctl -a /dev/ada4 also helps.

When I decided to replace all my 5 year old disks, I checked them with this procedure after each resilver to make sure that I spot any further issues and can replace a troublemaker before the others.

cods69 · Sep 12, 2016

emk2203 said:
Do you use rsync, compression to a remote archive or zfs send/receive?

No. I must admit, for a person in IT, I'm a very naughty boy and have neglected replication and backups.

Yes, I know. Stupid.

I do have some of the critical work, photos and home vids of the kids, but they are all in the process of being madly refreshed as they are way out of date.
The largest chunk of the NAS is consumed by a huge amount of DVD, Bluray and CD rips which took me months and several burned out optical drives to rip and convert and pack all the original boxes away in nice containers. It'll hurt if I have to do those again.

I've stopped the scrub and things are moving much quicker now, so I hope to test plugging in a 9th drive tomorrow to see how that goes.
TBH I never thought it would be a drama to replace a failed drive in a bay and not have the failed drive in there. I honestly thought from the user guide I'd be able to do that.
Lesson learned the hard way.

Thanks for all the help so far. :)

emk2203 · Sep 13, 2016

cods69 said:
TBH I never thought it would be a drama to replace a failed drive in a bay and not have the failed drive in there. I honestly thought from the user guide I'd be able to do that.

Well, you are. If you have two failed disks, you can replace them and rebuild the RAIDZ2 array. You just live without redundancy during the process, which makes a lot of people nervous, including me. Been there, done that.

The difference between: "One more issue and the pool is gone" and "I have to do something, but at least the pool can take one more issue" is huge. Our situations are very, very similar, and I was in your shoes not two weeks ago. Now, eleven disks later, I am finishing my final scrub on the last pool.

cods69 · Sep 14, 2016

emk2203 said:
Well, you are. If you have two failed disks, you can replace them and rebuild the RAIDZ2 array. You just live without redundancy during the process, which makes a lot of people nervous, including me. Been there, done that.

Sorry I should have rephrased that. What I meant was that I followed all instructions, but was still taken aback by the problems I encountered initially, as well as the "Disk is not clear, partitions or ZFS labels were found." message and trying to find information about that was difficult.

Fortunately, I'm now in the position of having the least critical backups remaining and about 12-18 hours worth of copies left before I can attempt a fix. I might even have the luxury of being able to afford some experimentation so I can feel a bit more comfortable about this when it happens again. I know it will, as I'm well acquainted with Murphy.

Stux · Sep 14, 2016

I think you're beginning to understand why the focus to refresh the backup, then restore redundancy, then replace the disks :)

Reducing stress/fear of losing data is the most important thing, the rest is just time.

cods69 · Sep 18, 2016

OK finally getting around to the "experiment". Found a way to connect and hang a 9th drive outside the case.
Had some fun finding why the 9th drive would not be seen, which turned out the marvel controller was turned off for the sata3 drives.
Rebooted and after triple checking the new ADAx allocations with serials (which annoyingly keep moving around), I managed to find the magic offline button this time for the old ADA2 drive (the first one that failed).
Resilver in progress. Now for the nail-biting wait to see if the volume survives and I can get on to the second replacement...

danb35 · Sep 18, 2016

cods69 said:
as well as the "Disk is not clear, partitions or ZFS labels were found." message and trying to find information about that was difficult.

This, at least, is a bug, and I think it's scheduled to be fixed with the next update. Annoying, but easy enough to check the "force replace" box and proceed.

cods69 · Sep 20, 2016

danb35 said:
This, at least, is a bug, and I think it's scheduled to be fixed with the next update. Annoying, but easy enough to check the "force replace" box and proceed.

That's very handy to know. Appreciate it.
So an update. Resilver of drive seemed to go ok, but over the duration (1+ day) I saw a few CAM status: Command timeout errors on the new drive.
It did complete though, so swapped it back in to the rack after a power down. The new drive immediately had a failure. Power reset and it still did not come back, so using the SATA cable that was used for the resilver, switched that on to the back of the rack and now it's up clean as a whistle.
Additionally, the second drive that failed has somehow flagged itself as good again and the volume shows healthy.
Weird.

Either way, I know there's some sort of underlying issue here, which is either SATA cable related, PSU or the DS380 rack back plane.
Going to start swapping things to see if I can isolate the issue, but right now, the volume shows as good, so going to start a scrub, JIC.

Important Announcement for the TrueNAS Community.

SOLVED Can't replace drive in GUI

Explorer

MVP

Explorer

Guru

MVP

Explorer

Explorer

MVP

Explorer

Explorer

Guru

Explorer

Guru

Explorer

Guru

Explorer

MVP

Explorer

Hall of Famer

Explorer

Similar threads