Unable to ONLINE a drive back into the pool

Status
Not open for further replies.

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
Hi. Yes, I've searched. Yes I've read. The suggestions and solutions out there are not working. I'm obviously doing something wrong. Please help.

I'm using FreeNAS-8.3.0-RELEASE-x64
The other day I took a drive in the raidz1 pool offline via the GUI to attempt to correct some errors
Dec 18 13:31:16 freenas smartd[2462]: Device: /dev/ada3, 9 Currently unreadable (pending) sectors
Dec 18 13:31:16 freenas smartd[2462]: Device: /dev/ada3, 9 Offline uncorrectable sectors

I now want to bring the drive back online to have the raidz1 back to full functionality but the usual commands are not working. Here is what I have after a zpool status -v:
Code:
status: One or more devices has been taken offline by the administrator.        
        Sufficient replicas exist for the pool to continue functioning in a     
        degraded state.                                                         
action: Online the device using 'zpool online' or replace the device with       
        'zpool replace'.                                                        
  scan: none requested                                                          
config:                                                                         
                                                                                
        NAME                                            STATE     READ WRITE CKS
UM                                                                              
        guenther                                        DEGRADED     0     0    
 0                                                                              
          raidz1-0                                      DEGRADED     0     0    
 0                                                                              
            gptid/9ed006bc-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            gptid/9f5c3bef-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            13681489071282867862                        OFFLINE      0     0    
 0  was /dev/dsk/gptid/f9d6fcfd-4922-11e2-9f56-902b349b289a                     
                                                                                
errors: No known data errors

Here are all the attempts at bringing the drive back ONLINE:
Code:
zpool online 13681489071282867862
zpool online -e raidz1-0 13681489071282867862
zpool online dev/dsk/gptid/f9d6fcfd-4922-11e2-9f56-902b349b289a
zpool online 13681489071282867862 ada3p2
zpool online dev/dsk/gptid/f9d6fcfd-4922-11e2-9f56-902b349b289a ada3p2
zpool online f9d6fcfd-4922-11e2-9f56-902b349b289a ada3p2
zpool replace 13681489071282867862 ada3

What is the proper syntax? When I perform just zpool online 13681489071282867862 it says I'm using it incorrectly
Code:
[root@freenas ~]# zpool online 13681489071282867862                             
missing device name                                                             
usage:                                                                          
        online [-e] <pool> <device> ...

Am I missing something really obvious? Thanks.
 

uutzinger

Dabbler
Joined
Nov 27, 2011
Messages
43
I am curious also because I wonder whether the drive would need to be wiped before adding it back to the pool or whether the procedure to offline a drive and to put it back online is possible with a drive that has not participated for a while in the pool.
Wiping and adding it back into the pool can take a very long time.

Addendum:
http://docs.oracle.com/cd/E19253-01/819-5461/gazgm/index.html
This might explain the proper commands.
It appears to me the commands should have guenther and raidz1-0 or similar in them.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Now that's a useful post!

Am I missing something really obvious? Thanks.
Actually, yes. Your syntax is wrong.
Code:
zpool online poolName device

E.G.

zpool online guenther gptid/f9d6fcfd-4922-11e2-9f56-902b349b289a


The other day I took a drive in the raidz1 pool offline via the GUI to attempt to correct some errors
Come to think of it how did you "correct" the errors?
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I am curious also because I wonder whether the drive would need to be wiped before adding it back to the pool or whether the procedure to offline a drive and to put it back online is possible with a drive that has not participated for a while in the pool.
ZFS keeps track and will resilver only the changes needed. I would scrub afterwards in case whatever block(s) were bad also contained data.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
The unavailable sectors were not resolved. I opted to keep the drive offline and zero it out with the intent of writing over the unavailable blocks. I don't know yet if it is resolved. I now need to bring it back online in the pool because it still shows as part of the pool but unavailable.

zpool online only made it unavailable to which it said I needed to zpool rename - ok... how? How do I get this drive back online with the rest of the pool?
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I opted to keep the drive offline and zero it out with the intent of writing over the unavailable blocks
If you wiped the drive then it's a new drive as far as ZFS is concerned. You should be able to click Replace in the GUI and select the zeroed drive.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
If you wiped the drive then it's a new drive as far as ZFS is concerned. You should be able to click Replace in the GUI and select the zeroed drive.
I have attempted just that and I get
Code:
action: Attach the missing device and online it using 'zpool online'.           
   see: http://www.sun.com/msg/ZFS-8000-2Q                                      
  scan: scrub repaired 0 in 0h21m with 0 errors on Tue Dec 18 16:39:15 2012     
config:                                                                         
                                                                                
        NAME                                            STATE     READ WRITE CKS
UM                                                                              
        guenther                                        DEGRADED     0     0    
 0                                                                              
          raidz1-0                                      DEGRADED     0     0    
 0                                                                              
            gptid/9ed006bc-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            gptid/9f5c3bef-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            1111973921623465147                         UNAVAIL      0     0    
 0  was /dev/dsk/ada2                                                           
                                                                                
errors: No known data errors

Then from the GUI I select Replace - I get:
Code:
Unable to GPT format the disk "ada2"

Exhaustive searching here and online has not helped.

edit: After detaching the drive (it still showed in the pool) and then zeroing it out I ran a scrub - as you can see with no errors at the top of the status output.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Then from the GUI I select Replace - I get:
Code:
Unable to GPT format the disk "ada2"
I thought it was ada3? Let's start from the beginning:
Code:
camcontrol devlist

gpart show

glabel status
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
----- bringing it back online results in
Code:
[root@freenas ~]# zpool online -e guenther 1111973921623465147                  
warning: device '1111973921623465147' onlined, but remains in faulted state     
use 'zpool replace' to replace devices that are no longer present


ok, so here we go again...



camcontrol devlist
Code:
<WDC WD10EZEX-00RKKA0 80.00A80>    at scbus3 target 0 lun 0 (pass0,ada0)        
<WDC WD10EZEX-00RKKA0 80.00A80>    at scbus4 target 0 lun 0 (pass1,ada1)        
<WDC WD10EZEX-00RKKA0 80.00A80>    at scbus5 target 0 lun 0 (aprobe0,pass2,ada2)
<ASUS DRW-24B1ST   a 1.04>         at scbus6 target 0 lun 0 (pass3,cd0)         
<Generic USB Flash Disk 1.00>      at scbus8 target 0 lun 0 (pass4,da0)

gpart show
Code:
=>        34  1953525101  ada0  GPT  (931G)                                     
          34          94        - free -  (47k)                                 
         128     4194304     1  freebsd-swap  (2.0G)                            
     4194432  1949330696     2  freebsd-zfs  (929G)                             
  1953525128           7        - free -  (3.5k)                                
                                                                                
=>        34  1953525101  ada1  GPT  (931G)                                     
          34          94        - free -  (47k)                                 
         128     4194304     1  freebsd-swap  (2.0G)                            
     4194432  1949330696     2  freebsd-zfs  (929G)                             
  1953525128           7        - free -  (3.5k)                                
                                                                                
=>      63  31457217  da0  MBR  (15G)                                           
        63   1930257    1  freebsd  [active]  (942M)                            
   1930320        63       - free -  (31k)                                      
   1930383   1930257    2  freebsd  (942M)                                      
   3860640      3024    3  freebsd  (1.5M)                                      
   3863664     41328    4  freebsd  (20M)                                       
   3904992  27552288       - free -  (13G)                                      
                                                                                
=>      0  1930257  da0s1  BSD  (942M)                                          
        0       16         - free -  (8.0k)                                     
       16  1930241      1  !0  (942M)

glabel status
Code:
                                      Name  Status  Components                  
gptid/9ed006bc-3e2c-11e2-a2f0-902b349b289a     N/A  ada0p2                      
gptid/9f5c3bef-3e2c-11e2-a2f0-902b349b289a     N/A  ada1p2                      
                             ufs/FreeNASs3     N/A  da0s3                       
                             ufs/FreeNASs4     N/A  da0s4                       
                            ufs/FreeNASs1a     N/A  da0s1a
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
offline, online and replace (if I had the syntax right) haven't worked. What about detach? p.s. thank you for your patience, I really appreciate your effort in trying to get me back up!
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
What about detach?
Detach doesn't work on raidz arrays.

Not sure why the GUI isn't working for you. How is ada2 connected anyway?

We can always do it from the CLI:
Code:
gpart create -s gpt ada2

gpart add -i 1 -b 128 -t freebsd-swap -s 2G ada2

gpart add -i 2 -t freebsd-zfs ada2
I'm doing that from memory. Double check the 2G swap partition is 4k aligned, i.e. gpart show should show the same thing for ada2 as for ada0.
Code:
zpool replace guenther 1111973921623465147 ada2p2
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
I'm back up, I think. I took the drive offline
Code:
zpool offline -t [poolname] [long drive name]
I then shut the system down and physically swapped sata cables with one of the other drives. When I rebooted the drives were in a different order with the affected drive now as ada0. I went directly to Volume Status and clicked Replace next to the long drive name and selected ada0. The affected drive immediately started to resilver. I was then left with the old long drive name and a button to detach. I did that, am now rebooting and ... yup, my volumes show as HEALTHY!

So sorry for taking up your time but I really appreciate your help! Thank you, thank you, thank you!
Code:
state: ONLINE                                                                  
status: One or more devices is currently being resilvered.  The pool will       
        continue to function, possibly in a degraded state.                     
action: Wait for the resilver to complete.                                      
  scan: resilver in progress since Tue Dec 18 17:25:11 2012                     
        154G scanned out of 435G at 289M/s, 0h16m to go                         
        51.3G resilvered, 35.44% done                                           
config:                                                                         
                                                                                
        NAME                                            STATE     READ WRITE CKS
UM                                                                              
        guenther                                        ONLINE       0     0    
 0                                                                              
          raidz1-0                                      ONLINE       0     0    
 0                                                                              
            gptid/9ed006bc-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            gptid/9f5c3bef-3e2c-11e2-a2f0-902b349b289a  ONLINE       0     0    
 0                                                                              
            gptid/c8f6c9b7-4961-11e2-9104-902b349b289a  ONLINE       0     0    
 0  (resilvering)                                                               
                                                                                
errors: No known data errors 


Final thoughts - I believe that for some reason the drive was not being fully released from the pool and a shutdown and physical cable change shocked it back into a true offline state. BTW, no amount of reboots had helped the original troubleshooting.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I went directly to Volume Status and clicked Replace next to the long drive name and selected ada0. The affected drive immediately started to resilver.
Excellent.

I then shut the system down and physically swapped sata cables with one of the other drives.
I would check the SMART info for all your drives. If a SATA cable was only loose & was reseated everything should be fine, but if you have a flaky cable you will want to replace it.

Final thoughts - I believe that for some reason the drive was not being fully released from the pool and a shutdown and physical cable change shocked it back into a true offline state. BTW, no amount of reboots had helped the original troubleshooting.
Something was "stuck" somewhere, but as you didn't have a partition table on the affected drive I don't know what it would be.
 
Status
Not open for further replies.
Top