Attempted to replace a failing drive with a brand-new, dead-on-arrival drive; need some help...

yeliaB · Aug 26, 2019

Hello all,

My NAS (config in my signature) started throwing concerning errors on ada1, so I ordered a replacement drive.

When the replacement drive arrived, I (using the GUI) OFFLINE'd ada1, and shutdown the system so I could get the old drive out, and the replacement drive in (no hot swap hardware in this box.)

After swapping in the replacement drive, I restarted the system. Getting into the GUI, I looked at the pool, found the old drive and clicked on its Options, and then on Replace. I then identified the replacement drive, and clicked the REPLACE DISK button. All just as written up in the documentation.

The console immediately started spewing these kinds of errors:

Code:

Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 f2 a0 40 ba 02 00 00 00 00
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): Retrying command
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 00 00 40 00 00 00 00 00 00
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): Error 5, Retries exhausted
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 02 00 40 00 00 00 00 00 00
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): Error 5, Retries exhausted
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 f0 a0 40 ba 02 00 00 00 00
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): Error 5, Retries exhausted
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 f2 a0 40 ba 02 00 00 00 00
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug 15 09:18:27 server (ada1:ahcich1:0:0:0): Error 5, Retries exhausted

Being slow up the uptake, I head'ed a couple zpool status and saw that the resilvering wasn't starting, decided to shutdown the system, remove the replacement drive, put back in the old drive, rebooted and, in some kind of panic, got my system back into a state where the GUI describes my pool (tank) as follows:

Code:

tank    0    0    0    DEGRADED
 RAIDZ2    0    0    0    DEGRADED
  ada0p2    0    0    0    ONLINE
  ada3p2    0    0    0    ONLINE
  ada2p2    0    0    0    ONLINE
  REPLACING    0    0    0    DEGRADED
   ada1p2    0    0    0    ONLINE
   /dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac    0    0    0    OFFLINE
  ada4p2    0    0    0    ONLINE

Or, if you'd prefer the shell version:

Code:

# zpool status tank
  pool: tank
 state: DEGRADED
  scan: scrub repaired 0 in 0 days 04:47:13 with 0 errors on Sun Aug 25 05:17:14 2019
config:

    NAME                                              STATE     READ WRITE CKSUM
    tank                                              DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/938511ff-d954-11e8-b0a5-001e67e092ac    ONLINE       0     0     0
        gptid/e5e894dd-f1e5-11e6-a4e7-001e67e092ac    ONLINE       0     0     0
        gptid/66d967c3-4d16-11e7-8664-001e67e092ac    ONLINE       0     0     0
        replacing-3                                   DEGRADED     0     0     0
          gptid/65bacdb2-f144-11e6-9a9b-001e67e092ac  ONLINE       0     0     0
          10491244544366002744                        OFFLINE      0     0     0  was /dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac
        gptid/2ba1b21b-4d53-11e7-8664-001e67e092ac    ONLINE       0     0     0

errors: No known data errors
#

Here's the output from glabel; scanning through all this, I believe that /dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac is the dead-on-arrival replacement drive that is no longer attached, and gptid/65bacdb2-f144-11e6-9a9b-001e67e092ac is the original error-throwing drive (ada1) that I added back into the pool:

Code:

# glabel list
Geom name: ada0p2
Providers:
1. Name: gptid/938511ff-d954-11e8-b0a5-001e67e092ac
   Mediasize: 3998639456256 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842688
   length: 3998639456256
   index: 0
Consumers:
1. Name: ada0p2
   Mediasize: 3998639456256 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e2

Geom name: ada1p2
Providers:
1. Name: gptid/65bacdb2-f144-11e6-9a9b-001e67e092ac
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada1p2
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e2

Geom name: ada2p2
Providers:
1. Name: gptid/66d967c3-4d16-11e7-8664-001e67e092ac
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada2p2
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e2

Geom name: ada3p2
Providers:
1. Name: gptid/e5e894dd-f1e5-11e6-a4e7-001e67e092ac
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada3p2
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e2

Geom name: ada4p2
Providers:
1. Name: gptid/2ba1b21b-4d53-11e7-8664-001e67e092ac
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 7809842696
   length: 3998639460352
   index: 0
Consumers:
1. Name: ada4p2
   Mediasize: 3998639460352 (3.6T)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2147549184
   Mode: r1w1e2

Geom name: da0p1
Providers:
1. Name: gptid/fa0bc54d-efb5-11e6-8c15-001e67e092ac
   Mediasize: 524288 (512K)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0
   secoffset: 0
   offset: 0
   seclength: 1024
   length: 524288
   index: 0
Consumers:
1. Name: da0p1
   Mediasize: 524288 (512K)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0

Geom name: da1p1
Providers:
1. Name: gptid/fa32753a-efb5-11e6-8c15-001e67e092ac
   Mediasize: 524288 (512K)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0
   secoffset: 0
   offset: 0
   seclength: 1024
   length: 524288
   index: 0
Consumers:
1. Name: da1p1
   Mediasize: 524288 (512K)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 17408
   Mode: r0w0e0

Geom name: ada0p1
Providers:
1. Name: gptid/9377a876-d954-11e8-b0a5-001e67e092ac
   Mediasize: 2147483648 (2.0G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 65536
   Mode: r0w0e0
   secoffset: 0
   offset: 0
   seclength: 4194304
   length: 2147483648
   index: 0
Consumers:
1. Name: ada0p1
   Mediasize: 2147483648 (2.0G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 65536
   Mode: r0w0e0

#

At this point, I know enough to know that I don't know enough to get myself out of this mess that was admittedly my own doing.

I *think* what I'm looking to do is to remove the offline /dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac, and get rid of the "replacing" status because the original, error-throwing drive (ada1) seems to be online. At that point, I'm hoping I'll be able to once again follow the documentation to replace ada1 with a new replacement drive, which I will test thoroughly before attempting to do so.

Any words of wisdom would be greatly appreciated at this point--thanks for reading this far!

dlavigne · Sep 2, 2019

Were you able to resolve this? If so, how?

yeliaB · Sep 2, 2019

dlavigne said:
Were you able to resolve this? If so, how?

Thanks for asking! I'm afraid not, as I'm really not sure how to begin.
In an ideal world, I'd like to be able to get the system to delete that DOA replacement drive (the " /dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac 0 0 0 OFFLINE" in the GUI pool status), clear out the "Replacing" status, and then try the replacement again with a drive that's been running badblocks for the past few days.

dlavigne · Sep 6, 2019

If you offline the DOA disk and put in a good disk, does it start to resilver?

yeliaB · Sep 6, 2019

dlavigne said:
If you offline the DOA disk and put in a good disk, does it start to resilver?

The DOA disk (which is represented by the "/dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac 0 0 0 OFFLINE" line in the GUI interface's pool status) is already offline. According to the GUI interface, I can replace it, but I'm not sure if this would be advisable, as this DOA disk was supposed to be replacing the one named ada1 in the first place.

I now have a known-good drive in the box that I can use to replace...something. The question is, which one? If I replace the DOA drive, will that then make the DOA drive go away, but not resilver so as to replace ada1 (and bring the pool back from being degraded?) Or will the DOA drive be replaced, and then the originally-intended resilvering to replace ada1 will be done? Or if I replace ada1 with the known-good drive as I originally intended, will the pool no longer be degraded, and the entry for the DOA drive removed? Or will I just end upmaking things more broken?

I'm not afraid to take a chance here, but I would feel better about that chance if I had a feeling which might be the more sensible one to try.

Any thoughts on which approach might be more sensible? Fear not, I won't hold you responsible for the outcome.

And above all, thanks for continuing the conversation!

dlavigne · Sep 12, 2019

Replacing the DOA disk with the good disk should let the resilver complete. Let us know how it works out for you.

yeliaB · Sep 12, 2019

I replaced the DOA disk (represented by the "/dev/gptid/69edc2e6-bf5f-11e9-abca-001e67e092ac 0 0 0 OFFLINE" line in the GUI interface's pool status) and everything just...worked. The entry for the DOA drive disappeared, the originally-failing drive (represented by "ada1p2 0 0 0 ONLINE" in the GUI interface's pool status) disappeared, and the pool is no longer degraded. After bouncing the system to pull out the old drive and slot the new one into the old one's slot, everything is back to normal.

Thanks for being there, @dlavigne--I really appreciated your support!

dlavigne · Sep 12, 2019

Glad to hear it worked out

Important Announcement for the TrueNAS Community.

Attempted to replace a failing drive with a brand-new, dead-on-arrival drive; need some help...

yeliaB

Dabbler

dlavigne

Guest

yeliaB

Dabbler

dlavigne

Guest

yeliaB

Dabbler

dlavigne

Guest

yeliaB

Dabbler

dlavigne

Guest

Similar threads