Upgrading drives now ALL degraded

evlvd

Dabbler
Joined
May 30, 2015
Messages
18
I've been upgrading my 10x 4TB drives to 10x 10TB drives one at a time (raid-z2). The last 4 drives I did two replace drives at a time, while leaving all drives connected (12 drives at once), then removing the old drives. Everyone would be fine and worked very smoothly until the last drives which were getting resilvered and then looking like I had one more drive to replace. I removed the 2 last drives so I had the 10x 10tb new drives, and had a faulted drive, so I restarted and scrubbed. Tons of errors started popping up and all drives are faulted with errors. Not sure what happened or the best way to fix it.

I still have the 10x 4tb drives and don't know if it would be better to start putting those back in, or start over with just the 4tb drives. Although it shows 10 errors, it's nothing critical and I have backups. I can't remove the unavailable drives either with detach because it's giving me the error no replicas available, even though they have been replaced and resilvered.

Anyone have some ideas of what to do next to fix this?

Code:
root@freenas:~ # zpool status -v
  pool: JailsSSD
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:22:26 with 0 errors on Sun Oct 20 00:22:27 2019
config:

        NAME                                          STATE     READ WRITE CKSUM
        JailsSSD                                      ONLINE       0     0     0
          gptid/225c5dcc-cdbf-11e8-8739-0cc47a40699d  ONLINE       0     0     0

errors: No known data errors

  pool: Media
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov 18 18:37:26 2019
        1.40T scanned at 1.72G/s, 330G issued at 405M/s, 32.3T total
        88.6M resilvered, 1.00% done, 0 days 22:58:25 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             DEGRADED     0     0    86
          raidz2-0                                        DEGRADED     0     0   172
            gptid/31f8eaec-0771-11ea-8878-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/e7f2af10-06a4-11ea-a7e0-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/6b83f8a4-0771-11ea-8878-0cc47a40699d    FAULTED      3   177    24  too many errors
            replacing-3                                   DEGRADED     0     0     0
              8734188899632115243                         UNAVAIL      0     0     0  was /dev/gptid/af4f2962-85a8-11e5-975c-0cc47a40699d
              gptid/a655397b-082f-11ea-8137-0cc47a40699d  ONLINE       0     0     0
            gptid/a2c7f92a-05d9-11ea-9c1e-0cc47a40699d    DEGRADED     0     0     0  too many errors
            replacing-5                                   DEGRADED     0     0     0
              6482462145072973473                         UNAVAIL      0     0     0  was /dev/gptid/7bb4f9ec-c0e2-11e6-b93d-0cc47a40699d
              gptid/d433c504-08f6-11ea-9164-0cc47a40699d  ONLINE       0     0     0
            gptid/db5cf4e4-051b-11ea-a112-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/9087a6ef-037c-11ea-a888-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/ac95eee3-0284-11ea-b143-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/cf3f5c1e-043a-11ea-8129-0cc47a40699d    DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        Media/Media/Movies@auto-20191116.0900-1m:<0x0>
        /var/db/system/syslog-adb946163d914f088dc14617dbc0bec3/log/utx.log
        /var/db/system/syslog-adb946163d914f088dc14617dbc0bec3/log/samba4/log.winbindd
        /var/db/system/syslog-adb946163d914f088dc14617dbc0bec3/log/cron
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        Media/Media@auto-20191020.0900-1m:/Music/MP3/****.mp3
        <0x41f7>:<0x424>
        <0x41f7>:<0x428>
        <0x41f7>:<0x42b>
        <0x41f7>:<0x42c>
        <0x41f7>:<0x42e>
        <0x41f7>:<0x42f>
        <0x41f7>:<0x430>
        <0x41f7>:<0x431>

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 01:25:35 with 0 errors on Fri Nov 15 05:10:35 2019
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/78ecb373-aa82-11e4-b2d8-0cc47a40699d  ONLINE       0     0     0

errors: No known data errors
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Could it be that you have impacted a cable at the controller end somewhere or even had the HBA move a bit when you performed the last action?

With all drives impacted like that, you need to go back to the common point which would be the Disk controller (you don't mention it in your sig, but I don't find more than 6 SATA ports on that Mobo in the pictures and spec I can see from Google, so I assume you must have an HBA)
 

evlvd

Dabbler
Joined
May 30, 2015
Messages
18
I think you're right. I do have an HBA, I'll update my sig. I thought it was weird that I could see the drives fine, but got a smart error this morning making me think the cable could be bad. I'll switch to a different one and try a scrub again.

I got this error this morning via email:

New alerts:
* Device: /dev/da7 [SAT], Read SMART Self-Test Log Failed
* Device: /dev/da7 [SAT], not capable of SMART self-check
* Device: /dev/da7 [SAT], Read SMART Error Log Failed
* Device: /dev/da7 [SAT], failed to read SMART Attribute Data

This is currently my resilvering status, which is showing a ton of checksum errors, but not on the drives:
Code:
  pool: Media
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov 18 18:37:26 2019
        30.1T scanned at 588M/s, 29.1T issued at 565M/s, 32.3T total
        9.70G resilvered, 90.09% done, 0 days 01:39:08 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             DEGRADED     0     0   765
          raidz2-0                                        DEGRADED     0     0 1.49K
            gptid/31f8eaec-0771-11ea-8878-0cc47a40699d    ONLINE       0     0     0
            gptid/e7f2af10-06a4-11ea-a7e0-0cc47a40699d    ONLINE       0     0     0
            9823415007424443445                           UNAVAIL      0     0     0  was /dev/gptid/6b83f8a4-0771-11ea-8878-0cc47a40699d
            replacing-3                                   DEGRADED     0     0     0
              8734188899632115243                         UNAVAIL      0     0     0  was /dev/gptid/af4f2962-85a8-11e5-975c-0cc47a40699d
              gptid/a655397b-082f-11ea-8137-0cc47a40699d  ONLINE       0     0     0
            gptid/a2c7f92a-05d9-11ea-9c1e-0cc47a40699d    DEGRADED     0     0     0  too many errors
            replacing-5                                   DEGRADED     0     0     0
              6482462145072973473                         UNAVAIL      0     0     0  was /dev/gptid/7bb4f9ec-c0e2-11e6-b93d-0cc47a40699d
              gptid/d433c504-08f6-11ea-9164-0cc47a40699d  ONLINE       0     0     0
            gptid/db5cf4e4-051b-11ea-a112-0cc47a40699d    ONLINE       0     0     0
            gptid/9087a6ef-037c-11ea-a888-0cc47a40699d    ONLINE       0     0     0
            gptid/ac95eee3-0284-11ea-b143-0cc47a40699d    ONLINE       0     0     0
            gptid/cf3f5c1e-043a-11ea-8129-0cc47a40699d    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
All that sounds like it may be cabling or something else in the path between OS and disks... see what you can narrow it down to.

At very least, you should be able to get SMART to read the last test data for all of your disks.
 

evlvd

Dabbler
Joined
May 30, 2015
Messages
18
I switched the sata cable straight to the mobo with a new sata cable and the drive isn't recognized at all. Really weird. Will keep playing around with it.
 

evlvd

Dabbler
Joined
May 30, 2015
Messages
18
So it's been a few days and I've had multiple resilvers happen. I looked at the HBA card, re plugged in the sata cables, everything looks good. I reonlined the drive that went offline, but it didn't like that and spit up a bunch of errors faulted. Here's the current state:

Code:
  pool: Media
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 21 13:38:09 2019
        11.3T scanned at 675M/s, 10.2T issued at 611M/s, 32.3T total
        3.51G resilvered, 31.61% done, 0 days 10:32:35 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             DEGRADED     0     0   368
          raidz2-0                                        DEGRADED     0     0   736
            gptid/31f8eaec-0771-11ea-8878-0cc47a40699d    ONLINE       0     0   569
            gptid/e7f2af10-06a4-11ea-a7e0-0cc47a40699d    DEGRADED     0     0     0  too many errors
            9823415007424443445                           UNAVAIL      0     0     0  was /dev/da7p2
            replacing-3                                   DEGRADED     0     0     0
              8734188899632115243                         UNAVAIL      0     0     0  was /dev/gptid/af4f2962-85a8-11e5-975c-0cc47a40699d
              gptid/a655397b-082f-11ea-8137-0cc47a40699d  ONLINE       0     0     0
            gptid/a2c7f92a-05d9-11ea-9c1e-0cc47a40699d    DEGRADED     0     0     0  too many errors
            replacing-5                                   DEGRADED     0     0     0
              6482462145072973473                         UNAVAIL      0     0     0  was /dev/gptid/7bb4f9ec-c0e2-11e6-b93d-0cc47a40699d
              gptid/d433c504-08f6-11ea-9164-0cc47a40699d  ONLINE       0     0     0
            gptid/db5cf4e4-051b-11ea-a112-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/9087a6ef-037c-11ea-a888-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/ac95eee3-0284-11ea-b143-0cc47a40699d    DEGRADED     0     0     0  too many errors
            gptid/cf3f5c1e-043a-11ea-8129-0cc47a40699d    DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:


Would it be any good to reattach some of the pulled old drives and add them back in while keeping the other drives offline? It's weird that everything is performing normally, but it's spitting back lots of errors. I also don't know what it's resilvering currently. I have all 10 drives available that I could add back through extra sata ports, but don't know if that would do much good.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hey evlvd,

I would not put back any 4TB drive for sure! These drives belong to the past now. When you removed them, they were each in their respective state --at that time--. So say you removed disk 1 on day 1, that drive has the proper value for whatever your pool was on day 1. The drive you removed day 2 has what matches your pool on that second day. Same thing for all drives after that. So despite all these drives were once members of the same pool, now they each belong to a different one. If you try to put all them together in a single pool, none will be in sync and you will end up from something not working at all to corruption at higher level in FreeNAS.

Good luck fixing your pool,
 

evlvd

Dabbler
Joined
May 30, 2015
Messages
18
Ok thank you! I'll just keep chugging along with these repairs and see what happens.
 
Top