What is going on? vdev showing second spare + tons of checksum errors during resilvering

ITOperative · Mar 4, 2023

Hey guys,

I had two drives failing on long SMART testing, so I decided to swap them out for replacement drives.
Since the drives were in separate vdevs (both raidz2), I popped them both in and started them both on replacement.

I waited for resilvering to hit 100% and in vdev1, da1 was replaced and seems to be fine.
But resilvering now shows at 46.4% as of writing, I noticed it showed "Errors: 1" and when I checked vdev2, it shows two spares instead of one (da10 is now listed as a spare), da17 under the spare section shows as unavailable, and every other drive in vdev2 shows checksum errors.

Additionally, I checked my notifications and there's no new errors listed.

What on earth happened; does anyone know what might be going on?
I saw similar checksum issues before, before I switched out my PERC h730p for a 330 mini, but it has been fine since.
I could only assume something went bad and it had to pull the spare in to help resilver?

Here's a picture of my pool, which oddly doesn't show as being degrades:

Any ideas as to what is going on, and what level of concern I should be having, would be greatly appreciated!

Johnny Fartpants · Mar 4, 2023

What chassis are you using? Is there anything different with the drives in the first vdev compared to the second like where they are located or how they are connected?

ITOperative · Mar 4, 2023

Johnny Fartpants said:
What chassis are you using? Is there anything different with the drives in the first vdev compared to the second like where they are located or how they are connected?

I directly swapped da1 and da10, same spots, and nothing else has changed.
Full chassis info and build is verbose in my signature.

Here's an output of zpool status -v if it helps:

Code:

root@truenas[~/.ssh]# zpool status -v
  pool: MasterPool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Mar  4 13:15:24 2023
        7.20T scanned at 749M/s, 3.42T issued at 355M/s, 7.23T total
        219G resilvered, 47.24% done, 03:07:35 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        MasterPool                                        ONLINE       0     0     0
          raidz2-0                                        ONLINE       0     0     0
            gptid/34261a2b-ba0b-11ed-a2db-000c292474de    ONLINE       0     0     0
            gptid/5ee980b9-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
            gptid/5f030a48-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
            gptid/5f058c17-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
            gptid/5ee05210-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
            gptid/9f557bb8-af67-11ed-8707-000c292474de    ONLINE       0     0     0
            gptid/5f043d9b-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
            gptid/5f3e30be-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     0
          raidz2-1                                        ONLINE       0     0     0
            gptid/5f01a11f-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    67  (resilvering)
            spare-1                                       ONLINE       0     0   145
              gptid/5dfe5a89-ba0b-11ed-a2db-000c292474de  ONLINE       0     0     0  (resilvering)
              gptid/802f96f9-a82d-11ed-9b8c-000c2969ae4f  ONLINE       0     0     0  (resilvering)
            gptid/5f03d758-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    15  (resilvering)
            gptid/5f051857-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    22  (resilvering)
            gptid/5e6e24d5-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    25  (resilvering)
            gptid/5f04ac2d-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0     7  (resilvering)
            gptid/5f03637e-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    43  (resilvering)
            gptid/5ee8da0c-9fa1-11ed-b862-000c2969ae4f    ONLINE       0     0    11  (resilvering)
        spares
          gptid/802f96f9-a82d-11ed-9b8c-000c2969ae4f      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        /mnt/[path redacted for forum post]/DSC_2877.NEF

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Tue Feb 28 03:45:09 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors
root@truenas[~/.ssh]#

I'll say the plus side is that I just did a server migration like a month or two ago, so I can still pull any affected files over, but that is a minor concern that one was lost. The big question is why.

Johnny Fartpants · Mar 4, 2023

Ah, are you virtualising TrueNAS?

Johnny Fartpants · Mar 4, 2023

The drives that are having issues do you know where they are located in the chassis? ie are they all at the back for example?

ITOperative · Mar 4, 2023

Johnny Fartpants said:
Ah, are you virtualising TrueNAS?

I am, but I am doing a direct pass-through of the HBA to the VM. So far that hasn't caused any issues.
The biggest "issue" is running scripts in TrueNAS to start my other VMs that run on it's storage, inside ESXi.

Johnny Fartpants said:
The drives that are having issues do you know where they are located in the chassis? ie are they all at the back for example?

They are actually mixed between the front backplane, the three midplane bays, and the two rear bays.
So far I haven't had any issues with the drives, even with prior resilvering (though I believe last resilvering was in vdev1?), and I've confirmed all the cables were good when I last opened it up.

Something similar did happen during initial creation, but I believe it was the H730p, because swapping to the 330 mini, I didn't have it happen again. It's also not a clear separation, physically, so if it were hardware related, I'd expect there to only be issues either on da's 13, 14, or 15, 16, 17.
Rear bays, and midplane, respectively.

Johnny Fartpants · Mar 4, 2023

Any errors on the console either now or when you replace a disk?

Johnny Fartpants · Mar 4, 2023

I had a similar issue once with a 45 bay JBOD where all looked fine until I replaced a drive at the rear and then all hell broke loose. Worked out the internal cabling can be done in a couple of ways. One way FreeNAS (as it was) liked it and another way it didn’t. After changing that all was great.

ITOperative · Mar 5, 2023

Johnny Fartpants said:
I had a similar issue once with a 45 bay JBOD where all looked fine until I replaced a drive at the rear and then all hell broke loose. Worked out the internal cabling can be done in a couple of ways. One way FreeNAS (as it was) liked it and another way it didn’t. After changing that all was great.

I don't believe so? I didn't see any errors in the web notifications if that's what you're referring, and I shared the vpool status. Haven't seen anything beyond that, unless I'm missing a command I should know.

Johnny Fartpants · Mar 5, 2023

With my issue years ago cam errors would spam the console when I did a disk replace on the back of the system.

ITOperative · Mar 5, 2023

Johnny Fartpants said:
With my issue years ago cam errors would spam the console when I did a disk replace on the back of the system.

How would I go about checking that? Do you mean the UI notifications or is this a command to check the status via CLI?

Johnny Fartpants · Mar 5, 2023

This would happen on the console. One way to check would be to pull and disk and then put it back in and keep an eye on the console. Or even try and replace a disk and see how it goes. I assume everything was working great until you tried to replace a disk?

ITOperative · Mar 5, 2023

Johnny Fartpants said:
This would happen on the console. One way to check would be to pull and disk and then put it back in and keep an eye on the console. Or even try and replace a disk and see how it goes. I assume everything was working great until you tried to replace a disk?

It was fine before, but it was also fine during the first disk's resilvering. I did note that so far the checksums haven't went up since that one time, so that's good.

Also can you clarify by console? Are you referring to the shell/CLI, or the web UI?
Because I know the web UI has that notification area, and the shell is basically SSH, but I'm wondering if I'm missing something in the shell or something.

Johnny Fartpants · Mar 5, 2023

you’ll need to connect a screen to your server or you can display it at the bottom of the web ui in advanced. When did the error start after the second disk replacement? If so where was that located in the chassis?

ITOperative · Mar 5, 2023

Johnny Fartpants said:
you’ll need to connect a screen to your server or you can display it at the bottom of the web ui in advanced. When did the error start after the second disk replacement? If so where was that located in the chassis?

Ohh right the screen.. I deployed it as a VM so I typically interact with it via SSH, shell, or web UI.

I don't know exactly when the error started, but I noticed it after the first resilver finished, which was on the other vdev.
It's located in the same backplane (front of chassis) as the previous disk, connected to the same controller and dual-SAS cable.

ITOperative · Mar 5, 2023

So the second resliver is done, all looks good and the same as it was, except I think my hot spare is failing. So that may be why.

ITOperative · Mar 5, 2023

So just as a creepy status update, I decided I really don't care about the one corrupt file, but it still showed da10 (the replaced drive) and da17 (the hotspare) both under the spare category.

I removed the file via shell, and it still had the error, even after putting da17 offline and back online.
I decided to try putting da10 offline and back online, and it stopped hot-sparing and sees no errors supposedly; I'm thinking it was holding onto the issue for that one particular file. I did all of that, of course, after doing zpool clear to ensure I'd catch any new errors, and there aren't any.

I'm going to next do short SMART tests, then immediately schedule a long SMART test after, if the first was good, to ensure no issues seemingly remain. This has been a bumpy ride, but if the SMART tests pass, I think it was just getting up in arms about the one file which is totally fine.

If I bump into any other oddities, I'll be sure to include them just so the knowledge of my experiences is available for anyone else who deals with a file corruptions mid-resilver.

UPDATE:
Just wanted to confirm everything tested fine for both short and long SMART tests.
Nothing seems to be wrong, just a single file corruption amidst the drive swaps. All things considered, I'd say it went well enough.

Important Announcement for the TrueNAS Community.

What is going on? vdev showing second spare + tons of checksum errors during resilvering

ITOperative

Dabbler

Johnny Fartpants

Guru

ITOperative

Dabbler

Johnny Fartpants

Guru

Johnny Fartpants

Guru

ITOperative

Dabbler

Johnny Fartpants

Guru

Johnny Fartpants

Guru

ITOperative

Dabbler

Johnny Fartpants

Guru

ITOperative

Dabbler

Johnny Fartpants

Guru

ITOperative

Dabbler

Johnny Fartpants

Guru

ITOperative

Dabbler

ITOperative

Dabbler

ITOperative

Dabbler

Similar threads