interpreting disk and pool errors, next steps

seanellis · Aug 29, 2019

Hello, I am looking for suggestions about how to approach errors in my system over the last few days, which began when the machine became temporarily unreachable.

The machine was brought up without incident, but showing errors. I've been looking at the logs and machine since then. It's confusing to me as the state has been changing throughout. When it came back up the machine notified that it was resilvering, this has continued, although when I check it seems to lose ground occasionally with the per cent done dropping. There are errors in the logs around two of the drives. Running `zpool status -v` originally showed two of the disks to be DEGRADED. After stopping the machine to confirm that the cabling was well seated it now reports, after the reboot, that the disks are all ONLINE. The GUI says the pool is HEALTHY. zpool status has also reported files with "Permanent errors" that are cleared now.

I can post more explicit info, but thought I would get started before putting up a wall of text. What will be the most prudent course of action after this behaviour?

Thanks,

Sean

garm · Aug 29, 2019

Errors aren’t persistent over reboots, run a scrub and check smart reports

seanellis · Aug 29, 2019

garm said:
Errors aren’t persistent over reboots, run a scrub and check smart reports

Thanks for the reply. As I understand it, I'll need to wait out the resilvering before I'm able to that scrub, is that so? I did run
smartctl earlier. The same drives showing errors in the logs had some errors there too although they reported as PASSED. It looks as if it will be resilvering for some time. This is the current status, the files with errors are completely cleared now.

Code:

> zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:52 with 0 errors on Fri Aug 23 03:45:52 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: main_volume
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Aug 29 11:47:08 2019
        758G scanned at 152M/s, 414G issued at 83.0M/s, 4.55T total
        96.1G resilvered, 8.88% done, 0 days 14:33:28 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        main_volume                                     ONLINE       0     0     1
          raidz1-0                                      ONLINE       0     0     2
            gptid/4074d0a8-cc10-11e8-9942-d05099aa523b  ONLINE       0     0     0
            gptid/41779cad-cc10-11e8-9942-d05099aa523b  ONLINE       0     0     0
            gptid/427882a1-cc10-11e8-9942-d05099aa523b  ONLINE       0     0     0
            gptid/4388574d-cc10-11e8-9942-d05099aa523b  ONLINE       1     0     0

errors: 1 data errors, use '-v' for a list

garm · Aug 29, 2019

Wait for the resilver to finish, if there is enough redundancy the errors will be corrected. But you have pool metadata checksum errors, prepare for a pool rebuild. You should probably start updating your backup if it’s lagging behind.

seanellis · Aug 29, 2019

garm said:
Wait for the resilver to finish

Gotcha. See how that goes. Thanks!

Important Announcement for the TrueNAS Community.

interpreting disk and pool errors, next steps

seanellis

Dabbler

garm

Wizard

seanellis

Dabbler

garm

Wizard

seanellis

Dabbler

Similar threads