Pool unavailable: One or more devices are faulted in response to IO Failures

DrunkenPeleg

Cadet
Joined
Apr 25, 2019
Messages
2
Persistent little issue I've got here. A storage pool bugs out and becomes unavailable randomly. This started happening about two weeks ago; pool would go out every 6-10 hours. A reboot brought the pool back up, but it had to be a power cycle, since the restart failed due to a hanging process.

OK, so let's get started.

Here is an example of what happens with the pool:

Code:
root@freenas[~]# zpool status -v POOL4
  pool: POOL4
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Mon Aug 10 12:11:15 2020
        5.24T scanned at 1.57G/s, 2.35T issued at 722M/s, 30.3T total
        0 repaired, 7.75% done, 0 days 11:17:04 to go
config:

        NAME                      STATE     READ WRITE CKSUM
        POOL4                     UNAVAIL      0     0     0
          raidz2-0                UNAVAIL     44     0     0
            12611917014333540652  REMOVED      0     0     0  was /dev/gptid/9feee305-50eb-11ea-ad9a-002590343322
            1567844855632180418   REMOVED      0     0     0  was /dev/gptid/adf3da24-50eb-11ea-ad9a-002590343322
            10092631138471912601  REMOVED      0     0     0  was /dev/gptid/af868d3e-af7c-11ea-800a-002590343322
            12748854986319037125  REMOVED      0     0     0  was /dev/gptid/c9b0f525-50eb-11ea-ad9a-002590343322
            6683381636331138817   REMOVED      0     0     0  was /dev/gptid/d7735d16-50eb-11ea-ad9a-002590343322
            7872409557343776049   REMOVED      0     0     0  was /dev/gptid/43c07d3d-caf8-11ea-969f-002590343322
            8424755186125633080   REMOVED      0     0     0  was /dev/gptid/e689752c-50eb-11ea-ad9a-002590343322
            10321425597778855069  REMOVED      0     0     0  was /dev/gptid/f46ad081-50eb-11ea-ad9a-002590343322
            11260363593274240624  REMOVED      0     0     0  was /dev/gptid/022be270-50ec-11ea-ad9a-002590343322
            6997537682382531171   REMOVED      0     0     0  was /dev/gptid/10ab5145-50ec-11ea-ad9a-002590343322

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x48>
        POOL4:<0x13003>
        POOL4:<0x82d2>


At this point, I can't do anything with the pool since it just gives me the 'I/O is currently suspended' error

So, I initiate a reboot (have tried regular shut down as well). After some time the process hangs- here's the tail end of things:

Code:
Stopping ntpd.
Waiting for PIDS: 1672, 1672.
Shutting down local daemons:.
Stopping lockd.
Waiting for PIDS:  1628.
Stopping statd.
Waiting for PIDS: 1625.
Stopping nfsd.
Waiting for PIDS:  1616 1617.
Stopping mountd.
Waiting for PIDS:  1610.
Stopping watchdogd.
Waiting for PIDS:  1550.
Stopping rpcbind.
Waiting for PIDS:  1395.
Writing entropy file:.
Writing early boot entropy file:.
Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted
system call: going to single user mode
Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted
system call: going to single user mode
Aug 10 14:24:06 init: some processes would not die: ps axl advised


Now, I partly guess that the jails are preventing shutdown as a result of mount points pointing to the affected pool.

When I was having this issue two weeks ago, it was "resolved" after checking all cable connections, reseating the controller card and running a scrub. I thought it was probably just a loose connection somewhere as I had recently changed out a drive and probably jiggled something loose somewhere.

As you can see in the logs above, my attempts to run a scrub now are being foiled by the pool going out before the scrub can finish.

I think the restart/shutdown issue is secondary to whatever is going on with the pool. I should note that prior to this occurring again, there were no read/write errors, and SMART tests did not return any issues.

Should I chalk this up to bad cables, or perhaps the controller card? What's my best option to narrow this down?
 

DrunkenPeleg

Cadet
Joined
Apr 25, 2019
Messages
2
Started a new scrub today, and everything was looking good- until the scrub was just over 75% completed:

Code:
root@freenas[~]# zpool status -v POOL4
  pool: POOL4
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Mon Aug 10 12:11:15 2020
        25.6T scanned at 1.10G/s, 22.8T issued at 1004M/s, 30.3T total
        0 repaired, 75.20% done, 0 days 02:10:50 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        POOL4                                           ONLINE   83.5K   519     0
          raidz2-0                                      ONLINE    167K    36     0
            gptid/9feee305-50eb-11ea-ad9a-002590343322  ONLINE     326    19     0
            gptid/adf3da24-50eb-11ea-ad9a-002590343322  ONLINE     326    15     0
            gptid/af868d3e-af7c-11ea-800a-002590343322  ONLINE     326    13     0
            gptid/c9b0f525-50eb-11ea-ad9a-002590343322  ONLINE     326    10     0
            gptid/d7735d16-50eb-11ea-ad9a-002590343322  ONLINE     326    13     0
            gptid/43c07d3d-caf8-11ea-969f-002590343322  ONLINE     326    15     0
            gptid/e689752c-50eb-11ea-ad9a-002590343322  ONLINE     326    18     0
            gptid/f46ad081-50eb-11ea-ad9a-002590343322  ONLINE     326    19     0
            gptid/022be270-50ec-11ea-ad9a-002590343322  ONLINE     326    19     0
            gptid/10ab5145-50ec-11ea-ad9a-002590343322  ONLINE     326    20     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x48>
        POOL4:<0x13911>
        POOL4:<0x13a28>
        POOL4:<0x13a29>
        POOL4:<0x13a2a>
        POOL4:<0x13a32>
        POOL4:<0x13a33>
        POOL4:<0x13a37>
        POOL4:<0x13a3e>
        POOL4:<0x13a41>
        POOL4:<0x13968>
        POOL4:<0x13969>
        POOL4:<0x1396b>
        POOL4:<0x1396c>
        POOL4:<0x1396d>
        POOL4:<0x1396f>
        POOL4:<0x13970>
        POOL4:<0x13971>
        POOL4:<0x13972>
        POOL4:<0x13977>
        POOL4:<0x1397a>


I also took a look at the SMART results for each of the drives- values seem okay, but a few drives have one of the following errors recorded: "Read DMA" or "ABRT" errors.

Self-test values are all good, so drives seem okay. In any case, I am seeing that the metadata errors mean this pool should be destroyed.
 
Top