DrunkenPeleg
Cadet
- Joined
- Apr 25, 2019
- Messages
- 2
Persistent little issue I've got here. A storage pool bugs out and becomes unavailable randomly. This started happening about two weeks ago; pool would go out every 6-10 hours. A reboot brought the pool back up, but it had to be a power cycle, since the restart failed due to a hanging process.
OK, so let's get started.
Here is an example of what happens with the pool:
At this point, I can't do anything with the pool since it just gives me the 'I/O is currently suspended' error
So, I initiate a reboot (have tried regular shut down as well). After some time the process hangs- here's the tail end of things:
Now, I partly guess that the jails are preventing shutdown as a result of mount points pointing to the affected pool.
When I was having this issue two weeks ago, it was "resolved" after checking all cable connections, reseating the controller card and running a scrub. I thought it was probably just a loose connection somewhere as I had recently changed out a drive and probably jiggled something loose somewhere.
As you can see in the logs above, my attempts to run a scrub now are being foiled by the pool going out before the scrub can finish.
I think the restart/shutdown issue is secondary to whatever is going on with the pool. I should note that prior to this occurring again, there were no read/write errors, and SMART tests did not return any issues.
Should I chalk this up to bad cables, or perhaps the controller card? What's my best option to narrow this down?
OK, so let's get started.
Here is an example of what happens with the pool:
Code:
root@freenas[~]# zpool status -v POOL4 pool: POOL4 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-JQ scan: scrub in progress since Mon Aug 10 12:11:15 2020 5.24T scanned at 1.57G/s, 2.35T issued at 722M/s, 30.3T total 0 repaired, 7.75% done, 0 days 11:17:04 to go config: NAME STATE READ WRITE CKSUM POOL4 UNAVAIL 0 0 0 raidz2-0 UNAVAIL 44 0 0 12611917014333540652 REMOVED 0 0 0 was /dev/gptid/9feee305-50eb-11ea-ad9a-002590343322 1567844855632180418 REMOVED 0 0 0 was /dev/gptid/adf3da24-50eb-11ea-ad9a-002590343322 10092631138471912601 REMOVED 0 0 0 was /dev/gptid/af868d3e-af7c-11ea-800a-002590343322 12748854986319037125 REMOVED 0 0 0 was /dev/gptid/c9b0f525-50eb-11ea-ad9a-002590343322 6683381636331138817 REMOVED 0 0 0 was /dev/gptid/d7735d16-50eb-11ea-ad9a-002590343322 7872409557343776049 REMOVED 0 0 0 was /dev/gptid/43c07d3d-caf8-11ea-969f-002590343322 8424755186125633080 REMOVED 0 0 0 was /dev/gptid/e689752c-50eb-11ea-ad9a-002590343322 10321425597778855069 REMOVED 0 0 0 was /dev/gptid/f46ad081-50eb-11ea-ad9a-002590343322 11260363593274240624 REMOVED 0 0 0 was /dev/gptid/022be270-50ec-11ea-ad9a-002590343322 6997537682382531171 REMOVED 0 0 0 was /dev/gptid/10ab5145-50ec-11ea-ad9a-002590343322 errors: Permanent errors have been detected in the following files: <metadata>:<0x0> <metadata>:<0x1> <metadata>:<0x48> POOL4:<0x13003> POOL4:<0x82d2>
At this point, I can't do anything with the pool since it just gives me the 'I/O is currently suspended' error
So, I initiate a reboot (have tried regular shut down as well). After some time the process hangs- here's the tail end of things:
Code:
Stopping ntpd. Waiting for PIDS: 1672, 1672. Shutting down local daemons:. Stopping lockd. Waiting for PIDS: 1628. Stopping statd. Waiting for PIDS: 1625. Stopping nfsd. Waiting for PIDS: 1616 1617. Stopping mountd. Waiting for PIDS: 1610. Stopping watchdogd. Waiting for PIDS: 1550. Stopping rpcbind. Waiting for PIDS: 1395. Writing entropy file:. Writing early boot entropy file:. Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted system call: going to single user mode Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted system call: going to single user mode Aug 10 14:24:06 init: some processes would not die: ps axl advised
Now, I partly guess that the jails are preventing shutdown as a result of mount points pointing to the affected pool.
When I was having this issue two weeks ago, it was "resolved" after checking all cable connections, reseating the controller card and running a scrub. I thought it was probably just a loose connection somewhere as I had recently changed out a drive and probably jiggled something loose somewhere.
As you can see in the logs above, my attempts to run a scrub now are being foiled by the pool going out before the scrub can finish.
I think the restart/shutdown issue is secondary to whatever is going on with the pool. I should note that prior to this occurring again, there were no read/write errors, and SMART tests did not return any issues.
Should I chalk this up to bad cables, or perhaps the controller card? What's my best option to narrow this down?