an unhealthy and a degraded pool, no longer able to access smb shares, rsync still running but seeing no write i/o on the disks.

SaiKo · Feb 1, 2022

system info: TrueNAS-12.0-U1, Asus P10S-I, intel i3 6100T, 32GB ddr4 ecc ram, intel cna x550, 9211-8i sas hba, intel sas expander, adaptec sas expander, Crucial CT128MX100SSD1 as boot drive
unhealthy pool: 1vdev, raidz2, 10x HGST HDN721010AL, existing pool unhealthy since last boot
degraded pool: 1vdev, raidz2, 10x WDC WD180EDGZ-11, new pool created yesterday, after last boot
2 healthy pools

Hello,

i bought new drives a while back and after testing them with h2testw i was waiting for an opportune moment to put them in my nas, and then one of my sas expanders stopped working so yesterday a replacement for that arrived and i installed both it and the new drives in my system.

once i got all the new stuff working i reconnected all my old drives again but i forgot to plug in 1 of the power cables to my psu.

after the system had booted i immediately noticed a missing and a degraded pool (it was missing 2 drives), and quickly realized my mistake, shut down, re-plugged the cable and booted again.

this time the boot took a lot longer and apparently had to do a resilver (i remember seeing something about resilver completed with 0 errors after booting, can't check any more cause a scrub is now in progress, great timing for that) and the pool was left showing as 'unhealthy'.
(status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.)

when i check smart values for the drives i can only find 0's for Raw_Read_Error_Rate, Reallocated_Sector_Ct, Seek_Error_Rate, Reallocated_Event_Count, Current_Pending_Sector, Offline_Uncorrectable and UDMA_CRC_Error_Count on all drives, so what is the unrecoverable error the zpool status mentions, and how can i find out which drive it is talking about if i can see nothing wrong in the smart data?

anyway, i had new drives available so i made a new pool and started an rsync job to copy all data from the unhealthy pool to the shiny new empty one, thinking i can do whatever to get the old pool properly working again later if all data on it is safe on the new pool.

that was yesterday, today the rsync was halfway done when i go and check and notice i now have a degraded pool on top of the unhealthy one, it's the new one (status: One or more devices are faulted in response to IO failures., and apparently it also resilvered this afternoon scan: resilvered 988M in 00:01:14 with 0 errors on Tue Feb 1 15:36:25 2022)
2 of the drives in the new pool are stated as REMOVED, one of them doesn't show up under disks and one does but without serial or model number info and size 0B, but i can get smart data off them both, and again all drives in the pool don't show anything wrong in the smart values.

zpool status has 0's everywhere except 64 write on the vdev of the new pool, and 1 of its drives with 1 read and 72 write. the drive that doesn't have 0's isn't one of the 2 REMOVED ones.
i can see the gptid in zpool status, but gptid is not on the disks page, so how do i find the serial?

can someone explain what it means if a drive suddenly doesn't have details on the disks page, or if a drive disappears from the disks page? in both cases i can still get the smart values for them from the shell, so they must still have power and sata connected.
does this mean there is a problem with the drives, or could it be the sas expander's fault (i used the new drives on the new sas expander), or could there be another cause?

allrighty then, apparently i am also not able to access any shares any more on the system, i could still do that a few hours ago, not even those on healthy pools, so this just got a little more serious than me wondering if i need to rma a few of the new drives.

the rsync job is still running but i am not seeing any disk i/o on the disks in the new pool (where rsync was writing to), and there is a scrub in progress on the unhealthy pool (25hrs to go) and one on a healthy pool (2 hrs to go).
may i shut down the system while a scrub is in progress, or do i need to wait a day before i can attempt to fix something?
what happens to the rsync tmp files when rsync is disrupted? this might not matter any more since i see no i/o at all on disks in the new pool since 6 hours ago (and the metrics dont go any further than that, i see nothing before the metric at 15:42, there's a little writing on the first metric and 0mib i/o after that) which is just after the resilver, the pool was degraded and had 2 disks removed at 15:41

just started long smart tests on the drives, maybe that gives some results tomorrow

thank you to anyone willing to spend the time giving me advice.

Important Announcement for the TrueNAS Community.

an unhealthy and a degraded pool, no longer able to access smb shares, rsync still running but seeing no write i/o on the disks.

SaiKo

Dabbler

Similar threads