Uncorrectable I/O failure and disaster recovery

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
just got this...
Another failure identical to the previous one. Same exact progression. high cpu -> IO failure - data corruption -> pool unavail
m.2 expansion out
3TiB out
everything running on previous config that worked... Only thing that's changed is the upgrade from 12 to 13

Oct 9 15:07:51 truenas kernel: pid 5961 (smbd), jid 0, uid 0: exited on signal 6
Oct 9 15:08:37 truenas (ada3:ahcich3:0:0:0): Periph destroyed
Oct 9 15:08:37 truenas ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
Oct 9 15:08:37 truenas ada3: <WDC WD20EARX-00PASB0 51.0AB51> ATA8-ACS SATA 3.x device
Oct 9 15:08:37 truenas ada3: Serial Number WD-WCAZAJ631126
Oct 9 15:08:37 truenas ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Oct 9 15:08:37 truenas ada3: Command Queueing enabled
Oct 9 15:08:37 truenas ada3: 1907728MB (3907027055 512 byte sectors)
Oct 9 15:08:37 truenas ada3: quirks=0x1<4K>
Oct 9 15:08:41 truenas Solaris: WARNING: Pool 'NR.2TiB.2' has encountered an uncorrectable I/O failure and has been suspended.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
What is the output of camcontrol devlist, zpool status, zpool import, and zpool list -v? Please post the output of each command between [CODE][/CODE] tags.
You should really stop and put together a proper, accurate hardware list, without which we have to diagnose blind.

I also suggest you start reading the resources made available by this forum (in my signature you have links) since from your posts I see a lack of understanding of TN and ZFS.

Please also try to condense your posts more and keep calm: we have much greater chances by doing one slow step after the other.

Finally, I would like to add that things going apparently well for a period of time don't mean things are (or were) fine; also, your approach of assigning blame to the software before understanding if it's you whom committed errors (which as others pointed out you did) is not positive.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You should really stop and put together a proper hardware list.
Agreed, and include the exact TrueNAS software you are running. 12 to 13 has me making assumptions.

Since you did the upgrade, did you apply any new ZFS features to your pool? If not, you could roll back to your previous version with little effort, and that should allow your system to continue supporting your needs.
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
Would this do?

Yes feature uprgades were made so no going back
 

Attachments

  • ixdiagnose.zip
    10 KB · Views: 44

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
While waiting for the required informations (the commands and the accurate hardware list), here are a few things:
  1. As others pointed out, SMR drives are a BIG issue (WD itself declared that their SMR drives are not compatible with ZFS)
  2. As (again) others have pointed out, you using port multipliers is another BIG issue (I'd say BIGGER than the SMR drives): not to undermine your research efforts, but the ASM1166 is a port multiplier (not compatible with ZFS, unless you use them to connect a single drive)
  3. Anything that goes against standard settings (ie overclock and underclock, or overvolt and undervolt) does NOT contribute to a system stability and resilience, especially if done in an improper way
Point 1 and 2 are the likely cause of the I/O errors you are getting at a immediate look, but there could be a whole lot of things happening; in regards to the underclock, I would absolutely reset the BIOS to factory.

Would this do?
Not for me, maybe for joe.

More questions:
What was the boot pool drive? How was it attached to the motherboard? Do you have a configuration backup?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
Is there a work around or something? something that could potentially reduce speeds but give some degree of reliability - for instance?
That is part of the problem, the DRIVE handles SMR, not OS driver or ZFS. It is not possible to work around any problems. If it takes time to do something with that SMR drive, you just have to wait.

...
If that's the case, and it certainly sounds like it might be how would I fix it? I thought that exact same thing and thought a disconnect of the pool and an import might fix it. but as you know it didn;t
No, if a bit flip occurred and was written to disk, that will not be fixed by a disconnect, import or scrub. Well, in rare cases, with a pool that has redundancy, a bit flip might be recoverable. But, not in your case. In essence, what ever is corrupt, stays corrupt, unless you delete it. The following command would show if their are any permanent errors, (replace POOL with pool name);
zpool status -v POOL
In rare cases, an error might not show up until a scrub is run and finds the error.

...
So it would make sense that after all the transfers it caused some corruption due to the SMR issue.
ZFS does not corrupt data because a disk is SMR, as far as I know. Recent Western Digital Reds that are SMR have a bug that causes a read error, (well, ID not found, a disk sector issue). The good news is you don't appear to have any WD Red SMR disks.

For you, SMR disks should generally be just dog slow at certain times. And potentially time out, which could show up as an I/O error, which then disconnects your pool!


Ideally their would be a way to change a disk I/O timeout, so that a SMR disk would not cause any problems, (except being slow at times). I just don't know of anyway.


One extreme, and likely not to work, way to temporarily fix the problem, is to turn off everything else, including Samba & NFS sharing. Anything un-related to the problem pool gets stopped and the other pools get exported. Free up as much memory and CPU as you can. Then see if your problematic pool can be used, (while logged in via SSH).

It may be that you are on the edge of what that hardware can do with that SMR disk.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
While waiting for the required informations, I want to talk a bit about a few things.
I will be lenient about the tone/frustration of your first posts, especially since I trust you have understood how the situation was caused by your poor choices and not by a system bug or similar, but I have to address at the very least the following lines (and the concept about it).
If this was an unattended system in a remote location then I would have entirely killed the server because 1 drive failed. No way to reboot the server short of cutting the power physically.
Now this is unacceptable - this means I can (or rather won't) ever install this in a production environment like a business.

In a production environment you would be using a server-grade motherboard, which means having accerss to IPMI: the whole point of IPMI is addressing these kind of issues where the system is unresponsive and you might need to power-cycle or just plainly reboot it while not necessarily having acces to its physical location; your choice of going gaming-grade hardware impacted you ability to remotely respond to the crisis.

Do note, I am not pointing at you or blaming you for choosing to save money: a TN system can work perfectly fine with non-server hardware (proven that the necessary precautions are taken), but you cannot blame the inability of remotely powering off the system while having chosen not to use hardware with that functionality.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
2. As (again) others have pointed out, you using port multipliers is another BIG issue (I'd say BIGGER than the SMR drives): not to undermine your research efforts, but the ASM1166 is a port multiplier (not compatible with ZFS, unless you use them to connect a single drive)
...
Checking, from what the following datasheet says, the ASMedia ASM1166 chip is NOT a SATA Port Multiplier.
It appears to be a 6 SATA port chip, that can support SATA Port Multipliers on it's own ports.
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
Code:
root@truenas[~]# camcontrol devlist

<Samsung SSD 750 EVO 120GB MAT01B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)

<ST2000DM008-2FR102 0001>          at scbus1 target 0 lun 0 (ada1,pass1)

<WDC WD5000AAKX-001CA0 15.01H15>   at scbus2 target 0 lun 0 (ada2,pass2)

<WDC WD20EARX-00PASB0 51.0AB51>    at scbus3 target 0 lun 0 (ada3,pass3)

<ST500LT012-1DG142 1003YAM1>       at scbus6 target 0 lun 0 (ada4,pass4)

<ST3500418AS CC38>                 at scbus7 target 0 lun 0 (ada5,pass5)


Code:
root@truenas[~]# zpool status

  pool: NR-2TiB

 state: ONLINE

  scan: scrub repaired 0B in 02:20:30 with 0 errors on Wed Sep 20 08:11:58 2023

config:


        NAME                                          STATE     READ WRITE CKSUM

        NR-2TiB                                       ONLINE       0     0     0

          gptid/fc334b46-2b59-11ec-800d-001517cb21bc  ONLINE       0     0     0


errors: No known data errors


  pool: NR.2TiB.2

 state: ONLINE

  scan: scrub in progress since Mon Oct  9 16:07:36 2023

        540G scanned at 265M/s, 129G issued at 63.0M/s, 570G total

        0B repaired, 22.54% done, 01:59:36 to go

config:


        NAME                                          STATE     READ WRITE CKSUM

        NR.2TiB.2                                     ONLINE       0     0     0

          gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc  ONLINE       0     0     0


errors: No known data errors


  pool: RAIDZ.1TiB

 state: ONLINE

  scan: resilvered 2.09M in 00:00:01 with 0 errors on Mon Oct  9 11:05:51 2023

config:


        NAME                                            STATE     READ WRITE CKSUM

        RAIDZ.1TiB                                      ONLINE       0     0 0

          raidz1-0                                      ONLINE       0     0 0

            gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc  ONLINE       0     0 0

            gptid/4f771390-578d-11ee-9f43-001517cb21bc  ONLINE       0     0 0

            gptid/456fa97e-5812-11ee-9f43-001517cb21bc  ONLINE       0     0 0


errors: No known data errors


  pool: boot-pool

 state: ONLINE

  scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct  8 03:45:08 2023

config:


        NAME        STATE     READ WRITE CKSUM

        boot-pool   ONLINE       0     0     0

          ada0p2    ONLINE       0     0     0


errors: No known data errors


  pool: nvme0

 state: ONLINE

  scan: scrub repaired 0B in 00:00:33 with 0 errors on Sun Oct  8 00:00:33 2023

config:


Code:
root@truenas[~]# zpool import                                                  
no pools available to import


Code:
root@truenas[~]# zpool list -v

NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT

NR-2TiB                                         1.81T  1.15T   677G        -     -    10%    63%  1.00x    ONLINE  /mnt

  gptid/fc334b46-2b59-11ec-800d-001517cb21bc    1.82T  1.15T   677G        -     -    10%  63.5%      -    ONLINE

NR.2TiB.2                                       1.81T   433G  1.39T        -     -     0%    23%  1.00x    ONLINE  /mnt

  gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc    1.82T   433G  1.39T        -     -     0%  23.4%      -    ONLINE

RAIDZ.1TiB                                      1.35T   686G   698G        -     -     2%    49%  1.00x    ONLINE  /mnt

  raidz1-0                                      1.35T   686G   698G        -     -     2%  49.5%      -    ONLINE

    gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

    gptid/4f771390-578d-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

    gptid/456fa97e-5812-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

boot-pool                                        111G  1.57G   109G        -     -     0%     1%  1.00x    ONLINE  -

  ada0p2                                         112G  1.57G   109G        -     -     0%  1.41%      -    ONLINE

nvme0                                            220G  33.7G   186G        -     -     2%    15%  1.00x    ONLINE  /mnt

  gptid/060124c9-2a55-11ee-ad01-001517cb21bc     222G  33.7G   186G        -     -     2%  15.3%      -    ONLINE
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
I'm not running the ASM1166 right now anyway. it's been removed till I can get this sorted
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Checking, from what the following datasheet says, the ASMedia ASM1166 chip is NOT a SATA Port Multiplier.
It appears to be a 6 SATA port chip, that can support SATA Port Multipliers on it's own ports.
It is a low latency, low cost and low power AHCI controller with six SATA ports and cascaded port multipliers.
My understanding is that it has port multipliers, but I might be wrong.
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
So you're saying it's normal for an interface going down to crash the server, and that I need hardware to power cycle the system in case of such failure. I'm not talking datacenters I'm talking SOHO.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
My understanding is that it has port multipliers, but I might be wrong.
Hmmm, I don't know... I just read that referenced datasheet and it seemed to imply that it supported Port Multipliers, not that it had Port Multiplier(s) on it. But, as the original poster said, it's not an issue now.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
So you're saying it's normal for an interface going down to crash the server, and that I need hardware to power cycle the system in case of such failure. I'm not talking datacenters I'm talking SOHO.
I never wrote such, please read again my post. But I do think it's very likely that your system crashing was the result of your actions.

Code:
root@truenas[~]# camcontrol devlist
<Samsung SSD 750 EVO 120GB MAT01B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<ST2000DM008-2FR102 0001>          at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD5000AAKX-001CA0 15.01H15>   at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus3 target 0 lun 0 (ada3,pass3)
<ST500LT012-1DG142 1003YAM1>       at scbus6 target 0 lun 0 (ada4,pass4)
<ST3500418AS CC38>                 at scbus7 target 0 lun 0 (ada5,pass5)

We see only 6 drives here.

Code:
root@truenas[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
NR-2TiB                                         1.81T  1.15T   677G        -     -    10%    63%  1.00x    ONLINE  /mnt
  gptid/fc334b46-2b59-11ec-800d-001517cb21bc    1.82T  1.15T   677G        -     -    10%  63.5%      -    ONLINE

NR.2TiB.2                                       1.81T   433G  1.39T        -     -     0%    23%  1.00x    ONLINE  /mnt
  gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc    1.82T   433G  1.39T        -     -     0%  23.4%      -    ONLINE

RAIDZ.1TiB                                      1.35T   686G   698G        -     -     2%    49%  1.00x    ONLINE  /mnt
  raidz1-0                                      1.35T   686G   698G        -     -     2%  49.5%      -    ONLINE
    gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE
    gptid/4f771390-578d-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE
    gptid/456fa97e-5812-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

boot-pool                                        111G  1.57G   109G        -     -     0%     1%  1.00x    ONLINE  -
  ada0p2                                         112G  1.57G   109G        -     -     0%  1.41%      -    ONLINE

nvme0                                            220G  33.7G   186G        -     -     2%    15%  1.00x    ONLINE  /mnt
  gptid/060124c9-2a55-11ee-ad01-001517cb21bc     222G  33.7G   186G        -     -     2%  15.3%      -    ONLINE

While here we have 7.

This is an issue. How are the drives connected to the motherboard?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
@ghost_za - You have a scrub running on the impacted pool;
scan: scrub in progress since Mon Oct 9 16:07:36 2023

540G scanned at 265M/s, 129G issued at 63.0M/s, 570G total

0B repaired, 22.54% done, 01:59:36 to go
Until that is done, paused or canceled, that pool will likely be slow.

But, the good news is that the problematic pool is working at present.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Hmmm, I don't know... I just read that referenced datasheet and it seemed to imply that it supported Port Multipliers, not that it had Port Multiplier(s) on it.
A quick search in the forum brought this up:
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
Forget about the ASM116 for a moment please...it's been removed and I haven't gotten around to properly testing it yet. I don't know if it's an issue but once I test it and determine it's not worth while I will remove it
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
Until that is done, paused or canceled, that pool will likely be slow.
But, the good news is that the problematic pool is working at present.

Yeah this is after a reboot - I initiated it. Just copying the data over then I'm removing it. Both of the faulty HDD's are WD SMR drives as you guys pointed out. I can tell you one thing now... I'm sticking to seagate from now on.
 

ghost_za

Dabbler
Joined
Oct 13, 2021
Messages
42
I will reset the bios to safe_defaults after the disk operations have finished
 
Top