Uncorrectable I/O failure and disaster recovery

ghost_za · Oct 9, 2023

just got this...
Another failure identical to the previous one. Same exact progression. high cpu -> IO failure - data corruption -> pool unavail
m.2 expansion out
3TiB out
everything running on previous config that worked... Only thing that's changed is the upgrade from 12 to 13

Oct 9 15:07:51 truenas kernel: pid 5961 (smbd), jid 0, uid 0: exited on signal 6
Oct 9 15:08:37 truenas (ada3:ahcich3:0:0:0): Periph destroyed
Oct 9 15:08:37 truenas ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
Oct 9 15:08:37 truenas ada3: <WDC WD20EARX-00PASB0 51.0AB51> ATA8-ACS SATA 3.x device
Oct 9 15:08:37 truenas ada3: Serial Number WD-WCAZAJ631126
Oct 9 15:08:37 truenas ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Oct 9 15:08:37 truenas ada3: Command Queueing enabled
Oct 9 15:08:37 truenas ada3: 1907728MB (3907027055 512 byte sectors)
Oct 9 15:08:37 truenas ada3: quirks=0x1<4K>
Oct 9 15:08:41 truenas Solaris: WARNING: Pool 'NR.2TiB.2' has encountered an uncorrectable I/O failure and has been suspended.

Davvo · Oct 9, 2023

What is the output of camcontrol devlist, zpool status, zpool import, and zpool list -v? Please post the output of each command between [CODE][/CODE] tags.
You should really stop and put together a proper, accurate hardware list, without which we have to diagnose blind.

I also suggest you start reading the resources made available by this forum (in my signature you have links) since from your posts I see a lack of understanding of TN and ZFS.

Please also try to condense your posts more and keep calm: we have much greater chances by doing one slow step after the other.

Finally, I would like to add that things going apparently well for a period of time don't mean things are (or were) fine; also, your approach of assigning blame to the software before understanding if it's you whom committed errors (which as others pointed out you did) is not positive.

joeschmuck · Oct 9, 2023

Davvo said:
You should really stop and put together a proper hardware list.

Agreed, and include the exact TrueNAS software you are running. 12 to 13 has me making assumptions.

Since you did the upgrade, did you apply any new ZFS features to your pool? If not, you could roll back to your previous version with little effort, and that should allow your system to continue supporting your needs.

ghost_za · Oct 9, 2023

Would this do?

Yes feature uprgades were made so no going back

Davvo · Oct 9, 2023

While waiting for the required informations (the commands and the accurate hardware list), here are a few things:

As others pointed out, SMR drives are a BIG issue (WD itself declared that their SMR drives are not compatible with ZFS)
As (again) others have pointed out, you using port multipliers is another BIG issue (I'd say BIGGER than the SMR drives): not to undermine your research efforts, but the ASM1166 is a port multiplier (not compatible with ZFS, unless you use them to connect a single drive)
Anything that goes against standard settings (ie overclock and underclock, or overvolt and undervolt) does NOT contribute to a system stability and resilience, especially if done in an improper way

Point 1 and 2 are the likely cause of the I/O errors you are getting at a immediate look, but there could be a whole lot of things happening; in regards to the underclock, I would absolutely reset the BIOS to factory.

ghost_za said:
Would this do?

Not for me, maybe for joe.

More questions:
What was the boot pool drive? How was it attached to the motherboard? Do you have a configuration backup?

Arwen · Oct 9, 2023

ghost_za said:
...
Is there a work around or something? something that could potentially reduce speeds but give some degree of reliability - for instance?

That is part of the problem, the DRIVE handles SMR, not OS driver or ZFS. It is not possible to work around any problems. If it takes time to do something with that SMR drive, you just have to wait.

ghost_za said:
...
If that's the case, and it certainly sounds like it might be how would I fix it? I thought that exact same thing and thought a disconnect of the pool and an import might fix it. but as you know it didn;t

No, if a bit flip occurred and was written to disk, that will not be fixed by a disconnect, import or scrub. Well, in rare cases, with a pool that has redundancy, a bit flip might be recoverable. But, not in your case. In essence, what ever is corrupt, stays corrupt, unless you delete it. The following command would show if their are any permanent errors, (replace POOL with pool name);
zpool status -v POOL
In rare cases, an error might not show up until a scrub is run and finds the error.

ghost_za said:
...
So it would make sense that after all the transfers it caused some corruption due to the SMR issue.

ZFS does not corrupt data because a disk is SMR, as far as I know. Recent Western Digital Reds that are SMR have a bug that causes a read error, (well, ID not found, a disk sector issue). The good news is you don't appear to have any WD Red SMR disks.

For you, SMR disks should generally be just dog slow at certain times. And potentially time out, which could show up as an I/O error, which then disconnects your pool!

Ideally their would be a way to change a disk I/O timeout, so that a SMR disk would not cause any problems, (except being slow at times). I just don't know of anyway.

One extreme, and likely not to work, way to temporarily fix the problem, is to turn off everything else, including Samba & NFS sharing. Anything un-related to the problem pool gets stopped and the other pools get exported. Free up as much memory and CPU as you can. Then see if your problematic pool can be used, (while logged in via SSH).

It may be that you are on the edge of what that hardware can do with that SMR disk.

Davvo · Oct 9, 2023

While waiting for the required informations, I want to talk a bit about a few things.
I will be lenient about the tone/frustration of your first posts, especially since I trust you have understood how the situation was caused by your poor choices and not by a system bug or similar, but I have to address at the very least the following lines (and the concept about it).

ghost_za said:
If this was an unattended system in a remote location then I would have entirely killed the server because 1 drive failed. No way to reboot the server short of cutting the power physically.
Now this is unacceptable - this means I can (or rather won't) ever install this in a production environment like a business.

In a production environment you would be using a server-grade motherboard, which means having accerss to IPMI: the whole point of IPMI is addressing these kind of issues where the system is unresponsive and you might need to power-cycle or just plainly reboot it while not necessarily having acces to its physical location; your choice of going gaming-grade hardware impacted you ability to remotely respond to the crisis.

Do note, I am not pointing at you or blaming you for choosing to save money: a TN system can work perfectly fine with non-server hardware (proven that the necessary precautions are taken), but you cannot blame the inability of remotely powering off the system while having chosen not to use hardware with that functionality.

Arwen · Oct 9, 2023

Davvo said:
...
2. As (again) others have pointed out, you using port multipliers is another BIG issue (I'd say BIGGER than the SMR drives): not to undermine your research efforts, but the ASM1166 is a port multiplier (not compatible with ZFS, unless you use them to connect a single drive)
...

Checking, from what the following datasheet says, the ASMedia ASM1166 chip is NOT a SATA Port Multiplier.

ASM1166|ASMedia Technology Inc.

ASMedia Technology was founded in 2004 and went public in 2012 (stock code: 5269.TW). Headquarter is located in Taiwan. ASMedia specializes in high-speed interface technologies with strong expertise in high-speed SERDES in house development. Company strives for continual innovation and the...

www.asmedia.com.tw

It appears to be a 6 SATA port chip, that can support SATA Port Multipliers on it's own ports.

ghost_za · Oct 9, 2023

Code:

root@truenas[~]# camcontrol devlist

<Samsung SSD 750 EVO 120GB MAT01B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)

<ST2000DM008-2FR102 0001>          at scbus1 target 0 lun 0 (ada1,pass1)

<WDC WD5000AAKX-001CA0 15.01H15>   at scbus2 target 0 lun 0 (ada2,pass2)

<WDC WD20EARX-00PASB0 51.0AB51>    at scbus3 target 0 lun 0 (ada3,pass3)

<ST500LT012-1DG142 1003YAM1>       at scbus6 target 0 lun 0 (ada4,pass4)

<ST3500418AS CC38>                 at scbus7 target 0 lun 0 (ada5,pass5)

Code:

root@truenas[~]# zpool status

  pool: NR-2TiB

 state: ONLINE

  scan: scrub repaired 0B in 02:20:30 with 0 errors on Wed Sep 20 08:11:58 2023

config:


        NAME                                          STATE     READ WRITE CKSUM

        NR-2TiB                                       ONLINE       0     0     0

          gptid/fc334b46-2b59-11ec-800d-001517cb21bc  ONLINE       0     0     0


errors: No known data errors


  pool: NR.2TiB.2

 state: ONLINE

  scan: scrub in progress since Mon Oct  9 16:07:36 2023

        540G scanned at 265M/s, 129G issued at 63.0M/s, 570G total

        0B repaired, 22.54% done, 01:59:36 to go

config:


        NAME                                          STATE     READ WRITE CKSUM

        NR.2TiB.2                                     ONLINE       0     0     0

          gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc  ONLINE       0     0     0


errors: No known data errors


  pool: RAIDZ.1TiB

 state: ONLINE

  scan: resilvered 2.09M in 00:00:01 with 0 errors on Mon Oct  9 11:05:51 2023

config:


        NAME                                            STATE     READ WRITE CKSUM

        RAIDZ.1TiB                                      ONLINE       0     0 0

          raidz1-0                                      ONLINE       0     0 0

            gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc  ONLINE       0     0 0

            gptid/4f771390-578d-11ee-9f43-001517cb21bc  ONLINE       0     0 0

            gptid/456fa97e-5812-11ee-9f43-001517cb21bc  ONLINE       0     0 0


errors: No known data errors


  pool: boot-pool

 state: ONLINE

  scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct  8 03:45:08 2023

config:


        NAME        STATE     READ WRITE CKSUM

        boot-pool   ONLINE       0     0     0

          ada0p2    ONLINE       0     0     0


errors: No known data errors


  pool: nvme0

 state: ONLINE

  scan: scrub repaired 0B in 00:00:33 with 0 errors on Sun Oct  8 00:00:33 2023

config:

Code:

root@truenas[~]# zpool import                                                  
no pools available to import

Code:

root@truenas[~]# zpool list -v

NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT

NR-2TiB                                         1.81T  1.15T   677G        -     -    10%    63%  1.00x    ONLINE  /mnt

  gptid/fc334b46-2b59-11ec-800d-001517cb21bc    1.82T  1.15T   677G        -     -    10%  63.5%      -    ONLINE

NR.2TiB.2                                       1.81T   433G  1.39T        -     -     0%    23%  1.00x    ONLINE  /mnt

  gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc    1.82T   433G  1.39T        -     -     0%  23.4%      -    ONLINE

RAIDZ.1TiB                                      1.35T   686G   698G        -     -     2%    49%  1.00x    ONLINE  /mnt

  raidz1-0                                      1.35T   686G   698G        -     -     2%  49.5%      -    ONLINE

    gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

    gptid/4f771390-578d-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

    gptid/456fa97e-5812-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

boot-pool                                        111G  1.57G   109G        -     -     0%     1%  1.00x    ONLINE  -

  ada0p2                                         112G  1.57G   109G        -     -     0%  1.41%      -    ONLINE

nvme0                                            220G  33.7G   186G        -     -     2%    15%  1.00x    ONLINE  /mnt

  gptid/060124c9-2a55-11ee-ad01-001517cb21bc     222G  33.7G   186G        -     -     2%  15.3%      -    ONLINE

ghost_za · Oct 9, 2023

I'm not running the ASM1166 right now anyway. it's been removed till I can get this sorted

Davvo · Oct 9, 2023

Arwen said:
Checking, from what the following datasheet says, the ASMedia ASM1166 chip is NOT a SATA Port Multiplier.

ASM1166|ASMedia Technology Inc.

ASMedia Technology was founded in 2004 and went public in 2012 (stock code: 5269.TW). Headquarter is located in Taiwan. ASMedia specializes in high-speed interface technologies with strong expertise in high-speed SERDES in house development. Company strives for continual innovation and the...

www.asmedia.com.tw

It appears to be a 6 SATA port chip, that can support SATA Port Multipliers on it's own ports.

It is a low latency, low cost and low power AHCI controller with six SATA ports and cascaded port multipliers.

My understanding is that it has port multipliers, but I might be wrong.

ghost_za · Oct 9, 2023

So you're saying it's normal for an interface going down to crash the server, and that I need hardware to power cycle the system in case of such failure. I'm not talking datacenters I'm talking SOHO.

Arwen · Oct 9, 2023

Davvo said:
My understanding is that it has port multipliers, but I might be wrong.

Hmmm, I don't know... I just read that referenced datasheet and it seemed to imply that it supported Port Multipliers, not that it had Port Multiplier(s) on it. But, as the original poster said, it's not an issue now.

Davvo · Oct 9, 2023

ghost_za said:
So you're saying it's normal for an interface going down to crash the server, and that I need hardware to power cycle the system in case of such failure. I'm not talking datacenters I'm talking SOHO.

I never wrote such, please read again my post. But I do think it's very likely that your system crashing was the result of your actions.

Code:

root@truenas[~]# camcontrol devlist
<Samsung SSD 750 EVO 120GB MAT01B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<ST2000DM008-2FR102 0001>          at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD5000AAKX-001CA0 15.01H15>   at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus3 target 0 lun 0 (ada3,pass3)
<ST500LT012-1DG142 1003YAM1>       at scbus6 target 0 lun 0 (ada4,pass4)
<ST3500418AS CC38>                 at scbus7 target 0 lun 0 (ada5,pass5)

We see only 6 drives here.

Code:

root@truenas[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
NR-2TiB                                         1.81T  1.15T   677G        -     -    10%    63%  1.00x    ONLINE  /mnt
  gptid/fc334b46-2b59-11ec-800d-001517cb21bc    1.82T  1.15T   677G        -     -    10%  63.5%      -    ONLINE

NR.2TiB.2                                       1.81T   433G  1.39T        -     -     0%    23%  1.00x    ONLINE  /mnt
  gptid/b77789fb-64ba-11ee-b4e7-001517cb21bc    1.82T   433G  1.39T        -     -     0%  23.4%      -    ONLINE

RAIDZ.1TiB                                      1.35T   686G   698G        -     -     2%    49%  1.00x    ONLINE  /mnt
  raidz1-0                                      1.35T   686G   698G        -     -     2%  49.5%      -    ONLINE
    gptid/3c4b74fe-58d7-11ee-9748-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE
    gptid/4f771390-578d-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE
    gptid/456fa97e-5812-11ee-9f43-001517cb21bc   464G      -      -        -     -      -      -      -    ONLINE

boot-pool                                        111G  1.57G   109G        -     -     0%     1%  1.00x    ONLINE  -
  ada0p2                                         112G  1.57G   109G        -     -     0%  1.41%      -    ONLINE

nvme0                                            220G  33.7G   186G        -     -     2%    15%  1.00x    ONLINE  /mnt
  gptid/060124c9-2a55-11ee-ad01-001517cb21bc     222G  33.7G   186G        -     -     2%  15.3%      -    ONLINE

While here we have 7.

This is an issue. How are the drives connected to the motherboard?

Arwen · Oct 9, 2023

@ghost_za - You have a scrub running on the impacted pool;

scan: scrub in progress since Mon Oct 9 16:07:36 2023

540G scanned at 265M/s, 129G issued at 63.0M/s, 570G total

0B repaired, 22.54% done, 01:59:36 to go

Until that is done, paused or canceled, that pool will likely be slow.

But, the good news is that the problematic pool is working at present.

Davvo · Oct 9, 2023

Arwen said:
Hmmm, I don't know... I just read that referenced datasheet and it seemed to imply that it supported Port Multipliers, not that it had Port Multiplier(s) on it.

A quick search in the forum brought this up:

ASM1166 on TrueNAS Scale 22.02-RC.2

Hi Everyone, I spun up a new Scale system and tried to use a Silverstone ECS06 (https://www.silverstonetek.com/product.php?pid=973&area=en) (ASM1166 Controller) for additional SATA drives and it doesn't "just work". I'm wondering how to debug this or if it will work at all. In the comment...

www.truenas.com

ASM1166 on TrueNAS Scale 22.02-RC.2

Hi Everyone, I spun up a new Scale system and tried to use a Silverstone ECS06 (https://www.silverstonetek.com/product.php?pid=973&area=en) (ASM1166 Controller) for additional SATA drives and it doesn't "just work". I'm wondering how to debug this or if it will work at all. In the comment...

www.truenas.com

ghost_za · Oct 9, 2023

While here we have 7.
This is an issue. How are the drives connected to the motherboard?

last one is the nvme0

ghost_za · Oct 9, 2023

Forget about the ASM116 for a moment please...it's been removed and I haven't gotten around to properly testing it yet. I don't know if it's an issue but once I test it and determine it's not worth while I will remove it

ghost_za · Oct 9, 2023

Until that is done, paused or canceled, that pool will likely be slow.
But, the good news is that the problematic pool is working at present.

Yeah this is after a reboot - I initiated it. Just copying the data over then I'm removing it. Both of the faulty HDD's are WD SMR drives as you guys pointed out. I can tell you one thing now... I'm sticking to seagate from now on.

ghost_za · Oct 9, 2023

I will reset the bios to safe_defaults after the disk operations have finished

Important Announcement for the TrueNAS Community.

Uncorrectable I/O failure and disaster recovery

Dabbler

MVP

Old Man

Dabbler

Attachments

MVP

MVP

MVP

MVP

Dabbler

Dabbler

MVP

Dabbler

MVP

MVP

MVP

MVP

Dabbler

Dabbler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Uncorrectable I/O failure and disaster recovery"

Similar threads