Pool "ONLINE (Unhealthy)" - please provide guidance on how to troubleshoot

Siege801

Cadet
Joined
Jun 30, 2021
Messages
5
Hello, and thanks in advance for reading.

This is a new server configuration, my first time with VMWare ESXi and with TrueNAS. Today's task WAS to be to configure backups given that the system is now up and running, but trouble struck before I had a chance to get backups running.

Here is my rig:
  • CPU: 6 CPUs x Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
  • RAM: 32Gb RAM
  • Motherboard: ASRock Z390 Pro4 ATX LGA1151

Hard Drives:
  • 1 x 500Gb (TrueNAS Misc. - Used for plugin storage)
  • 4 x 2Tb Seagate ATA Disks (6Gb TrueNAS POOL)
  • -- POOL_BigData (Media)
  • -- POOL_NCDATA (NextCloud data)
  • 1 x 250Gb SATA SSD (VMware VMs)

TrueNAS VM:

Within ESXi, the TrueNAS virtual machine is as follows:
  • 4 x CPU Cores
  • 16Gb RAM
  • TrueNAS is version 12.0-U4


The issue:

Overnight the following alerts presented within TrueNAS:
Code:
CRITICAL
Pool POOL_BigData state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

warning
WARNING
Device /dev/gptid/17c1c0e0-cf4e-11eb-a969-000c29ac803d is causing slow I/O on pool POOL_BigData.


I have searched around to try and find some guidance on troubleshooting. From what I can tell, everything looks to be in an OK state other than the 3 read errors. I'm not certain how big a deal those are.

Code:
root@LucindNAS[~]# zpool status -v
  pool: POOL_500Gb M.2
state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        POOL_500Gb M.2                                ONLINE       0     0     0
          gptid/00eafae0-cdc6-11eb-96b9-000c29ac803d  ONLINE       0     0     0

errors: No known data errors

  pool: POOL_BigData
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 01:56:06 with 0 errors on Sun Jun 20 01:56:06 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        POOL_BigData                                    ONLINE       0     0 0
          raidz1-0                                      ONLINE       0     0 0
            gptid/17c1c0e0-cf4e-11eb-a969-000c29ac803d  ONLINE       3     0 0
            gptid/17cfc0cc-cf4e-11eb-a969-000c29ac803d  ONLINE       0     0 0
            gptid/17d28b06-cf4e-11eb-a969-000c29ac803d  ONLINE       0     0 0
            gptid/18212509-cf4e-11eb-a969-000c29ac803d  ONLINE       0     0 0

errors: No known data errors


Given TrueNAS sees the drives as vmdrives, little information seems to be available from within TrueNAS. For example:
Code:
=== START OF INFORMATION SECTION ===
Vendor:               VMware
Product:              Virtual disk
Revision:             2.0
Compliance:           SPC-4
User Capacity:        1,979,120,929,792 bytes [1.97 TB]
Logical block size:   512 bytes
LU is fully provisioned
Device type:          disk
Local Time is:        Sun Jul  4 23:08:34 2021 PDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===


And so, from within ESXi I have found: (I note the "Uncorrectable Error Count" on drives 2, 3 and 4 looks like a potential concern?)
Code:

[root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____WDC__WDS500G2B0B2D00YS70_________________21031X803085________
Parameter                  Value  Threshold  Worst  Raw
-------------------------  -----  ---------  -----  ---
Health Status              OK     N/A        N/A    N/A
Media Wearout Indicator    251    0          N/A    251
Power-on Hours             139    0          N/A    139
Power Cycle Count          19     0          N/A    19
Reallocated Sector Count   0      0          N/A    0
Drive Temperature          65     0          39     35
Write Sectors TOT Count    34     0          N/A    34
Read Sectors TOT Count     92     0          N/A    92
Program Fail Count         0      0          N/A    0
Erase Fail Count           0      0          N/A    0
Uncorrectable Error Count  0      0          N/A    0


[root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DL0032D9VT166__________________________________5YD3A2RK
Parameter                          Value  Threshold  Worst  Raw
---------------------------------  -----  ---------  -----  ---
Health Status                      OK     N/A        N/A    N/A
Read Error Count                   114    6          99     208
Power-on Hours                     96     0          96     229
Power Cycle Count                  95     20         95     224
Reallocated Sector Count           100    36         100    0
Drive Temperature                  29     0          42     29
Write Sectors TOT Count            100    0          253    71
Read Sectors TOT Count             100    0          253    233
Initial Bad Block Count            100    0          100    0
Uncorrectable Error Count          100    0          100    0
Pending Sector Reallocation Count  100    0          100    0
Uncorrectable Sector Count         100    0          100    0


[root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DM0082D2FR102__________________________________ZFL38P0N
Parameter                          Value  Threshold  Worst  Raw
---------------------------------  -----  ---------  -----  ---
Health Status                      OK     N/A        N/A    N/A
Read Error Count                   100    6          64     118
Power-on Hours                     100    0          100    216
Power Cycle Count                  100    20         100    5
Reallocated Sector Count           100    10         100    0
Drive Temperature                  31     0          40     31
Write Sectors TOT Count            100    0          253    12
Read Sectors TOT Count             100    0          253    177
Initial Bad Block Count            100    0          100    0
Uncorrectable Error Count          100    0          100    0
Pending Sector Reallocation Count  100    0          100    0
Uncorrectable Sector Count         100    0          100    0


[root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DM0082D2FR102__________________________________ZFL38NDS
Parameter                          Value  Threshold  Worst  Raw
---------------------------------  -----  ---------  -----  ---
Health Status                      OK     N/A        N/A    N/A
Read Error Count                   81     6          64     64
Power-on Hours                     100    0          100    216
Power Cycle Count                  100    20         100    5
Reallocated Sector Count           100    10         100    0
Drive Temperature                  31     0          40     31
Write Sectors TOT Count            100    0          253    28
Read Sectors TOT Count             100    0          253    151
Initial Bad Block Count            100    0          100    0
Uncorrectable Error Count          100    0          100    0
Pending Sector Reallocation Count  100    0          100    0
Uncorrectable Sector Count         100    0          100    0



So, what am I looking at here? Was there some kind of issue and the system has recovered OK? Do I just need to run zpool clear as suggested? Or is there something more sinister / problematic? Currently, TrueNAS is running, but its services (NFS and NextCloud) are non-functional.

Your patience and assistance would be greatly appreciated.

Regards.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
For a start, you should get a good backup of your pool while you still can.

Next, you're doing the storage wrong if you want to keep your data:

After you've done the first thing and read the second, you may decide it's time to do some rebuilding.

If this is a test system and you're just playing around and don't care about the data, run zpool clear and wait for the next problem.
 

Siege801

Cadet
Joined
Jun 30, 2021
Messages
5
Thanks very much for the reply.

Backing up the data is indeed my priority right now - I'm just not sure how. Currently NFS doesn't seem to be working. Any device that tries to open the NFS share just hangs. Do I need to zpool clear before I can access the pool?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Do I need to zpool clear before I can access the pool?
If the pool is showing as online, you shouldn't need to.

You may find it necessary to restart the VM to get access back depending on what the root cause of the problem is.

You may or may not be able to use rsync or zfs send to get at the data from ssh.
 

Siege801

Cadet
Joined
Jun 30, 2021
Messages
5
Ok, I've decided to split the server. So TrueNAS will be installed on the baremetal, and I'll spin up a new ESXi server. Turns out one of the 2Tb-ers had dropped its guts because on rebooting the ESXi server, it could no longer detect one of them.

NextCloud dbase exported, data all backed up. Time for a rebuild.

Thanks for your help!
 
Top