Hello, and thanks in advance for reading.
This is a new server configuration, my first time with VMWare ESXi and with TrueNAS. Today's task WAS to be to configure backups given that the system is now up and running, but trouble struck before I had a chance to get backups running.
Here is my rig:
Hard Drives:
TrueNAS VM:
Within ESXi, the TrueNAS virtual machine is as follows:
The issue:
Overnight the following alerts presented within TrueNAS:
I have searched around to try and find some guidance on troubleshooting. From what I can tell, everything looks to be in an OK state other than the 3 read errors. I'm not certain how big a deal those are.
Given TrueNAS sees the drives as vmdrives, little information seems to be available from within TrueNAS. For example:
And so, from within ESXi I have found: (I note the "Uncorrectable Error Count" on drives 2, 3 and 4 looks like a potential concern?)
So, what am I looking at here? Was there some kind of issue and the system has recovered OK? Do I just need to run zpool clear as suggested? Or is there something more sinister / problematic? Currently, TrueNAS is running, but its services (NFS and NextCloud) are non-functional.
Your patience and assistance would be greatly appreciated.
Regards.
This is a new server configuration, my first time with VMWare ESXi and with TrueNAS. Today's task WAS to be to configure backups given that the system is now up and running, but trouble struck before I had a chance to get backups running.
Here is my rig:
- CPU: 6 CPUs x Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
- RAM: 32Gb RAM
- Motherboard: ASRock Z390 Pro4 ATX LGA1151
Hard Drives:
- 1 x 500Gb (TrueNAS Misc. - Used for plugin storage)
- 4 x 2Tb Seagate ATA Disks (6Gb TrueNAS POOL)
- -- POOL_BigData (Media)
- -- POOL_NCDATA (NextCloud data)
- 1 x 250Gb SATA SSD (VMware VMs)
TrueNAS VM:
Within ESXi, the TrueNAS virtual machine is as follows:
- 4 x CPU Cores
- 16Gb RAM
- TrueNAS is version 12.0-U4
The issue:
Overnight the following alerts presented within TrueNAS:
Code:
CRITICAL Pool POOL_BigData state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. warning WARNING Device /dev/gptid/17c1c0e0-cf4e-11eb-a969-000c29ac803d is causing slow I/O on pool POOL_BigData.
I have searched around to try and find some guidance on troubleshooting. From what I can tell, everything looks to be in an OK state other than the 3 read errors. I'm not certain how big a deal those are.
Code:
root@LucindNAS[~]# zpool status -v pool: POOL_500Gb M.2 state: ONLINE config: NAME STATE READ WRITE CKSUM POOL_500Gb M.2 ONLINE 0 0 0 gptid/00eafae0-cdc6-11eb-96b9-000c29ac803d ONLINE 0 0 0 errors: No known data errors pool: POOL_BigData state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 01:56:06 with 0 errors on Sun Jun 20 01:56:06 2021 config: NAME STATE READ WRITE CKSUM POOL_BigData ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/17c1c0e0-cf4e-11eb-a969-000c29ac803d ONLINE 3 0 0 gptid/17cfc0cc-cf4e-11eb-a969-000c29ac803d ONLINE 0 0 0 gptid/17d28b06-cf4e-11eb-a969-000c29ac803d ONLINE 0 0 0 gptid/18212509-cf4e-11eb-a969-000c29ac803d ONLINE 0 0 0 errors: No known data errors
Given TrueNAS sees the drives as vmdrives, little information seems to be available from within TrueNAS. For example:
Code:
=== START OF INFORMATION SECTION === Vendor: VMware Product: Virtual disk Revision: 2.0 Compliance: SPC-4 User Capacity: 1,979,120,929,792 bytes [1.97 TB] Logical block size: 512 bytes LU is fully provisioned Device type: disk Local Time is: Sun Jul 4 23:08:34 2021 PDT SMART support is: Unavailable - device lacks SMART capability. === START OF READ SMART DATA SECTION ===
And so, from within ESXi I have found: (I note the "Uncorrectable Error Count" on drives 2, 3 and 4 looks like a potential concern?)
Code:
[root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____WDC__WDS500G2B0B2D00YS70_________________21031X803085________ Parameter Value Threshold Worst Raw ------------------------- ----- --------- ----- --- Health Status OK N/A N/A N/A Media Wearout Indicator 251 0 N/A 251 Power-on Hours 139 0 N/A 139 Power Cycle Count 19 0 N/A 19 Reallocated Sector Count 0 0 N/A 0 Drive Temperature 65 0 39 35 Write Sectors TOT Count 34 0 N/A 34 Read Sectors TOT Count 92 0 N/A 92 Program Fail Count 0 0 N/A 0 Erase Fail Count 0 0 N/A 0 Uncorrectable Error Count 0 0 N/A 0 [root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DL0032D9VT166__________________________________5YD3A2RK Parameter Value Threshold Worst Raw --------------------------------- ----- --------- ----- --- Health Status OK N/A N/A N/A Read Error Count 114 6 99 208 Power-on Hours 96 0 96 229 Power Cycle Count 95 20 95 224 Reallocated Sector Count 100 36 100 0 Drive Temperature 29 0 42 29 Write Sectors TOT Count 100 0 253 71 Read Sectors TOT Count 100 0 253 233 Initial Bad Block Count 100 0 100 0 Uncorrectable Error Count 100 0 100 0 Pending Sector Reallocation Count 100 0 100 0 Uncorrectable Sector Count 100 0 100 0 [root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DM0082D2FR102__________________________________ZFL38P0N Parameter Value Threshold Worst Raw --------------------------------- ----- --------- ----- --- Health Status OK N/A N/A N/A Read Error Count 100 6 64 118 Power-on Hours 100 0 100 216 Power Cycle Count 100 20 100 5 Reallocated Sector Count 100 10 100 0 Drive Temperature 31 0 40 31 Write Sectors TOT Count 100 0 253 12 Read Sectors TOT Count 100 0 253 177 Initial Bad Block Count 100 0 100 0 Uncorrectable Error Count 100 0 100 0 Pending Sector Reallocation Count 100 0 100 0 Uncorrectable Sector Count 100 0 100 0 [root@ESXi_Server:~] esxcli storage core device smart get -d t10.ATA_____ST2000DM0082D2FR102__________________________________ZFL38NDS Parameter Value Threshold Worst Raw --------------------------------- ----- --------- ----- --- Health Status OK N/A N/A N/A Read Error Count 81 6 64 64 Power-on Hours 100 0 100 216 Power Cycle Count 100 20 100 5 Reallocated Sector Count 100 10 100 0 Drive Temperature 31 0 40 31 Write Sectors TOT Count 100 0 253 28 Read Sectors TOT Count 100 0 253 151 Initial Bad Block Count 100 0 100 0 Uncorrectable Error Count 100 0 100 0 Pending Sector Reallocation Count 100 0 100 0 Uncorrectable Sector Count 100 0 100 0
So, what am I looking at here? Was there some kind of issue and the system has recovered OK? Do I just need to run zpool clear as suggested? Or is there something more sinister / problematic? Currently, TrueNAS is running, but its services (NFS and NextCloud) are non-functional.
Your patience and assistance would be greatly appreciated.
Regards.