SOLVED HP hardware or hard drive issue?

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
During the last months my TrueNAS had two unscheduled shutdown due to some hardware issue.

I couldn't actually understand what happened by I got the following alert:

* Pool NAS state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
After each event I run a scrub of my pool. The first time the pool came back with an HEALTY status, this time seems to be different.

The scrub process ended and I have an ONLINE (Unhealty) status for my pool.

If I check the Pool Status page, I get:
SCRUB

Status: FINISHED

Errors: 3

Date: 2023-06-12 19:15:23

The pool status for my MIRROR pool tells:
(Name) (Read) (Write) (CheckSum) (Status)
Mirror 0 0 0 ONLINE
> ada0 0 0 2380 ONLINE
> ada1 0 0 2380 ONLINE
How should I read this table? Is the checksum a counter or the actual number of unrecovered data sectors?

My email report just returned these result summary:
########## ZPool status report for NAS ##########


pool: NAS
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 23:22:53 with 3 errors on Tue Jun 13 18:38:16 2023
config:

NAME STATE READ WRITE CKSUM
NAS ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e81e097b-53ba-11e8-a3a4-70106fca6c4a ONLINE 0 0 2.32K
gptid/e91c8486-53ba-11e8-a3a4-70106fca6c4a ONLINE 0 0 2.32K

errors: Permanent errors have been detected in the following files:

<0x99f>:<0x866c>
<0x99f>:<0x88a8>
<0x99f>:<0x88ef>
/mnt/NAS/iocage/jails/nextcloud/root/var/db/elasticsearch/nodes/0/indices/PiMxmfHoR0quyS28KggRtQ/0/index/_30l.fdt



########## ZPool status report for freenas-boot ##########


pool: freenas-boot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:18 with 0 errors on Wed May 31 03:45:18 2023
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada2p2 ONLINE 0 0 0

errors: No known data errors





########## SMART status report for ada0 drive (Western Digital Red: WD-WCC7K0HVHCJT) ##########
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 176 152 021 Pre-fail Always - 6158
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 132
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 040 040 000 Old_age Always - 44378
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 56
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3453
194 Temperature_Celsius 0x0022 117 103 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

No Errors Logged

Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 43680 -
Short offline Completed without error 00% 44299 -





########## SMART status report for ada1 drive (Western Digital Red: WD-WCC7K5NYYJFV) ##########
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 174 158 021 Pre-fail Always - 6300
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 126
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 040 040 000 Old_age Always - 44377
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 49
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3730
194 Temperature_Celsius 0x0022 118 103 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

No Errors Logged

Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 43680 -
Short offline Completed without error 00% 44297 -

What do you suggest me to do? I am running TrueNAS on a HP Microserver Gen10 with two 4TB WD Red hard drives.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
gptid/e81e097b-53ba-11e8-a3a4-70106fca6c4a ONLINE 0 0 2.32K
gptid/e91c8486-53ba-11e8-a3a4-70106fca6c4a ONLINE 0 0 2.32K

You have 2.3 thousand checksum errors - thats quite a lot
You also have metadata errors which you won't be able to fix - do you have a backup?

The WD Red Drives - specifically what model are they
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
Hello @NugentS, I am using two WD40EFRX-68N32N0 hard drives.

The 2.3k checksum errors are the RECOVERED sectors or LOST data?

I guess the three metedata files you are referring to are these:

<0x99f>:<0x866c>
<0x99f>:<0x88a8>
<0x99f>:<0x88ef>
/mnt/NAS/iocage/jails/nextcloud/root/var/db/elasticsearch/nodes/0/indices/PiMxmfHoR0quyS28KggRtQ/0/index/_30l.fdt

I can understand the last one but, what those hex codes means?

I can't understand if I should focus on my drives instead of my main hardware. As far as I understood, both drives faced the same number of errors, so this could depends on some external hardware failing, am I right?

I don't have a pool dataset 1:1 backup but I do have (several) backups for my data (not for the applications I run).
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
chksum errors are often, but not always cabling issues.
What are the HDD's connected to? Motherboard, some sort of controller?

The first 3 files are metadata - as far as I am aware the only solution to this is to backup, trash the pool and re-create the pool. However you need to fix the chksum errors issue first
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
The disks are connected using an HPE proprietary cable (as you can see here) directly to the motherboard.

I guess those metadata are related to the pool itself, they are not something related to my files. Am I right?

Backing up the pool wouldn't I copy the metadata? Do you think it's safer to backup only the data without using the snapshot external copy/sync function?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
How old is this server - can you get new cables?
Yes - copy the data.
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
My NAS is ~5 years old. It's an HPE Microserver Gen10, I am using 2 bay out of 4. Do you think I could try to move the two disks to the 2 unused bays? What would happens to TrueNAS / my pool?

I will try to identify the cable, maybe it's something similar to this.

I will easly move the files, it will be more stressing to move the apps I am running (Nextcloud, Mosquitto, HomeAssistant, NodeRed, Plex, Transmission...).
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
TN won't have an issue with you moving the disks. And thats worth trying to see if the issue continues and the chksum errors continue to grow.
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
Ok I will try to move the disks. Do you think there is any way to avoid rebuilding a new pool from scratch? In case the cabling tentative should solve the checksum errors, what's next for my pool and those metadata error I had? I guess moving the disks will not fix the "unhealthy" status.

Another question: I just visited the storage/pools/status/1 page and the checksum errors are now 62 for both disks (I did a reboot, but did not do anything at hardware level). What does it means? I was expecting to see the checksum record to grow from the previous one (2380). What's about that number? How/when is it reset?

Thank you!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
To the best of my knowledge its reset when you run a zpool clear
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
Does it make sense to reset it? What about the checksum counter changing values? What do you suggest me to do with my pool?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Again (to my knowledge) there is no harm in running a zpool clear - it just resets the existing error conditions. Just keep a very close eye on the chksum numbers and if they increase - you haven't fixed the issue.
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
Ok since I moved the disks to the empty bays (I was using 2 out of 4) the checksum error counter is 0.

I guess something on the cabling (or the connector?!?) could be defective and caused the issue.

What should I do next? Do you think is safe to clear the zpool error and keep on going? Or those metadata errors are prone to cause issues to the system? Will those metadata be rebuild whenever needed or what? I would like to avoid to reinstall and re-configure all the services (on different jails) I am running. Any suggestion?

Thank you!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
1. Replace the cable
2. Backup your data, destroy the pool, restore your data.

You cannot fix those metadata issues without destroying the pool

Actually - I would suggest buying a couple of suitably sized SSD's and putting these in the spare slots. Migrate the jails / apps to the SSD's
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
So you want me to build a new pool on the new SSD next to the existing one in order to migrate the jails? Wouldn't this move the broken metadata too?

Moving the jails would solve half of the difficulties (restoring and reconfiguring the services configuration requires a lot of time).
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
It depends on where the broken metadata is, and what it references. You are going to have to trash your pool anyway.

But first - replace the cable
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
I will bump up this topic back again. During the summer I had no time to manage the issue. As discussed before, I moved the two disks to the 2 spare slots and since then things have gone... sideways.

I noticed the issue is still there, I got some checksum errors (as of now the system is reporting two checksum errors) and for the last month, on my weekly report, I got these two checksum errors:
Code:
errors: Permanent errors have been detected in the following files:
        <0x1798d>:<0x88a8>
        <0x1798d>:<0x88ef>

The scrub routine on the pool has a monthly schedule and I never tried to clear the zpool errors. I wanted/wished to find the time to deep dive into the zfs metadata debugging (I am still fascinated and would like to learn how to do it) and wanted to monitor if anything different would happens over time.

The last report (and the last scrub) said something different. Those two metadata errors disappeared but these two were reported:
Code:
errors: Permanent errors have been detected in the following files:

        NAS/iocage/jails/nextcloud/root@auto-20231019.003000-8d:/tmp/elasticsearch-11816373281817258052/geoip-databases/nigLZd0OS0aokgTpQgG-0Q/GeoLite2-City.mmdb
        NAS/iocage/jails/nextcloud/root@auto-20231019.003000-8d:/tmp/elasticsearch-11816373281817258052/geoip-databases/nigLZd0OS0aokgTpQgG-0Q/GeoLite2-Country.mmdb

I guess that's some files on a snapshot so I decided to delete that specific snapshot. As of now, I am running a scrub once again and I am expecting to see no issues on the pool.

How should I weigh this new scenario? Is the pool clean and safe back again? What about those metadata error not reported during the last scrub? Is the error information lost and out of reports or it means those metadata have been deleted? (as far as I understood these errors cannot be fixed).

It depends on where the broken metadata is, and what it references. You are going to have to trash your pool anyway.

But first - replace the cable

One side question: if I will proceed with migrating the jails, how can I be sure I am not copying the metadata error too? Is there any way to get an answer? I would like to do it if I am sure I don't need to... start from scratch back again.
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
Brief update. As I wrote, I ran another pool scrub and some metatada errors have been found once again:

Code:
errors: Permanent errors have been detected in the following files:

        <0x1c16>:<0x88a8>
        <0x1c16>:<0x88ef>

I guess I have no luck with this pool. I will have to destroy it. I am still unsure if I can copy a dataset/jail to another pool without copying the metadata error.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I have no idea I am afraid.
 

pixel8383

Explorer
Joined
Apr 19, 2018
Messages
80
A final update: I managed to destroy the pool migrating the data first. I discovered (using zfs send) that one jail was aving some issue (I faced several time with data stream fails during the transfer) so I ended deleting that jail and dataset and creating it back again before migrating all the data. I completed the import to the pool back again. I will be monitoring in the next weeks if any error occurs again.
 
Top