Random Resilver @ Scrub + Reallocated_Sector_Ct of 18

bicycle_wreck · Aug 2, 2020

Hello,

Currently running FreeNAS-11.3-U1. I have a primary pool consisting of two 8 TB WD Red drives in a ZFS mirror. Weekly scrubs on the primary pool. Primary pool is replicated on a third (single) WD Red drive as a backup.

This morning I received an email alert:

New alerts:
* Pool primary state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state..

Current alerts:
* Scrub of pool 'backup' finished.
* Scrub of pool 'primary' finished.
* Scrub of pool 'primary' started.
* Scrub of pool 'backup' started.
* A system update is available. Go to System -> Update to download and apply the update.
* Replication "primary -> root@localhost:backup/" succeeded.
* Pool primary state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state..

Shortly after, within ten minutes, I received a follow-up email alert:

The following alert has been cleared:
* Pool primary state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state..

Current alerts:
* Scrub of pool 'backup' finished.
* Scrub of pool 'primary' finished.
* Scrub of pool 'primary' started.
* Scrub of pool 'backup' started.
* A system update is available. Go to System -> Update to download and apply the update.
* Replication "primary -> root@localhost:backup/" succeeded.

Checking my zpool status shows me that the other pools were scrubbed with zero errors, but the primary pool resilvered around 8 GB of data:

Code:

pool: primary

state: ONLINE

status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable.

action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details.

scan: resilvered 8.10G in 0 days 00:05:28 with 0 errors on Sun Aug  2 06:11:23 2020

config:

NAME                                            STATE     READ WRITE CKSUM
primary                                         ONLINE      0     0     0
mirror-0                                         ONLINE     0     0     0
        gptid/6248a972-5454-11e7-9cea-002590f01e26  ONLINE    0     0     0
        gptid/62dac90d-5454-11e7-9cea-002590f01e26  ONLINE    0     0     0

errors: No known data errors

I've had this configuration (with some software updates here and there) for about four years, but this is the first time a scrub has ever resilvered anything, so I decided to run a SMART (short) test on each drive in the primary mirror.

ada0:

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       112
  3 Spin_Up_Time            0x0007   146   146   024    Pre-fail  Always       -       451 (Average 453)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       55
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       18
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27450
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       55
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1191
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1191
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 20/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       18
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

ada1:

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   131   131   054    Pre-fail  Offline      -       116
  3 Spin_Up_Time            0x0007   147   147   024    Pre-fail  Always       -       446 (Average 448)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       20
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27450
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1188
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1188
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 20/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       20
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

So, each of the primary drives has a Reallocated_Sector_Ct above zero (18 for ada0 and 20 for ada1). I originally tested both drives before install with badblocks and then smartctl long, and those counts were zero. The Reallocated_Sector_Ct for the backup WD Red is 0.

In the past four years, I suppose, the counts have increased. Given the recent (and first) resilver and these error counts, I'm thinking about ordering a couple more WD Reds and building a new pool. Is this an overreaction? Are there any other diagnostics I should run (e.g., smartctl long)?

Thank you.

sretalla · Aug 2, 2020

As reallocated sectors increase from 0, you should start to distrust the drive(s).

Your data is still safe and you have a reasonable number of copies of it, but I personally wouldn't feel comfortable leaving those disks in place knowing they are on their way to failure.

At just over 3 years, that's a little soon for both to be failing without some kind of environmental contributor... the temperatures look good... do you have power issues where you are? Is your server powered by a UPS? Does the server get "bumped" or is it in a place where physical movement is possible?

bicycle_wreck · Aug 2, 2020

Thank you so much for the reply.

The system is on a UPS.

The server is way too close to a 500w RMS subwoofer (within 12 linear inches). I'm moving it today, as I had thought of this after the fact.

bicycle_wreck · Aug 2, 2020

Ordered two new 8 TB EasyStore drives from Best Buy for $139 each. Same thing I did in 2017, but I guess I'll probably end up with white label drives this time. Oh well. Will run badblocks for burn in and report back here.

Also ordered another sine wave UPS so I can put this thing far, far away from my stereo equipment.

bicycle_wreck · Aug 13, 2020

The shucked 8 TB EasyStores from Best Buy revealed some white-label drives (both were WD80EDAZ). Both disks burned in with Badblocks and passed the SMART long tests.

Wall time for the Badblocks (running in parallel [using TMUX] on both 8 TB drives) was ~7 days. I've seen this asked several places, so there you go.

I've since offlined the first WD Red (ada0), swapped the SATA cable to the new WD White drive, and added it back to the pool. It is resilvering now. Will repeat the process for the second drive once this one is done.

I used the following resources during this process.

Shucking the EasyStore shells: https://www.youtube.com/watch?v=b6VCQ64DkfM
Testing new drives: https://www.ixsystems.com/community/resources/hard-drive-burn-in-testing.92/
Replacing old drives in pool: https://www.ixsystems.com/documentation/freenas/11.3-U1/storage.html#replacing-a-failed-disk

I guess I'll put the old WD Reds in my desktop machine for a scratch disk or something. I'm open to other ideas, but I don't think they're worthy of hot (or cold) spares?

Important Announcement for The TrueNAS Community.

Random Resilver @ Scrub + Reallocated_Sector_Ct of 18

bicycle_wreck

Cadet

sretalla

Powered by Neutrality

bicycle_wreck

Cadet

bicycle_wreck

Cadet

bicycle_wreck

Cadet

Similar threads