Understanding Drive Error Messages

PhiloEpisteme · Feb 21, 2019

Hi folks,

First, my build.
FreeNAS Release: FreeNAS-11.2-RELEASE-U2
Board: X11SSM-F i3-7100, 32GB ECC Ram, LGA 2151/Socket, IPMI, 2x GbE Intel i210-AT
HBA: LSI/Broadcom SAS9207-8i, Firmware 20.00.07.00
Storage Pool 1: 1 vdev 6 x 3 TB drives in RAID-z2
Storage Pool 2: 1 vdev 6 x 2 TB drives in RAID-z2

I got the following alert for one of my drives from my email report.

Code:

New alerts:
* Device: /dev/da3 [SAT], 1 Currently unreadable (pending) sectors
* Device: /dev/da3 [SAT], 1 Offline uncorrectable sectors

Alerts:
* Device: /dev/da3 [SAT], 1 Currently unreadable (pending) sectors
* Device: /dev/da3 [SAT], 1 Offline uncorrectable sectors

If I look at the smartctl output for the drive I see the output attached to this post.

The pieces that I _think_ are most salient are

Code:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1

Code:

Error 21 occurred at disk power-on lifetime: 848 hours (35 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 65 79 6f 01  Error: UNC at LBA = 0x016f7965 = 24082789

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 d0 30 85 6f 41 00   1d+13:53:18.405  READ FPDMA QUEUED
  60 00 00 30 84 6f 41 00   1d+13:53:18.405  READ FPDMA QUEUED
  60 00 00 30 83 6f 41 00   1d+13:53:18.405  READ FPDMA QUEUED
  60 00 00 30 82 6f 41 00   1d+13:53:18.405  READ FPDMA QUEUED
  60 00 00 30 81 6f 41 00   1d+13:53:18.405  READ FPDMA QUEUED

When I first saw the errors I checked the Checksum under storage->pools->{pool}->status for that drive and saw that it was the only non-zero value listed. I ran another scrub and checking the value now shows 0s for all Checksums across all disks in that pool.

The latest scrub status reports as

Code:

SCRUB

Status: FINISHED

Errors: 0

When it first started happening I was receiving the email alert regularly. Now I have not received the alert via email or the UI in 5 days. I haven't pulled the drive yet as I was in the process of finishing the burn-in on a replacement drive. The pool lists as HEALTHY and the drive lists as online. The drive is in a 6-disk RAIDZ2 vdev.

So, whats the deal? I imagine I should replace the drive anyway rather than risk it. Seagate is willing to replace the drive with a used, refurbished drive. It has very few online hours though so it is a shame to replace it with a used drive. Why have the alerts gone away and the checksum returned to 0? I did upgrade to U2 within the last few days, in case that is relevant.

Johnnie Black · Feb 21, 2019

Run an extended SMART test, if it doesn't fail disk is good for now, but not a very good sign pending sectors on a disk with so few hours.

Chris Moore · Feb 21, 2019

PhiloEpisteme said:
I got the following alert for one of my drives from my email report.

Code:
New alerts: * Device: /dev/da3 [SAT], 1 Currently unreadable (pending) sectors * Device: /dev/da3 [SAT], 1 Offline uncorrectable sectors

That defective sector was taken offline by the drive. The checksum error that you initially saw, was probably because of the missing data caused by the drive taking the sector offline.

PhiloEpisteme said:
When I first saw the errors I checked the Checksum under storage->pools->{pool}->status for that drive and saw that it was the only non-zero value listed. I ran another scrub and checking the value now shows 0s for all Checksums across all disks in that pool.

That means that ZFS is working the way it should and corrected the error from parity data.

PhiloEpisteme said:
So, whats the deal?

The drive has bad spot that might lead to more bad spots later. I would go ahead and get it replaced while it is under warranty, but it isn't a hot rush.

Chris Moore · Feb 21, 2019

PS. You might want to read through this guide:

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

SweetAndLow · Feb 21, 2019

Replace the driver under warranty if you can. You get a new drive out of the deal.

Chris Moore · Feb 21, 2019

PhiloEpisteme said:
Seagate is willing to replace the drive with a used, refurbished drive.

Note. It is always best to test the drives you get from them under warranty, because ANY drive can be bad right out of the gate, but the repaired drives they send out are often just as good as a brand new drive. You shouldn't think of it as being a 'used' drive. Yes, it previously failed and was sent back to Seagate, but they repaired it to 'new' condition.
All the drive vendors do that. Sometimes they will even send you an actual new drive if they don't have one like the one you sent in. I once sent a 2TB drive in for warranty and got a 3TB drive back. Equal or better is the policy they appear to follow. I have only dealt with Western Digital and Seagate for warranty because all the HGST drives I have ever had to deal with were OEM drives and went back to the equipment vendor instead of needing to deal with HGST.

PhiloEpisteme · Feb 21, 2019

As I have a spare drive of the appropriate size to replace the disk with the bad sector I opted to replace the disk.

I followed the instructions outlined in the manual. Specifically, I navigated to the pool in question to offline the disk: -> status -> failing disk -> offline. Once the disk showed as offline I turned the system off, replaced the drive, and turned it back on.

The pool now lists as "DEGRADED", which I expected. I then try to replace the disk by following section 9.5.1.3

After the disk is replaced and is showing as OFFLINE, click  (Options) on the disk again and then Replace. Select the replacement disk from the drop-down menu and click the REPLACE DISK button. After clicking the REPLACE DISK button, the pool begins resilvering.

The list of drives in the pool has changed rather dramatically. Where previously there were drives da[0-5]p2 where da3p2 was the failing drive (as identified by the SN matching the da3 drive in pools->disks) I now see da0p2, da1p2, da3p2, /dev/gptid/6baa8a5e-2f8d-11e9-89f4-ac1f6b855b2c, . . . I did not expect to still see da3p2 as that was the drive I had pulled. If I then click "Replace" on the drive with the long name I am unable to select any drive from the dropdown.

Am I selecting the wrong disk to replace? Or perhaps I am missing a step somewhere from the guide?

Edit: Updated for clarity

PhiloEpisteme · Feb 22, 2019

Update: I put the old drive back in, onlined it, etc to see if I could start the process over just in the event that I made a mistake the first pass through. I was able to get the new drive back in, online, and the pool HEALTHY.

I then followed the instructions in the manual once more except this time I hot-swapped the drives. Everything was exactly as before. When I tried to replace the drive the dropdown was empty. However, after a few minutes I tried the dropdown list again to replace the drive and the new drive showed up. My system is now happily resilvering my drive.

I'm not sure what happened the first time around. It was likely my mistake, but perhaps it was a UI bug? I am using the latest 11.2-U2 version.

Chris Moore · Feb 22, 2019

PhiloEpisteme said:
I then followed the instructions in the manual once more except this time I hot-swapped the drives. Everything was exactly as before. When I tried to replace the drive the dropdown was empty. However, after a few minutes I tried the dropdown list again to replace the drive and the new drive showed up. My system is now happily resilvering my drive.

If the system is able to do hot-swap, that is probably the best choice. The reason that shutdown is part of the process in the manual is because there are many users of FreeNAS that don't invest in hot-swap drive bays.

PhiloEpisteme said:
I'm not sure what happened the first time around. It was likely my mistake, but perhaps it was a UI bug? I am using the latest 11.2-U2 version.

I wouldn't be entirely shocked if there was a little flaw in the new UI. If you want to open a ticket, it might be worth having the developers look at.

I am happy it is working for you now.

PhiloEpisteme · Feb 22, 2019

Chris Moore said:
If the system is able to do hot-swap, that is probably the best choice.

Yeah, I thought the reboot was safer but when that didn't work I figured I'd try the hot-swap. I wish I had tried it once more without hot-swap just to better isolate the issue.

Chris Moore said:
I wouldn't be entirely shocked if there was a little flaw in the new UI. If you want to open a ticket, it might be worth having the developers look at.

I am happy it is working for you now.

I'm happy too! I have backups so the whole thing was more curious than stressful. I'll file a ticket with as much detail as I can and hopefully if it is a bug someone will find it. Else, I'll assume that I somehow made a silly mistake.

PhiloEpisteme · Feb 26, 2019

Quick update in case folks experience the same issue. I've filed a bug report and it looks like it has been set for the 11.2-U3 release.

Important Announcement for the TrueNAS Community.

Understanding Drive Error Messages

PhiloEpisteme

Guru

Attachments

Johnnie Black

Guru

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

SweetAndLow

Sweet'NASty

Chris Moore

Hall of Famer

PhiloEpisteme

Guru

PhiloEpisteme

Guru

Chris Moore

Hall of Famer

PhiloEpisteme

Guru

PhiloEpisteme

Guru

Similar threads