I'm getting a DEGRADED error message but all my drives are fine.

Joe Goldthwaite · Jul 4, 2016

This has happened twice now. I'm getting this error:

CRITICAL: July 3, 2016, 7:02 a.m. - The volume mediapool (ZFS) state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

When I look at the storage, all the drives are happily working away. I checked the Smart status and they all pass. How do I find out which drive or drives generated the error?

Robert Trevellyan · Jul 5, 2016

Joe Goldthwaite said:
How do I find out which drive or drives generated the error?

Try looking at the output of smartctl -a for each drive, to see if any of them has any logged errors. Also, zpool status.

DrKK · Jul 5, 2016

"zpool status -v" will almost certainly show you which drive(s) had the errors. It is quite possible for drives to have errors and for it to not show up on SMART, which is one of the strenghts of ZFS. You can count on this: SOMETHING is wrong.

If you could pastebin the output of smartctl -x /dev/ada0 (replacing ada0 with all available drives in your pool), and let us have a look at it, we can assess it more thoroughly. Note, use the "-x" instead of Trevellyan's "-a". -x gives more information, including information about how you've been maintaining your drives and what environment they've been in, that will be helpful in the diagnosis, sir.

Joe Goldthwaite · Jul 6, 2016

I created a file with the output of the smartctl -x command. It's pretty big so I uploaded it instead of putting in code tags. Which one of those is preferre? Here's the zpool status -v

Code:

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jun 27 03:45:40 2016
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
     da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: mediapool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 4.93M in 17h47m with 0 errors on Sun Jul  3 17:47:12 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    mediapool                                       DEGRADED     0     0     0
     raidz2-0                                      DEGRADED     0     0     0
       gptid/baa5b1e5-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/bb933277-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/bcfa581f-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/be5b0e4b-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/bfbbfd81-2477-11e6-bcd4-d05099c0e0e3  DEGRADED     0     0   467  too many errors
       gptid/c1157de1-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c279d194-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c3dc64d3-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c544a7d8-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c6acad6c-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c80e2f9c-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/c973f3e4-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0
       gptid/cadcd300-2477-11e6-bcd4-d05099c0e0e3  ONLINE       0     0     0

errors: No known data errors

While I'm at it, how do you link the gptid to the device? Other than the order they're in, I couldn't see anything in common between it and anything on the smartctl report.

DrKK · Jul 6, 2016

you should be able to tease out which drive is which by "gpart list" possibly with the help of "camcontrol devlist".

Looking at your data, sir, you did not parse this out in a way that makes it easy to see which goes with which drive, but some general commentary:

These drives are of various ages. I assume they were not all bought together.
Many of the drives (but not all? that's weird) have not been subjected to a proper SMART test or other prophylaxis, why are you not scheduling smart tests as the documentation and the forum suggest you should do?

One drive stands out as bad. /dev/ada2, I believe, even though your numbering is hard to make out. You will note the following very bad signs:

Attribute ID#10, "Spin Retry Count", shows 65536. Any number other than 0 is concerning. But a number like 65536 is very bad, and indicates hardware problems.
Attribute ID#199, "CRC Error Count", shows 97. Which isn't that bad. "0" is the correct number, though. Almost anything can cause this, including a slightly dodgy SATA cable, so I wouldn't worry about it, if it were in isolation.
The drive is only 4 months old, it should still be a virgin on these numbers, so that's disconcerting.
Then you have ICRC ABRT's on at least 4 different LBA's, at slightly different times, about, what, two months ago. That's weird, and almost always bad. That indicates either a hardware or controller problem. There should definitely be none of these.
You have 795 logged resets between command acceptance and completion. The correct number there is 0, anything other than 0 is bad.
You have 1107 hardware resets. Hard to read much into this, since a certain number of hardware resets is not unusual, but 1107 sounds bad to me.

So I believe when you decipher everything, you will find that the offline'd drive is /ada2. *IF* that is the case, proceed with RMA'ing this drive, in my opinion. If that is *NOT* the case, then, do tell, which drive is the one ZFS offlined?

DrKK · Jul 6, 2016

Also, I wouldn't fuck around. This is a fairly wide RAIDZ2 VDEV, and you're already dropped on one drive. That leaves uncomfortably little room for further things to screw up, so I'd get on this PRONTO. You drop another drive, and you're critical. You drop TWO more drives, and you're hosed.

DrKK · Jul 6, 2016

Also, you've set the APM to 1. i.e., you've put this on mega-miser electricity savings. Congratulations, you are probably saving approximately 30 cents per month, and killing your performance. I'd turn that shit off, personally, put it on full blast. But opinions vary.

DrKK · Jul 6, 2016

And also, for the record, that looks like a 13-wide RAID-Z2 VDEV.

We would not have recommended that. We don't recommend 13-wide vdev's, *AT ALL*, and certainly, we would *NEVER* recommend a RAID-Z2 that wide. That's only about 15% parity in the vdev. Way too low. That number should be around 30%. Even a RAID-Z3 would bring you only to a barely marginal 23% on parity, which is why we don't recommend 13-wide vdev's.

Mirfster · Jul 7, 2016

DrKK = King of the "Oh and one more thing" posts.... ;)

DrKK · Jul 7, 2016

Mirfster said:
DrKK = King of the "Oh and one more thing" posts.... ;)

Inspiration comes in fits and starts.

Joe Goldthwaite · Jul 7, 2016

DrKK = King of the "Oh and one more thing" posts.... ;)

Well I appreciate the advice. I don't really know what I'm doing and since this is a side project I only get a chance to work on it in fits and starts. To be fair, my criteria is probably a little different than most people. Performance is a non-issue. There's usually only one person accessing the NAS at any point in time and even that is fairly rare. Maybe two to four hours a week. I've also got an older NAS with a full backup of everything so I was willing to sacrifice the redundancy for the extra space. This new one isn't in production yet. I'm still just trying to get it set up.

Looking at your data, sir, you did not parse this out in a way that makes it easy to see which goes with which drive

Sorry about that. It's a lot of text that runs together. I wrote a script and tried to put a string of equals and the drive at the top of each block. It looks like it missed /dev/ada0 and for some reason and it messed up on ada8, 9 and 10. I must have mistyped something in the script.

I'll swap out the failing drive tonight. I'll add an additional one and create two 7 drive raid2z vdevs. Then I'll start copying the data over again.

I really appreciate you taking all the time to look over the smartctl output and offer the suggesting. Thanks!

Edit: I fixed the quote tags. I was quoting myself. Doh!

Ericloewe · Jul 7, 2016

Mirfster said:
DrKK = King of the "Oh and one more thing" posts.... ;)

It's easier than editing posts (especially if you want to quote something!) and the mods aren't anal about this practice, like on some forums, where people look at you nastily if you dare to post twice in a row, even if you're replying to completely different topics that got jumbled up in a single thread.

Stux · Jul 11, 2016

If you have activity lights on your drive bays then doing a

dd if=<gptid> of=/dev/null

Will pretty quickly tell you which drive to pull.

Important Announcement for the TrueNAS Community.

I'm getting a DEGRADED error message but all my drives are fine.

Joe Goldthwaite

Dabbler

Robert Trevellyan

Pony Wrangler

DrKK

FreeNAS Generalissimo

Joe Goldthwaite

Dabbler

Attachments

DrKK

FreeNAS Generalissimo

DrKK

FreeNAS Generalissimo

DrKK

FreeNAS Generalissimo

DrKK

FreeNAS Generalissimo

Mirfster

Doesn't know what he's talking about

DrKK

FreeNAS Generalissimo

Joe Goldthwaite

Dabbler

Ericloewe

Server Wrangler

Stux

MVP

Similar threads