Confusing zfs status report

shawndewet · Nov 3, 2014

I got the following email from my freenas server last night. It's confusing because it mentions an unrecoverable error, but the detail says all is online with no known data errors.
Should I be concerned?

Checking status of zfs pools:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
Volume1 3.62T 355G 3.28T 9% 1.00x ONLINE /mnt

pool: Volume1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h28m with 0 errors on Sun Nov 2 00:28:08 2014
config:

NAME STATE READ WRITE CKSUM
Volume1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 1
gptid/958c3c6f-4592-11e4-9395-38eaa7abeb2a ONLINE 0 0 7
gptid/49d458de-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 4
mirror-1 ONLINE 0 0 0
gptid/4a6020b7-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 5
gptid/4aec015a-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0

errors: No known data errors

-- End of daily output --

DrKK · Nov 3, 2014

Ugh. You have some checksumming errors. Don't like how many drives are having the problem. I'm suspecting the possibility of a thermal issue.

What is your hardware? Mobo, CPU, amount/type of RAM? And, can you pastebin for us the output of "smartctl -x /dev/ada0" where "ada0" is replaced with the identifier for each of your drives? I suspect we're going to see some unpleasantries.

shawndewet · Nov 3, 2014

I'm using an HP MicroServer with 16GB ECC RAM, and 4 Seagate 2TB HDD's

Here's the requested pastebin links for each of my 4 drives:
ada0: http://pastebin.com/TCa6tyTV
ada1: http://pastebin.com/hwjJ1iD2
ada2: http://pastebin.com/VKQCaRQg
ada3: http://pastebin.com/dPwJnP0J

DrKK · Nov 3, 2014

Interesting. Unless I'm missing something, all of that looks pretty reasonable. Drives haven't even been in service that long. Thermals look good.

Only things I'm noticing:

1) There's no record of SMART Long tests. I suggest you run one. Only the SMART long test will actually inspect every writable sector.
2) The Seek Error Rate is pretty high. Attributes #1 and #7 sometimes show high numbers for a variety of reasons, and every manufacturer is different, so I don't know what to make of your high readings for #1 and #7.

But you appear to be showing several drives with checksum errors. That's not a particularly encouraging sign. Run the SMART long tests, see if we get other things cropping up.

You can run the tests on demand with:

smartctl -t long /dev/ada0

etc.

Probably will take a couple hours to complete. See if that changes our SMART readings.

Also, we do recommend that "long" SMART tests be a once-per-2-to-4-week regimen, so I suggest you add that to your FreeNAS GUI.

DrKK · Nov 3, 2014

For comparison, here's the smartctl -x output from my WD reds in perfect condition:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAGS  VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate  POSR-K  200  200  051  -  0
  3 Spin_Up_Time  POS--K  172  171  021  -  4391
  4 Start_Stop_Count  -O--CK  100  100  000  -  82
  5 Reallocated_Sector_Ct  PO--CK  200  200  140  -  0
  7 Seek_Error_Rate  -OSR-K  200  200  000  -  0
  9 Power_On_Hours  -O--CK  088  088  000  -  9058
 10 Spin_Retry_Count  -O--CK  100  253  000  -  0
 11 Calibration_Retry_Count -O--CK  100  253  000  -  0
 12 Power_Cycle_Count  -O--CK  100  100  000  -  36
192 Power-Off_Retract_Count -O--CK  200  200  000  -  13
193 Load_Cycle_Count  -O--CK  200  200  000  -  68
194 Temperature_Celsius  -O---K  123  116  000  -  24
196 Reallocated_Event_Count -O--CK  200  200  000  -  0
197 Current_Pending_Sector  -O--CK  200  200  000  -  0
198 Offline_Uncorrectable  ----CK  100  253  000  -  0
199 UDMA_CRC_Error_Count  -O--CK  200  200  000  -  0
200 Multi_Zone_Error_Rate  ---R--  200  200  000  -  0
  ||||||_ K auto-keep
  |||||__ C event count
  ||||___ R error rate
  |||____ S speed/performance
  ||_____ O updated online
  |______ P prefailure warning

shawndewet · Nov 4, 2014

thanks a mil. Long SMARTs in progress...

shawndewet · Nov 4, 2014

Ok here's updated Pastebins after long SMART run on all drives:
ada0: http://pastebin.com/ntXGnRBQ
ada1: http://pastebin.com/mt0qan5T
ada2: http://pastebin.com/CHQeUdq7
ada3: http://pastebin.com/4FEUAMCD

I'm not sure what to look for?

Ericloewe · Nov 4, 2014

DrKK said:
Interesting. Unless I'm missing something, all of that looks pretty reasonable. Drives haven't even been in service that long. Thermals look good.

Only things I'm noticing:

1) There's no record of SMART Long tests. I suggest you run one. Only the SMART long test will actually inspect every writable sector.
2) The Seek Error Rate is pretty high. Attributes #1 and #7 sometimes show high numbers for a variety of reasons, and every manufacturer is different, so I don't know what to make of your high readings for #1 and #7.

But you appear to be showing several drives with checksum errors. That's not a particularly encouraging sign. Run the SMART long tests, see if we get other things cropping up.

You can run the tests on demand with:

smartctl -t long /dev/ada0

etc.

Probably will take a couple hours to complete. See if that changes our SMART readings.

Also, we do recommend that "long" SMART tests be a once-per-2-to-4-week regimen, so I suggest you add that to your FreeNAS GUI.

Seagate uses some silly encoding that crams two values into the raw value. The lower half is an inverse log scale, to make things even more unreadable.

shawndewet said:
Ok here's updated Pastebins after long SMART run on all drives:
ada0: http://pastebin.com/ntXGnRBQ
ada1: http://pastebin.com/mt0qan5T
ada2: http://pastebin.com/CHQeUdq7
ada3: http://pastebin.com/4FEUAMCD

I'm not sure what to look for?

There's no smoking gun. Those command timeouts are to be expected after a few years in operation (no interface is perfect).

shawndewet · Nov 4, 2014

Well I dunno...
Running "zpool status -x" now I get back: all pools are healthy.
Running zpool status:
NAME STATE READ WRITE CKSUM
Volume1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/958c3c6f-4592-11e4-9395-38eaa7abeb2a ONLINE 0 0 0
gptid/49d458de-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/4a6020b7-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0
gptid/4aec015a-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0

I'll schedule the mentioned long SMARTS and keep monitoring...

titan_rw · Nov 5, 2014

Seek error rate looks fine:

ada0: 0 seek errors out of 3580810 total seeks
ada1: 0 seek errors out of 16507719 total seeks
ada2: 0 seek errors out of 16311730 total seeks
ada3: 0 seek errors out of 16354425 total seeks

All numbers are 8 digits or less when converted to hex, so the number is simply the total number of seeks with zero seek errors.

All drives have command timeouts though. I doubt they're all experiencing problems all at the same time, so I'd suspect something intermittently wonky in the sata controller. I'd mention cables, but I doubt all 4 cables are 'bad' to the same degree. The sata controller is the only common part that I can see.

I'd simply keep an eye on it.

shawndewet · Nov 5, 2014

Thanks for the input guys...I'll keep monitoring.

Important Announcement for the TrueNAS Community.

Confusing zfs status report

shawndewet

Dabbler

DrKK

FreeNAS Generalissimo

shawndewet

Dabbler

DrKK

FreeNAS Generalissimo

DrKK

FreeNAS Generalissimo

shawndewet

Dabbler

shawndewet

Dabbler

Ericloewe

Server Wrangler

shawndewet

Dabbler

titan_rw

Guru

shawndewet

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Confusing zfs status report

Dabbler

FreeNAS Generalissimo

Dabbler

FreeNAS Generalissimo

FreeNAS Generalissimo

Dabbler

Dabbler

Server Wrangler

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Confusing zfs status report"

Similar threads