Confusing zfs status report

Status
Not open for further replies.

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37
I got the following email from my freenas server last night. It's confusing because it mentions an unrecoverable error, but the detail says all is online with no known data errors.
Should I be concerned?


Checking status of zfs pools:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
Volume1 3.62T 355G 3.28T 9% 1.00x ONLINE /mnt

pool: Volume1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h28m with 0 errors on Sun Nov 2 00:28:08 2014
config:

NAME STATE READ WRITE CKSUM
Volume1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 1
gptid/958c3c6f-4592-11e4-9395-38eaa7abeb2a ONLINE 0 0 7
gptid/49d458de-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 4
mirror-1 ONLINE 0 0 0
gptid/4a6020b7-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 5
gptid/4aec015a-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0

errors: No known data errors

-- End of daily output --
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Ugh. You have some checksumming errors. Don't like how many drives are having the problem. I'm suspecting the possibility of a thermal issue.

What is your hardware? Mobo, CPU, amount/type of RAM? And, can you pastebin for us the output of "smartctl -x /dev/ada0" where "ada0" is replaced with the identifier for each of your drives? I suspect we're going to see some unpleasantries.
 

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Interesting. Unless I'm missing something, all of that looks pretty reasonable. Drives haven't even been in service that long. Thermals look good.

Only things I'm noticing:

1) There's no record of SMART Long tests. I suggest you run one. Only the SMART long test will actually inspect every writable sector.
2) The Seek Error Rate is pretty high. Attributes #1 and #7 sometimes show high numbers for a variety of reasons, and every manufacturer is different, so I don't know what to make of your high readings for #1 and #7.

But you appear to be showing several drives with checksum errors. That's not a particularly encouraging sign. Run the SMART long tests, see if we get other things cropping up.

You can run the tests on demand with:

smartctl -t long /dev/ada0

etc.

Probably will take a couple hours to complete. See if that changes our SMART readings.

Also, we do recommend that "long" SMART tests be a once-per-2-to-4-week regimen, so I suggest you add that to your FreeNAS GUI.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
For comparison, here's the smartctl -x output from my WD reds in perfect condition:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAGS  VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate  POSR-K  200  200  051  -  0
  3 Spin_Up_Time  POS--K  172  171  021  -  4391
  4 Start_Stop_Count  -O--CK  100  100  000  -  82
  5 Reallocated_Sector_Ct  PO--CK  200  200  140  -  0
  7 Seek_Error_Rate  -OSR-K  200  200  000  -  0
  9 Power_On_Hours  -O--CK  088  088  000  -  9058
 10 Spin_Retry_Count  -O--CK  100  253  000  -  0
 11 Calibration_Retry_Count -O--CK  100  253  000  -  0
 12 Power_Cycle_Count  -O--CK  100  100  000  -  36
192 Power-Off_Retract_Count -O--CK  200  200  000  -  13
193 Load_Cycle_Count  -O--CK  200  200  000  -  68
194 Temperature_Celsius  -O---K  123  116  000  -  24
196 Reallocated_Event_Count -O--CK  200  200  000  -  0
197 Current_Pending_Sector  -O--CK  200  200  000  -  0
198 Offline_Uncorrectable  ----CK  100  253  000  -  0
199 UDMA_CRC_Error_Count  -O--CK  200  200  000  -  0
200 Multi_Zone_Error_Rate  ---R--  200  200  000  -  0
  ||||||_ K auto-keep
  |||||__ C event count
  ||||___ R error rate
  |||____ S speed/performance
  ||_____ O updated online
  |______ P prefailure warning
 

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37
thanks a mil. Long SMARTs in progress...
 

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Interesting. Unless I'm missing something, all of that looks pretty reasonable. Drives haven't even been in service that long. Thermals look good.

Only things I'm noticing:

1) There's no record of SMART Long tests. I suggest you run one. Only the SMART long test will actually inspect every writable sector.
2) The Seek Error Rate is pretty high. Attributes #1 and #7 sometimes show high numbers for a variety of reasons, and every manufacturer is different, so I don't know what to make of your high readings for #1 and #7.

But you appear to be showing several drives with checksum errors. That's not a particularly encouraging sign. Run the SMART long tests, see if we get other things cropping up.

You can run the tests on demand with:

smartctl -t long /dev/ada0

etc.

Probably will take a couple hours to complete. See if that changes our SMART readings.

Also, we do recommend that "long" SMART tests be a once-per-2-to-4-week regimen, so I suggest you add that to your FreeNAS GUI.

Seagate uses some silly encoding that crams two values into the raw value. The lower half is an inverse log scale, to make things even more unreadable.

Ok here's updated Pastebins after long SMART run on all drives:
ada0: http://pastebin.com/ntXGnRBQ
ada1: http://pastebin.com/mt0qan5T
ada2: http://pastebin.com/CHQeUdq7
ada3: http://pastebin.com/4FEUAMCD

I'm not sure what to look for?

There's no smoking gun. Those command timeouts are to be expected after a few years in operation (no interface is perfect).
 

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37
Well I dunno...
Running "zpool status -x" now I get back: all pools are healthy.
Running zpool status:
NAME STATE READ WRITE CKSUM
Volume1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/958c3c6f-4592-11e4-9395-38eaa7abeb2a ONLINE 0 0 0
gptid/49d458de-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/4a6020b7-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0
gptid/4aec015a-a10c-11e3-8791-38eaa7abeb2a ONLINE 0 0 0

I'll schedule the mentioned long SMARTS and keep monitoring...
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Seek error rate looks fine:

ada0: 0 seek errors out of 3580810 total seeks
ada1: 0 seek errors out of 16507719 total seeks
ada2: 0 seek errors out of 16311730 total seeks
ada3: 0 seek errors out of 16354425 total seeks

All numbers are 8 digits or less when converted to hex, so the number is simply the total number of seeks with zero seek errors.

All drives have command timeouts though. I doubt they're all experiencing problems all at the same time, so I'd suspect something intermittently wonky in the sata controller. I'd mention cables, but I doubt all 4 cables are 'bad' to the same degree. The sata controller is the only common part that I can see.

I'd simply keep an eye on it.
 

shawndewet

Dabbler
Joined
Feb 28, 2014
Messages
37
Thanks for the input guys...I'll keep monitoring.
 
Status
Not open for further replies.
Top