zpool iostat showing odd numbers for cache device

Crispin · Sep 7, 2015

Hi folks,

I have a 6 disk + single SSD cache RAIDZ1 setup running on 9.3 latest.
Main disks at 6 x 4TB Reds and a OCZ 60GB SSD for the cache.

I was moving a large amount of data around and decided to look at iostat for laughs. It shows something odd for the SSD though:

Code:

                                           capacity     operations    bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
tank                                    9.35T  12.4T    523    953  63.9M   106M
  raidz1                                9.35T  12.4T    523    953  63.9M   106M
    gptid/349c35f9-445d-11e5-8abf-2c27d7158144      -      -    148    240  12.5M  22.1M
    gptid/3519e6c4-445d-11e5-8abf-2c27d7158144      -      -    135    292  12.3M  27.6M
    gptid/eebe1ba0-4a9f-11e5-93f3-2c27d7158144      -      -    139    291  12.6M  27.5M
    gptid/367dc071-445d-11e5-8abf-2c27d7158144      -      -    132    251  12.4M  23.9M
    gptid/376dde13-445d-11e5-8abf-2c27d7158144      -      -    139    269  12.6M  25.3M
    gptid/e25a04e4-4b4d-11e5-a068-2c27d7158144      -      -    133    285  12.4M  27.8M
cache                                       -      -      -      -      -      -
  gptid/38b64f4a-445d-11e5-8abf-2c27d7158144   589G  16.0E      1    339  7.97K  41.8M

Why would it be reporting 589G on a 60GB disk? As the copy continues it does up and up. It's currently at 603GB

Signs that the disk is not happy? A SMART report says all is well. (or does it?)

Code:

########## SMART status report for ada1 drive (SandForce Driven SSDs: OCZ-B592KW38CR0R1K0D) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   092   092   050    Pre-fail  Always       -       102/178934374
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   099   099   000    Old_age   Always       -       960h+57m+01.888s
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       358
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       214
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       2
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   128   129   000    Old_age   Always       -       128 (0 127 0 129 0)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       102/178934374
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       102/178934374
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       102/178934374
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       83992380440676
231 SSD_Life_Left           0x0013   094   094   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       17350
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       16303
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       16303
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       4643

SMART Error Log not supported

Thanks
Crispin

dlavigne · Sep 13, 2015

Were you able to figure this out?

Crispin · Sep 13, 2015

Nope.... Was hoping someone would give some ideas but alas. :(
I'm trying to sort out some other problems on another issue hence my lack of following up.

cyberjock · Sep 15, 2015

If you look closely the system thinks the SSD is 16 Exabytes. It's a bug in ZFS that hasn't been tracked down. There doesn't seem to be any consequences with the error that I know of (I've seen this several times in production environments) and if you reboot it will likely go back to normal values.

Crispin · Sep 15, 2015

Cool, thanks for the clarification.

Crispin · Sep 30, 2015

Well, a couple of days ago I got the following email from freenas -
"Device: /dev/ada1, Failed SMART usage Attribute: 1 Raw_Read_Error_Rate."

I was away and could not look into it until two days later. Just before I got a chance I received a second email saying the disk had vanished and had been removed from the pool.
It was showing as "removed" in the gui and CLI.

After a reboot the drive shows up in the BIOS but not in Freenas. It now showed as offline.
I removed it from the pool, did a scrub and all seems ok.

Odd that it's busted and after the SMART was complaining...

I'll start another thread on the failure as this is one of many SSDs I've broken.
https://forums.freenas.org/index.php?threads/broke-yet-another-ssd-5-and-counting.38379/

Important Announcement for the TrueNAS Community.

zpool iostat showing odd numbers for cache device

Crispin

Explorer

dlavigne

Guest

Crispin

Explorer

cyberjock

Inactive Account

Crispin

Explorer

Crispin

Explorer

Similar threads