security and run output messages

Status
Not open for further replies.

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
hello,

i was hoping for some help with three messages i am getting. i am running 9.2.1.9 on prosumer hardware.

the first message is from the daily run output where i get am email saying:

Checking status of zfs pools:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
volume 32.5T 20.3T 12.2T 62% 1.00x ONLINE /mnt

pool: volume
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 1.48M in 14h35m with 0 errors on Wed Apr 1 14:35:25 2015
config:

NAME STATE READ WRITE CKSUM
volume ONLINE 0 0 0
raidz2-0 ONLINE 0 0 1
gptid/92e122fb-5a13-11e4-919c-c86000cb131c ONLINE 0 0 1
gptid/8e751139-6874-11e4-a79e-c86000cb131c ONLINE 0 0 0
gptid/93c736f3-5a13-11e4-919c-c86000cb131c ONLINE 0 0 0
gptid/94203f52-5a13-11e4-919c-c86000cb131c ONLINE 0 0 0
gptid/947317c4-5a13-11e4-919c-c86000cb131c ONLINE 0 0 0
gptid/94d99f4a-5a13-11e4-919c-c86000cb131c ONLINE 0 0 0
gptid/95402b37-5a13-11e4-919c-c86000cb131c ONLINE 0 0 79
gptid/95c39268-5a13-11e4-919c-c86000cb131c ONLINE 0 0 0
gptid/962ade58-5a13-11e4-919c-c86000cb131c ONLINE 0 0 1

errors: No known data errors

how do i know what drive gptid/95402b37-5a13-11e4-919c-c86000cb131c is and do i need to worry about it? the smart test comes back with all drives passing the test.

the second is from the security run output and it says:

kernel log messages:
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e8 00 aa c8 40 e4 00 00 00 00 00
> (ada3:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich10:0:0:0): Retrying command
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 e8 aa c8 40 e4 00 00 01 00 00
> (ada3:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich10:0:0:0): Retrying command
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 10 ac c8 40 e4 00 00 00 00 00
> (ada3:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich10:0:0:0): Retrying command

-- End of security output --

these were received on the same day, is the checksum error on drive gptid/95402b37-5a13-11e4-919c-c86000cb131c the ada3 drive with crc errors? and how would i know what file(s) are having a problem?

the third question is also from the security run output and i get:

> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 08 88 d5 40 51 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 48 86 d5 40 51 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 28 87 d5 40 51 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 a8 78 13 40 4f 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 88 79 13 40 4f 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 68 7a 13 40 4f 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 38 ba 40 82 01 00 01 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d0 39 ba 40 82 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 b0 3a ba 40 82 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 40 fe bc 40 50 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 20 ff bc 40 50 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 00 00 bd 40 50 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d0 00 ff 40 ea 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f0 b0 01 ff 40 ea 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 20 a0 02 ff 40 ea 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 28 7e 74 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 08 7f 74 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 e8 7f 74 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 f8 e3 a2 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d8 e4 a2 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 b8 e5 a2 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 88 34 ba 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 68 35 ba 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 48 36 ba 40 eb 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f8 58 13 1d 40 ec 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 78 14 1d 40 ec 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 78 12 1d 40 ec 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 40 60 23 40 ed 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 20 61 23 40 ed 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 00 62 23 40 ed 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 10 49 1f 40 ee 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 f0 49 1f 40 ee 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d0 4a 1f 40 ee 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 b8 fb 69 40 cf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 98 fc 69 40 cf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 78 fd 69 40 cf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 98 e8 74 02 40 05 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 a8 76 02 40 05 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d8 78 02 40 05 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 41 95 40 4f 01 00 01 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 20 43 95 40 4f 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 00 44 95 40 4f 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 f8 e8 87 c6 40 d1 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 20 60 c0 c6 40 d1 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c8 c0 c6 40 d1 00 00 01 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 f0 e6 cf 40 06 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 d0 e7 cf 40 06 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 b0 e8 cf 40 06 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 b8 5a 53 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 98 5b 53 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 78 5c 53 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 68 43 45 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 48 44 45 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 28 45 45 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 f0 3a 91 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 d0 3b 91 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 b0 3c 91 40 bd 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 e0 fd ac 40 bf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 fc ac 40 bf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 00 fd ac 40 bf 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 80 de 05 40 c0 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 60 df 05 40 c0 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e8 40 e0 05 40 c0 00 00 00 00 00
> (ada3:ahcich10:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 08 19 ec 40 27 01 00 01 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 78 85 5b 40 54 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 18 98 86 1a 40 04 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 c0 7b 83 40 c8 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 a0 7c 83 40 c8 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 80 7d 83 40 c8 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 f8 4e 03 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 d8 4f 03 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 b8 50 03 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 68 86 e6 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 48 87 e6 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 28 88 e6 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 08 89 e6 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 c8 8a e6 40 c9 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 a8 14 16 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 88 15 16 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 68 16 16 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 00 64 2a 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 e0 64 2a 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 c0 65 2a 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 90 01 bf 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 70 02 bf 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 50 03 bf 40 ca 00 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 98 41 17 40 65 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 78 42 17 40 65 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 58 43 17 40 65 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 c0 c5 f4 40 66 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 a0 c6 f4 40 66 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b8 80 c7 f4 40 66 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 48 23 78 40 27 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 70 28 24 78 40 27 01 00 00 00 00
> (ada3:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 70 e8 24 78 40 27 01 00 00 00 00

what does all that mean? i get that email every morning and i have no idea what is causing it.

thanks for your help!
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I would say it's bad cables and/or failing drives and/or if the hardware isn't appropriate it's the hardware (RAID controller, crappy PSU, ...).

What's your hardware?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
  1. Back up your data
  2. perform a backup of your current configuration
  3. Find the serial number of drive ada3
  4. report system hardware & Freenas version in this thread
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
i have been running this setup for over 6 months with no issues and it doing a scrub twice a month. it is an i7 on a sabertooth x79 board with 9 hitachi/hgst 4tb drives using the onboard data controllers, 64 GB ram and an 850 watt psu. so if ada3 is having crc issues, but smart isn't coming back with anything, would i just be throwing the drive away? it would be under warranty but will hitachi replace it for crc errors? is ada3 gptid/95402b37-5a13-11e4-919c-c86000cb131c?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
is ada3 gptid/95402b37-5a13-11e4-919c-c86000cb131c?
It would seem so. You are also showing checksum errors on two other drives.
In code tags, please post output of # smartctl -a /dev/ada3

 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
Code:
[root@freenas] ~# smartctl -a /dev/ada3

smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Device Model:     HGST HDS724040ALE640

Serial Number:    PK2334PBJMG1MT

LU WWN Device Id: 5 000cca 23de506cf

Firmware Version: MJAOA580

User Capacity:    4,000,787,030,016 bytes [4.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Sun Apr 12 15:49:31 2015 CDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x84)    Offline data collection activity

                    was suspended by an interrupting command from host.

                    Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0)    The previous self-test routine completed

                    without error or no self-test has ever 

                    been run.

Total time to complete Offline 

data collection:         (   24) seconds.

Offline data collection

capabilities:            (0x5b) SMART execute Offline immediate.

                    Auto Offline data collection on/off support.

                    Suspend Offline collection upon new

                    command.

                    Offline surface scan supported.

                    Self-test supported.

                    No Conveyance Self-test supported.

                    Selective Self-test supported.

SMART capabilities:            (0x0003)    Saves SMART data before entering

                    power-saving mode.

                    Supports SMART auto save timer.

Error logging capability:        (0x01)    Error logging supported.

                    General Purpose Logging supported.

Short self-test routine 

recommended polling time:    (   1) minutes.

Extended self-test routine

recommended polling time:    ( 561) minutes.

SCT capabilities:          (0x003d)    SCT Status supported.

                    SCT Error Recovery Control supported.

                    SCT Feature Control supported.

                    SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       79

  3 Spin_Up_Time            0x0007   131   131   024    Pre-fail  Always       -       589 (Average 590)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       25

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       3

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3515

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       149

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       149

194 Temperature_Celsius     0x0002   122   122   000    Old_age   Always       -       49 (Min/Max 22/59)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2127



SMART Error Log Version: 1

ATA Error Count: 2127 (device log contains only the most recent five errors)

    CR = Command Register [HEX]

    FR = Features Register [HEX]

    SC = Sector Count Register [HEX]

    SN = Sector Number Register [HEX]

    CL = Cylinder Low Register [HEX]

    CH = Cylinder High Register [HEX]

    DH = Device/Head Register [HEX]

    DC = Device Command Register [HEX]

    ER = Error register [HEX]

    ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 2127 occurred at disk power-on lifetime: 3515 hours (146 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 11 d7 30 0a 07  Error: ICRC, ABRT at LBA = 0x070a30d7 = 118108375



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 20 10 d8 30 0a 47 ff   3d+15:34:25.442  READ FPDMA QUEUED

  60 20 50 c8 30 0a 40 00   3d+15:34:25.441  READ FPDMA QUEUED

  60 28 48 e8 30 0a 40 00   3d+15:34:25.441  READ FPDMA QUEUED

  60 28 40 10 31 0a 40 00   3d+15:34:25.441  READ FPDMA QUEUED

  60 28 38 a0 30 0a 40 00   3d+15:34:25.441  READ FPDMA QUEUED



Error 2126 occurred at disk power-on lifetime: 3515 hours (146 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 09 8f 21 f9 00  Error: ICRC, ABRT at LBA = 0x00f9218f = 16327055



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 28 08 90 21 f9 40 ff   3d+15:20:16.409  READ FPDMA QUEUED

  60 28 78 70 21 f9 40 00   3d+15:20:16.408  READ FPDMA QUEUED

  60 20 78 50 21 f9 40 00   3d+15:20:16.407  READ FPDMA QUEUED

  60 28 78 28 21 f9 40 00   3d+15:20:16.407  READ FPDMA QUEUED

  60 28 78 00 21 f9 40 00   3d+15:20:16.407  READ FPDMA QUEUED



Error 2125 occurred at disk power-on lifetime: 3514 hours (146 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 11 1f d9 a6 09  Error: ICRC, ABRT at LBA = 0x09a6d91f = 161929503



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 20 10 20 d9 a6 49 ff   3d+14:57:18.214  READ FPDMA QUEUED

  60 20 70 10 d9 a6 40 00   3d+14:57:18.213  READ FPDMA QUEUED

  60 20 68 70 d9 a6 40 00   3d+14:57:18.212  READ FPDMA QUEUED

  60 20 60 30 d9 a6 40 00   3d+14:57:18.212  READ FPDMA QUEUED

  60 20 60 50 d9 a6 40 00   3d+14:57:18.212  READ FPDMA QUEUED



Error 2124 occurred at disk power-on lifetime: 3514 hours (146 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 19 b7 ca 2d 0f  Error: ICRC, ABRT at LBA = 0x0f2dcab7 = 254659255



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 28 18 b8 ca 2d 4f ff   3d+14:51:51.074  READ FPDMA QUEUED

  60 28 18 a8 ca 2d 40 00   3d+14:51:51.072  READ FPDMA QUEUED

  60 20 18 88 ca 2d 40 00   3d+14:51:51.072  READ FPDMA QUEUED

  60 28 18 60 ca 2d 40 00   3d+14:51:51.072  READ FPDMA QUEUED

  60 48 10 18 ca 2d 40 00   3d+14:51:51.072  READ FPDMA QUEUED



Error 2123 occurred at disk power-on lifetime: 3484 hours (145 days + 4 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 41 e7 23 78 07  Error: ICRC, ABRT at LBA = 0x077823e7 = 125314023



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 70 40 e8 23 78 47 ff   2d+08:59:51.658  READ FPDMA QUEUED

  60 70 c8 28 24 78 40 00   2d+08:59:51.656  READ FPDMA QUEUED

  60 e0 c0 48 23 78 40 00   2d+08:59:51.656  READ FPDMA QUEUED

  60 20 b8 b8 22 78 40 00   2d+08:59:51.656  READ FPDMA QUEUED

  60 50 b0 88 21 78 40 00   2d+08:59:51.656  READ FPDMA QUEUED



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       507         -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Oh god, this drive run very hot, you should add cooling to your server immediately.

So, we can see that it has 3 reallocated sectors (this means maybe the drive has started to fail or maybe it's just the very high temp) and it has more than 2k UDMA errors (which means that the cable is a crappy one or the connector(s) isn't well seated or the PSU outputs crappy power).

Also, this drive has seen only one SMART test early in his whole life so it means you don't scheduled proper automated SMART tests. Please launch an extended test (smartctl -t long /dev/ada3) now to see if there's more errors.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Please launch an extended test (smartctl -t long /dev/ada3) now to see if there's more errors.
^^^^^^^^^^^^^^^^^^^^what he said, yep (gonna take > 9 hrs.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you're kind of in a bad way... got VERY hot drives and at least 2 with problems. I'd find the worst offender and replace that one first, and hope to God you don't have a 3rd disk start failing before resilvering finishes. I also wouldn't even try to resilver until you fix your heat problems. Cooking drives while trying to restore redundancy is... silly.
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
i don't think the drives are really that hot. could they just have bad temperature sensors? or be reporting incorrectly?
Code:
[root@freenas] ~# echo "drive 0" `smartctl -a /dev/ada0 | grep Temperature_Celsius`; echo "drive 1" `smartctl -a /dev/ada1 | grep Temperature_Celsius`; echo "drive 2" `smartctl -a /dev/ada2 | grep Temperature_Celsius`;echo "drive 3" `smartctl -a /dev/ada3 | grep Temperature_Celsius`;echo "drive 4" `smartctl -a /dev/ada4 | grep Temperature_Celsius`;echo "drive 5" `smartctl -a /dev/ada5 | grep Temperature_Celsius`;echo "drive 6" `smartctl -a /dev/ada6 | grep Temperature_Celsius`;echo "drive 7" `smartctl -a /dev/ada7 | grep Temperature_Celsius`;echo "drive 8" `smartctl -a /dev/ada8 | grep Temperature_Celsius`

drive 0 194 Temperature_Celsius 0x0002 146 146 000 Old_age Always - 41 (Min/Max 22/66)
drive 1 194 Temperature_Celsius 0x0002 142 142 000 Old_age Always - 42 (Min/Max 22/66)
drive 2 194 Temperature_Celsius 0x0022 033 060 000 Old_age Always - 33 (0 19 0 0 0)
drive 3 194 Temperature_Celsius 0x0002 162 162 000 Old_age Always - 37 (Min/Max 22/59)
drive 4 194 Temperature_Celsius 0x0022 030 052 000 Old_age Always - 30 (0 20 0 0 0)
drive 5 194 Temperature_Celsius 0x0022 030 055 000 Old_age Always - 30 (0 19 0 0 0)
drive 6 194 Temperature_Celsius 0x0002 166 166 000 Old_age Always - 36 (Min/Max 21/57)
drive 7 194 Temperature_Celsius 0x0002 146 146 000 Old_age Always - 41 (Min/Max 20/72)
drive 8 194 Temperature_Celsius 0x0002 146 146 000 Old_age Always - 41 (Min/Max 20/70)
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
i don't think the drives are really that hot. could they just have bad temperature sensors? or be reporting incorrectly?
They could indeed have bad sensors, or be reporting lower than they actually are.They could actually be hotter than that!
Your last post shows a 12 degree temperature spread over all your drive temps.
You've had more than one person bring this up to you, are you going to ignore that?
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
i'm not ignoring anything. just trying to understand what i am seeing. case fans are on their way, the nas is getting moved to another room in the house tomorrow, and for tonight i have the a/c cranked down and blowing right on it.

that being said, i take it from your comment that the number i should be looking at is the 37 in the line:
drive 3 194 Temperature_Celsius 0x0002 162 162 000 Old_age Always - 37 (Min/Max 22/59)

originally i was looking at the 162 as the temperature thinking that it would be not be functional if it was actually that hot. hence the comment.

if all that is correct, what is the normal operating temperature supposed to be? from what i can tell >50-60 you should begin to worry. is that correct? which makes that max of 72 really scare me...
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Thats what we've been trying to gently urge you to see.;)
Drive temps should remain for the most part at or just below 40.
There have been studies done on reliability that strongly suggest
exceeding 40c long term cuts drive life by a great margin.
Drives 7 & 8 are the ones scaring me right now, but it sounds as if
you are taking the right steps and hopfully you can run long smart tests
on all your drives and like cyberjock recommended, tackle the worst first:cool:
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
No need for tmux; the SMART tests run on the drive itself and don't require the console/shell at all. Just do "smartctl -t long /dev/ada0"; when it returns you to the shell prompt in a second or two, do "smartctl -t long /dev/ada1", lather, rinse, repeat.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
You're right Dan, for some reason (it was past my bedtime), I was thinking about badblocks
when I wrote that. Thanks for clearing that up :oops:
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Two drives with bad temperature sensors? Doesn't sound likely. Nasty as it sounds, a crazy high temperature is more likely.
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
well the test finally finished. i had some drives drop out of the array when i tried to run multiple smart tests at the same time. probably user error though.

so even though the drive says it passed its smart test, should i be worried about it?

Code:
[root@freenas] ~# smartctl -a /dev/ada3
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  HGST HDS724040ALE640
Serial Number:  PK2334PBJMG1MT
LU WWN Device Id: 5 000cca 23de506cf
Firmware Version: MJAOA580
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Tue Apr 14 16:53:01 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
  was completed without error.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (  24) seconds.
Offline data collection
capabilities:  (0x5b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  No Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  1) minutes.
Extended self-test routine
recommended polling time:  ( 561) minutes.
SCT capabilities:  (0x003d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000b  081  081  016  Pre-fail  Always  -  10682828
  2 Throughput_Performance  0x0005  137  137  054  Pre-fail  Offline  -  78
  3 Spin_Up_Time  0x0007  131  131  024  Pre-fail  Always  -  589 (Average 590)
  4 Start_Stop_Count  0x0012  100  100  000  Old_age  Always  -  25
  5 Reallocated_Sector_Ct  0x0033  100  100  005  Pre-fail  Always  -  3
  7 Seek_Error_Rate  0x000b  100  100  067  Pre-fail  Always  -  0
  8 Seek_Time_Performance  0x0005  121  121  020  Pre-fail  Offline  -  34
  9 Power_On_Hours  0x0012  100  100  000  Old_age  Always  -  3564
 10 Spin_Retry_Count  0x0013  100  100  060  Pre-fail  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  25
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  150
193 Load_Cycle_Count  0x0012  100  100  000  Old_age  Always  -  150
194 Temperature_Celsius  0x0002  146  146  000  Old_age  Always  -  41 (Min/Max 22/59)
196 Reallocated_Event_Count 0x0032  100  100  000  Old_age  Always  -  3
197 Current_Pending_Sector  0x0022  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0008  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x000a  200  200  000  Old_age  Always  -  2194

SMART Error Log Version: 1
ATA Error Count: 2194 (device log contains only the most recent five errors)
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2194 occurred at disk power-on lifetime: 3531 hours (147 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 61 ff e0 72 08  Error: ICRC, ABRT at LBA = 0x0872e0ff = 141746431

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 e0 60 00 e1 72 48 ff  4d+07:57:52.154  READ FPDMA QUEUED
  60 e0 c0 60 e1 72 40 00  4d+07:57:52.153  READ FPDMA QUEUED
  60 e0 b8 80 e0 72 40 00  4d+07:57:52.153  READ FPDMA QUEUED
  60 e8 b0 98 df 72 40 00  4d+07:57:52.152  READ FPDMA QUEUED
  60 e0 a8 b8 de 72 40 00  4d+07:57:52.152  READ FPDMA QUEUED

Error 2193 occurred at disk power-on lifetime: 3527 hours (146 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 09 df 87 e9 08  Error: ICRC, ABRT at LBA = 0x08e987df = 149522399

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 28 08 e0 87 e9 48 ff  4d+03:15:04.141  READ FPDMA QUEUED
  60 28 b0 c0 87 e9 40 00  4d+03:15:04.140  READ FPDMA QUEUED
  60 20 b0 a0 87 e9 40 00  4d+03:15:04.140  READ FPDMA QUEUED
  60 28 b0 78 87 e9 40 00  4d+03:15:04.140  READ FPDMA QUEUED
  60 28 b0 50 87 e9 40 00  4d+03:15:04.139  READ FPDMA QUEUED

Error 2192 occurred at disk power-on lifetime: 3526 hours (146 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 4f dc 84 0f  Error: ICRC, ABRT at LBA = 0x0f84dc4f = 260365391

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 00 4f dc 84 4f ff  4d+02:53:04.973  READ FPDMA QUEUED
  60 20 78 30 dc 84 40 00  4d+02:53:04.972  READ FPDMA QUEUED
  60 28 78 08 dc 84 40 00  4d+02:53:04.972  READ FPDMA QUEUED
  60 28 70 e0 db 84 40 00  4d+02:53:04.972  READ FPDMA QUEUED
  60 20 70 c0 db 84 40 00  4d+02:53:04.972  READ FPDMA QUEUED

Error 2191 occurred at disk power-on lifetime: 3526 hours (146 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 09 e7 01 ef 0e  Error: ICRC, ABRT at LBA = 0x0eef01e7 = 250544615

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 28 08 e8 01 ef 4e ff  4d+02:46:34.811  READ FPDMA QUEUED
  60 28 18 c8 01 ef 40 00  4d+02:46:34.810  READ FPDMA QUEUED
  60 28 18 58 01 ef 40 00  4d+02:46:34.810  READ FPDMA QUEUED
  60 20 18 38 01 ef 40 00  4d+02:46:34.810  READ FPDMA QUEUED
  60 28 18 e8 00 ef 40 00  4d+02:46:34.810  READ FPDMA QUEUED

Error 2190 occurred at disk power-on lifetime: 3526 hours (146 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 09 27 1f e5 03  Error: ICRC, ABRT at LBA = 0x03e51f27 = 65347367

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 28 08 28 1f e5 43 ff  4d+02:37:52.155  READ FPDMA QUEUED
  60 28 78 08 1f e5 40 00  4d+02:37:52.154  READ FPDMA QUEUED
  60 20 78 e8 1e e5 40 00  4d+02:37:52.154  READ FPDMA QUEUED
  60 20 70 78 1e e5 40 00  4d+02:37:52.154  READ FPDMA QUEUED
  60 28 68 98 1e e5 40 00  4d+02:37:52.154  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  3554  -
# 2  Extended offline  Interrupted (host reset)  90%  3522  -
# 3  Extended offline  Aborted by host  90%  3522  -
# 4  Extended offline  Interrupted (host reset)  90%  3520  -
# 5  Extended offline  Completed without error  00%  507  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Status
Not open for further replies.
Top