Freenas reports drive read/write errors, Long smart test shows no errors

Status
Not open for further replies.

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
There is an issue with that. You are correct that the WebGUI didn't let you do that. You cannot add the same drive back to the pool from the WebGUI. The most "correct" way is to reboot the system. The problem is that when you try to add the hard drive back to the pool the WebGUI sees that a partition and data is potentially on the disk, so it refuses to overwrite that data. You short circuited that "safety" logic when you went to the CLI and replaced it manually.

So now what you have to do is remove ada11 again, then you have to zero out the disk on another machine (NOT the FreeNAS machine), then do a replacement again.

Understood. That makes sense that FN would not want to whack a drive with data on it. Before I saw this I just offlined the drive in the GUI and then re added it. It came back online and resilvered no problem this time using the gptid.
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Hey @cyberjock - Same exact problem, different drive (/dev/da3 as opposed to /dev/da11). Took your advice and did nothing but rebooted the box. Problem went away according to freenas on reboot but it gave me a notice in the GUI:


Code:
The volume vol1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.


So I jumped on the CLI and saw that it brought the drive back online and resilvered 460K of data, but that /dev/da3 (faeb3d6d-ed62-11e4-a956-0cc47a31abcc) was showing 2 cksum errors:


Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 460K in 0h0m with 0 errors on Mon Jun  1 08:47:31 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     2
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/a82dda8c-ef5f-11e4-bb0a-0cc47a31abcc  ONLINE       0     0     0

errors: No known data errors
[root@plexnas] ~#



Smartdrive was not showing any errors at all (same as before):

Code:
[root@plexnas] ~# smartctl -a /dev/da3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK1334PCJZNLWX
LU WWN Device Id: 5 000cca 24cea1f81
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jun  1 09:08:14 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       80
  3 Spin_Up_Time            0x0007   133   133   024    Pre-fail  Always       -       577 (Average 588)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       68
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1198
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       100
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       100
194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 21/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1190         -
# 2  Short offline       Completed without error       00%      1166         -
# 3  Short offline       Completed without error       00%      1118         -
# 4  Short offline       Completed without error       00%      1070         -
# 5  Short offline       Completed without error       00%      1022         -
# 6  Short offline       Completed without error       00%       974         -
# 7  Extended offline    Completed without error       00%       961         -
# 8  Short offline       Completed without error       00%       926         -
# 9  Short offline       Completed without error       00%       878         -
#10  Short offline       Completed without error       00%       830         -
#11  Short offline       Completed without error       00%       783         -
#12  Short offline       Completed without error       00%       734         -
#13  Short offline       Completed without error       00%       686         -
#14  Short offline       Completed without error       00%       638         -
#15  Extended offline    Completed without error       00%       625         -
#16  Short offline       Completed without error       00%       590         -
#17  Short offline       Completed without error       00%       542         -
#18  Extended offline    Completed without error       00%       117         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@plexnas] ~#



So I cleared the error:

Code:
[root@plexnas] ~# zpool clear vol1 gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc


And now everything looks good again:

Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
state: ONLINE
  scan: resilvered 460K in 0h0m with 0 errors on Mon Jun  1 08:47:31 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/a82dda8c-ef5f-11e4-bb0a-0cc47a31abcc  ONLINE       0     0     0

errors: No known data errors
[root@plexnas] ~#


I did receive this in my email, I am assuming that it has something to do with the issue since it is specifically about the drive in question:

Code:
plexnas.rstechnical.com kernel log messages:
>       (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000f82de8
>       (da3:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 17 7f 1c 40 00 00 00 08 00 00 length 4096 SMID 722 terminated ioc 804b scsi 0 state c xfer 0
>       (da3:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 17 7d 86 90 00 00 01 00 00 00 length 131072 SMID 416 terminated ioc 804b scsi 0 state c xfer 0
>       (da3:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 17 7d 85 90 00 00 01 00 00 00 length 131072 SMID 436 terminated ioc 804b scsi 0 state c xfer 0
>       (da3:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 17 7d 84 90 00 00 01 00 00 00 length 131072 SMID 144 terminated ioc 804b scsi 0 state c xfer 0
> (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
> (da3:mpr0:0:11:0): CAM status: Command timeout
> (da3:mpr0:0:11:0): Retrying command
> (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
> (da3:mpr0:0:11:0): CAM status: SCSI Status Error
> (da3:mpr0:0:11:0): SCSI status: Check Condition
> (da3:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da3:mpr0:0:11:0): Error 6, Retries exhausted
> (da3:mpr0:0:11:0): Invalidating pack

-- End of security output --






I also downloaded the Advanced => Debug info as you suggested and attached it to this post. Would be very interested in your take on what the problem might be.
 

Attachments

  • debug-plexnas-20150601090130.tgz
    300.5 KB · Views: 192

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you are using an LSI 3008 controller. The driver is classified as "pre-alpha", so having issues is totally expected.

Nothing else looks terribly wrong though. But that controller really should be taken out and replaced with a recommended controller.
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Thanks!
 
Status
Not open for further replies.
Top