Is my hard drive failing?

Status
Not open for further replies.

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
i have setup a raidz2....

i ssh into my box this morning, found some IO errors. i thought it was my USB Drive failing (which i still think it is), then all of a sudden my ssh froze i got disconnected i couldnt do anything. i went to the box, turned on the monitor, i see a bunch of IO errors. i couldnt read them since it was just spitting them out so fast on my screen.

i couldn't do anything but do a hard reset (pull the plug), it booted up fine. then when i logged into the Web GUI, i see the alerts.
  • WARNING: The volume Data (ZFS) status is UNKNOWN: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

Checked the zpool status Data i see
NAME STATE READ WRITE CKSUM
Data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/3f27b294-cd80-11e2-bb99-00e0531360af ONLINE 0 0 0
gptid/3fa3f08b-cd80-11e2-bb99-00e0531360af ONLINE 0 0 0
gptid/405d6fe5-cd80-11e2-bb99-00e0531360af ONLINE 0 0 0
gptid/41514f72-cd80-11e2-bb99-00e0531360af ONLINE 0 0 70
so i did a scub, after a few hours, i see my ssh session lost its connection, i went over to the box, turned on the monitor and it was just frozen. i did a restart and it looks like scrub went back into running.
here is the smart info.

[root@freenas] ~# smartctl -a /dev/ada3
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green
Device Model: WDC WD20EADS-11R6B1
Serial Number: WD-WMAVY0000000
LU WWN Device Id: 5 0014ee 0473d6abc
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Jun 11 01:06:35 2013 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: (40800) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 464) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 175 139 021 Pre-fail Always - 8233
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 115
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 366
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1498
194 Temperature_Celsius 0x0022 120 099 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 10% 366 -
# 2 Short offline Completed without error 00% 349 -
# 3 Short offline Completed without error 00% 337 -
# 4 Short offline Completed without error 00% 325 -
# 5 Short offline Completed without error 00% 313 -
# 6 Short offline Completed without error 00% 303 -
# 7 Short offline Completed without error 00% 294 -
# 8 Short offline Completed without error 00% 277 -
# 9 Short offline Completed without error 00% 266 -
#10 Short offline Completed without error 00% 254 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
dont know any thing about this smart stats. can someone tell me is my HD is dying. or its something fixable.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Current Pending Sector Count is my personal indicator that a hard drive is failing. Give up those scheduled 12 hour short offline tests(they are worthless) and run a single long test (smartctl -t long /dev/ada3). The command will tell you how long the test will take to run(should take 464 minutes to complete). Minimize heavy loading on the server during the test and don't do a scrub. Then look at your log afterward(smartctl -a /dev/ada3) and I bet you'll see the test failed with a read error. If it passes than your drive is probably in good shape, but I'd monitor it since your very new hard drive already has a non-zero Current Pending Sector Count.

If it fails the long test I'd RMA the drive. I had 2 of those drives in my FreeNAS server, both failed with current pending sector count being non-zero and failed the long test in the last 2 weeks. So I feel your pain.
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
ok i will do that.

so how about the other part, the fact that i cant log into my ssh or my local shell freezes up. is this issues a unrelated to this HD. even right now, i cant log in via web gui or ssh.

@cyberjock - btw ur signatures link isnt working.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
A failing hard drive can cause the system to become unstable and even crash. Although this shouldn't normally happen its not unheard of. It's a little hard to isolate the exact cause for the local shell freezing up with your hard drive issues. I'd do a disk replacement and monitor your system. If it continues to freeze up and you don't have a second disk failing you may have another issue to deal with. I will warn you that if your system is freezing up regularly it can make the zpool unmountable(loss of all data in your zpool) so I wouldn't ignore the problem if the system keeps freezing and being unreliable.

I am aware that my signature link is broken. The link should point to http://forums.freenas.org/threads/slideshow-explaining-vdev-zpool-zil-and-l2arc-for-noobs.7775/ . Due to the upgraded forums as of last Friday I don't have rights to even edit my own signature(go figure) while regular users can still edit theirs. Due to that and various other things I'm leaving the forums until further notice(for good?). You can read about the frustration of the new forum software at http://forums.freenas.org/threads/new-forum-design.13117/ .
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
umm, dont leave, u have helped me on many posts. do u use stackexchange? handle?

how can i run a test on my USB flash disk which is housing the freenas install.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sorry, this place is just too frustrating for me. It's like a constant fight that's always uphill, against the wind, and in 2 feet of snow.

I've never heard of stackexchange so I assume I have no handle for it. :P

When I want to test USB sticks I use "Flash Drive Tester 1.14" (http://www.vconsole.com/client/?page=page&id=13). It has multiple modes; read, write, and "write, read and compare". Read and write are self explanatory. Write, read and compare mode writes a test pattern to the flash drive, then reads the test pattern looking for errors. Note that many memory sticks have reserve space that you don't have direct access to, so its possible to have a USB stick that intermittently fails the tests. I've never encountered a stick that intermittently failed though. It either passed the test and worked fine or failed at the same location every time(and also had corrupted data on it leading to me testing it).

Note that the modes that involve writing to your USB stick will erase all data on the stick.

I'm sure there's a way to test your USB stick from the command line, you could easily do a dd to read and write, but I don't know how you'd do the "write, read and compare" test in FreeBSD.
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
stackexchange is awesome, they have categories for pretty much everything. check out
stackexchange.com/sites
. Basically its a Q & A. people ask questions, other people answer, u can vote them up or down, and the creator of the questions, chooses the best answer. i like this method better then a forum.

Anyways going back to this. again today i was unable to ssh into the box. i went to the machine. i ha a bunch of errors. i didnt know how to go back into the console screen (how do u do that?), but i pressed keys and when i pressed print screen, it took me to local shell. but i couldnt type anything in there. i did a force restart.


iam still thinking my USB flash drive is not good. but not sure.

now my system shows my scrub is still not complete (53% done) . so i just stopped it. now iam going to run smartctl -t long /dev/ada3 and i will report the results.

as you said it showed "Please wait 464 minutes for test to complete.". but how can i check if it has completed or what the progress is of the test?
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
nevermind i see the status. when i run smartctl -a /dev/ada3...

how to we console menu on the freenas box after i hit the print screen button which takes me to shell but asks me to log in?
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
why would you hit the printscreen button, instead of just 9, for the shell?

If you're are in the shell, type exit <return> to get back to the console menu.


how to we console menu on the freenas box after i hit the print screen button which takes me to shell but asks me to log in?
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
if u read the rest of my posts, i was saying it freezes up and i dont see anything. normally i hit esc then return and console shows up, but becuase of the problems i am having it wont let me do that. so i started pressing keys, i notices print screen key. took me to shell, the only difference is, i need to log in when i do that. and i have no way of getting back to console.
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Since I read all the messages on the forum (with the exeption of the plugins), I have read your posts. Sorry, I missed the part about pressing random keys.

if u read the rest of my posts

Since it appears that you can login to the shell, after pressing the printscreen, can you type "exit" and have the console menu reappear. "exit"ing displays the menu for me.
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
if i go to shell (# 9) and then exit. it does that for me to. but if i press print screen, then log in (yes i have to log in, username and password), exit doesnt do it for me. not sure why
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
i ran the long test, it said it has passed. i took the HD offline and ran it again, and it still said it passed, i will still getting errors from HD. so i WIPED (FULL with zeros) it, now iam trying to replace the drive in the pool and i get this message.

manage.py: [middleware.exceptions:38] [MiddlewareError: Unable to GPT format the disk "ada3"]

why cant i replace this disk?
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
i rebooted the machine, when into my sata/raid controller and did a low level format. now iam able to attach my disk again. and it rebuild the array. i ran zpool status i see this.

Code:
  pool: Data
state: DEGRADED
  scan: resilvered 1.03T in 18h48m with 0 errors on Thu Jun 13 10:20:57 2013
config:
 
        NAME                                              STATE    READ WRITE CKSUM
        Data                                              DEGRADED    0    0    0
          raidz2-0                                        DEGRADED    0    0    0
            gptid/3f27b294-cd80-11e2-bb99-00e0531360af    ONLINE      0    0    0
            gptid/3fa3f08b-cd80-11e2-bb99-00e0531360af    ONLINE      0    0    0
            gptid/405d6fe5-cd80-11e2-bb99-00e0531360af    ONLINE      0    0    0
            replacing-3                                  DEGRADED    0    0    0
              4393343433214444512                        OFFLINE      0    0    0  was /dev/gptid/41514f72-cd80-11e2-bb99-00e0531360af
              gptid/00d6511c-d3b0-11e2-9a48-00e0531360af  ONLINE      0    0    0
 
errors: No known data errors


it shows that resilvered is complete. but under that it still shows raplcing-3 Degraded.

I dont understand if it is complete why is it not replacing it?
 

William Grzybowski

Wizard
iXsystems
Joined
May 27, 2011
Messages
1,754
You need to click "Detach" for it in the volume status screen

or

zpool detach Data 4393343433214444512
 

mskenderian

Contributor
Joined
May 24, 2013
Messages
100
awesome, it worked. i read the zfs manual, it didnt say anything about that. can u explain why is that? what is that disk?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
awesome, it worked. i read the zfs manual, it didnt say anything about that. can u explain why is that? what is that disk?

It does say that in the manual. Check out 6.3.11 step 4.

If the replaced disk continues to be listed after resilvering is complete, use the Detach button to remove the disk from the list.
 
Status
Not open for further replies.
Top