So, how much time do I have !!??

Status
Not open for further replies.

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
Hello Forum,

Recently replaced all my 1TB hard drives with 2TB drives using the replace one drive, resilver, replace another, resilver method; until all three drives are replaced (3 drives, RAIDz). This was over 6 weeks ago now. I made sure SMARTD was off as the drives I bought do not work well with SMARTD (Samsung HD204UI) and can cause data corruption if SMARTD is enabled and the new/updated firmware is not installed.

Well, over the last week or so I have noticed that one (or maybe two) of the drives are making some very bad "clicking" noises, every now and again. I have been backing off all my data (movies,music, VM's etc) to other places using various methods (don't really care where,as long as my data is safe!). Whilst I have been frantically getting the data backed up I have been watching the activity lights on the FreeNAS box (separate light for each drive). Two of the drives get activity, then the third drive, then all goes quiet for like 5 seconds, my copies to external sources pause (I'm using robocopy for most of my copies so the copy recovers by itself), then all three get activity, almost as if all is OK.

So, the burning question... How much time do I realistically have before my hard drives fail ? Does anyone know of any tests I can perform to say for definite that one (or maybe two) of my drives are trashed ?

I have done a scrub and the process identified some bad data, but fixed it, zpool status does not show any errors ??

Thanks in advance
 

Joshua Parker Ruehlig

Hall of Famer
Joined
Dec 5, 2011
Messages
5,949
Did you flash the new firmware? It's not too hard, just make a freedos usb stick, add the firmware flasher to it, boot of it and run the flasher.

EDIT
what sucks is it doesn't tell you a new firmware version (a bug), but I just do it anytime I get a HD204UI so I know I'm safe.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
Hi,

Thanks for the prompt response....

When you flashed the drive, did you lose any data ?
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
Hi,

Thanks for the prompt response....

When you flashed the drive, did you lose any data ?

I have a couple of the HD204UI's. I *wouldn't* flash them. Flashing them was only necessary for drives manufactured BEFORE December 2 years ago? I seriously doubt this is your problem, and adding another variable to an existing problem is a bad idea. Continue backing them up, there is no way at all to tell how long you have before they fail. Doing scrubs and diagnostics only stresses the disks more and lessens your chance of pulling all of your data off if a drive fails during a diagnostic. GET YOUR DATA OFF FIRST, then play!
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
Like protosd, I have the HD204UI -- 11 of them spread across 2 system. No flashing necessary. SMARTD works fine with my drives. I second the advice to back up, then run a SMART long test on your drives from the FreeBSD console to see what's going on. Unfortunately, there's no way to predict how much time you have, so you keep backing up starting with the most critical data first, then go from there.
 

Joshua Parker Ruehlig

Hall of Famer
Joined
Dec 5, 2011
Messages
5,949
Yeah like protosd said flashing is only need for the really older versions of these, but I always do it anyway before throwing one of these drives into my system. definitly agree to run a long smart test after backing up.
If you have errors and your still under warrenty then RMA it with Seagate and get a refurb from them
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
Well, managed to get all my data saved, have flashed one of the drives and now it's time to start testing..

The reason I say one of the drives is because only one drive is a "true" Samsung HD204UI, the other two are Seagate HD204UI's but labeled as Samsungs..... go figure......

Initially I thought I would be OK but this happened....

I have ran "smartctl -H /dev/DISKID" for my three drives, all have come back with a value of "passed", however one of my drives has something called a "pre-fail" failure ? On disk spin up time ?

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0023 067 001 025 Pre-fail Always In_the_past 10160

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Then, after enabling smart for all of my drives via the FReeNAS GUI and setting email notifications..... I got a message stating that another drives had 5 pending sectors...

The following warning/error was logged by the smartd daemon:

Device: /dev/ada2, 5 Currently unreadable (pending) sectors

I'm starting to think that my drives could be "toast".....

Any recommendations for further testing would be greatly appreciated, especially if there are any "smartctl" guru's out there....
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
I'm starting to think that my drives could be "toast"..... Any recommendations for further testing would be greatly appreciated, especially if there are any "smartctl" guru's out there....

I second the advice to back up, then run a SMART long test on your drives from the FreeBSD console to see what's going on.

I can't add much more than that. You want the "-t long" option.
http://smartmontools.sourceforge.net/man/smartctl.8.html
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
thanks for the prompt reply. Running with the "-t long" option right now. Do you know where the output gets placed ? Is it a file somewhere ? Or do the results from the test get output to the console when complete ?
 

tingo

Contributor
Joined
Nov 5, 2011
Messages
137
The output gets logged in the SMART logs, and any errors in the SMART errorlog.
BTW, as long as the number of errors on the drives stay the same and don't increase, the drive may live a long time.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
The saga continues...

I managed to get all my data off my zpool and decided to do some testing of one of the drives on another system.

I have been getting the following errors on two of my drives ...

/dev/ada2, 3 Currently unreadable (pending) sectors

/dev/ada0, 1 Offline uncorrectable sectors (this number has increased to 2)

Using a live Knoppix CD I booted up a laptop with "ada0" connected via some fancy SATA to USB connectors. The drive was recognized by Knoppix and I ran a program on it (TestDisk) which amongst many things, can be used to test for and locate bad sectors. I ran the program twice over the disk and no bad sectors were found ?

I then decided to zero the disk using dd if=/dev/zero of=/dev/sdb and, once complete re-instate the drive and re-silver. I started this process on Sunday. 48 hours later it's still running ?

The drive is 2TB, has anyone else "zero'd" a 2TB drive ? I'm interested in seeing how long it took ?

Thanks in advance
 

tingo

Contributor
Joined
Nov 5, 2011
Messages
137
It is "normal" that a test program (testdisk in this case) won't find any bad sectors. When SMART reports unreadable sectors, the drive firmware remaps (moves the data) to another (good) sector from a pool of spare sectors. The problem is when the drive runs out of spare sectors; then you lose data.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
I have been doing some more tests with my "failing" hard drives. Luckily, as I have mentioned previously I have all my data backed up so I can do whatever I want with the drives...


Test1...
  • Connect one of the drives to my laptop boot up Knoppix live CD, fdisk the whole disk and create an ext2 file system across the entire of the disk
  • Run fsck across the whole disk. No errors found.

Test2...
  • Download and burn "SeaTools for DOS" (Seagate disk tools utility) to a CDROM
  • Put the hard drive used in Test1 back in the FreeNAS system, boot up the system using the Seatools for DOS CDROM and run a "Short Test", Test fails. Oh nuts, disk is toast, or is it ?
  • Completely re-format the drive using SeaTools for DOS. This takes around 7 hours to complete. No errors reported on the format
  • Re-run the short test. Eh Voila!! The test passes !!

Test3...
  • Download SeaTools for Windows, install it to my laptop which is running XP.
  • Connect up another of my "bad" hard drives (not the one in test1) and run the short test, test fails. Same as above, is the drive "really" toast?
  • Run a format (long) this takes way longer than the DOS version of Seatools, around 12 hours in total, re-format works without any errors.
  • Run a short test, test works fine, no errors, what !!??

OK, so what gives ? SMART is telling me I have errors on 2 of my drives yet after running a low level format using the hard drive manufacturers tools, then running tests, I don't get errors ? and this is for both drives ?

Is ZFS really tough on hard drives by any chance ? One thing I have noticed whilst running all these tests is that none of the drives have that awful clicking noise any more, which I had when I was using RAID1Z. Which "tool" should I trust, SMART or Seatools ??



I'm currently running a Seatools "long test" against one of the drives, it's around 45% complete. If this comes back with no errors I'm gonna be even more confused than I am now !
 
Status
Not open for further replies.
Top