Possible failure in near future?

Don1919 · Apr 4, 2014

state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 5.64M in 0h0m with 0 errors on Fri Apr 4 21:10:21 2014
config:

NAME STATE READ WRITE CKSUM
RaidZ1 ONLINE 0 0
0
raidz1-0 ONLINE 0 0
0
gptid/299dce8e-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
0 block size: 512B configured, 4096B native
gptid/2a43be61-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
0 block size: 512B configured, 4096B native
gptid/2ae81dc9-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
39 block size: 512B configured, 4096B native

errors: No known data errors

(All 3 drives are seagate 2TB drives, model #ST2000DL003)

As of last night i actually had a degrade error, and this drive which is now online but has 39 cksum errors wasn't even running or being detected.

I recently (today 4/4/14) ran sealtools on the drive that stopped showing up in freenas, upon booting it beeped 3 or 4 times, turned off, then rebooted and ran fine for the 8hours it took to run both general scans (short and long) both of which came back with no errors what so ever.

Upon placing back into my nas i get the following error code which i first posted. I've already purchased a new drive but now that it seems to be up and running with no issues I'm just wondering if its worth replacing?

Anyone able to shed some light on this for me? first time ever having a failed drive in my nas.

Edit: Started to perform a scrub, looks like as if its upto 377 chksum errors now. Assuming its time to replace. However scrub says 6hours remaining atm.

joeschmuck · Apr 5, 2014

Replace your SATA cable first. And I'm not saying it is a good drive, just this is the first thing to do. If you could post the output of 'smartctl -a /dev/adax' (for that drive) that could help.

Don1919 · Apr 5, 2014

state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 68.9M in 4h50m with 0 errors on Sat Apr 5 02:35:42 2014

config:

NAME STATE READ WRITE CKSUM
RaidZ1 ONLINE 0 0
0
raidz1-0 ONLINE 0 0
0
gptid/299dce8e-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
0 block size: 512B configured, 4096B native
gptid/2a43be61-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
0 block size: 512B configured, 4096B native
gptid/2ae81dc9-02e3-11e2-9dbf-50e549deb030 ONLINE 0 0
18.4K block size: 512B configured, 4096B native

This is after the scrub, 18.4k now.

Heres from the command you listed.

Edit: had to upload a photo, output looked fine when pasted in but was completely messed up after posting

joelmusicman · Apr 5, 2014

You *ARE* running ECC ram, correct?

Don1919 · Apr 5, 2014

joelmusicman said:
You *ARE* running ECC ram, correct?

TBH, its some old memory i used in my old desktop, tried finding more info in my purchase history but cannot find it.

However, this is the board/proc it's running - http://www.newegg.com/Product/Product.aspx?Item=N82E16813128452

Has 2 4GB sticks of DDR3 1333mhz.

I'll keep digging but i cannot say if it is or isn't, but since it was just standard desktop memory i'd assume it wasn't ECC.

joelmusicman · Apr 5, 2014

Don1919 said:
I'll keep digging but i cannot say if it is or isn't, but since it was just standard desktop memory i'd assume it wasn't ECC.

That's usually a pretty safe assumption.

You may be too late since you just did a scrub, but you should definitely think about backups for your important media if you haven't already.

joeschmuck · Apr 5, 2014

ECC RAM has nothing to do with this type of problem otherwise I would have mentioned it, although having ECC RAM is a very good thing when dealing with ZFS. Lack of ECC RAM could result in data corruption during a scrub, not drives going offline or having chksum errors. Also, running a scrub does no harm providing your RAM is fine but that is the risk you take not using ECC RAM, again, it's not a RAM issue for this thread.

Don't let all those high values in that smart report concern you too much, they can be that high under normal conditions for many drives. If you have values in ID's 5, 196, or 197 then you are looking at a hard drive failure. I would replace the SATA cable first (I prefer locking cables). If you have another open SATA port you could plug it in there as well in case it's your MB connector.

joelmusicman · Apr 5, 2014

Swap the cable and then do short and long SMART tests before deciding that the HDD is bad...

Don1919 · Apr 5, 2014

Cable has been swapped, will run another short and long test to ensure. However when i did run them before it was with another cable which it came back with zero errors.

Outside of this, any other suggestions? or pretty much a wait game to see if it actually fails?

Edit: after reboot with new cable checksum error says 0, normal?

joeschmuck · Apr 5, 2014

Just wait to see what happens. The problem you had is indicative of a SATA cable connection issue.

Important Announcement for the TrueNAS Community.

Possible failure in near future?

Don1919

Cadet

joeschmuck

Old Man

Don1919

Cadet

Attachments

joelmusicman

Patron

Don1919

Cadet

joelmusicman

Patron

joeschmuck

Old Man

joelmusicman

Patron

Don1919

Cadet

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Possible failure in near future?

Cadet

Old Man

Cadet

Attachments

Patron

Cadet

Patron

Old Man

Patron

Cadet

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Possible failure in near future?"

Similar threads