security and run output messages

Status
Not open for further replies.

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Those drives appear to have had several errors in the past and should not be trusted. In the past those errors were from heat issues which could have causes permanent damage.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
If the drive is still under warranty, contact the manufacturer and give them
what you have, with luck, you may still qualify for a replacement.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Two things: Check for a bad cable (CRC errors) and the drive is probably dying (3 reallocated sectors).
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
is there a more intelligent way to test the cable? and would those sata cables go bad after a time, i've not had reported crc errors for months. and is the serial number reported in the smartctl output the same as the manufacturer's serial number?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
thank you for those answers. last question: when i see crc errors, does that explicitly mean that there is data corruption or that it was noticed and repaired?
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
If zfs does not tell you explicitly that you have data corruption in files or metadata you probably don't. May be worth doing a scrub to correct any correctable errors and make sure there no errors in data that hasn't been accessed recently.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
thank you for those answers. last question: when i see crc errors, does that explicitly mean that there is data corruption or that it was noticed and repaired?

It means that the drive found errors in the transmitted data and requested that it be retransmitted.
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
so if i am running raidz2 and all the other drives come back fine after a long smart test, would it be that bad to run it until it actually does fail? it is not under warranty and i don't have the money to replace it until the end of the month.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Since your drives have been running hot (discovered just a day or two ago) you would be risking
your entire pool of data. If you continue to use the server and this one disk fails, AND before the
end of the month another drive craps out, then you now have no redundancy left and the next drive
to fail takes your data with it when it dies.
If the drive cannot be replaced until funds become availible, shut down your server and wait it out.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Since the drive is still more or less running along, you can keep it running until a replacement arrives.

But please get those temperatures down!
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
good news! i got the temps down (ranging from 26-31 now). but after all that, i had a drive fail while setting it up. interestingly enough, it wasnt the one that we have been talking about this whole time. so i broke down and got another drive, but have no idea how to replace it. when i run zpool status -v, i get:

Code:
[root@freenas] ~# zpool status -v
  pool: volume
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 3.43M in 0h0m with 0 errors on Tue Apr 14 18:48:23 2015
config:

  NAME  STATE  READ WRITE CKSUM
  volume  DEGRADED  0  0  0
  raidz2-0  DEGRADED  0  0  0
  gptid/92e122fb-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/8e751139-6874-11e4-a79e-c86000cb131c  ONLINE  0  0  0
  gptid/93c736f3-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/94203f52-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/947317c4-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/94d99f4a-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  15112422042946592908  UNAVAIL  0  0  0  was /dev/gptid/95402b37-5a13-11e4-919c-c86000cb131c
  4162859240354443090  UNAVAIL  0  0  0  was /dev/gptid/95c39268-5a13-11e4-919c-c86000cb131c
  gptid/962ade58-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0


but when i view disks, i see 8 of the 9 online.

upload_2015-4-18_18-2-39.png


which one do i replace? and why does the one screen show 7 drives healthy, and another show 8?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
thanks! thats a nice script! turned out that i didnt plug in one of the drives. it is resilvering now.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
You're welcome ;)

And be careful because one more drive out and the data is lost, don't play with fire... :)
 

liquidice

Dabbler
Joined
Nov 15, 2014
Messages
20
drive replaced and all is healthy. new last question: when i get emails like this:
Code:
  pool: volume
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 40K in 20h1m with 0 errors on Mon Apr 20 11:22:46 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/92e122fb-5a13-11e4-919c-
c86000cb131c  ONLINE       0     0     0
            gptid/8e751139-6874-11e4-a79e-c86000cb131c  ONLINE       0     0     0
            gptid/93c736f3-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0
            gptid/94203f52-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0
            gptid/947317c4-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/94d99f4a-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/95402b37-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/92f858cb-e622-11e4-9d41-90e2ba66cab8  ONLINE       0     0     0
            gptid/962ade58-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0

errors: No known data errors


according to the link, parity has fixed the corruption and my data is all well and good. but, it also says that if i continue getting these, it could be a drive failing. if i don't clear it, i get the email every day. is it the same message? is it saying that i should clear them and if i continue getting them after clearing them then i may have a problem, or should this be going away all by itself eventually (when the next scrub takes place)?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You have 3 disks with 3 checksum errors each. Usually that's a sign of a failing disk, but SMART can tell you more.

In the meantime I'd log those 3 gpt-ids, figure out what disks those are, and figure it if they have a problem, share a common controller or cable, etc.

Then, do a "zpool clear" and do another scrub and see if any errors develop. If they come back, you've got something to be concerned about.
 
Status
Not open for further replies.
Top