security and run output messages

SweetAndLow · Apr 14, 2015

Those drives appear to have had several errors in the past and should not be trusted. In the past those errors were from heat issues which could have causes permanent damage.

BigDave · Apr 14, 2015

If the drive is still under warranty, contact the manufacturer and give them
what you have, with luck, you may still qualify for a replacement.

Ericloewe · Apr 14, 2015

Two things: Check for a bad cable (CRC errors) and the drive is probably dying (3 reallocated sectors).

liquidice · Apr 14, 2015

is there a more intelligent way to test the cable? and would those sata cables go bad after a time, i've not had reported crc errors for months. and is the serial number reported in the smartctl output the same as the manufacturer's serial number?

Ericloewe · Apr 14, 2015

liquidice said:
is there a more intelligent way to test the cable?

Replacing it and seeing if it solves a problem.

liquidice said:
and would those sata cables go bad after a time, i've not had reported crc errors for months.

Possible, but unlikely. It's a quick and easy fix, so no harm done.

liquidice said:
and is the serial number reported in the smartctl output the same as the manufacturer's serial number?

Yes it is.

liquidice · Apr 14, 2015

thank you for those answers. last question: when i see crc errors, does that explicitly mean that there is data corruption or that it was noticed and repaired?

rogerh · Apr 14, 2015

If zfs does not tell you explicitly that you have data corruption in files or metadata you probably don't. May be worth doing a scrub to correct any correctable errors and make sure there no errors in data that hasn't been accessed recently.

Ericloewe · Apr 14, 2015

liquidice said:
thank you for those answers. last question: when i see crc errors, does that explicitly mean that there is data corruption or that it was noticed and repaired?

It means that the drive found errors in the transmitted data and requested that it be retransmitted.

liquidice · Apr 14, 2015

so if i am running raidz2 and all the other drives come back fine after a long smart test, would it be that bad to run it until it actually does fail? it is not under warranty and i don't have the money to replace it until the end of the month.

BigDave · Apr 14, 2015

Since your drives have been running hot (discovered just a day or two ago) you would be risking
your entire pool of data. If you continue to use the server and this one disk fails, AND before the
end of the month another drive craps out, then you now have no redundancy left and the next drive
to fail takes your data with it when it dies.
If the drive cannot be replaced until funds become availible, shut down your server and wait it out.

Ericloewe · Apr 15, 2015

Since the drive is still more or less running along, you can keep it running until a replacement arrives.

But please get those temperatures down!

liquidice · Apr 18, 2015

good news! i got the temps down (ranging from 26-31 now). but after all that, i had a drive fail while setting it up. interestingly enough, it wasnt the one that we have been talking about this whole time. so i broke down and got another drive, but have no idea how to replace it. when i run zpool status -v, i get:

Code:

[root@freenas] ~# zpool status -v
  pool: volume
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 3.43M in 0h0m with 0 errors on Tue Apr 14 18:48:23 2015
config:

  NAME  STATE  READ WRITE CKSUM
  volume  DEGRADED  0  0  0
  raidz2-0  DEGRADED  0  0  0
  gptid/92e122fb-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/8e751139-6874-11e4-a79e-c86000cb131c  ONLINE  0  0  0
  gptid/93c736f3-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/94203f52-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/947317c4-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  gptid/94d99f4a-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0
  15112422042946592908  UNAVAIL  0  0  0  was /dev/gptid/95402b37-5a13-11e4-919c-c86000cb131c
  4162859240354443090  UNAVAIL  0  0  0  was /dev/gptid/95c39268-5a13-11e4-919c-c86000cb131c
  gptid/962ade58-5a13-11e4-919c-c86000cb131c  ONLINE  0  0  0

but when i view disks, i see 8 of the 9 online.

which one do i replace? and why does the one screen show 7 drives healthy, and another show 8?

Bidule0hm · Apr 18, 2015

Look at this post: https://forums.freenas.org/index.php?threads/drive-unavailable-pool-degraded.30269/#post-194723 ;)

liquidice · Apr 18, 2015

thanks! thats a nice script! turned out that i didnt plug in one of the drives. it is resilvering now.

Bidule0hm · Apr 18, 2015

You're welcome ;)

And be careful because one more drive out and the data is lost, don't play with fire... :)

liquidice · Apr 23, 2015

drive replaced and all is healthy. new last question: when i get emails like this:

Code:

  pool: volume
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 40K in 20h1m with 0 errors on Mon Apr 20 11:22:46 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/92e122fb-5a13-11e4-919c-
c86000cb131c  ONLINE       0     0     0
            gptid/8e751139-6874-11e4-a79e-c86000cb131c  ONLINE       0     0     0
            gptid/93c736f3-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0
            gptid/94203f52-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0
            gptid/947317c4-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/94d99f4a-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/95402b37-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     2
            gptid/92f858cb-e622-11e4-9d41-90e2ba66cab8  ONLINE       0     0     0
            gptid/962ade58-5a13-11e4-919c-c86000cb131c  ONLINE       0     0     0

errors: No known data errors

according to the link, parity has fixed the corruption and my data is all well and good. but, it also says that if i continue getting these, it could be a drive failing. if i don't clear it, i get the email every day. is it the same message? is it saying that i should clear them and if i continue getting them after clearing them then i may have a problem, or should this be going away all by itself eventually (when the next scrub takes place)?

cyberjock · Apr 23, 2015

You have 3 disks with 3 checksum errors each. Usually that's a sign of a failing disk, but SMART can tell you more.

In the meantime I'd log those 3 gpt-ids, figure out what disks those are, and figure it if they have a problem, share a common controller or cable, etc.

Then, do a "zpool clear" and do another scrub and see if any errors develop. If they come back, you've got something to be concerned about.

Important Announcement for the TrueNAS Community.

security and run output messages

SweetAndLow

Sweet'NASty

BigDave

FreeNAS Enthusiast

Ericloewe

Server Wrangler

liquidice

Dabbler

Ericloewe

Server Wrangler

liquidice

Dabbler

rogerh

Guru

Ericloewe

Server Wrangler

liquidice

Dabbler

BigDave

FreeNAS Enthusiast

Ericloewe

Server Wrangler

liquidice

Dabbler

Bidule0hm

Server Electronics Sorcerer

liquidice

Dabbler

Bidule0hm

Server Electronics Sorcerer

liquidice

Dabbler

cyberjock

Inactive Account

Similar threads