Freenas reports drive read/write errors, Long smart test shows no errors

Status
Not open for further replies.

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Hardware Platform:

Supermicro Superserver 5028R-E1CR12L
Supermicro X10SRH-CLN4F Motherboard
1 x Intel Xeon E5-2640 V3 8 Core 2.66GHz
4 x 16GB PC4-17000 DDR4 2133Mhz Registered ECC
12 x 4TB HGST HDN724040AL 7200RPM NAS SATA Hard Drives
LSI3008 SAS Controller - Flashed to IT Mode
LSI SAS3x28 SAS Expander
Dual 920 Watt Platinum Power Supplies
16GB USB3 Thumb Drive for booting
FreeNAS-9.3-STABLE-201504152200
12 Drives, 2 x 6 Drive VDEVS RAIDZ2




Extensive pre-deployment testing completed on all hard drives and platform including over 10TB of dummy files written to the zpool. Started doing another 11TB copy of data to the pool via NFS today and part way through I see an alert on the freenas dashboard letting me know that a drive failed due to read and write errors.

I have several questions. First, the email is set up correctly for the root user, I can send (and I get) the test email, yet I received no email about the drive failing. I was poking around on the forums and read something (I think from @cyberjock that the system only send out alerts at 3am. Am I reading this correctly? I just want to make sure that I do not have something misconfigured.

Second, with the drive offline I started a long smart test (I had done long and short and conveyance when I got the drives) and the smart test returned:

Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 4  Extended offline    Completed without error       00%       453         -


So I decided to offline the drive and replace it with itself to see if the problem cropped up again. I think the system is smarter than I am as it would not let me do this. I have not yet tried physically pulling the drive and reinserting it or rebooting to see if it will allow me to read that same drive (no production data on the box).

So my questions is this: Why does FreeNAS report read and write errors but the smart long test report the drive as fine with 0 errors? Is there some other test I can run on the drive to determine if I actually have a failing drive or should I just pull the drive, replace it with a new one I have on the shelf, RMA the old drive and not worry about it?

Thanks for any input!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can you post the full output of smartctl -a /dev/adawhatever ?
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
You bet!

Code:
[root@plexnas] ~# smartctl -a /dev/da11
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK2338P4H9LW8C
LU WWN Device Id: 5 000cca 249d275b7
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 06:12:32 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 599) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       81
  3 Spin_Up_Time            0x0007   130   130   024    Pre-fail  Always       -       596 (Average 595)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       466
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       69
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       69
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 20/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       453         -
# 2  Short offline       Completed without error       00%       453         -
# 3  Extended offline    Completed without error       00%       117         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@plexnas] ~#



As a side note, the data on this box (about 11TB now) is a replica of another box, so I decided to play around trying to figure this out - I shut down the box, pulled the drive, reseated it, rebooted and then I see this:

Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            DEGRADED     0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      DEGRADED     0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            475905736839031462                          UNAVAIL      0     0     0  was /dev/gptid/0bd7d8b8-ed63-11e4-a956-0cc47a31abcc

errors: No known data errors


So wanting to use the same drive (since the long smart showed no errors again) I did this:

Code:
[root@plexnas] ~# zpool replace vol1 /dev/gptid/0bd7d8b8-ed63-11e4-a956-0cc47a31abcc /dev/da11


and saw this:

Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 30 06:07:30 2015
        28.0G scanned out of 15.6T at 736M/s, 6h10m to go
        2.34G resilvered, 0.18% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            DEGRADED     0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      DEGRADED     0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            replacing-5                                 UNAVAIL      0     0     0
              475905736839031462                        UNAVAIL      0     0     0  was /dev/gptid/0bd7d8b8-ed63-11e4-a956-0cc47a31abcc
              da11                                      ONLINE       0     0     0  (resilvering)

errors: No known data errors


Then this:

Code:
[root@plexnas] ~# zpool detach vol1 475905736839031462


Which resulted in this:

Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 30 06:08:59 2015
        14.6G scanned out of 15.6T at 649M/s, 7h0m to go
        1.21G resilvered, 0.09% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            da11                                        ONLINE       0     0     0  (resilvering)

errors: No known data errors


I waited a few minutes and checked again and it appears the resilver is moving right along:

Code:
[root@plexnas] ~# zpool status vol1
  pool: vol1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 30 06:08:59 2015
        1006G scanned out of 15.6T at 1.24G/s, 3h22m to go
        83.8G resilvered, 6.28% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE       0     0     0
            da11                                        ONLINE       0     0     0  (resilvering)

errors: No known data errors




So it would appear that my attempt to replace the failed drive with itself has been successful, but I am still wondering why after all of the pre-deployment testing that FreeNas thought the drive was bad. I always stock several brand-new replacement drives and I am wondering if I should just replace it.

My vendor says they run a smart test on any returned drives and if it shows up with no errors they do not take them back, so I would be stuck with a drive that FreeNas does not like (for whatever reason) but when you ask it, it says "I'm just fine"!
 
Last edited:

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
Great question - I'm going through the same thing now with a couple hundred drives..... Strange that some have Current Pending Sector counts > 0 and FreeNAS has no problems with them, and some have 0 issues with a full smart long test, but FN degrades the pool if there is 1 Read or Write error....

Even though the drives are there, I wonder what criteria is best used to judge whether or not to replace a drive.

Does clearing an error using zpool clear just mask an issue, even though smart says there isn't one?
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Wow...a couple of hundred drives...? We have about 150TB of space deployed with FN and this is the first time I have ever seen this where the drives pass all the pre-deployment tests and then FN says it is no good.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You screwed up by doing the zpool replace from the CLI. Now you are using da11 instead of a gpt-id...

Need to fix that first.

Then, once its fixed and the issue has come up again, post a debug file from System -> Advanced.
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Hey @cyberjock

Tried doing a zpool replace (multiple times) in the GUI, it failed every time, CLI was the only way it worked. Every time I tried I got and error telling me to do it on the command line. As far as using the da11 as opposed to the gpt-id, is there any issue with that?
 

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
You screwed up by doing the zpool replace from the CLI. Now you are using da11 instead of a gpt-id...

Need to fix that first.

Then, once its fixed and the issue has come up again, post a debug file from System -> Advanced.
cyberjock - so glad you replied to this post. Maybe I should start another thread, but I've been searching YOUR posts trying to find your opinion of the definition of a failed drive. I've been working on a spreadsheet because my shitty Seagates are "failing" but I've also been adding columns for smart output.

Some drives FreeNAS sees as fine and smart sees errors that bother me, some have smart errors that don't bother FreeNAS yadda yadda.... So in your opinion, is Current Pending Sector the telling issue from smart data, or something else? What about the drives FN sees as FAULTED, but smart sees nothing wrong with them?

Thanks,

R1chR
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Hey @cyberjock

Tried doing a zpool replace (multiple times) in the GUI, it failed every time, CLI was the only way it worked. Every time I tried I got and error telling me to do it on the command line. As far as using the da11 as opposed to the gpt-id, is there any issue with that?

There is an issue with that. You are correct that the WebGUI didn't let you do that. You cannot add the same drive back to the pool from the WebGUI. The most "correct" way is to reboot the system. The problem is that when you try to add the hard drive back to the pool the WebGUI sees that a partition and data is potentially on the disk, so it refuses to overwrite that data. You short circuited that "safety" logic when you went to the CLI and replaced it manually.

So now what you have to do is remove ada11 again, then you have to zero out the disk on another machine (NOT the FreeNAS machine), then do a replacement again.
 

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
At the end of the day, the original question from Helo_Junkie has not been answered... and the same thing has happened to me numerous times...

From above : "So my questions is this: Why does FreeNAS report read and write errors but the smart long test report the drive as fine with 0 errors? Is there some other test I can run on the drive to determine if I actually have a failing drive or should I just pull the drive, replace it with a new one I have on the shelf, RMA the old drive and not worry about it?"
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
At the end of the day, the original question from Helo_Junkie has not been answered... and the same thing has happened to me numerous times...

From above : "So my questions is this: Why does FreeNAS report read and write errors but the smart long test report the drive as fine with 0 errors? Is there some other test I can run on the drive to determine if I actually have a failing drive or should I just pull the drive, replace it with a new one I have on the shelf, RMA the old drive and not worry about it?"

The most likely reason is a very specifically bad cable that only causes errors on the RX side (from the host's perspective).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
From above : "So my questions is this: Why does FreeNAS report read and write errors but the smart long test report the drive as fine with 0 errors? Is there some other test I can run on the drive to determine if I actually have a failing drive or should I just pull the drive, replace it with a new one I have on the shelf, RMA the old drive and not worry about it?"

FreeNAS, the OS itself, does not report errors. Errors are reported by drivers, by software (like SMART), or by kernel code like ZFS. So the question is non-sequitur.

To answer the question of why X doesn't report errors while Y reports errors you must look at what X and Y do and how they define "an error". Then, and only then, can you determine via logic similar to a venn diagram what is going on with your server.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
this: Why does FreeNAS report read and write errors but the smart long test report the drive as fine with 0 errors?
SMART tests test certain things and tell you what happened. If a SMART test completes without errors, that doesn't prove that the drive is 100% OK, just that the test found no issues. I have a 500GB drive that consistently completes short and long SMART tests without error, and shows no errors with badblocks either. But one day I was moving a bunch of data from one dataset to another, and I could hear the disks seeking constantly. Every so often the data rate would tank and an error would appear in the FreeNAS console. The data move completed without data loss, but I replaced the drive anyway, because it clearly can't cope under certain high stress conditions.
 

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
Great answers, thanks to all of you. It's much clearer to me now.

For arguments sake, lets say I go with the Backblaze's criteria for troubled disks as far as smart tests are concerned. Those are:
  • SMART 5 – Reallocated_Sector_Count.
  • SMART 187 – Reported_Uncorrectable_Errors.
  • SMART 188 – Command_Timeout.
  • SMART 197 – Current_Pending_Sector_Count.
  • SMART 198 – Offline_Uncorrectable.
Assuming those numbers are fine, when FN reports read, write, or cksum errors. What would you do?

My thought is to note the drive, clear the error, let it resilver, and see what happens. If the same drive continues to throw errors, check the cable, then clear once again. If that continues, replace the drive.

Your thoughts?
 

TXAG26

Patron
Joined
Sep 20, 2013
Messages
310
Great answers, thanks to all of you. It's much clearer to me now.

For arguments sake, lets say I go with the Backblaze's criteria for troubled disks as far as smart tests are concerned. Those are:
  • SMART 5 – Reallocated_Sector_Count.
  • SMART 187 – Reported_Uncorrectable_Errors.
  • SMART 188 – Command_Timeout.
  • SMART 197 – Current_Pending_Sector_Count.
  • SMART 198 – Offline_Uncorrectable.
Assuming those numbers are fine, when FN reports read, write, or cksum errors. What would you do?

My thought is to note the drive, clear the error, let it resilver, and see what happens. If the same drive continues to throw errors, check the cable, then clear once again. If that continues, replace the drive.

Your thoughts?

The steps you outlined are a very methodical way of ruling out what you can. You may have posted this earlier, but what kind of PSU are you using? If you rule out all of the issues above and still have problems, maybe take a look at your PSU and see if there is anything going on with it.
 

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
Its Athena Power AP-RRP4ATX6808 20+4Pin 2 x 800W Redundant 80 PLUS Bronze Certified ATLAS 800 PLUS Server Power Supply with PM Bus. Have not had any issues with it that I know of. I logged into the motherboard IPMI and there were no alarms regarding power that had been set off, but that may not mean anything..... If however there were serious issues with the PSU, I'd imagine it would would impact more than a couple of drives out of 45... BUT, there may be an issue with a power connector to some drives, so your point could be spot on! The test would be in replacing those drives at the right time (these are production machines), and seeing if the drives using the same power connectors exhibit similar problems.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Great answers, thanks to all of you. It's much clearer to me now.

For arguments sake, lets say I go with the Backblaze's criteria for troubled disks as far as smart tests are concerned. Those are:
  • SMART 5 – Reallocated_Sector_Count.
  • SMART 187 – Reported_Uncorrectable_Errors.
  • SMART 188 – Command_Timeout.
  • SMART 197 – Current_Pending_Sector_Count.
  • SMART 198 – Offline_Uncorrectable.
Assuming those numbers are fine, when FN reports read, write, or cksum errors. What would you do?

My thought is to note the drive, clear the error, let it resilver, and see what happens. If the same drive continues to throw errors, check the cable, then clear once again. If that continues, replace the drive.

Your thoughts?

It's not that simplistic (is anything that easy in IT?)

The real answer is you have to understand what is going on, and then decide what the appropriate action is. If CUPS goes from zero to one, then something like a SMART long test is a good idea (a scrub is a good idea too, but should be done after the long test). If CUPS goes from zero to 5 million, the disk is shot and should be replaced immediately.

Likewise, if Command_Timeout goes from zero to one, no action is required aside from monitoring that drive closely for a few days.

One you didn't mention is UDMA_CRC, which I consider important. If that goes from zero to one, no big deal. If it goes from zero to 1000, then you have a problem.

So what the parameter is, what it's changing from and to, the timeframe it is happening within, along with other parameter changes (or what isn't changing) all factor into deciding what the corrective action is.

I will say that, generally, if the disk starts showing signs of "i'm having problems" it rapidly deteriorates and it becomes pretty obvious that a disk replacement is neessary.
 

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
It's not that simplistic (is anything that easy in IT?)

The real answer is you have to understand what is going on, and then decide what the appropriate action is. If CUPS goes from zero to one, then something like a SMART long test is a good idea (a scrub is a good idea too, but should be done after the long test). If CUPS goes from zero to 5 million, the disk is shot and should be replaced immediately.

So what the parameter is, what it's changing from and to, the timeframe it is happening within, along with other parameter changes (or what isn't changing) all factor into deciding what the corrective action is.

I will say that, generally, if the disk starts showing signs of "i'm having problems" it rapidly deteriorates and it becomes pretty obvious that a disk replacement is neessary.

I have to say, my favorite is "It's not that simplistic (is anything that easy in IT?)"

The issue for me is when reporting to others, they want a plan (perfectly within their right). My task is to do the best I can, and do what I think is best for the company. This is certainly one of the IT issues where having a crystal ball to see into the future would be nice (when will a drive die), but it just isn't happening. So as long as I have some plan that makes sense, follow it the best I can, the rest should fall into place.

When you say "CUPS" are you talking about the print server??? or is it an acronym I'm not picking up off the top of my head?

I really need a source to understand the read, write, cksum errors and how they are produced. If smart shows next to nothing,and a scrub repairs 0,then what? I'm certainly not complaining, but I wear numerous hats and it's hard to grasp everything with all the other stuff going on....

cyberjock - I seem to have seen your whereabouts in an earlier post. At this point (and it's very far in advance) I'm supposed to be there next spring for some days.... I'll buy dinner if your interested in getting together...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
CUPS = Current Pending Sector Count

A crystal ball would be nice, but instead, to do file servers justice you have to do one of two things:

1. Start replacing crap at the hint of any problem.
2. Understand all the depths, layers, and what causes everything so you can actually identify what the problem is, how serious (or not serious) it is, and what kind of corrective action to take. There isn't a "file server 101" course that covers this stuff.

If you are wearing many hats (doesn't everyone these days?) either you have to commit to becoming a pro with enough knowledge to figure this out, or be ready to pay someone (like iXsystems) to resolve your problems. There is no "easy button" for this knowledge. It's years of trial, error, and experience.

If you want to do dinner we definitely could do that. I'm actually in the process of moving (mwahahaha) but I'd love to meet forum users that just want to say hi and have common interests.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
The issue for me is when reporting to others, they want a plan (perfectly within their right). My task is to do the best I can, and do what I think is best for the company. This is certainly one of the IT issues where having a crystal ball to see into the future would be nice (when will a drive die), but it just isn't happening. So as long as I have some plan that makes sense, follow it the best I can, the rest should fall into place.
It seems to me that planning to replace any hard drive that appears to be having problems and then observing whether the problems cease is entirely reasonable. The cost of a hard drive is small compared to the cost of employee time, regardless of the potential cost of data loss. If replacing a drive doesn't fix the problem, the one you removed can be set aside as a spare.

I'm not saying you shouldn't try to understand underlying issues, but I don't see why replacing a drive shouldn't be step 1 in your plan in a corporate environment.
 
Status
Not open for further replies.
Top