So, I just got bit by this "issue" so I figured I'd share.
As many of you may know, I have a custom email script that emails me nightly. I thought I was super cool because I had the script provide me with what I consider the 2 most important parameters for determining drive health. Temperature(for long term health) and Current Pending Sector Count(for short-term health). Temperature should be kept below 45C at all times, and below 40C at least 50% of the time(but preferably all of the time). Any non-zero Current Pending Sector Count is usually a situation that should be monitored very closely as you likely have a drive that is getting ready to fail.
So my script provides me with something like the following every morning:
I have more than 5 drives, but this proves the point. At a glance I can see that hard drive temps are okay and the current pending sector counts are okay.
But wait. Don't you dare think that current pending sector counts are okay. Check this out... I'll use my newly failed drive to prove my point:
Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...
Notice the bold. Current Pending Sector Count is actually 19! I did try to perform an extended offline test, but you cannot perform any SMART tests on drives connected to an Areca controller. The only option I had was to swap the drive with a drive connected to my Intel onboard SATA and run the long test. Naturally it failed. I knew it would fail. I've been moving data to the zpool this drive is part of all day at an amazing 6MB/sec average. When I looked at the drive LEDs the other drives in the zpool would blink every 10-15 seconds for a fraction of a second, but this drive LED was on solid. Ironically this was a hardware RAID that had been abandoned for about 120 days, then put back online to retrieve the data(which went flawlessly), then converted to a zpool. Naturally the zpool has had weird performance issues since the moment it was created, but I assumed the issue was not a failing disk because my emails were saying everything was fine. What makes matters worse is if you do this:
I had moved the drive around as part of troubleshooting. I never suspected that the areca-cli would tell me the drive is good if it isn't.
So what's the moral of the story? Areca controllers, even in JBOD mode, are not the best choice. Not only do they not let you run SMART tests, but their CLI lies about the hard drive status. I knew that it wouldn't let you run SMARt tests, so on my 24 port Norco case I have 4 drives that connect to the Intel SATA controller. I thought that in the event that I needed to run a SMART test I need only swap out the questionable disk with a disk on the Intel SATA controller. Add to that the fact that for this type of error doing a # zpool status have you zero errors on all disks. The only way I would have known about the exact issue was by noticing the LED light on almost constantly and actually querying the drive via smartctl and not the areca-cli. Long live smartctl. Now to resilver with my spare disk and fix my script so I don't get screwed over by this issue again.
Edit: Something else I'd like to add. If I did a # zpool iostat 1 I'd get non-zero values for writes, but then zero for 3-6 seconds, which is how you'd expect the zpool to behave when everything is working. So don't expect zpool iostat to help you out.
Also, I did a # gstat and watched the latency of the devices. All 6 drives in the zpool(the bad disk + the 5 good drives) were attached to the Areca controller. All 6 drives would show very high latency values on occasion(2000ms or higher). So I thought something was up. Also my other zpool is attached to the Areca controller and if the latency of the other zpool was really high and a read or write was being made to the other zpool at the same time them all of the hard disks attached to my Areca controller would be extremely high. So gstat was a way to identify that there was a disk issue, but was not useful in identifying the exact disk for my situation. It may have provided more value (perhaps the exact device that is having issues) for the onboard Intel SATA or a HBA adapter.
As many of you may know, I have a custom email script that emails me nightly. I thought I was super cool because I had the script provide me with what I consider the 2 most important parameters for determining drive health. Temperature(for long term health) and Current Pending Sector Count(for short-term health). Temperature should be kept below 45C at all times, and below 40C at least 50% of the time(but preferably all of the time). Any non-zero Current Pending Sector Count is usually a situation that should be monitored very closely as you likely have a drive that is getting ready to fail.
So my script provides me with something like the following every morning:
The following lists the Current Pending Sector Count for all hard drives on the system in order:
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
197 Current Pending Sector Count 0x32 200 0 OK
The following lists the current temperatures for all hard drives on the system in order:
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
194 Temperature 0x22 100 0 OK
I have more than 5 drives, but this proves the point. At a glance I can see that hard drive temps are okay and the current pending sector counts are okay.
But wait. Don't you dare think that current pending sector counts are okay. Check this out... I'll use my newly failed drive to prove my point:
Code:
# areca-cli disk smart drv=12 S.M.A.R.T Information For Drive[#12] # Attribute Items Flag Value Thres State =============================================================================== 1 Raw Read Error Rate 0x2f 200 51 OK 3 Spin Up Time 0x27 144 21 OK 4 Start/Stop Count 0x32 100 0 OK 5 Reallocated Sector Count 0x33 200 140 OK 7 Seek Error Rate 0x2e 200 0 OK 9 Power-on Hours Count 0x32 86 0 OK 10 Spin Retry Count 0x32 100 0 OK 11 Calibration Retry Count 0x32 100 0 OK 12 Device Power Cycle Count 0x32 100 0 OK 192 Power-off Retract Count 0x32 200 0 OK 193 Load Cycle Count 0x32 199 0 OK 194 Temperature 0x22 112 0 OK 196 Reallocation Event Count 0x32 200 0 OK 197 Current Pending Sector Count 0x32 200 0 OK 198 Off-line Scan Uncorrectable Sector Count 0x30 200 0 OK 199 Ultra DMA CRC Error Count 0x32 200 0 OK ===============================================================================
Hmm. Even the "long" printout from the Areca CLI says that Current Pending Sector Count is good. How odd. I do get this information in my email, but as you can see there's no reason to even question how healthy the disk is. Let's use smartctl...
Code:
# smartctl -q noserial -a --device=areca,12 /dev/arcmsr0 smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p7 amd64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) Device Model: WDC WD30EZRX-00MMMB0 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri May 10 00:09:41 2013 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51180) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 492) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 144 144 021 Pre-fail Always - 9791 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 778 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10533 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 178 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 174 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5967 194 Temperature_Celsius 0x0022 112 104 000 Old_age Always - 40 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 19 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 10533 2375802360 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Notice the bold. Current Pending Sector Count is actually 19! I did try to perform an extended offline test, but you cannot perform any SMART tests on drives connected to an Areca controller. The only option I had was to swap the drive with a drive connected to my Intel onboard SATA and run the long test. Naturally it failed. I knew it would fail. I've been moving data to the zpool this drive is part of all day at an amazing 6MB/sec average. When I looked at the drive LEDs the other drives in the zpool would blink every 10-15 seconds for a fraction of a second, but this drive LED was on solid. Ironically this was a hardware RAID that had been abandoned for about 120 days, then put back online to retrieve the data(which went flawlessly), then converted to a zpool. Naturally the zpool has had weird performance issues since the moment it was created, but I assumed the issue was not a failing disk because my emails were saying everything was fine. What makes matters worse is if you do this:
Code:
# areca-cli event info Date-Time Device Event Type Elapsed Time Errors =============================================================================== 2013-05-10 04:57:59 H/W MONITOR Raid Powered On 2013-05-10 04:36:30 SW API Interface API Log In 2013-05-10 04:21:14 H/W MONITOR Raid Powered On 2013-05-10 03:39:25 H/W MONITOR Raid Powered On 2013-05-09 08:04:46 SW API Interface API Log In 2013-05-09 06:09:03 H/W MONITOR Raid Powered On 2013-05-09 05:45:37 H/W MONITOR Raid Powered On 2013-05-09 05:31:36 IDE Channel #17 Reading Error 2013-05-09 05:29:42 H/W MONITOR Raid Powered On 2013-05-08 08:02:46 SW API Interface API Log In 2013-05-08 02:00:49 IDE Channel #11 Reading Error 2013-05-08 01:44:35 IDE Channel #11 Reading Error 2013-05-08 01:44:27 IDE Channel #11 Reading Error 2013-05-08 01:44:20 IDE Channel #11 Reading Error 2013-05-08 01:44:10 IDE Channel #11 Reading Error 2013-05-08 01:44:03 IDE Channel #11 Reading Error 2013-05-08 01:43:56 IDE Channel #11 Reading Error 2013-05-08 01:43:49 IDE Channel #11 Reading Error 2013-05-08 01:43:41 IDE Channel #11 Reading Error 2013-05-08 01:42:57 IDE Channel #11 Reading Error 2013-05-08 01:42:50 IDE Channel #11 Reading Error 2013-05-08 01:42:43 IDE Channel #11 Reading Error 2013-05-08 01:42:35 IDE Channel #11 Reading Error 2013-05-08 01:42:28 IDE Channel #11 Reading Error 2013-05-08 01:42:20 IDE Channel #11 Reading Error 2013-05-08 01:42:07 IDE Channel #11 Reading Error 2013-05-08 01:42:00 IDE Channel #11 Reading Error 2013-05-08 01:41:53 IDE Channel #11 Reading Error 2013-05-08 01:41:46 IDE Channel #11 Reading Error 2013-05-07 17:16:13 SW API Interface API Log In 2013-05-07 15:18:15 H/W MONITOR Raid Powered On 2013-05-07 15:01:52 H/W MONITOR Raid Powered On 2013-05-07 13:02:42 IDE Channel #11 Reading Error 2013-05-07 12:59:50 IDE Channel #11 Reading Error 2013-05-07 12:59:37 IDE Channel #11 Reading Error 2013-05-07 12:59:12 IDE Channel #11 Reading Error 2013-05-07 12:58:57 IDE Channel #11 Reading Error 2013-05-07 12:58:50 IDE Channel #11 Reading Error 2013-05-07 12:58:43 IDE Channel #11 Reading Error 2013-05-07 12:57:42 IDE Channel #11 Reading Error 2013-05-07 12:57:28 IDE Channel #11 Reading Error 2013-05-07 12:53:55 IDE Channel #11 Reading Error 2013-05-07 12:53:48 IDE Channel #11 Reading Error 2013-05-07 12:53:21 IDE Channel #11 Reading Error 2013-05-07 12:53:05 IDE Channel #11 Reading Error 2013-05-07 12:52:55 IDE Channel #11 Reading Error 2013-05-07 12:52:29 IDE Channel #11 Reading Error 2013-05-07 12:51:59 IDE Channel #11 Reading Error 2013-05-07 12:51:52 IDE Channel #11 Reading Error 2013-05-07 12:51:23 IDE Channel #11 Reading Error 2013-05-07 12:51:16 IDE Channel #11 Reading Error 2013-05-07 12:51:07 IDE Channel #11 Reading Error 2013-05-07 12:50:59 IDE Channel #11 Reading Error 2013-05-07 12:50:38 IDE Channel #11 Reading Error 2013-05-07 12:50:18 IDE Channel #11 Reading Error 2013-05-07 12:49:56 IDE Channel #11 Reading Error 2013-05-07 12:49:44 IDE Channel #11 Reading Error 2013-05-07 12:49:30 IDE Channel #11 Reading Error 2013-05-07 12:49:19 IDE Channel #11 Reading Error 2013-05-07 12:49:11 IDE Channel #11 Reading Error 2013-05-07 12:49:04 IDE Channel #11 Reading Error 2013-05-07 12:48:37 IDE Channel #11 Reading Error 2013-05-07 12:48:30 IDE Channel #11 Reading Error 2013-05-07 12:48:01 IDE Channel #11 Reading Error =============================================================================== GuiErrMsg<0x00>: Success.
I had moved the drive around as part of troubleshooting. I never suspected that the areca-cli would tell me the drive is good if it isn't.
So what's the moral of the story? Areca controllers, even in JBOD mode, are not the best choice. Not only do they not let you run SMART tests, but their CLI lies about the hard drive status. I knew that it wouldn't let you run SMARt tests, so on my 24 port Norco case I have 4 drives that connect to the Intel SATA controller. I thought that in the event that I needed to run a SMART test I need only swap out the questionable disk with a disk on the Intel SATA controller. Add to that the fact that for this type of error doing a # zpool status have you zero errors on all disks. The only way I would have known about the exact issue was by noticing the LED light on almost constantly and actually querying the drive via smartctl and not the areca-cli. Long live smartctl. Now to resilver with my spare disk and fix my script so I don't get screwed over by this issue again.
Edit: Something else I'd like to add. If I did a # zpool iostat 1 I'd get non-zero values for writes, but then zero for 3-6 seconds, which is how you'd expect the zpool to behave when everything is working. So don't expect zpool iostat to help you out.
Also, I did a # gstat and watched the latency of the devices. All 6 drives in the zpool(the bad disk + the 5 good drives) were attached to the Areca controller. All 6 drives would show very high latency values on occasion(2000ms or higher). So I thought something was up. Also my other zpool is attached to the Areca controller and if the latency of the other zpool was really high and a read or write was being made to the other zpool at the same time them all of the hard disks attached to my Areca controller would be extremely high. So gstat was a way to identify that there was a disk issue, but was not useful in identifying the exact disk for my situation. It may have provided more value (perhaps the exact device that is having issues) for the onboard Intel SATA or a HBA adapter.