Uncorrectable Parity/CRC Error ??

Status
Not open for further replies.

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Tested in a manner of speaking:

1. In old mobo/RAM/power got same errors even after swapping in new cables.
2. Got new mobo/RAM/power and keep seeing recurrance of the error on the console as an ada2 error.
3. Smartctl gives no issues with ada2 (or ada0, ada1, ada3, ada5).
4. Smartctl gives issue with ada4 posted above.

So on console I get the retrying and keep seeing that error ever 10 min or so, but overall freenas says pool is healthy and drive view shows same.

Errors continue to show in console and in daily emails.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
I mean did you do ram testing / hard disk drive testing / visual inspections of all gear.. Capacitors etc.. As cyberjock noted earlier temps are high but not extreme.. I certainly wouldn't trust those drives/systems until you stop seeing those errors.. Maybe reseat both ends of ada4 again.. or give it a new cable again... Other then that I'd be back to testing..
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Did Prime test for 24 hrs, and RAM passed and was installed.

I typically don't test drives that go into RAID as I rely on RAID software to inform me.

In this case we have a difference of opinion between freeNAS pool/disk reporting and SMART showing an issue.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
So is the motherboard set to RAID instead of AHCI?
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Mobo is set to AHCI. RAID is handled by freeNAS using Sun's ZFS technology for software RAID. However, the issue is that nothing in FreeNAS pool status or disk status shows any issue while console (which I believe is freeBSD's) does show the error on ADA2 while smartctl says it's on ADA4.

Ugh as the console will keep repeating:
Code:
Jan 21 17:15:08 freenas kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c8 a0 15 8f 40 2e 00 00 00 00 00
Jan 21 17:15:08 freenas kernel: (ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
Jan 21 17:15:08 freenas kernel: (ada2:ahcich2:0:0:0): Retrying command
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Please re-run: smartctl -a /dev/ada2 and post all the output in code tags, like you did for ada4.

It appears that you've never run any smart tests (short or long) on ada2.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
I ran long and short tests on all disks, but only ADA4 had issues. Here's smartctl -a /dev/ada2:

Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)                                                             
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                                
Model Family:     Western Digital Red (AF)                                                                                          
Device Model:     WDC WD30EFRX-68EUZN0                                                                                              
Serial Number:    WD-WMC4N0329356                                                                                                   
LU WWN Device Id: 5 0014ee 003a8ce01                                                                                                
Firmware Version: 80.00A80                                                                                                          
User Capacity:    3,000,592,982,016 bytes [3.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                            
Rotation Rate:    5400 rpm                                                                                                          
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-2 (minor revision not indicated)                                                                              
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                            
Local Time is:    Wed Jan 22 09:41:23 2014 EST                                                                                      
SMART support is: Available - device has SMART capability.                                                                          
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART overall-health self-assessment test result: PASSED                                                                            
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                            
                                        was never started.                                                                          
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                    
                                        without error or no self-test has ever                                                      
                                        been run.                                                                                   
Total time to complete Offline                                                                                                      
data collection:                (40680) seconds.                                                                                    
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                            
                                        Auto Offline data collection on/off support.                                                
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                    
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                        
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                              
SMART capabilities:            (0x0003) Saves SMART data before entering                                                            
                                        power-saving mode. 
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                    
                                        General Purpose Logging supported.                                                          
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                    
Extended self-test routine                                                                                                          
recommended polling time:        ( 408) minutes.                                                                                    
Conveyance self-test routine                                                                                                        
recommended polling time:        (   5) minutes.                                                                                    
SCT capabilities:              (0x703d) SCT Status supported.                                                                       
                                        SCT Error Recovery Control supported.                                                       
                                        SCT Feature Control supported.                                                              
                                        SCT Data Table supported.                                                                   
                                                                                                                                    
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   201   201   021    Pre-fail  Always       -       4908                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       11                                           
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2611                                         
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       11                                           
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8                                            
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2                                            
194 Temperature_Celsius     0x0022   108   104   000    Old_age   Always       -       42                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                            
                                                                                                                                    
SMART Error Log Version: 1                                                                                                          
No Errors Logged                                                                                                                    
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                         
 
                                                                                                                                    
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                      
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                        
    1        0        0  Not_testing                                                                                                
    2        0        0  Not_testing                                                                                                
    3        0        0  Not_testing                                                                                                
    4        0        0  Not_testing                                                                                                
    5        0        0  Not_testing                                                                                                
Selective self-test flags (0x0):                                                                                                    
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                
If Selective self-test is pending on power-up, resume after 0 minute delay.    
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
According to line 83 (for ada2), you've never run any tests on this drive.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Never said I prestressed drives... Only RAM with Prime. I have done the same smartctl on that and the other drives.

More to the point is the above anything related to the console messages about ADA2.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
According to line 83 (for ada2), you've never run any tests on this drive.

Exactly. Back on page 2 you'll see DrKK's comment about how I asked for the output and I got a watered down filtered and the response was something like "nothing notable to mention".
This is precisely why I don't want people interpreting what I ask for. Give me what I want or I leave. Because 90% of the time if you are asking for help you probably don't have the necessary knowledge to actually come to the conclusion that "nothing notable was seen". Now we've been told tests were run and passed, yet it's completely obvious no test was run, ever. Even if the test had been interrupted those are also logged appropriately.

And now we go back to what you said..

I typically don't test drives that go into RAID as I rely on RAID software to inform me.

In this case we have a difference of opinion between freeNAS pool/disk reporting and SMART showing an issue.

So have you figured out that there IS a reason why the developers include the SMART testing, SMART monitoring, and SMART reporting via email? That stuff you are choosing to ignore is actually VERY valuable and VERY useful. It tells you things ZFS and scrubs can't tell you.

Sorry if I sound snarky. I get really pissed when people water down what I ask for with their own interpretations and/or try to argue why SMART is a waste of time. It's super important! It may be the single most important thing you can use to monitor, test, and report disk health.

Of course, I do also enjoy people like you because when you need data recovery you have me or other services like Ontrack. And since the other people are more than 1 magnitude higher in cost I do enjoy the extra money from time to time. So guess who gets to earn a little money on the side when you lose your pool? :D
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
No worries on sounding snarky. I get it. And I and surely dozens more appreciate your input on this board.

What I meant to imply was that when I keep seeing console errors for ADA2, I start fresh doing smartctl against ada0, ada1, ada2, etc. As ADA0, ADA1, ADA2, ADA3, ADA5 were all no errors I meant to say nothing startling there while ADA4 gave an explicit error which is odd (console says ADA2, smartctl says ADA4). This just points to a disparity in console reporting vs UNIX command reporting.

And at this point freeNAS says no issue in either pool health in gui or command line zpool status. The GUI shows all drives are fine as well. This just points to a disparity in freeNAS ability to inform on disk issues.

At this point I'm contemplating getting 2 new drives and for ADA4 and ADA2 and raidz2 should support that and resilver slowly. I'll probably be pulling one drive at a time starting with ADA4.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So I think I should clarify a few things...

If the console says ada2 is having problems, you can be assured that ada2 is having problems. Now, some people have gone off and thought other disks were bad due to crosstalk on cables. They'll disconnect ada3 and the ada2 problems will go away. So they'll assume that ada3 is the problem and the error reporting is wrong. So you need to take ALL of your data together to get the right answer. Since you are getting errors on ada2 and ada4 they are DEFINITELY having problems. The difference is whether ada2 is having problems or is the problem. There's a difference. And this is something you'll have to figure out with process of elimination.

ZFS simply reports on errors it finds via checksums, disk read errors, or disk write errors. That's all it can do. You want more, use SMART. That's what its there to do!

You should just stop and run SMART tests on all of your disks and see what they report.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
I'll RMA ADA2 to WD and get the replacement.

If that goes swimmingly I'll do the same on ADA4.

I'd rather replace one at a time in a RAIDz2 even though it can support 2 drives failing.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You do realize WD doesn't do RMAs without some kind of assurance there is a problem. And SMART tests will tell you that...

That is exactly why I said to stop and run SMART tests. You'll probably be upset if you RMA 2 good disks and the problem remains....
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
I've worked with WD over the past several years, and their online RMA process allow for some freeform explanation after they validate the S/N of the drive(s).

That said, ADA4 reports a tangible error from SMART that will pass the RMA process; then upon replacement drive being cold-replaced I'll just see what I see from console/SMART regarding ADA2.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
So got the RMA in for ADA4 which had the smartctl error and also bought a new WD 3TB red drive to replace that.

So now it's resilvering and smartctl -a /dev/ada4 gives no errors. Monday I send back the replaced/RMA drive and then when I receive the new drive from them it goes in place of ADA2 which I'll be RMAing as well next week (that one with the continuous console errors)
 
Status
Not open for further replies.
Top