Uncorrectable Parity/CRC Error ??

Status
Not open for further replies.

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Not entirely stupid, and more born out of need.

Administering a box that cannot be rebooted in prime working hours let's you hot seat a card, then later that night, remotely reboot. Come in next day and plug in something. That way you aren't rebooting when you shouldn't, and more over don't have to hang out at the office late night.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Wow.. you got more balls than I! LOL I'd NEVER hotseat a card. PCIe slots aren't physically designed for that as far as I know.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
PCI-e does technically incorporate hot-plug into it's design, but I think it's one of those features that is rarely used and therefore not tested too much.

I wouldn't bother myself. Even if it meant coming in to work late, or staying late, I'd just shutdown the server during the maintenance window and make the hardware change.

I found this:

http://expansionsystem.com/profiles/blogs/pci-express-hot-plug
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Right, but I thought it was supposed to be a different slot physically for the hot-plug stuff...
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Well, got new power, mobo, CPU, and RAM. Using same cables as before, but these 6 cables were replacements put in about 1 week ago.

I keep getting WRITE_FPDMA_QUEUED error on console for ada2:ahcich2:0:0:0 so I ran a serial smartctl check on each of the 6 drives ada0-ada5:

ada0: no errors
ada1: no errors
ada2: no errors (surprising)
ada3: no errors
ada4: errors (below)
ada5: no errors

ada4 (oddly since console has ada2):
Error 1 occurred at disk power-on lifetime: 82 hours (3 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 00 20 08 4e 44 Error: IDNF at LBA = 0x044e0820 = 72222752 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 00 20 08 4e 44 08 3d+10:11:28.692 WRITE DMA
]Error 1 occurred at disk power-on lifetime: 82 hours (3 days + 10 hours)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That error is from poweron hours 82. That will ALWAYS be there unless it runs out of long entries.
 

jerryjharrison

Explorer
Joined
Jan 15, 2014
Messages
99
I am going to jump on this bandwagon. I have read and searched the forum, and understand most of what I have read. I am posting here instead of the Noob forum, since the topic is already on point.

I am getting the information below in the local security email output. I have swapped SATA cables, with no change, and the errors have been observed on most of the drives. No errors from the long SMART tests. I have changed power supplies with no change. The drives are WD RED, and are all plugged into the MOBO SATA controller slots. I have no other controller, but am willing to buy a PCIe controller and move some drives over if it is recommended as the next step. Should I even be concerned?

Thanks in advance for the help.

homenas.local kernel log messages:
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 80 7f e6 40 83 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich2:0:0:0): Retrying command
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 40 7f e6 40 83 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich2:0:0:0): Retrying command
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 c0 7f e6 40 83 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich2:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c0 e0 a2 9d 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
re0: link state changed to DOWN
re0: link state changed to UP
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 78 52 f7 40 35 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 10 e4 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 e4 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 e5 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 40 e5 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 80 e5 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 38 e6 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 d0 df e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 78 e6 e2 40 35 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 b0 e4 58 40 36 00 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command
(ada4:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 88 f0 15 e4 40 83 00 00 00 00 00
(ada4:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich4:0:0:0): Retrying command

-- End of security output --
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
I am going to jump on this bandwagon. I have read and searched the forum, and understand most of what I have read. I am posting here instead of the Noob forum, since the topic is already on point.

I am getting the information below in the local security email output. I have swapped SATA cables, with no change, and the errors have been observed on most of the drives. No errors from the long SMART tests. I have changed power supplies with no change. The drives are WD RED, and are all plugged into the MOBO SATA controller slots. I have no other controller, but am willing to buy a PCIe controller and move some drives over if it is recommended as the next step. Should I even be concerned?

Thanks in advance for the help.

It's easier with specs/hardware.. Check to see where the errors are occurring.. What is the output of "zpool status -v"? Please put the output in code tags.. Looks like ADA2 / ADA4 Having trouble communicating..

Thanks,
 

jerryjharrison

Explorer
Joined
Jan 15, 2014
Messages
99
Here you go...

Code:
  pool: homenas
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 12h45m with 0 errors on Sun Dec 29 12:45:06 2013
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    homenas                                        ONLINE      0    0    0
      raidz2-0                                      ONLINE      0    0    0
        gptid/31c180f9-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
        gptid/32378856-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
        gptid/32ae0af9-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
        gptid/332983d4-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
        gptid/33934044-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
        gptid/3401a1b2-f019-11e2-999f-60a44c2ffe84  ONLINE      0    0    0
 
errors: No known data errors
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Would it be advisable to just turn off SMART in BIOS?

Reasoning is that a lot of SMART errors are not giving FreeNAS a degraded pool and drives all show healthy.

Meaning, either FreeNAS is not seeing real errors or the errors are not real ones.

My 5x3TB RAIDz2 looks like above with no errors. The ADA4 from smartctl shows one for me, but cyberjock says it's on poweron and will just keep filling my console. If there's no error and my zpool is cool as a cucumber, then why not just turn of SMART?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No, I'd never disable SMART. That's one of your primary indicators of a failing disk!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
marcevan: You *shouldn't* be getting any SMART errors in the console or via emails unless new errors are developing. Can you provide an email or screenshot of what you are seeing?
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
I get this on the console and in the daily email. Seems it's one error but keeps getting reprocessed with the same result. Shortened it as it would be redundant to show every instance but it's about 40 lines in email of the same:

Code:
freenas.basement.local kernel log messages:
> (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 97 ec 40 a2 00 00 00 00 00
> (ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada2:ahcich2:0:0:0): Retrying command[
> (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 38 10 87 0d 40 a3 00 00 00 00 00
> (ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada2:ahcich2:0:0:0): Retrying command
-- End of security output --
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you post the output of smartctl -a /dev/ada2?
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Nothing startling on smartctl for ada2.. ada4 is posted above from Wednesday.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
For the record, smartctl -a /dev/ada2

Code:
[noformat]
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 201 201 021 Pre-fail Always - 4908
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2541
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 108 104 000 Old_age Always - 42
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
 
SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1
No self-tests have been logged.
 
[To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, 42C is outside the recommended temperature band, but it shouldn't cause the problems you are seeing.

I'd try using a different SATA cable. You chopped off alot of info I wanted to see from the -a output, so I can't really provide any more info.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Nothing startling on smartctl for ada2.. ada4 is posted above from Wednesday.
Sir, just for the record:

You have Cyberjock, who is one of our most active helper-guys on the forum attempting to help you out. Which he does for free.

When he asks you for the output of a smartctl, or anything else, please do not tell him "nothing startling", or cut off parts of the output that you deem unnecessary. Please provide the information these guys ask for. Let them decide what is, and is not, important.
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Wednesday 10:32 am post from me shows nothing for ada2 but weirdly for ada4.

By nothing startling I mean a perfect smartctl output that shows no errors at all for ada2.

For ada4, I have in that post a commented upon error that is from bootup but continuously keeps retrying but in the console says ada2.

Here is the complete ada4 smartclt output:

Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)                                                             
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                                
Model Family:     Western Digital Red (AF)                                                                                          
Device Model:     WDC WD30EFRX-68EUZN0                                                                                              
Serial Number:    WD-WMC4N0524063                                                                                                   
LU WWN Device Id: 5 0014ee 6ae755ef4                                                                                                
Firmware Version: 80.00A80                                                                                                          
User Capacity:    3,000,592,982,016 bytes [3.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                            
Rotation Rate:    5400 rpm                                                                                                          
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-2 (minor revision not indicated)                                                                              
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                            
Local Time is:    Tue Jan 21 11:49:14 2014 EST                                                                                      
SMART support is: Available - device has SMART capability.                                                                          
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART overall-health self-assessment test result: PASSED                                                                            
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                            
                                        was never started.                                                                          
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                    
                                        without error or no self-test has ever                                                      
                                        been run.                                                                                   
Total time to complete Offline                                                                                                      
data collection:                (40860) seconds.                                                                                    
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                            
                                        Auto Offline data collection on/off support.                                                
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                    
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                        
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                              
SMART capabilities:            (0x0003) Saves SMART data before entering                                                            
                                        power-saving mode.                                                                          
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                    
                                        General Purpose Logging supported.                                                          
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                    
Extended self-test routine                                                                                                          
recommended polling time:        ( 410) minutes.                                                                                
Conveyance self-test routine                                                                                                        
recommended polling time:        (   5) minutes.                                                                                    
SCT capabilities:              (0x703d) SCT Status supported.                                                                       
                                        SCT Error Recovery Control supported.                                                       
                                        SCT Feature Control supported.                                                              
                                        SCT Data Table supported.                                                                   
                                                                                                                                    
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   198   184   021    Pre-fail  Always       -       5058                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10                                           
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1127                                         
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10                                           
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7                                            
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2                                            
194 Temperature_Celsius     0x0022   109   108   000    Old_age   Always       -       41                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0                                            
                                                                                                                                    
SMART Error Log Version: 1                                                                                                          
ATA Error Count: 1                                                                                                                  
        CR = Command Register [HEX]                                                                                                 
        FR = Features Register [HEX]                                                                                                
        SC = Sector Count Register [HEX]                                                                                            
        SN = Sector Number Register [HEX]                                                                                           
        CL = Cylinder Low Register [HEX]                                                                                            
        CH = Cylinder High Register [HEX]                                                                                           
        DH = Device/Head Register [HEX]                                                                                             
        DC = Device Command Register [HEX]                                                                                          
        ER = Error register [HEX]                                                                                                   
        ST = Status register [HEX]                                                                                                  
Powered_Up_Time is measured from power on, and printed as                                                                           
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                                                                               
SS=sec, and sss=millisec. It "wraps" after 49.710 days.                                                                             
                                                                                                                                    
Error 1 occurred at disk power-on lifetime: 82 hours (3 days + 10 hours)                                                            
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                            
  ER ST SC SN CL CH DH                                                                                                              
  -- -- -- -- -- -- --                                                                                                              
  10 51 00 20 08 4e 44  Error: IDNF at LBA = 0x044e0820 = 72222752                                                                  
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  ca 00 00 20 08 4e 44 08   3d+10:11:28.692  WRITE DMA                                                                              
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed without error       00%       159         -                                                      
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                      
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                        
    1        0        0  Not_testing                                                                                                
    2        0        0  Not_testing                                                                                                
    3        0        0  Not_testing                                                                                                
    4        0        0  Not_testing                                                                                                
    5        0        0  Not_testing                                                                                                
Selective self-test flags (0x0):                                                                                                    
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                
If Selective self-test is pending on power-up, resume after 0 minute delay.            
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
I'd backup any data (if you havent already) if your still getting errors.. I'm confused by your last post.. Does this mean you are not getting any more errors? You put 2 brand new sata cables on ada2 & ada4? The error @ hour 82 on WD-WMC4N0524063 will always be present in the smart logs.. I'd put the whole box back into testing if you are still having issues.. Your motherboard supports non-ecc ram and im assuming your not overclocking.. Have you tested ram etc?
 
Status
Not open for further replies.
Top