SOLVED New build causing CRC errors and AHCICH timeouts

Status
Not open for further replies.

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Everybody,

I recently built a new Freenas system which is throwing out random errors when sustained copying / scrubbing is performed on the pool.

The specs of the Freenas build are:

12 X 4TB hard-drives, two pools of Raidz2 with six drives per pool. As shown below.

Code:
ada0-\
ada1---|
ada2---|
ada3---|
ada4---|--RAIDZ2\
ada5--/                     \
                          [freenas]
ada6--\                    /
ada7---|--RAIDZ2/           
ada8---|         
ada9---|
ada10- |
ada11-/


C2750D4I Motherboard
Transcend 32 GB DDR3-1866 ECC unbuffered DIMM (Running at 1600Mhz)
550W Cooler master power-supply

Before I put the server into production, I performed a burn-in of the RAM using memtest; which it passed.

The backup of my data wasn't stored at my location and I had to relocated Freenas to copy my backup. After powering up the newly built Freenas in my backup location, I noticed that the zpool status was showing as degraded. At the time, I though that this was due to one of the SATA cables coming loose. I powered off the server, checked all the cabling, powered on the server, cleared the zpool error and started copying all the files from my backup to Freenas.

After a couple of days of copying, I returned the server back to it's original location and powered it on again. I was greeted by a zpool degraded message again and an error message in my console stating that ada8 was showing chksum errors in the zpool so again I checked all the cabling, cleared the zpool error and then ran a scrub of the pool. This caused the drive in question (ada8) to disconnect from the raid and no longer show up in the BIOS.

Following the Freenas guide, I shut-down Freenas, removed the faulty drive and installed the spare. Using the GUI I replaced the faulty drive and after a few hours, the replacement drive was successfully resilvered into the pool.

After the faulty drive happened, I ran a long smartctl scan of all drives attached to the system, the results can be seen at the link below:

http://pastebin.com/vBc2GyDy

Since then, I have had more strange errors when copying or scrubbing my volume, such as:

Code:
 ahcich2: Timeout on slot 14 port 0
ahcich2: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 50 serr 00000000 cmd 10008e17
ahcich4: Timeout on slot 24 port 0
ahcich4: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 50 serr 00000000 cmd 10009817
ahcich3: Timeout on slot 21 port 0
ahcich3: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 50 serr 00000000 cmd 10009517
(ada10:ahcich14:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 06 69 40 a1 00 00 01 00 00
(ada10:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada10:ahcich14:0:0:0): Retrying command
ahcich2: Timeout on slot 13 port 0
ahcich2: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 50 serr 00000000 cmd 10008d17
(ada10:ahcich14:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 d8 55 f2 40 a1 00 00 01 00 00
(ada10:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada10:ahcich14:0:0:0): Retrying command
ahcich3: Timeout on slot 2 port 0
ahcich3: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 50 serr 00000000 cmd 10008217
ahcich3: Timeout on slot 18 port 0
ahcich3: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 40 serr 00000000 cmd 10009217
ahcich5: Timeout on slot 19 port 0
ahcich5: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 10009317
ahcich2: Timeout on slot 11 port 0
ahcich2: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd 40 serr 00000000 cmd 10008b17
ahcich3: Timeout on slot 27 port 0
ahcich3: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 50 serr 00000000 cmd 10009b17 


Ada10 seems to be the drive that I receive CRC errors from. According to other Freenas forum posts could indicate a faulty SATA cable.

What about these ahcich timeout errors messages? Are these related to smartd? I ask because I can see the error message below displayed on my console and not on dmesg, indicating that:

Code:
Oct 4 08:24:08 freenas smartd[12317]: Device: /dev/ada3, failed to read SMART Attribute Data. 


Any help would be most appreciated!
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I really can't provide the silver bullet you are looking for. But I'll give you these tidbits...

1. Your hard drives are a but high in temp. Some drives are recording temps as high as 52C in the past. That's far above the maximum 40C that is recommended. ada10 was 44C when you did your SMART query. So I'd definitely look at better cooling for starters. ada5 recording itself at 68C (which has automatically voided your warranty as you exceeded the temperature limit for the drive).
2. I'm jaded against Seagates. They were shitty products back in 2008/2009 (when I bought like 16 of them) and I bought a 6TB last month, it came out of the box broken. A friend of mine also bought a Seagate 6TB for himself, and it came out of the box fine, but 2 weeks later was dead. I personally don't recommend Seagates because my past with them has been horrible.
3. UDMA_CRC_Error_Count is an indicator that is usually (but not always) associated with a bad SATA cable, so I'd replace any that are >5 or so.
4. Some drives have problems with the Marvell chipset. This may or may not be your problem and the only good way I know of to rule it out is to add an M1015 or something and see if the problems go away.

I'm not sure what your long-term prospects are for this server, but the drives that were >50C in their past are probably going to live relatively short lives. The one that got over 60C is probably not going to be working in 3 months. :(
 

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Cyberjock,

Thanks for your quick reply. To answer your questions in a bit further detail:

1. I've cleaned the filters of that particular 4-drive caddy to lower the average temp. Regarding the high temp for ada5 - Somebody who shall remain nameless turned off the air-con in the computer room!

2. I understand your pain but recently my problem seems to be the Hitachi 4TB drives dying - Two out of four so far.

3. I've changed the SATA cable for ada10 UDMA_CRC_Error_Count = 50 - This was the only drive reporting this error. In addition, I've also cut the cable ties that were bunching the SATA cables together, in the off chance it was an interference issue.

4. The M1015 is quite an expensive option. I will consider it though if the problems with the suspect marvel chip-set continue.

Other forums: http://enira.net/?p=709 recommends disabling both Intel Speedstep and C-Bit. Have you experienced problems with either of these BIOS options?

Thanks!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Disabling Intel Speedstep is fool-hearty. That's Intel's primary power saving tool for CPUs. If that were a problem it would be *very* well known in these parts. Do a search of the forums and you'll probably find less than 10 references to speedstep. No, that's not a problem, at least not *your* problem.

Never heard of C-bit as a BIOS setting at all with any manufacturer, ever. It's not in my BIOS either, so he's either talking out of his butt or he meant to write something else. C-State *may* be what he meant, but mine is enabled and I have no problems. I only had a 4 day up-time until I rebooted to check my BIOS settings, but that's only because I'm constantly doing reboots and testing things for people. :P

I'd argue that his stuff (assuming it even does what he thinks it does) isn't going to solve your problem. You can't disabled "C-bit" but you are welcome to turn off speedstep. Just don't get upset when your CPU is running hotter and your power usage is higher. :P
 

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Cyberjock,

I think I'll leave SpeedStep and C-State where they are for the time-being.

Since changing the sata cable for ada10 I've had no further CRC errors but I'm still being plagued by the ahcich time-outs.

Other Freenas forum posts: https://forums.freenas.org/index.php?threads/ahci-timeouts.1910/page-4 Talk about switching from ACHI to IDE as a possible fix, abeit for older versions of Freenas.

Would this be worth a try in my case and would there be any possiblity of data loss?

Thanks Again!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The C-State is just the lowest sleep state your CPU will achieve and is not your problem. I wouldn't switch to IDE mode.

My advice is to look in the MB user manual, identify which SATA ports are from the INTEL controllers and which ones are form the Marvel controllers. Can you link the issues to a specific controller? If you want, power down the unit and move the SATA cables from one port to another to disprove it's the cables.

Also, do you have any power savings options turned on for FreeNAS? (powerd, sleeping the drives, etc...) If so, turn them off for testing purposes to see if the problems go away.

Some of those drives have fairly high load cycle counts and power off retract counts like ADA4 and ADA5, and it looks like it's only your Hitachi drives doing this. You must have that drive sleeping and waking up all the time and it's going to live a very short life if you keep spinning it up and down that much.
 

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Joeschmuck,

On your advice, I turned off the powerd option to see if it'd make a difference but it did not as when I perform sustained copying to the Freenas (>100GB) I'm still seeing ACHICH time-outs. I'll try turning off sleeping of the drives next.

More worrying than that, I had two drives (ada0 & ada1) that weren't detected at power-on this afternoon. Eventually Freenas started, albeit about 5 minutes later with a degraded pool. I rebooted the system again and the drives were detected this time but I still have to scan the drives (smartctl) to make sure that they're fine then clear the zpool error.

I'm think I'm going to try and change the power-supply as the failing ports don't seem to be pointing to one particular sata controller.

Your thoughts?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You should look into which SATA ports are Intel and move all the drives in one pool to it, then test to each pool individually for the AHCI CH errors.

What does your CPU utilization look like?

When you ran Memtest86, how long did you let it run? How many passes or how many days? 3 Days should be good to rule out RAM issues.

I must have been thinking about someone else but I though you had replaced the power supply already. Again, I must be thinking of someone else.
 

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Joeschmuck & Cyberjock,

I can gladly report that the errors were the result of a flaky power supply that was unable to supply enough current during power-up or sustained R/W to the drives - I suspect faulty capacitors.

The clincher was when I had two drives with no previous SMART errors become undetected in BIOS but return after a subsequent reboot.

The power-supply has been replaced and the system is now stable.

Thanks for all your help!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
No problem. I too believe I'm battling a PS issue on a different PC. When I install two sticks of RAM, Memtest passes, install all four sticks, failure. Doesn't matter how I mix the sticks up or which connections I use, four is not stable. The thing is, it was stable when using the power supply I had purchased for my FreeNAS (which this MB/RAM was originally used for but was not ECC). Before purchasing a new power supply though, I'll have to pull my FreeNAS PS and use it to validate my suspicions and then order a new PS. I don't by cheap power supplies. Anyway, that a different topic.

Cheers
 

shan81

Dabbler
Joined
Oct 11, 2011
Messages
21
Hi Joeschmuch,

I feel your pain. I tried pretty much everything before changing the power-supply as it's a pain in the ass!

Anyway, hopefully it fixes your problem.

Cheers!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Ordering a new PS. Looks like there is an issue with the PS because the other PS works fine. Tested over night and passed 3 times. Before I couldn't get it to run longer than about 25 minutes before either a failure list or it just locked up. At least I had another one to test with, even if that meant my FreeNAS server was down overnight.
 
Status
Not open for further replies.
Top