Psycho ZFS Checksum errors - Help

Status
Not open for further replies.

nerfage

Cadet
Joined
Jan 4, 2013
Messages
1
Ok, here's the skinny:

Supermicro 2U Chassis
8x Hitachi 4TB Hard Drives
Highpoint RR3520 Controller (Configured as passthrough)
12x Intel Xeon Cores
64GB Memory

I had originally wanted to use the onboard controller, but it was some built-in adaptec that there's no FreeBSD support for, so I used the HPT controller. Controller works fine, and even have Smartd working on the drives on it. Server was up and online since about the beginning of December. It's not 'live' yet as it's been going through the process of being filled with data, provisioning and such. It was scheduled to go live this weekend, so I was spending most of this week doing the final load of data to it. (All using rsync btw) 2 days ago, rsync got an error attempting to update a file, and I got the warning that the ZFS vol was in an "unknown" state. Uses zpool status to see that there was a file listed as permanent corruption (Interestingly, not the one rsync complained about). Thought no big deal, removed the file, and went on my way. However, after finishing my current batch of rsync tasks, I went back to look and 'zpool status' not only still reported errors, but it showed 5 new files with corruption now. Thought this was odd. Tried a 'zpool clear' but all it did was clear the counters. So, then, just because I wanted the rabbit hole to keep going, I kicked off a scrub. Once completed, the scrub reported over 90K files with 'permanent corruption'. Super. Well, here's where it gets weird. I went down the list, and chose a few random PDF files off the list, FTP'd them to my local, and lo-and-behold, they're fine. In addition, 'zpool status' is logging 10s and 100s of thousands of checksum errors......on ALL the drives. Check out the shot below. (During a running scrub)

Code:
  pool: ******xxRaidZ
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Fri Jan  4 08:35:50 2013
        121G scanned out of 991G at 47.3M/s, 5h13m to go
        785M repaired, 12.20% done
config:

	NAME                                            STATE     READ WRITE CKSUM
	******xxRaidZ                                   ONLINE       0     0 1.28K
	  raidz2-0                                      ONLINE       0     0 2.55K
	    gptid/52261dcc-4896-11e2-93ce-003048631da6  ONLINE       0     0 6.32K  (repairing)
	    gptid/5252e4d0-4896-11e2-93ce-003048631da6  ONLINE       0     0 6.80K  (repairing)
	    gptid/527f2801-4896-11e2-93ce-003048631da6  ONLINE       0     0 6.11K  (repairing)
	    gptid/52aaee44-4896-11e2-93ce-003048631da6  ONLINE       0     0 6.04K  (repairing)
	    gptid/52d87580-4896-11e2-93ce-003048631da6  ONLINE       0     0 5.39K  (repairing)
	    gptid/5307b22b-4896-11e2-93ce-003048631da6  ONLINE       0     0 4.40K  (repairing)
	    gptid/53360849-4896-11e2-93ce-003048631da6  ONLINE       0     0 3.75K  (repairing)
	    gptid/536308c7-4896-11e2-93ce-003048631da6  ONLINE       0     0   326  (repairing)

errors: 61995 data errors, use '-v' for a list



As you can see, I'm getting crazy checksum counts to all the drives now. I would go with the 'hardware failure' that every single forum I find out there shows, but I'm seeing the same behavior on all the drives. One scrub leaves the counters on the drives at about 120K checksum errors after each pass. All 8 drives were burned in using both an MHDD loop and a full pass on SpinRite. The server (CPU+mem) was burned in for 1 week before use as well.

Also, the drives show ZERO errors on their smart status, and all 8 passed both a short and long self-test through SMART.

As an additional detail, I cleared all counters, deleted one of the sync'd folders, and re-copied it to the server. When I did, all 8 drives had about 2.5K checksum errors.

So, I guess what I'm asking is, does anyone have any ideas? Usually a failed drive is a few errors on one drive, not 100K on every drive. The data all seems to be there, but I can't just leave it in an error state. This is destined to be a production level server, so if I can't get these shenanigans to end, I'm going to have to use a different platform altogether.
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
I'll be looking to see what people come up with because the first things that come to mind for me are...
- cabling is loose
- controller is flaky
- power supply is flaky
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
- cpu heat issues (improper heatsink install, inadequate airflow, etc)
- bad memory
- improperly seated cards/controllers/memory
etc

Step 1: stop the machine, and boot up Memtest86 and let it run. This is your very first must-do-it-now step. If you cannot trust the core system, eventually it WILL write something nasty out to your zpool and then zpoof.

Step 2: Assuming you have a Supermicro board, check all the BIOS health status stuff.

If that doesn't locate a serious problem of some sort, then post back.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
As the resident HPT controller expert for FreeNAS I can say that if you tried to run short or long SMART tests on your 3520 controller you did NOT actually run the test at all. I'd have to go back and verify this but I'm 99% sure that the SMART tests as well as the SMART warnings for disks that are expected to fail soon do NOT operate properly with HPT. The only way is through the hpt CLI which is not immediately compatible with FreeNAS. There is currently a ticket open to add the HPT CLI to FreeNAS but it is currently expected with the 9.1 milestone(3+ months away is my guess).

Edit: I'm not sure of your actual problem with the checksums, I'm just posting because you mentioned your controller.
 
Status
Not open for further replies.
Top