Ok, here's the skinny:
Supermicro 2U Chassis
8x Hitachi 4TB Hard Drives
Highpoint RR3520 Controller (Configured as passthrough)
12x Intel Xeon Cores
64GB Memory
I had originally wanted to use the onboard controller, but it was some built-in adaptec that there's no FreeBSD support for, so I used the HPT controller. Controller works fine, and even have Smartd working on the drives on it. Server was up and online since about the beginning of December. It's not 'live' yet as it's been going through the process of being filled with data, provisioning and such. It was scheduled to go live this weekend, so I was spending most of this week doing the final load of data to it. (All using rsync btw) 2 days ago, rsync got an error attempting to update a file, and I got the warning that the ZFS vol was in an "unknown" state. Uses zpool status to see that there was a file listed as permanent corruption (Interestingly, not the one rsync complained about). Thought no big deal, removed the file, and went on my way. However, after finishing my current batch of rsync tasks, I went back to look and 'zpool status' not only still reported errors, but it showed 5 new files with corruption now. Thought this was odd. Tried a 'zpool clear' but all it did was clear the counters. So, then, just because I wanted the rabbit hole to keep going, I kicked off a scrub. Once completed, the scrub reported over 90K files with 'permanent corruption'. Super. Well, here's where it gets weird. I went down the list, and chose a few random PDF files off the list, FTP'd them to my local, and lo-and-behold, they're fine. In addition, 'zpool status' is logging 10s and 100s of thousands of checksum errors......on ALL the drives. Check out the shot below. (During a running scrub)
As you can see, I'm getting crazy checksum counts to all the drives now. I would go with the 'hardware failure' that every single forum I find out there shows, but I'm seeing the same behavior on all the drives. One scrub leaves the counters on the drives at about 120K checksum errors after each pass. All 8 drives were burned in using both an MHDD loop and a full pass on SpinRite. The server (CPU+mem) was burned in for 1 week before use as well.
Also, the drives show ZERO errors on their smart status, and all 8 passed both a short and long self-test through SMART.
As an additional detail, I cleared all counters, deleted one of the sync'd folders, and re-copied it to the server. When I did, all 8 drives had about 2.5K checksum errors.
So, I guess what I'm asking is, does anyone have any ideas? Usually a failed drive is a few errors on one drive, not 100K on every drive. The data all seems to be there, but I can't just leave it in an error state. This is destined to be a production level server, so if I can't get these shenanigans to end, I'm going to have to use a different platform altogether.
Supermicro 2U Chassis
8x Hitachi 4TB Hard Drives
Highpoint RR3520 Controller (Configured as passthrough)
12x Intel Xeon Cores
64GB Memory
I had originally wanted to use the onboard controller, but it was some built-in adaptec that there's no FreeBSD support for, so I used the HPT controller. Controller works fine, and even have Smartd working on the drives on it. Server was up and online since about the beginning of December. It's not 'live' yet as it's been going through the process of being filled with data, provisioning and such. It was scheduled to go live this weekend, so I was spending most of this week doing the final load of data to it. (All using rsync btw) 2 days ago, rsync got an error attempting to update a file, and I got the warning that the ZFS vol was in an "unknown" state. Uses zpool status to see that there was a file listed as permanent corruption (Interestingly, not the one rsync complained about). Thought no big deal, removed the file, and went on my way. However, after finishing my current batch of rsync tasks, I went back to look and 'zpool status' not only still reported errors, but it showed 5 new files with corruption now. Thought this was odd. Tried a 'zpool clear' but all it did was clear the counters. So, then, just because I wanted the rabbit hole to keep going, I kicked off a scrub. Once completed, the scrub reported over 90K files with 'permanent corruption'. Super. Well, here's where it gets weird. I went down the list, and chose a few random PDF files off the list, FTP'd them to my local, and lo-and-behold, they're fine. In addition, 'zpool status' is logging 10s and 100s of thousands of checksum errors......on ALL the drives. Check out the shot below. (During a running scrub)
Code:
pool: ******xxRaidZ state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: scrub in progress since Fri Jan 4 08:35:50 2013 121G scanned out of 991G at 47.3M/s, 5h13m to go 785M repaired, 12.20% done config: NAME STATE READ WRITE CKSUM ******xxRaidZ ONLINE 0 0 1.28K raidz2-0 ONLINE 0 0 2.55K gptid/52261dcc-4896-11e2-93ce-003048631da6 ONLINE 0 0 6.32K (repairing) gptid/5252e4d0-4896-11e2-93ce-003048631da6 ONLINE 0 0 6.80K (repairing) gptid/527f2801-4896-11e2-93ce-003048631da6 ONLINE 0 0 6.11K (repairing) gptid/52aaee44-4896-11e2-93ce-003048631da6 ONLINE 0 0 6.04K (repairing) gptid/52d87580-4896-11e2-93ce-003048631da6 ONLINE 0 0 5.39K (repairing) gptid/5307b22b-4896-11e2-93ce-003048631da6 ONLINE 0 0 4.40K (repairing) gptid/53360849-4896-11e2-93ce-003048631da6 ONLINE 0 0 3.75K (repairing) gptid/536308c7-4896-11e2-93ce-003048631da6 ONLINE 0 0 326 (repairing) errors: 61995 data errors, use '-v' for a list
As you can see, I'm getting crazy checksum counts to all the drives now. I would go with the 'hardware failure' that every single forum I find out there shows, but I'm seeing the same behavior on all the drives. One scrub leaves the counters on the drives at about 120K checksum errors after each pass. All 8 drives were burned in using both an MHDD loop and a full pass on SpinRite. The server (CPU+mem) was burned in for 1 week before use as well.
Also, the drives show ZERO errors on their smart status, and all 8 passed both a short and long self-test through SMART.
As an additional detail, I cleared all counters, deleted one of the sync'd folders, and re-copied it to the server. When I did, all 8 drives had about 2.5K checksum errors.
So, I guess what I'm asking is, does anyone have any ideas? Usually a failed drive is a few errors on one drive, not 100K on every drive. The data all seems to be there, but I can't just leave it in an error state. This is destined to be a production level server, so if I can't get these shenanigans to end, I'm going to have to use a different platform altogether.