Scrub finds (and fixes) errors every time it is run, checksum errors for all disk in pool

Status
Not open for further replies.

Mithril

Cadet
Joined
Jan 9, 2012
Messages
4
FreeNAS version: 9.2.1.7
6 3TB drives in the pool, Mix of Seagate, Toshiba and WD
10GB system RAM
775 era motherboard (I can get the exact model if needed, just need to crack open the case)

All drives are in a single Z3 pool, all data is stored in one of the datasets, no data on the root pool.
2 of the drives are new. 2 of the drives are less than a year old, the other 2 are about 2 years old.

The problem:
Running zpool scrub pool results in some errors that are repaired and all drives end up with checksum errors listed under zpool status pool. Running scrub again has a similar result. No matter how many times scrub is run errors are found and fixed. The amount of data fixed is usually between ~20MB and ~500MB

What has already been done:
memtest86+ overnight: no errors
checked each drives SMART logs, only the seagate drive has any troubling errors.
all non system datasets were destroyed
each drive has been taken out of the pool and a full badblocks read/write test: no errors
once passed each drive was then formatted and added to the pool with zpool replace
after each drive was added and done resilvering a scrub was run with only the system data sets: no errors
data sets were (re)created, and data copied over the network with rsync
and now I'm back to square one.

Possibilities I have not tested:
The CPU itself is going bad, the corruption is happening in L1/2 cache: Unlikely, but possible. I don't have anything to swap in.
The motherboard itself is causing corruption, possibly a fault in the southbridge or SATA controller: I may have enough PCIe sata cards to check that. Of note, the motherboard doesn't support 3TB drives correctly, if you run windows on the board it will not see the drives correctly. FreeNAS has no issue seeing the full capacity
I recently changed the secondary SATA controller from RAID to AHCI, none of the pool drives are connected to it however.
The power supply should be stable, I do have another unit I could try but have not yet done so.
Completely delete and create a new pool, I'd just have to save the system data etc.

Not sure what to do at this point, the motherboard and CPU have been stable for years, it was previously my main server and never gave me any grief except windows not seeing 3TB drives correctly unless they were connected to a PCIe card (and no, the "latest" bios doesn't fix it, there is a non OEM bios that does but it causes instability)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
memtest really isn't as great at finding memory problems as people might think. There's lots of examples of memtest saying all was well when things were in fact not so great. My guess is your hardware just isn't 100% compatible. Can't say I'm even remotely surprised considering 775 is desktop hardware so is lacking support for things like ECC while adding things like audio cards and other vendor specific hardware that can create all new problems for FreeNAS.

A friend has this behavior with a test system about 2 years ago. It was desktop stuff and it would always find problems. As soon as we tore the box down and built the box again with proper hardware (the hard drives and PSU were all that were reused) the problem went away completely.
 

Mithril

Cadet
Joined
Jan 9, 2012
Messages
4
The pool errors are new, I've been using the same motherboard and CPU(edit: with FreeNAS) for about a year. The previous setup was a bit... unusual and might get me some hate :) However, weekly scrubs always came back clean.

re: memtest86+, I've only ever heard good things about it, and I'd think running it overnight would be fairly comprehensive. Is there a better alternative, I'd be more than happy to test it again if so.

come to think of it, I may have changed the main sata mode as well, hmm

edit: Since it is possible I forgot that I changed the native SATA ports' mode as well, is it likely/possible that would cause this kind of issue?
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
AFAIK the better alternative is to use ECC hardware. That is, literally, the only "good" fix. The rest are bandaids after you've already bled all over the carpet and walls. :P

Your sata ports should be set to AHCI... that's covered in the manual. ;)
 

Mithril

Cadet
Joined
Jan 9, 2012
Messages
4
I'm no stranger to the forum, so I'm quite familiar with the (nearly religious) preference for ECC. :) The likelihood of this specific problem being down to (non faulty) non ECC memory is very low, and as far as I can tell the memory tests fine.
That is somewhat beside the point, where previously working hardware and software is now having errors. I'd like to narrow the issue down to the specific failure, either software or hardware.
This isn't my primary backup(it is at the moment one of three full backups), my goal is primarily to identify the issue.

Any links to what is wrong with memtest86+ or a better tool would be great. It is entirely possible some of the RAM is faulty, but I don't know of any other well regarded memory test.

The ports are currently set for AHCI, I believe they were not previously.

I'm currently running rsync to compare checksums between this FreeNAS box and the primary server (where the data was copied from). At current speed I've got about a day remaining (I wish 10G was cheaper :) ) I'm curious as to what the end result is.

I'd pop the entire pool into my other FreeNAS box but there are not enough SATA ports and I can't bring it down anytime soon to put in PCIe cards :(
 
S

sef

Guest
Memory isn't the only place bit errors can happen -- they can happen on the CPU, on the bus, and on the controller card/chips. (This is the reason why ECC is so important, since you are trying to minimize the places where bit flips can happen.)

In your case, I wonder if the hardware is having problems. I'd want to hook the drives up to another system, and see what happens then.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm currently running rsync to compare checksums between this FreeNAS box and the primary server (where the data was copied from). At current speed I've got about a day remaining (I wish 10G was cheaper :) ) I'm curious as to what the end result is.

I can tell you what the end result will be. Everything will be fine unless:

1. You have corruption from the hardware that is resulting in corruption after ZFS checksums it. In this case ZFS is fixing it (which is the fixing that you are seeing in zpool status).
2. "zpool status" actually tells you data was lost. (it's not at the present, so I'd expect this to not be a problem).
 

Mithril

Cadet
Joined
Jan 9, 2012
Messages
4
So I did find the root of the problem, (after letting the hardware sit cold for a few weeks due to other projects being more pressing). Turns out the power supply was at fault. When I was running the memcheck the drives were idling, but with the drives running (or a simulated load equivalent) every so often there was a random memory corruption. High CPU load make the issue a little bit worse. The 12v line from the PSU was dropping to about the limit of ATX spec, and had quite a bit of noise (even at low draw). The PSU is a quality unit, but it is an older one, I bet if I opened it I would find one or more bulging/failed caps.

The PSU has been.."retired", and I am now trying to figure out if I want to go big, or small, with the new secondary freenas box, but that is a whole new thread :)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
GO BIG!
 
Status
Not open for further replies.
Top