Yikes. Scrub repaired 172K in 26h with 0 errors

Status
Not open for further replies.

TonyToews

Dabbler
Joined
Mar 20, 2012
Messages
33
According to my zpool status scrub repaired 172K in 26h... with 0 errors
No read or write CKSUM errors.
At the bottom "No known data errors"

There have been no abrupt power downs since, well, initial startup, and the system is on a UPS.

Now is this 172K what? Bytes, sectors, index entries?
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
I would guess 172 kilobytes. Try posting the output of "zpool status". Remember the code tags.
 

TonyToews

Dabbler
Joined
Mar 20, 2012
Messages
33
I do not know what you mean by code tags.

pool: Vol8Tb
state: ONLINE
scan: scrub repaired 172K in 25h56m with 0 errors on Mon Feb 3 01:56:53 2014
config:
NAME STATE READ WRITE CKSUM
Vol8Tb ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/cb05738d-73f7-11e1-b8e8-14dae9686191 ONLINE 0 0 0
gptid/cbd7758b-73f7-11e1-b8e8-14dae9686191 ONLINE 0 0 0
gptid/cc9a7ff5-73f7-11e1-b8e8-14dae9686191 ONLINE 0 0 0
gptid/ad374f75-f047-11e2-b98c-14dae9686191 ONLINE 0 0 0
gptid/ed18361f-e873-11e2-b666-14dae9686191 ONLINE 0 0 0
errors: No known data errors
[root@TTNAS1 ~]#
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
Looks alright, you haven't got any read/write errors, is the system running on ecc memory?

Maybe schedule a long smart-test and see if anything crops up.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
Then what you saw is probably bit-flipping which means you could be in trouble. First of all power down the nas, and then run memtest86+ on it for 5-6 passes. Verify that the memory is good before proceeding any further.

Bad ram can kill your pool.

Also verify your backups at this point.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
It means that ZFS found 172kB on disk that did not match the checksum/parity information that ZFS has stored for it. There 2 causes for this problem:

1. Bad drive. But since you have no rw-erros this is unlikely.
2. Bad memory, very likely in your case since you have consumer grade stuff.

If you are unlucky and you have scenario 2 you can effectively kill your pool because checksums can be corrupted in flight between cpu and disk.

I would suggest the wikipedia article on how zfs is designed.
 

Starpulkka

Contributor
Joined
Apr 9, 2013
Messages
179
Looks like that a scrub did find somewhere a cksum mismatch but not in harddrives. If i would be you I would do 2 days memory test and change sata cables. I am intrested, do you use jails? Edit: More detailed answer might tell it that error did go in your harddrive but lets not go there.. , just check your hardware.
 

TonyToews

Dabbler
Joined
Mar 20, 2012
Messages
33
I have been running Memtest86 now for 11 hours with no errors so far. I'll let it run for another 36 hours. I am not using jails of any sort. The box has been in service for at almost two years judging from the date when I joined this form.
 

TonyToews

Dabbler
Joined
Mar 20, 2012
Messages
33
I wish that all mother boards supported ECC memory if you want to spend the extra money. Which I would gladly do. <sigh>
 

TonyToews

Dabbler
Joined
Mar 20, 2012
Messages
33
I have been running Memtest86 now for 62 hours with no errors.

I do not understand how SATA cables could be the problem as I would assume there is all kinds of data packet CRC/etc checking done between the main CPU and the CPU on the SATA hard drives.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
Then I'm out of ideas. May have been a stray bit that caused a file checksum to corrupt. Either way you have done all you can and should go back to just using your system.

I would be careful though, maybe make a backup snapshot here so you can roll back.
 

tio

Contributor
Joined
Oct 30, 2013
Messages
119
I had this when i didn't reconnect one of my RZ2 drives back after performing a RAM upgrade. I reconnected it and it resilvered the exact same amount of data.

Check your cables and connections on your main board. Get cables which have spring clips on them as well to prevent potential disconnection.
 

Starpulkka

Contributor
Joined
Apr 9, 2013
Messages
179
Searched on forums 256k and found pretty much topics on scrubs witch is similar on this topic case, scrub error was zeroes on all hdds but scrub stated 256k corrected, even on ecc memory machines. When i did have scrub found something it clearly showed what drive it did found it.

So next step on those other topics was full smart test for hdd's, and some hd didn't pass long test. Maby save logs before and after smart test might be smart thing to do? We now know that memory should be good.
 

Ken Almond

Dabbler
Joined
May 11, 2014
Messages
19
I got this in this morning's report: "scan: scrub repaired 64K in 42h48m with 0 errors on Mon Nov 13 18:48:05 2017"
The root event was a motherboard that 'froze/failed' due to an overclock timing issue. I reset the motherboard to BIOS defaults (to get rid of overclock) and rebooted - and got this message a while later.

So based on above - I assume this is related to 'recovery' of the system because of bad motherboard/freeze.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I have also seen this type of scrub result on ECC/server pools. The pool has also had drive disconnections/removals, so I think there is something to that and why no errors register to any device.
 

chris crude

Patron
Joined
Oct 13, 2016
Messages
210
I got this in this morning's report: "scan: scrub repaired 64K in 42h48m with 0 errors on Mon Nov 13 18:48:05 2017"
The root event was a motherboard that 'froze/failed' due to an overclock timing issue. I reset the motherboard to BIOS defaults (to get rid of overclock) and rebooted - and got this message a while later.

So based on above - I assume this is related to 'recovery' of the system because of bad motherboard/freeze.
I'm not your dad and I can't tell you what to do, but 2 things you want from server is cpu/memory stability and heat management. 2 things an overclock will cause is cpu/memory instability and more heat.
Do what you want, but you won't get much help from the regulars around here when you come asking for help wondering why things don't work as it should then you admit to overclocking a server.
 

Ken Almond

Dabbler
Joined
May 11, 2014
Messages
19
I'm not your dad and I can't tell you what to do, but 2 things you want from server is cpu/memory stability and heat management. 2 things an overclock will cause is cpu/memory instability and more heat.
Do what you want, but you won't get much help from the regulars around here when you come asking for help wondering why things don't work as it should then you admit to overclocking a server.
Agreed! I had a motherboard failure, bought used on on Ebay and slapped it in. I didn't realize/pay-attention that it was set 'overclocked'. That's why I reset bios to get rid of overclock settings.
 

chris crude

Patron
Joined
Oct 13, 2016
Messages
210
Agreed! I had a motherboard failure, bought used on on Ebay and slapped it in. I didn't realize/pay-attention that it was set 'overclocked'. That's why I reset bios to get rid of overclock settings.
Did the eBay seller advertise that it had been used for OCing? That might be considered false advertising or a breach of their ebay store contract if you have any problems with the board.
I'm glad you found the problem but I worry about the lifespan of that board.
 
Status
Not open for further replies.
Top