Got Corruption on System

denaba · May 17, 2023

OK, so what I did is I re-seated the memory sticks and still same results. So here is what I did to isolate the one stick

Re-seated - same results; failed
Swapped memory with each other - same results; failed
Removed one stick leaving one - that one passed the whole test
Removed passed stick, inserted other memory - this one failed the test

So I am guessing I have two options

A - return the memory stick and get a replacement (this just gets me going again)
or
B - Replace everything inside; mobo, CPU, memory for server grade. (takes me longer to get up and running)

My other question is then; my tank hosed? Going to have to wipe out (above, option A) and copy all files back over?

Off of work today so I can spend more time here. Or do I need to test something else?

Thank you

artlessknave · May 17, 2023

usually, the only data that should be corrupted is the data zfs reports as corrupted. there is a condition that can cause the checksums to be corrupted and thus everything they checksum but that should be really rare to actually occur, and should only occur on a few small things.

you might as well get working RAM. you can use the system for something else if you decide to replace it. these systems would work out tolerably as a backup destination, but having ECC on the main system is ideal since that's where al lthe work is done.

denaba · May 18, 2023

OK, got the memory in, going to run memtest on those new ones.

Once done then destroy the pool and then copy everything over from the main? Never had data corruption before so not sure how to proceed after the memtest of the new ones

artlessknave · May 18, 2023

i don't remember the entire pool being corrupted. was it?
zfs will tell you which files it found as being corrupted. while there is a condition that can theoretically cause the checksums to get created with corrupted data, that should be spectacularly rare. any data zfs doesn't say it corrupted should be fine. you need need to get RAM that isn't mangling more data, and replace anything that was corrupted.

you can certainly overwrite it if you wish to. though, you should be using replication, ideally.

denaba · May 18, 2023

After running the scrub the indications were just some files were corrupted. After each scrub, I ran the -v and got like 1 or two more files or on some occasions there were 5 or 10. I did what was said here; deleted the corrupted file, rebooted and then ran another scrub. Replaced the file and then ran -v only to find another 1 or 2 or more. So repeated the process. Each time the file that I replaced was not listed again so it looks like replacing them helped. But after every scrub, new files were identified. Again not many, but new ones.

Is this the process to take for replacing the corrupted files?

Is there a command to see all the corrupted files? Or am i stuck going through the above?

Again, thank you artlessknave for all your help, wisdom and time helping me.

Apollo · May 18, 2023

denaba said:
After running the scrub the indications were just some files were corrupted. After each scrub, I ran the -v and got like 1 or two more files or on some occasions there were 5 or 10. I did what was said here; deleted the corrupted file, rebooted and then ran another scrub. Replaced the file and then ran -v only to find another 1 or 2 or more. So repeated the process. Each time the file that I replaced was not listed again so it looks like replacing them helped. But after every scrub, new files were identified. Again not many, but new ones.

Is this the process to take for replacing the corrupted files?

Is there a command to see all the corrupted files? Or am i stuck going through the above?

Again, thank you artlessknave for all your help, wisdom and time helping me.

There are no reason to reboot you system after each scrub. It is not designed/intended to work that way nor necessary.
If you are getting random errors popping up during scrub, then your system is not doing it's job reliably and should be put aside.
Either you do the switch to a more suited recommended hardware or you might loose more than just a few files.

denaba · May 18, 2023

@Apollo - as mentioned earlier my two options are;
- get things up and running (which I am doing)
- start looking at server grade parts to buy.

The system is not doing well because of the faulty memory. New sticks in to get by till I acquire server parts in the meantime. Need to see listing of boards, CPU, ECC memory for my situation. System has backup files from main system (unaffected and running great) Main system is mainly read with occasional write. For simplicity; box B has copies. Box A contains files. Box A copies files to box B where the issues arose.

Noted on the reboot. So based on what you said then if I run -v I will see the corrupted file(s) and its location. If I delete that file and I run -v afterwards (no reboot and no scrub) then I should get the A2DX57 (something like that) meaning that the file has been deleted, correct? Scrubbing will remove that entry and then find any other file(s) corrupted, then repeat process correct?

Again, just trying to get the system back up

@artlessknave - new memory sticks in and memtest comes back all PASS. So to temporarily get me back up and running I should just continue my path of replacing files for the meantime.

From how I use these boxes, any recommendations for server grade parts to replace? I use only one system to access the main box. Writes are rare, mainly access files more so than writes.

Currently;
32GB of memory
motherboard has 6 SATA ports
Using 4 x 8TB HGST's (using only half the available though. Got the 8's at a steal so extra space)
M.2 where TrueNAS is loaded on a stick. 2nd M.2 mirroring the main

Would like the motherboard for expansion meaning 4 slots of memory in case I want to add more later and 1Gb ethernet port works for me in my setup around the house. Cabling can do 2.5Gb, but switches and my router are all 1Gb and for what I use these boxes for 1Gb is fine.

Like you mentioned ECC is key.

Thank you

Apollo · May 19, 2023

denaba said:
@Apollo - as mentioned earlier my two options are;

Noted on the reboot. So based on what you said then if I run -v I will see the corrupted file(s) and its location. If I delete that file and I run -v afterwards (no reboot and no scrub) then I should get the A2DX57 (something like that) meaning that the file has been deleted, correct? Scrubbing will remove that entry and then find any other file(s) corrupted, then repeat process correct?

When you run a scrub on a pool, ZFS will read every blocks that exist on the pool and calculate the checksum of the data contained in the block and compare it with the checksum value store in that same block.
When both checksums fails to match, scrub is still going to proceed to the end, but depending on the scenario could leave data unchanged or proceed with resilvering the corrupted block as long as the entire data is available from the drives, or when partial data can be reconstructed with the parity.

Upon resilvering the particular block, a new block is in fact created and pool ZFS metadata modified to point to the new block and discarding the old one.

When errors have been detected during scrub, ZFS will inform you about the state of the file that has caused checksum discrepancy.
If we take your second screenshot of your mirrored boot-pool, errors have been detected because calculated checksum and checksum stored in the block was different. ZFS is designed to figure which of the data is to be trusted. As such, when it figures out data has checksum discrepancy, it is able to validate whether the data is still correct. When it happens and no corruption to the data is found, then your will see the message "errors: No known data errors" which is always confusing to say the list.

When data is believed to be corrupted because of the lack of proper redundancy, ZFS will provide details of the damage with the -v argument.
At this stage, if you decide to replace the file that was corrupted, I believe ZFS is still going to report the corruption. ZFS doesn't know the new file is supposed to replace the old one as it treats it as two different entities. If the file is stored in the same location and has the same name, then ZFS should understand the new file will replace the old one, though, I don't know what exactly happens when snapshot is referencing the corrupted file.
'zpool clear' will have to be used at some point, which will cause ZFS to clear its status to appear clean.

The proper step at that point is to run through another scrub and see what is being reported.

If scrub did report errors on the data pool, then you have to apply the same reasoning and understanding how much corruption there is and deal accordingly.

In the meantime, you could try running with only the one working stick.

denaba · May 19, 2023

Interesting reading Apollo. I decided to check the corrupted file; in this case a movie Called Platoon and Hot Tub Time Machine. They played (funny to watch the Hot Tub one), but they played to the end. I though if a file is corrupt you see something wrong with it or you cannot access it. I have seen this on Windows where a picture in New York first showed it clearly. A year later there is this huge gray bar where you can only see about 85% of the picture.

To my surprise the file played the whole movie through. Platoon only watched a bit. These two movies were flagged as corrupt after the last scrub.

I will run the pool scrub today (going to work) and when I get back I will run the -v to see as you mentioned what is being reported. Wonder if the two movies will appear again.

Thank you

artlessknave · May 19, 2023

it depends what is corrupted. if the corruption is just to some metatdata, or hard to notice media, you would "see" nothing.

I don't have a lot of insight on unsnarling such a mess of corrupted data, as I have never been in that position. i do wonder if you are just copying data manually, because thats what it sounds like? or are you using snapshots and replication?

denaba · May 19, 2023

Yes, I was first deleting the file via my Windows system. After scrubbing I replaced the file with the one from my other box. When scrub ended and I used -v then that particular file never showed up again as a corrupted file.

After today's scrub now it shows no files corrupted, but now it shows this. What now? From two Windows computers I can still see the files and I tried opening them and they are fine.

Would be easier to just delete the pool tank and then create a new one and then spend the time transferring things over again?

Also, this time when the scrub ran the unit did not reboot like before. The scrub ran fine without interruption.

Apollo · May 19, 2023

Not sure why the pool is in a Degraded state. It would seem, maybe, that 3 of the drives went offline and came back.
There are no errors the pool is reporting otherwise.
Execute the "zpool clear tank" to clear the faulty state.
Then when it shows as "ONLINE" run another scrub.

denaba · May 20, 2023

Looks like crisis is over. I am assuming that all the issues were within the memory (the bad one), but the actual data were all intact. I did a -v and that's what it says, no known data errors. Since it looks like this box is happy again. Now on to getting parts, I will open a thread under the hardware section to see about parts. Thank you everyone for helping me out here on this.

artlessknave · May 21, 2023

one of the best examples I've ever seen of why ECC is highly recommended.
good to see you got it sorted out.

Important Announcement for the TrueNAS Community.

Got Corruption on System

denaba

Explorer

artlessknave

Wizard

denaba

Explorer

artlessknave

Wizard

denaba

Explorer

Apollo

Wizard

denaba

Explorer

Apollo

Wizard

denaba

Explorer

artlessknave

Wizard

denaba

Explorer

Apollo

Wizard

denaba

Explorer

artlessknave

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Got Corruption on System

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Got Corruption on System"

Similar threads