Got Corruption on System

denaba · May 9, 2023

So I upgraded from 12.0-U8.1 to 13.0-U4 and the message center says I have corrupt files. Checked with zpool status and confirmed. Again, this is not with my main pool, but where the actual system is installed on. The system is installed on an SSD and I have it mirrored to another SSD of the same make/model

So wondering if jumping to the higher 4-5 updates caused this.

Should I download all the version 13.0-Ux's, revert back to 12.0-U8.1 and then slowly install the new versions one at a time? Of course doing this without losing anything.

I do have a couple of data files also corrupted in the pool I have, but TrueNAS says to delete the files and restore them which I did one and it does not show up as a corrupt file anymore so going to work on those; seems easy enough. But my boot pool is the one where I thought of going back to U8.1 and see if installing (going forward will fix the boot pool corruptions.

Thank you

Apollo · May 9, 2023

In TrueNAS Core 13, you can check your "Boot Environments". Your 12.0-U1.1 should be listed and you can activate it and perform a reboot without having to mess with installing things again.
I doubt it will help with fixing your issues unless it is a driver incompatibility of some kind.

denaba · May 10, 2023

Well, I did revert back to 12.0-U8.1 and now the boot-pool has not corruptions. I went to the shell and use -v and nothing. Says no corruptions. I think I will head over to download each of the newer versions and then install manually starting with "Release" see how that goes and then go to -U1 until I see a problem. FYI
Thank you

Apollo · May 10, 2023

Before updating or doing new installs, I would run a scrub first. Errors might popup.

artlessknave · May 10, 2023

an upgrade should never do this. it sounds like something is likely corrupting the write while it upgrades.

if you trying to go from, like, 9.10 to 13, i could see upgrade problems, but 12-13? no.

I would run a memtest, since you do not appear to have ECC RAM. bad ram WILL corrupt data.

denaba · May 11, 2023

@Apollo - I ran the boot-pool scrub and one did show up. I did -v and this is all it shows me; no files with errors. The boot-pool are two m.2 SSD's where one is the main and the other just mirrors the main. Which brings me to the question about running TRIM since I saw you can turn that on to auto, but some here say that is not good to do and setting up manually is a better way.

@artlessknave - True, this weekend I finally have no demands on me so going to run the memtest. I will keep on the tank pool; just no time. Kiddo classes and shows, honey-dews, work. But this weekend I can do these things.

I'll report back on the boot-pool and see if anything fixes that. If not, are we talking re-installing? Or we are not there yet with all the replacing files on the tank pool and the boot-pool with doing memtest as well for good measure.

artlessknave · May 11, 2023

denaba said:
running TRIM s

TRIM just clears SSD bits. it wont do anything if your bits are being corrupted. TRIM just lets the SSD write faster, this will have no impact for a boot SSD afaik. not enough writes or write speed to matter

denaba said:
If not, are we talking re-installing?

no. again, the boot drive should NOT get corrupted. you need to find what's doing that, and its likely a hardware problem.
my money, currently, is on your RAM. maybe you should check its in right?

my desktop was getting random RAM like problems for awhile.....finially opened it and found one side of a RAM chip never got pushed quite all the way in....problem solved.

this is one primary reason that ECC is highly recommended

Apollo · May 11, 2023

@denaba It may be just fine switching back to TrueNAS Core 13, using the "Boot Environment".
Not sure if RAM is really the root cause of the errors. It could be NVME temperature related or other sources.

I don't know how boot pool redundancy is handled during boot time. Maybe there is something hapening that could affect mirror state from the BIOS point of view.

artlessknave · May 11, 2023

Apollo said:
Not sure if RAM is really the root cause of the errors.

im not sure either; the RAM test is one way to rule it out. and thats a full RAM test, like memtest86, not just an OS one.
ECC would make it dramatically less likely to be RAM and not really need a RAM test. one of the things you loose by not using ECC, having to bring the system down for tests.

Apollo said:
I don't know how boot pool redundancy is handled during boot time. Maybe there is something hapening that could affect mirror state from the BIOS point of view.

I have never had a corrupted boot pool. it should never happen. once would be one thing, but if you get more then something is really wrong and needs to be identified asap.

Apollo · May 11, 2023

artlessknave said:
im not sure either; the RAM test is one way to rule it out. and thats a full RAM test, like memtest86, not just an OS one.
ECC would make it dramatically less likely to be RAM and not really need a RAM test. one of the things you loose by not using ECC, having to bring the system down for tests.

I have never had a corrupted boot pool. it should never happen. once would be one thing, but if you get more then something is really wrong and needs to be identified asap.

Stress testing such as Memtest86 might give you 100% successful bill of health, but isn't really telling you the RAM and IO interface is 100% reliable. There are corner cases which could cause the RAM/CPU/Memory controller to fail (ie overclocking/underclocking, temperature changes...).
All it does is give a level of confidence which still need the proverbial use of "taking this with a grain of salt".

Tracking/tracing/locating the true root of the failure is the real challenge.
For now, I would say, keeping an eye on the system progress and state of health is as important as making brute force testing.

artlessknave · May 12, 2023

Apollo said:
but isn't really telling you the RAM and IO interface is 100% reliable.

nothing is ever 100% reliable. the point is to find out if its 100% NOT reliable. because if it is, there is no point wasting time troubleshooting something ELSE, you instead can spend that time working on a fix or replacement.
ensuring that the RAM is seated correctly and test it to ensure it's not giving errors memtest can find is a minimum level of confidence in that component.

aditionally, checking the PSU is probably a good idea. a PSU that's dying or degrading, because it was under-provisioned, can cause weird things like this.

denaba · May 13, 2023

Well I checked in to this some more and found out an issue. I decided I would watch the box via the monitor (you know the list 10 reboot, 11 shutdown, etc. It does not do this when the box is running just sitting there. But when I go to the web gui, go to the tank pool and say, "Scrub pool" it will start. But after about less than 5 minutes the system reboots all on it's own. Again, only when I run the scrub. If I leave the box alone it is just fine. It just does this once, when I say scrub. It just restarts once and then when the box is back then it continues to scrub till done with no other reboot. Just that first time when I run the scrub.

Weird too is that on web gui the main dashboard in the pool display it shows all three green checks (Online, Health and Used Space), but if I go to the tank pool it says it is unhealthy with a red mark.. In the alert it says that some files have data corruption which I see when I run the -v. I am just deleting those, rebooting and running the scrub (get the reboot). Then I copy the file back from the other TrueNAS box.

So a good day to really find out what is wrong.

Maybe the next build of replacement parts I will use the more traditional server grade parts. S.M.A.R.T. is on a schedule which I make sure it does not overlap the scrubs I am doing. Scrub is running now (had the reboot) and it is continuing, but when done I will shut the system and look at the memory sticks, hard drive cables and clean it out if any dust. Then I will do the memtest just to eliminate (or find) variables and check the PSU if it is not supplying proper power though all the connections.

Thank you everyone for helping out and I will get back here to tell the results I find.

artlessknave · May 13, 2023

wh

denaba said:
PSU if it is not supplying proper

at PSU do you have? power rating? PSUs lose their max output as they age. if you were running near the capacity on first build, it might be running over the line now.

denaba · May 14, 2023

Well, after the scrub today the tank pool went from red to yellow "degraded" Not sure what has happened but it went fast. I did check inside and moved SATA cables (I removed them and re-seated them in their port.

@artlessknave - power supply readings look fine except the 12V where it is right on the 12V. Checking all my systems around the house they are a just a little above their ratings except this 12V where it is right on.

Hard drives are HGST He8's and have been in there since 2019 just like my other box which is not showing anything wrong and no errors. Bought and installed all drives at the same time. Based on when I got them and the 5 year warranty I can return them if it is the actual drive issue.

I went to STH (ServerTheHome) and started looking at parts for TrueNAS and something that came up with the memory DDR5; no ECC on DDR5 from either Intel or AMD? Looks like I need to look at DDR4 which still have ECC?

Just thinking. S.M.A.R.T. tests start tonight so maybe I can see what is causing the corruptions (in others words are the drives still good and something like non-ECC memory causing the corruption?)

Looks like a project coming up on the horizon. Hell, right now both boxes are in separate rooms, may move them together. New build with server parts? Return drives for newer ones? What a shame of the drives dying after just two years. HGST drives have served me well over the 5 year warranty.

Anyways, work to do

artlessknave · May 14, 2023

denaba said:
@artlessknave - power supply readings look fine except the 12V where it is right on the 12V. Checking all my systems around the house they are a just a little above their ratings except this 12V where it is right on.

where are you getting these readings from? what is the PDU? whats it's wattage? brand?

denaba said:
I went to STH (ServerTheHome) and started looking at parts for TrueNAS and something that came up with the memory DDR5; no ECC on DDR5 from either Intel or AMD? Looks like I need to look at DDR4 which still have ECC?

server space does not jump on the newest anything as fast as possible. ECC for the newest RAM will take time, as companies test things, and ddr4 is still perfectly fine for servers, there is no need to upgrade for all but the most bleeding endge cases, and you want WANT bleeding edge for data stability. if you want new parts for truenas, supermicro x11 or x12 is far more than adequate. h11/h12 (AMD) is less tested and less common here in the forums, but should technically work.

denaba said:
Just thinking. S.M.A.R.T. tests start tonight so maybe I can see what is causing the corruptions (in others words are the drives still good and something like non-ECC memory causing the corruption?)

SMART generally tells you only base physical characteriscs. many people mistakenly think that zfs errors and SMART errors are directly connected. usually they are not. ZFS is far more sensivitve to anything that can corrupt data that nearly any other software, as it was designed to prevent data corruption, correct data corruption, and repair data corruption

denaba said:
What a shame of the drives dying after just two years.

you've only given us the bare minimun of your build. have you considered air flow? hot drive last less time, drives running on unstable power (one of the reason for my questions about PSU, which you kind of seem to dodge) last less time
.

denaba · May 15, 2023

I measured the power supply at the cables using my volt meter to get the numbers. PSU is a Corsair 500W, did not dodge, just missed it. My other box has the same parts and no issues. Same box, fans, PSU, mobo, everything. FYI

These systems don't work half as hard as many people here I am sure. I would say that 70% of time it just runs; used mainly for storage and gets accessed for shows once to 4 times a week so not a lot of activity. But like you mentioned parts matter from the data side of things; do not need data corruption. I may be on borrowed time, but moving forward if this is hardware related then it makes sense to me to go down the path of server grade rather than comm grade as I have done in the past since Freenas 7.

SMART I get; just from hardware side of things and not confused with ZFS. Also, before I installed the drive I did followed the burn-in test here in the forums for each drive.

For the box I did check all things; no dust inside. Free flow on CPU, I arranged the hard drives in the case so that air can flow over each. Vertical case, each drive is spread over the two 120mm fans. PSU intake is clear. Usually I clean the internals and filters about once every three months. I have a room air filter running 24/7 (I build models) so I do not have a lot of dust. But I always clean every quarter just because.

Checked the smart results on all drives; shows PASSED and in the next section it says no errors. Should I trust TrueNAS smart results? I mean, physically the drives appear to be fine based on the results, but from the data side, corruption. I did not get to run the memtest yet.

Thanks for helping

.

artlessknave · May 15, 2023

denaba said:
I did not get to run the memtest yet.

this is the one that matters the most. SMART pass is good but that's not where I would expect the corruption to be sourced anyway, based on the info you have provided.
if you pass a memtest then that's a different matter.

denaba · May 16, 2023

artlessknave - so finally got the memtest going. Just got it running for 15 minutes and there are errors. Only at 1/4 in testing and going to leave it as I got to go to work, but at least seeing the errors is what you were saying to do. When I get back from work I will update or just knowing there are errors there is no need for more details; just let me know.

UPDATE - the test said it could not complete due to too many errors. Spot on artlessknave, you called it.

Thank you again for the help

jgreco · May 16, 2023

Sizing of PSU is best done through careful calculation, you could be experiencing system brownout. See

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can find. DO NOT DO THIS. Your NAS lives or dies by...

www.truenas.com

artlessknave · May 16, 2023

denaba said:
the test said it could not complete due to too many errors. Spot on artlessknave, you called it.

welp. as much as I like to be right having memory errors sucks.
did you try reseating the sticks? as I'd said before, I once caused my own intermittent memory errors that I could never trace by having a mem stick just barely not quite snapped in. slight changes and it would crash windows but then it would pass memory tests.

jgreco said:
Sizing of PSU is best done through careful calculation, you could be experiencing system brownout. See

I believe they said it was at least a 500W when I asked. definitely one route that needs to be checked, but I don't think that's the case here.

Important Announcement for the TrueNAS Community.

Got Corruption on System

Explorer

Wizard

Explorer

Wizard

Wizard

Explorer

Wizard

Wizard

Wizard

Wizard

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Resident Grinch

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Got Corruption on System"

Similar threads