Got Corruption on System

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
So I upgraded from 12.0-U8.1 to 13.0-U4 and the message center says I have corrupt files. Checked with zpool status and confirmed. Again, this is not with my main pool, but where the actual system is installed on. The system is installed on an SSD and I have it mirrored to another SSD of the same make/model

So wondering if jumping to the higher 4-5 updates caused this.

Should I download all the version 13.0-Ux's, revert back to 12.0-U8.1 and then slowly install the new versions one at a time? Of course doing this without losing anything.

I do have a couple of data files also corrupted in the pool I have, but TrueNAS says to delete the files and restore them which I did one and it does not show up as a corrupt file anymore so going to work on those; seems easy enough. But my boot pool is the one where I thought of going back to U8.1 and see if installing (going forward will fix the boot pool corruptions.

Thank you
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
In TrueNAS Core 13, you can check your "Boot Environments". Your 12.0-U1.1 should be listed and you can activate it and perform a reboot without having to mess with installing things again.
I doubt it will help with fixing your issues unless it is a driver incompatibility of some kind.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
Well, I did revert back to 12.0-U8.1 and now the boot-pool has not corruptions. I went to the shell and use -v and nothing. Says no corruptions. I think I will head over to download each of the newer versions and then install manually starting with "Release" see how that goes and then go to -U1 until I see a problem. FYI
Thank you

1683758645473.png
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Before updating or doing new installs, I would run a scrub first. Errors might popup.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
an upgrade should never do this. it sounds like something is likely corrupting the write while it upgrades.

if you trying to go from, like, 9.10 to 13, i could see upgrade problems, but 12-13? no.

I would run a memtest, since you do not appear to have ECC RAM. bad ram WILL corrupt data.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
@Apollo - I ran the boot-pool scrub and one did show up. I did -v and this is all it shows me; no files with errors. The boot-pool are two m.2 SSD's where one is the main and the other just mirrors the main. Which brings me to the question about running TRIM since I saw you can turn that on to auto, but some here say that is not good to do and setting up manually is a better way.
1683853293417.png


@artlessknave - True, this weekend I finally have no demands on me so going to run the memtest. I will keep on the tank pool; just no time. Kiddo classes and shows, honey-dews, work. But this weekend I can do these things.

I'll report back on the boot-pool and see if anything fixes that. If not, are we talking re-installing? Or we are not there yet with all the replacing files on the tank pool and the boot-pool with doing memtest as well for good measure.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
running TRIM s
TRIM just clears SSD bits. it wont do anything if your bits are being corrupted. TRIM just lets the SSD write faster, this will have no impact for a boot SSD afaik. not enough writes or write speed to matter

If not, are we talking re-installing?
no. again, the boot drive should NOT get corrupted. you need to find what's doing that, and its likely a hardware problem.
my money, currently, is on your RAM. maybe you should check its in right?

my desktop was getting random RAM like problems for awhile.....finially opened it and found one side of a RAM chip never got pushed quite all the way in....problem solved.

this is one primary reason that ECC is highly recommended
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
@denaba It may be just fine switching back to TrueNAS Core 13, using the "Boot Environment".
Not sure if RAM is really the root cause of the errors. It could be NVME temperature related or other sources.

I don't know how boot pool redundancy is handled during boot time. Maybe there is something hapening that could affect mirror state from the BIOS point of view.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
Not sure if RAM is really the root cause of the errors.
im not sure either; the RAM test is one way to rule it out. and thats a full RAM test, like memtest86, not just an OS one.
ECC would make it dramatically less likely to be RAM and not really need a RAM test. one of the things you loose by not using ECC, having to bring the system down for tests.
I don't know how boot pool redundancy is handled during boot time. Maybe there is something hapening that could affect mirror state from the BIOS point of view.
I have never had a corrupted boot pool. it should never happen. once would be one thing, but if you get more then something is really wrong and needs to be identified asap.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
im not sure either; the RAM test is one way to rule it out. and thats a full RAM test, like memtest86, not just an OS one.
ECC would make it dramatically less likely to be RAM and not really need a RAM test. one of the things you loose by not using ECC, having to bring the system down for tests.

I have never had a corrupted boot pool. it should never happen. once would be one thing, but if you get more then something is really wrong and needs to be identified asap.
Stress testing such as Memtest86 might give you 100% successful bill of health, but isn't really telling you the RAM and IO interface is 100% reliable. There are corner cases which could cause the RAM/CPU/Memory controller to fail (ie overclocking/underclocking, temperature changes...).
All it does is give a level of confidence which still need the proverbial use of "taking this with a grain of salt".

Tracking/tracing/locating the true root of the failure is the real challenge.
For now, I would say, keeping an eye on the system progress and state of health is as important as making brute force testing.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
but isn't really telling you the RAM and IO interface is 100% reliable.
nothing is ever 100% reliable. the point is to find out if its 100% NOT reliable. because if it is, there is no point wasting time troubleshooting something ELSE, you instead can spend that time working on a fix or replacement.
ensuring that the RAM is seated correctly and test it to ensure it's not giving errors memtest can find is a minimum level of confidence in that component.

aditionally, checking the PSU is probably a good idea. a PSU that's dying or degrading, because it was under-provisioned, can cause weird things like this.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
Well I checked in to this some more and found out an issue. I decided I would watch the box via the monitor (you know the list 10 reboot, 11 shutdown, etc. It does not do this when the box is running just sitting there. But when I go to the web gui, go to the tank pool and say, "Scrub pool" it will start. But after about less than 5 minutes the system reboots all on it's own. Again, only when I run the scrub. If I leave the box alone it is just fine. It just does this once, when I say scrub. It just restarts once and then when the box is back then it continues to scrub till done with no other reboot. Just that first time when I run the scrub.

Weird too is that on web gui the main dashboard in the pool display it shows all three green checks (Online, Health and Used Space), but if I go to the tank pool it says it is unhealthy with a red mark.. In the alert it says that some files have data corruption which I see when I run the -v. I am just deleting those, rebooting and running the scrub (get the reboot). Then I copy the file back from the other TrueNAS box.

So a good day to really find out what is wrong.

Maybe the next build of replacement parts I will use the more traditional server grade parts. S.M.A.R.T. is on a schedule which I make sure it does not overlap the scrubs I am doing. Scrub is running now (had the reboot) and it is continuing, but when done I will shut the system and look at the memory sticks, hard drive cables and clean it out if any dust. Then I will do the memtest just to eliminate (or find) variables and check the PSU if it is not supplying proper power though all the connections.

Thank you everyone for helping out and I will get back here to tell the results I find.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
wh
PSU if it is not supplying proper
at PSU do you have? power rating? PSUs lose their max output as they age. if you were running near the capacity on first build, it might be running over the line now.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
Well, after the scrub today the tank pool went from red to yellow "degraded" Not sure what has happened but it went fast. I did check inside and moved SATA cables (I removed them and re-seated them in their port.

@artlessknave - power supply readings look fine except the 12V where it is right on the 12V. Checking all my systems around the house they are a just a little above their ratings except this 12V where it is right on.

Hard drives are HGST He8's and have been in there since 2019 just like my other box which is not showing anything wrong and no errors. Bought and installed all drives at the same time. Based on when I got them and the 5 year warranty I can return them if it is the actual drive issue.

I went to STH (ServerTheHome) and started looking at parts for TrueNAS and something that came up with the memory DDR5; no ECC on DDR5 from either Intel or AMD? Looks like I need to look at DDR4 which still have ECC?

Just thinking. S.M.A.R.T. tests start tonight so maybe I can see what is causing the corruptions (in others words are the drives still good and something like non-ECC memory causing the corruption?)

Looks like a project coming up on the horizon. Hell, right now both boxes are in separate rooms, may move them together. New build with server parts? Return drives for newer ones? What a shame of the drives dying after just two years. HGST drives have served me well over the 5 year warranty.

Anyways, work to do
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
@artlessknave - power supply readings look fine except the 12V where it is right on the 12V. Checking all my systems around the house they are a just a little above their ratings except this 12V where it is right on.
where are you getting these readings from? what is the PDU? whats it's wattage? brand?
I went to STH (ServerTheHome) and started looking at parts for TrueNAS and something that came up with the memory DDR5; no ECC on DDR5 from either Intel or AMD? Looks like I need to look at DDR4 which still have ECC?
server space does not jump on the newest anything as fast as possible. ECC for the newest RAM will take time, as companies test things, and ddr4 is still perfectly fine for servers, there is no need to upgrade for all but the most bleeding endge cases, and you want WANT bleeding edge for data stability. if you want new parts for truenas, supermicro x11 or x12 is far more than adequate. h11/h12 (AMD) is less tested and less common here in the forums, but should technically work.
Just thinking. S.M.A.R.T. tests start tonight so maybe I can see what is causing the corruptions (in others words are the drives still good and something like non-ECC memory causing the corruption?)
SMART generally tells you only base physical characteriscs. many people mistakenly think that zfs errors and SMART errors are directly connected. usually they are not. ZFS is far more sensivitve to anything that can corrupt data that nearly any other software, as it was designed to prevent data corruption, correct data corruption, and repair data corruption
What a shame of the drives dying after just two years.
you've only given us the bare minimun of your build. have you considered air flow? hot drive last less time, drives running on unstable power (one of the reason for my questions about PSU, which you kind of seem to dodge) last less time
.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
I measured the power supply at the cables using my volt meter to get the numbers. PSU is a Corsair 500W, did not dodge, just missed it. My other box has the same parts and no issues. Same box, fans, PSU, mobo, everything. FYI

These systems don't work half as hard as many people here I am sure. I would say that 70% of time it just runs; used mainly for storage and gets accessed for shows once to 4 times a week so not a lot of activity. But like you mentioned parts matter from the data side of things; do not need data corruption. I may be on borrowed time, but moving forward if this is hardware related then it makes sense to me to go down the path of server grade rather than comm grade as I have done in the past since Freenas 7.

SMART I get; just from hardware side of things and not confused with ZFS. Also, before I installed the drive I did followed the burn-in test here in the forums for each drive.

For the box I did check all things; no dust inside. Free flow on CPU, I arranged the hard drives in the case so that air can flow over each. Vertical case, each drive is spread over the two 120mm fans. PSU intake is clear. Usually I clean the internals and filters about once every three months. I have a room air filter running 24/7 (I build models) so I do not have a lot of dust. But I always clean every quarter just because.

Checked the smart results on all drives; shows PASSED and in the next section it says no errors. Should I trust TrueNAS smart results? I mean, physically the drives appear to be fine based on the results, but from the data side, corruption. I did not get to run the memtest yet.

Thanks for helping

.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I did not get to run the memtest yet.
this is the one that matters the most. SMART pass is good but that's not where I would expect the corruption to be sourced anyway, based on the info you have provided.
if you pass a memtest then that's a different matter.
 

denaba

Explorer
Joined
Jan 12, 2014
Messages
59
artlessknave - so finally got the memtest going. Just got it running for 15 minutes and there are errors. Only at 1/4 in testing and going to leave it as I got to go to work, but at least seeing the errors is what you were saying to do. When I get back from work I will update or just knowing there are errors there is no need for more details; just let me know.

UPDATE - the test said it could not complete due to too many errors. Spot on artlessknave, you called it.

Thank you again for the help
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sizing of PSU is best done through careful calculation, you could be experiencing system brownout. See

 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
the test said it could not complete due to too many errors. Spot on artlessknave, you called it.
welp. as much as I like to be right having memory errors sucks.
did you try reseating the sticks? as I'd said before, I once caused my own intermittent memory errors that I could never trace by having a mem stick just barely not quite snapped in. slight changes and it would crash windows but then it would pass memory tests.
Sizing of PSU is best done through careful calculation, you could be experiencing system brownout. See
I believe they said it was at least a 500W when I asked. definitely one route that needs to be checked, but I don't think that's the case here.
 
Top