[SOLVED] Crash after a couple mins of copy: Fatal Trap 9: General Protection Fault in kernal mode

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
A physical inspection of the modules (especially the pins) and the slots might help. Dust can cause issues too.

Reseating the CPU is also an option.
 

chadilac

Dabbler
Joined
Dec 27, 2022
Messages
24
A physical inspection of the modules (especially the pins) and the slots might help. Dust can cause issues too.

Reseating the CPU is also an option.
- Took out 2 sticks of RAM and re-arranged and re-seated the other 2
- reseated CPU

Testing a large copy now.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I did run MemTest last night. It ran for 6 hours and came back as passed.
You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.

The other stress test you should run is a CPU Stress Test for at least 8 hours.

Also, just in case you might be doing this, ensure you are testing the system with all your hard drives connected to ensure the power draw on the power supply is at the maximum. If you disconnect some of the hardware then the power supply isn't pulling the same current.

And lets say those things all pass. I would then grab a copy of Debian or Ubuntu and boot it, let it run for a day or two, see if the system crashes.

Lastly, see if you can identify what operation you might be doing when the system crashes. Is it when you are using Plex? Copying data? Nothing at all? Little things can provide big clues.

Best of luck to you.
 

chadilac

Dabbler
Joined
Dec 27, 2022
Messages
24
You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.

The other stress test you should run is a CPU Stress Test for at least 8 hours.

Also, just in case you might be doing this, ensure you are testing the system with all your hard drives connected to ensure the power draw on the power supply is at the maximum. If you disconnect some of the hardware then the power supply isn't pulling the same current.

And lets say those things all pass. I would then grab a copy of Debian or Ubuntu and boot it, let it run for a day or two, see if the system crashes.

Lastly, see if you can identify what operation you might be doing when the system crashes. Is it when you are using Plex? Copying data? Nothing at all? Little things can provide big clues.

Best of luck to you.

Thanks for the recommendatons, they sound very thorough and I understand that perhaps in enterprise scenarios why they may want to be this thorough. However in my scenario as a home/small business user, I'm ok with having to put up with some downtime in the event of a crash to wait for replacement hardware.

I'm a software developer but have always been curious to get more into hardware and this build was a way to do that and I'm quite curious and eager to learn more about the hardware side. To that end (of learning), I have a few questions about your recommendations (and I've seen others that go even farther like: "It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half").

What would be the benefit of purposely forcing this type of proactive downtime as opposed to reactively addressing these types of issues as they arise? I don't think there is a risk of data loss in the event of these types of crashes are there? You are recommending memory and CPU tests, which if either failed, would that cause data loss?

You don't mention hard disk tests, which I would think if you were going to be super proactive and cautious, this would be part of the regimine as well but maybe due to redundancies in RAID this is not as important?

I'm mirroring so there is already some redundancy there, plus I have another on site and offsite backup of my data, so even in the unlikely event of potential data loss on this system, I have backups.

The operation triggering the crash was data copy over the network and also using the Disk Import function. Both would trigger a crash after about 3 minutes of running.
 

chadilac

Dabbler
Joined
Dec 27, 2022
Messages
24
UPDATE:

I performed a large over the network copy that lasted about 2 hours and all was stable. I then put back in the other 2 RAM sticks and initiated another very large copy last night and is still running 12 hours later (much better stability than 3 minutes).

So there must've been some issue in just how the RAM or CPU was seated I guess. I'm not ruling out the potentials for bad RAM, etc that may yet to rear their heads, but this is definitely a large step in the right direction.

If this stability continues, I'll probably then test putting the RealTek 2GB NIC again to see what happens.
 
Joined
Jun 15, 2022
Messages
674
Thanks for the recommendatons, they sound very thorough and I understand that perhaps in enterprise scenarios why they may want to be this thorough. ....What would be the benefit of purposely forcing this type of proactive downtime as opposed to reactively addressing these types of issues as they arise? I don't think there is a risk of data loss in the event of these types of crashes are there? You are recommending memory and CPU tests, which if either failed, would that cause data loss?
Either tends to cause something worse: Data Corruption.

It's like Ransomeware, your data is corrupt, your backups are corrupt, your stuff has been corrupted and then corrupted differently over time (even though you didn't touch it), and you weren't aware of any of this. Unlike Ransomeware there's no coming back from the dead, the zombies are all you're left with. Most of us would rather not store the data, it's less trouble than trying to figure out what is bad, how bad, etc.

TrueNAS is production-level software. Because of how it works it moreso requires production-level hardware.. If you want a budget server we're fine with that, set up Ubuntu file sharing on ext4. If you want to dabble with ZFS and/or TrueNAS that's fine too, though do it on a test-system that will house no important data. The reason I say this is many home users approach TrueNAS somewhat analogous to building a 4-story building using skyscraper construction methods and equipment while sourcing materials from the scrapyard and workers from an inane asylum.
 

chadilac

Dabbler
Joined
Dec 27, 2022
Messages
24
Either tends to cause something worse: Data Corruption.

It's like Ransomeware, your data is corrupt, your backups are corrupt, your stuff has been corrupted and then corrupted differently over time (even though you didn't touch it), and you weren't aware of any of this. Unlike Ransomeware there's no coming back from the dead, the zombies are all you're left with. Most of us would rather not store the data, it's less trouble than trying to figure out what is bad, how bad, etc.

Ah, yes corruption would be very bad indeed.

I was thinking that ECC would prevent that....but reading more about corruption sources, the RAM is just one potential right?

Could it corrupt all existing data, or just data during the copy process specifically?

Is the process of just copying a large amount of data a form of stress test? Or is the issue more that this process may not identify an issue unless it was severe enought to cause a crash...where as using MemTest and a CPU test are meant to capture and report these less severe issues that would otherwise go unnoticed?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Other with proper experience in the field will correct me if I'm wrong, but from my understanding in enterprise environments burn-in tests are made to make sure the system you ship is stable and matches the required performance.
You don't want a machine that needs to be online 24/7 to suddenly get offline because something is wrong.
HDD burn-in is executed to spot infant mortality (bathub curve) and unexpected bottlenecks.
If you wanna test your HDDs, look for the solnet array test in the resource section (there is a direct link in my signature).

Data transfer isn't a valid HDD burn-in because it's usually short in time and is bottlenecked by a lot of other factors.
 
Last edited:
Joined
Jun 15, 2022
Messages
674
I was thinking that ECC would prevent that....but reading more about corruption sources, the RAM is just one potential right?

Could it corrupt all existing data, or just data during the copy process specifically?

Is the process of just copying a large amount of data a form of stress test? Or is the issue more that this process may not identify an issue unless it was severe enought to cause a crash...where as using MemTest and a CPU test are meant to capture and report these less severe issues that would otherwise go unnoticed?
Mind you, I've only learned much of this recently due to the awesome contributions of TrueNAS members, so while I'll do my best to be accurate, the last few months have been a whirlwind eye-opener into how many things can and do go wrong inside computer systems and I may misstate some things.

I think a lot of it has to do with pushing hardware to the bleeding edge of what it can do. Fast is important, and 1 error in 10^14 attempts may not seem like much, but computers are pushing way more data than they used to, so 10^14 happens "a lot more often." I remember back to the 90's how there were conversations about RAM error rate being "non-trivial," and apparently technology has gotten better, but not anywhere near flawless.

From what I understand, RAM has far more errors than one might imagine. (The computer stick, not the truck made by Dodge.) @Ericloewe posted a link to a great lecture (<--read his post) entitled "Zebras All the Way Down - Bryan Cantrill, Uptime 2017" --if you can watch it from the beginning, it's amazing. In short, Eric is talking about a setting in BIOS that basically causes the system to not report memory errors because...there are so many memory errors they felt it best to hide them. The lecture (really, it's awesome, stick with it) also explained how ECC RAM goes from "working fine" to OMG!!!! "suddenly," and much, much more. (at minimum read Eric's post)

---
Advanced Error Correction can actually correct for multi-bit errors on certain systems. I understand it's only on Intel Xenon processors, but I could be wrong.

Here are a couple interesting details on combating memory issues:





There can be controller errors, cable issues, connection corrosion, mainboard crosstalk, less-than-stellar firmware (that awesome video detailed this too), lots of causes.

Could it corrupt all existing data? Rarely, if the OS corrupts the File Allocation Tables.

Is the process of just copying a large amount of data a form of stress test? Y-y-y-y-y-yes-s-s-s-s, but badblocks will tell you more with less due to optimized patterns. smartctl will also point you in the correct direction.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Advanced Error Correction can actually correct for multi-bit errors on certain systems. I understand it's only on Intel Xenon processors, but I could be wrong.
That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.
It's an extreme measure that's really only useful to meet insane reliability criteria.
 
Joined
Jun 15, 2022
Messages
674
That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.
It's an extreme measure that's really only useful to meet insane reliability criteria.
That's what I had previously understood also, however:

per DELL (linked previously):
Intel has redesigned and optimized their Advanced Error Correcting Code in 3rd Gen Xeon Scalable Processors to handle the most common failure patterns known among the major DRAM suppliers. In doing this, many of the multi-bit error patterns that were uncorrectable by previous generations of Intel Xeon Scalable Processors are now correctable by 3rd Gen Xeon SPs. This uplift will result in a significant decrease in uncorrectable memory errors. This enhancement is available on all 3rd Gen Xeon Scalable Processors. There are no memory or system configuration requirements necessary to take advantage of the improved Advanced ECC (or SDDC).
(emphasis added)
per HP (HP ProLiant ML350 G6 Server User Guide, p46):
Advanced ECC—provides the greatest memory capacity for a given DIMM size, while providing up to 4-bit error correction. This mode is the default option for this server.

Standard ECC can correct single-bit memory errors and detect multi-bit memory errors. When multi-bit errors are detected using Standard ECC, the error is signaled to the server and causes the server to halt.

Advanced ECC protects the server against some multi-bit memory errors. Advanced ECC can correct both single-bit memory errors and 4-bit memory errors if all failed bits are on the same DRAM device on the DIMM. Advanced ECC provides additional protection over Standard ECC because it is possible to correct certain memory errors that would otherwise be uncorrected and result in a server failure. The server provides notification that correctable error events have exceeded a pre-defined threshold rate. p47

Note Mirrored, Lockstep, and Online Spare memory configurations are on p48.

p49 Under Advanced ECC states DIMMs may be installed individually, so that would indicate it is not using a RAID configuration for RAM.
(emphasis added)

If I were to guess I'd say Intel's Advanced ECC is probably using a block hash based on the work of Howard Fukada, Michael Nahas, Paul Nettle, Ryan Gallagher, Peter Clements, Paul Houle, Yutaka Sawada, and Michael Nahas. It would only be a W-A-G because -as far as I'm aware- Intel highly guards the implementation details. AMD is just getting around to implementing single-bit ECC*, Intel has 4-bit Advanced ECC**; that says a lot right there.

----
*I think AMD has single-bit ECC:



**for the most common failure patterns known among the major DRAM suppliers. -DELL
 
Last edited:
Top