[SOLVED] Crash after a couple mins of copy: Fatal Trap 9: General Protection Fault in kernal mode

Davvo · Jan 11, 2023

A physical inspection of the modules (especially the pins) and the slots might help. Dust can cause issues too.

Reseating the CPU is also an option.

chadilac · Jan 11, 2023

Davvo said:
A physical inspection of the modules (especially the pins) and the slots might help. Dust can cause issues too.

Reseating the CPU is also an option.

- Took out 2 sticks of RAM and re-arranged and re-seated the other 2
- reseated CPU

Testing a large copy now.

joeschmuck · Jan 11, 2023

chadilac said:
I did run MemTest last night. It ran for 6 hours and came back as passed.

You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.

The other stress test you should run is a CPU Stress Test for at least 8 hours.

Also, just in case you might be doing this, ensure you are testing the system with all your hard drives connected to ensure the power draw on the power supply is at the maximum. If you disconnect some of the hardware then the power supply isn't pulling the same current.

And lets say those things all pass. I would then grab a copy of Debian or Ubuntu and boot it, let it run for a day or two, see if the system crashes.

Lastly, see if you can identify what operation you might be doing when the system crashes. Is it when you are using Plex? Copying data? Nothing at all? Little things can provide big clues.

Best of luck to you.

chadilac · Jan 12, 2023

joeschmuck said:
You need to run it for at least 24 hours or maybe even 48 hours. 6 hours and one pass is not enough.

The other stress test you should run is a CPU Stress Test for at least 8 hours.

Also, just in case you might be doing this, ensure you are testing the system with all your hard drives connected to ensure the power draw on the power supply is at the maximum. If you disconnect some of the hardware then the power supply isn't pulling the same current.

And lets say those things all pass. I would then grab a copy of Debian or Ubuntu and boot it, let it run for a day or two, see if the system crashes.

Lastly, see if you can identify what operation you might be doing when the system crashes. Is it when you are using Plex? Copying data? Nothing at all? Little things can provide big clues.

Best of luck to you.

Thanks for the recommendatons, they sound very thorough and I understand that perhaps in enterprise scenarios why they may want to be this thorough. However in my scenario as a home/small business user, I'm ok with having to put up with some downtime in the event of a crash to wait for replacement hardware.

I'm a software developer but have always been curious to get more into hardware and this build was a way to do that and I'm quite curious and eager to learn more about the hardware side. To that end (of learning), I have a few questions about your recommendations (and I've seen others that go even farther like: "It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half").

What would be the benefit of purposely forcing this type of proactive downtime as opposed to reactively addressing these types of issues as they arise? I don't think there is a risk of data loss in the event of these types of crashes are there? You are recommending memory and CPU tests, which if either failed, would that cause data loss?

You don't mention hard disk tests, which I would think if you were going to be super proactive and cautious, this would be part of the regimine as well but maybe due to redundancies in RAID this is not as important?

I'm mirroring so there is already some redundancy there, plus I have another on site and offsite backup of my data, so even in the unlikely event of potential data loss on this system, I have backups.

The operation triggering the crash was data copy over the network and also using the Disk Import function. Both would trigger a crash after about 3 minutes of running.

chadilac · Jan 12, 2023

UPDATE:

I performed a large over the network copy that lasted about 2 hours and all was stable. I then put back in the other 2 RAM sticks and initiated another very large copy last night and is still running 12 hours later (much better stability than 3 minutes).

So there must've been some issue in just how the RAM or CPU was seated I guess. I'm not ruling out the potentials for bad RAM, etc that may yet to rear their heads, but this is definitely a large step in the right direction.

If this stability continues, I'll probably then test putting the RealTek 2GB NIC again to see what happens.

WI_Hedgehog · Jan 12, 2023

chadilac said:
Thanks for the recommendatons, they sound very thorough and I understand that perhaps in enterprise scenarios why they may want to be this thorough. ....What would be the benefit of purposely forcing this type of proactive downtime as opposed to reactively addressing these types of issues as they arise? I don't think there is a risk of data loss in the event of these types of crashes are there? You are recommending memory and CPU tests, which if either failed, would that cause data loss?

Either tends to cause something worse: Data Corruption.

It's like Ransomeware, your data is corrupt, your backups are corrupt, your stuff has been corrupted and then corrupted differently over time (even though you didn't touch it), and you weren't aware of any of this. Unlike Ransomeware there's no coming back from the dead, the zombies are all you're left with. Most of us would rather not store the data, it's less trouble than trying to figure out what is bad, how bad, etc.

TrueNAS is production-level software. Because of how it works it moreso requires production-level hardware.. If you want a budget server we're fine with that, set up Ubuntu file sharing on ext4. If you want to dabble with ZFS and/or TrueNAS that's fine too, though do it on a test-system that will house no important data. The reason I say this is many home users approach TrueNAS somewhat analogous to building a 4-story building using skyscraper construction methods and equipment while sourcing materials from the scrapyard and workers from an inane asylum.

chadilac · Jan 12, 2023

WI_Hedgehog said:
Either tends to cause something worse: Data Corruption.

It's like Ransomeware, your data is corrupt, your backups are corrupt, your stuff has been corrupted and then corrupted differently over time (even though you didn't touch it), and you weren't aware of any of this. Unlike Ransomeware there's no coming back from the dead, the zombies are all you're left with. Most of us would rather not store the data, it's less trouble than trying to figure out what is bad, how bad, etc.

Ah, yes corruption would be very bad indeed.

I was thinking that ECC would prevent that....but reading more about corruption sources, the RAM is just one potential right?

Could it corrupt all existing data, or just data during the copy process specifically?

Is the process of just copying a large amount of data a form of stress test? Or is the issue more that this process may not identify an issue unless it was severe enought to cause a crash...where as using MemTest and a CPU test are meant to capture and report these less severe issues that would otherwise go unnoticed?

Davvo · Jan 12, 2023

Other with proper experience in the field will correct me if I'm wrong, but from my understanding in enterprise environments burn-in tests are made to make sure the system you ship is stable and matches the required performance.
You don't want a machine that needs to be online 24/7 to suddenly get offline because something is wrong.
HDD burn-in is executed to spot infant mortality (bathub curve) and unexpected bottlenecks.
If you wanna test your HDDs, look for the solnet array test in the resource section (there is a direct link in my signature).

Data transfer isn't a valid HDD burn-in because it's usually short in time and is bottlenecked by a lot of other factors.

WI_Hedgehog · Feb 13, 2023

chadilac said:
I was thinking that ECC would prevent that....but reading more about corruption sources, the RAM is just one potential right?

Could it corrupt all existing data, or just data during the copy process specifically?

Is the process of just copying a large amount of data a form of stress test? Or is the issue more that this process may not identify an issue unless it was severe enought to cause a crash...where as using MemTest and a CPU test are meant to capture and report these less severe issues that would otherwise go unnoticed?

Mind you, I've only learned much of this recently due to the awesome contributions of TrueNAS members, so while I'll do my best to be accurate, the last few months have been a whirlwind eye-opener into how many things can and do go wrong inside computer systems and I may misstate some things.

I think a lot of it has to do with pushing hardware to the bleeding edge of what it can do. Fast is important, and 1 error in 10^14 attempts may not seem like much, but computers are pushing way more data than they used to, so 10^14 happens "a lot more often." I remember back to the 90's how there were conversations about RAM error rate being "non-trivial," and apparently technology has gotten better, but not anywhere near flawless.

From what I understand, RAM has far more errors than one might imagine. (The computer stick, not the truck made by Dodge.) @Ericloewe posted a link to a great lecture (<--read his post) entitled "Zebras All the Way Down - Bryan Cantrill, Uptime 2017" --if you can watch it from the beginning, it's amazing. In short, Eric is talking about a setting in BIOS that basically causes the system to not report memory errors because...there are so many memory errors they felt it best to hide them. The lecture (really, it's awesome, stick with it) also explained how ECC RAM goes from "working fine" to OMG!!!! "suddenly," and much, much more. (at minimum read Eric's post)

---
Advanced Error Correction can actually correct for multi-bit errors on certain systems. I understand it's only on Intel Xenon processors, but I could be wrong.

Here are a couple interesting details on combating memory issues:

https://dl.dell.com/manuals/common/dellemc_poweredge_yx5x_memoryras.pdf

HPE Fast Fault Tolerance vs HPE Advanced ECC Support - Choosing Best Technology!

HPE Fast Fault Tolerance is a new HPE Memory RAS feature first introduced in HPE Gen10 servers with Intel Xeon Scalable processors.

www.teimouri.net

[Review]: HPE Advanced Memory Protection Technologies

HPE offers Advanced Memory Protection (AMP) technologies for HPE ProLiant servers to increasing service availability for critical services. In addition of ECC (Error-Correction Code) and advanced ECC, these technologies are available to configure by system administrators.

www.teimouri.net

ECC working? Single/multi bit? Testing ECC RAM?

Hi! I'm building my second FreeNAS server. (Aiming at a low power NAS that can run 24/7/365 without I feel it's using too much electricity. My old NAS uses >100 W in idle.) Relevant to this thread is: Motherboard: Supermicro X11SSM-F CPU: Intel Pentium G4560 RAM: 1 x Samsung M391A2K43BB1-CRC...

www.truenas.com

There can be controller errors, cable issues, connection corrosion, mainboard crosstalk, less-than-stellar firmware (that awesome video detailed this too), lots of causes.

Could it corrupt all existing data? Rarely, if the OS corrupts the File Allocation Tables.

Is the process of just copying a large amount of data a form of stress test? Y-y-y-y-y-yes-s-s-s-s, but badblocks will tell you more with less due to optimized patterns. smartctl will also point you in the correct direction.

Ericloewe · Feb 13, 2023

WI_Hedgehog said:
Advanced Error Correction can actually correct for multi-bit errors on certain systems. I understand it's only on Intel Xenon processors, but I could be wrong.

That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.
It's an extreme measure that's really only useful to meet insane reliability criteria.

WI_Hedgehog · Feb 14, 2023

Ericloewe said:
That's one of those things that's a bit a niche feature in high-end servers and comes with relevant catches. Effectively, it mirrors memory controllers, halving memory performance and capacity in order to eke out a little bit more reliability.
It's an extreme measure that's really only useful to meet insane reliability criteria.

That's what I had previously understood also, however:

per DELL (linked previously):
Intel has redesigned and optimized their Advanced Error Correcting Code in 3rd Gen Xeon Scalable Processors to handle the most common failure patterns known among the major DRAM suppliers. In doing this, many of the multi-bit error patterns that were uncorrectable by previous generations of Intel Xeon Scalable Processors are now correctable by 3rd Gen Xeon SPs. This uplift will result in a significant decrease in uncorrectable memory errors. This enhancement is available on all 3rd Gen Xeon Scalable Processors. There are no memory or system configuration requirements necessary to take advantage of the improved Advanced ECC (or SDDC).

(emphasis added)

per HP (HP ProLiant ML350 G6 Server User Guide, p46):
Advanced ECC—provides the greatest memory capacity for a given DIMM size, while providing up to 4-bit error correction. This mode is the default option for this server.

Standard ECC can correct single-bit memory errors and detect multi-bit memory errors. When multi-bit errors are detected using Standard ECC, the error is signaled to the server and causes the server to halt.

Advanced ECC protects the server against some multi-bit memory errors. Advanced ECC can correct both single-bit memory errors and 4-bit memory errors if all failed bits are on the same DRAM device on the DIMM. Advanced ECC provides additional protection over Standard ECC because it is possible to correct certain memory errors that would otherwise be uncorrected and result in a server failure. The server provides notification that correctable error events have exceeded a pre-defined threshold rate. p47

Note Mirrored, Lockstep, and Online Spare memory configurations are on p48.

p49 Under Advanced ECC states DIMMs may be installed individually, so that would indicate it is not using a RAID configuration for RAM.

(emphasis added)

If I were to guess I'd say Intel's Advanced ECC is probably using a block hash based on the work of Howard Fukada, Michael Nahas, Paul Nettle, Ryan Gallagher, Peter Clements, Paul Houle, Yutaka Sawada, and Michael Nahas. It would only be a W-A-G because -as far as I'm aware- Intel highly guards the implementation details. AMD is just getting around to implementing single-bit ECC*, Intel has 4-bit Advanced ECC**; that says a lot right there.

----
*I think AMD has single-bit ECC:

Ryzen 3 1200, ECC Memory Check?

I have yet to find a post regarding Ryzen 3 compatibility to ECC memory. I did the dmidecode -t memory test and following data came out: Handle 0x0032, DMI type 17, 40 bytes Memory Device Array Handle: 0x0029 Error Information Handle: 0x0031 Total Width: 128 bits...

www.truenas.com

AMD Ryzen with ECC and 6x M.2 NVMe build

I wouldn't personally run docker/kubernetes off spinning disks - SSD all the way imo My docker/kubernetes is running from 3x M.2 NVMe array (R-Z1) ..

www.truenas.com

Ryzen 5 PRO 4650 Build (new to Truenas)

Hi, I've been running a "nas" with drives attached to a PI4, and I want to finally move to a real solution, however my budget is not unlimited. I've read this forum and research the topic and it seems that a ECC memory is very important, am I wrong? Long story short, at 150EUR I could find the...

www.truenas.com

**for the most common failure patterns known among the major DRAM suppliers. -DELL

Important Announcement for the TrueNAS Community.

[SOLVED] Crash after a couple mins of copy: Fatal Trap 9: General Protection Fault in kernal mode

Davvo

MVP

chadilac

Dabbler

joeschmuck

Old Man

chadilac

Dabbler

chadilac

Dabbler

WI_Hedgehog

Guru

chadilac

Dabbler

Davvo

MVP

WI_Hedgehog

Guru

HPE Fast Fault Tolerance vs HPE Advanced ECC Support - Choosing Best Technology!

[Review]: HPE Advanced Memory Protection Technologies

ECC working? Single/multi bit? Testing ECC RAM?

Ericloewe

Server Wrangler

WI_Hedgehog

Guru

Ryzen 3 1200, ECC Memory Check?

AMD Ryzen with ECC and 6x M.2 NVMe build

Ryzen 5 PRO 4650 Build (new to Truenas)

Similar threads