Server freezes/crashes after roughly a month of constant use (Core 13.0)

nanodisk

Dabbler
Joined
Dec 8, 2021
Messages
15
Hi all, we have a very simple file server with no jails/vms etc. This was built and put into production early November of last year (2022) but we have had multiple occasions where the server almost freezes/crashes. It doesn't respond to SSH, can't be pinged from other machines on same network, cannot access the web ui and don't even get monitor output when connecting to it. However, the machine appears to be fully active with all the drives and fans spinning etc.

First time round this happened we didn't have SSH setup, so after restarting the server we set it up incase it happened again and see if we could access it that way. The 2nd time it happened was just before NYE whilst we were all on holiday. The only reason we know this happened is because of our backup software telling us that it could not access the network drive within Windows. We then tried SSH into the server but had no luck with it at all.

The only way we can get the server running again is after a restart, which is not ideal at all. I've tried doing some research but only found a few people saying to check for bios power management. From what I can see within BIOs, I cannot find any power management settings which could cause this.

I would appreciate anyone with suggestions on this issue. Specs of server:

  • Intel Core i3-12100
  • ASRock Z690 PG Riptide LGA1700
  • Kingston FURY Beast 16gb (2x8gb) 3200mhz
  • Teamgroup MS30 256gb m.2 ssd
  • SeaSonic Focus 650w 80+ Platinum
  • TP-Link Gigabit PCI Ethernet card
  • x6 6tb HDD setup in raidz1
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Take a look into the logs, my bet is on the RAM. Start running memtest.
6x6 TB in RAIDZ1 is a disaster ready to stuck, I hope you are aware.
 

nanodisk

Dabbler
Joined
Dec 8, 2021
Messages
15
Take a look into the logs, my bet is on the RAM. Start running memtest.
6x6 TB in RAIDZ1 is a disaster ready to stuck, I hope you are aware.
No logs are present around the time of crashing. Only logs we have are from when the server was restarted :(

What would make you say that the RAM is the cause of it? I can double check if the RAM is running at XMP speed or not

I didn't think RAIDZ1 was a bad choice, this server is only used for backups and we have another server which we work off of. Please could you shed some light onto what you'd suggest to use instead.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
No logs are present around the time of crashing. Only logs we have are from when the server was restarted :(
Not even in the /var/log/messages file? view /var/log/messages. Ctrl+Z to exit.
What would make you say that the RAM is the cause of it? I can double check if the RAM is running at XMP speed or not
Turning off XMP would be a advisable, but it's unlikely the issue here. I'm more concerned about memory starvation or defective sticks since thaty's the usual cause of similar behaviours.
I didn't think RAIDZ1 was a bad choice, this server is only used for backups and we have another server which we work off of. Please could you shed some light onto what you'd suggest to use instead.
If you understand the risks of using RAIDZ1 and are fine with them by all means continue to use it, however since RAID5/RAIDZ1 is dead RAIDZ2 would be advisable. Please read the following resource:
 
Last edited:

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Is that a Realtek chipset Lan Card ?
They have been known to be unstable
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
we have had multiple occasions where the server almost freezes/crashes. It doesn't respond to SSH, can't be pinged from other machines on same network, cannot access the web ui and don't even get monitor output when connecting to it. However, the machine appears to be fully active with all the drives and fans spinning etc.
I'd have to say that your server did freeze/crashed, It was not responsive. Just because the fans and the hard drives had power means nothing more than you have power applied to those devices. A freeze/crash does not mean the system will turn off. I just wanted you to be clear that your computer did crash.

Now I'm going to sound like a broken record. Sorry for those who know what I'm going to say. I'm not trying to pick on anyone at all in this thread so please do not think that is what I'm doing, well except for using the wrong hardware for an office/production machine but you may not have known better.

So it sounds like this machine is being used for a work/office environment. With that said, why are you using a gaming motherboard and not using ECC RAM? As previously mentioned, the add-on NIC is likely a Realtek product, I know many of the TP-Link devices are, if not all of them.

Some recommendations:
1) Replace the system with a proper production system if you value your data at all. If your data is not very important (and some people do run servers just to host video content) then read on.
2) Replace the NIC with an Intel NIC. They are not expensive. If you haven't disabled the onboard NIC, do that too.
3) Do some burn-in testing on your system. Since this is a production machine, 24 hours or longer for the CPU Stress Testing and 1 full week (some would say a month) of RAM testing (Memtest86+).
4) If you are going to stick with this system, I'd set up a job to reboot it every night or at least twice a week to keep the system nice and clean.

I do and I don't mean to be confrontational. It's unfortunate that many people will get sucked into thinking that they are building a good solid system when they watch a YouTube video that some jerk threw together just to get some views. Then you have a person thinking they did it right but end up having a bad product for what they wanted to do, TrueNAS. Then you have people out there who think that a computer is a computer, RAM is RAM, if my gaming system works fine, why can't I use it instead and it costs less money.

So for those who got caught up in this and didn't know better, I'm sorry that you have to experience this pain. It's very unfortunate that you will likely be spending a lot more money to buy the parts you need to make a solid reliable machine.

For those who cut corners intentionally, you are likely going to pay in the long run with corrupt data and/or system failures. These are no fun because they always happen at the worst time.

Many of our forum members helped generate a well written resource on proper hardware to build a TrueNAS system. All these parts can be purchased anywhere from $500 USD and up depending on what is on sale, an occasional good deal, or whatever. Prebuilt systems are often the least expensive as well and there are some very good deals at times.

Sorry I got on my soapbox (for those of you know what that actually is). I truly do not like to see people have problems, but I also want anyone else reading this posting to understand that the system listed above, it's not recommended.
 

nanodisk

Dabbler
Joined
Dec 8, 2021
Messages
15
I'd have to say that your server did freeze/crashed, It was not responsive. Just because the fans and the hard drives had power means nothing more than you have power applied to those devices. A freeze/crash does not mean the system will turn off. I just wanted you to be clear that your computer did crash.

Now I'm going to sound like a broken record. Sorry for those who know what I'm going to say. I'm not trying to pick on anyone at all in this thread so please do not think that is what I'm doing, well except for using the wrong hardware for an office/production machine but you may not have known better.

So it sounds like this machine is being used for a work/office environment. With that said, why are you using a gaming motherboard and not using ECC RAM? As previously mentioned, the add-on NIC is likely a Realtek product, I know many of the TP-Link devices are, if not all of them.

Some recommendations:
1) Replace the system with a proper production system if you value your data at all. If your data is not very important (and some people do run servers just to host video content) then read on.
2) Replace the NIC with an Intel NIC. They are not expensive. If you haven't disabled the onboard NIC, do that too.
3) Do some burn-in testing on your system. Since this is a production machine, 24 hours or longer for the CPU Stress Testing and 1 full week (some would say a month) of RAM testing (Memtest86+).
4) If you are going to stick with this system, I'd set up a job to reboot it every night or at least twice a week to keep the system nice and clean.

I do and I don't mean to be confrontational. It's unfortunate that many people will get sucked into thinking that they are building a good solid system when they watch a YouTube video that some jerk threw together just to get some views. Then you have a person thinking they did it right but end up having a bad product for what they wanted to do, TrueNAS. Then you have people out there who think that a computer is a computer, RAM is RAM, if my gaming system works fine, why can't I use it instead and it costs less money.

So for those who got caught up in this and didn't know better, I'm sorry that you have to experience this pain. It's very unfortunate that you will likely be spending a lot more money to buy the parts you need to make a solid reliable machine.

For those who cut corners intentionally, you are likely going to pay in the long run with corrupt data and/or system failures. These are no fun because they always happen at the worst time.

Many of our forum members helped generate a well written resource on proper hardware to build a TrueNAS system. All these parts can be purchased anywhere from $500 USD and up depending on what is on sale, an occasional good deal, or whatever. Prebuilt systems are often the least expensive as well and there are some very good deals at times.

Sorry I got on my soapbox (for those of you know what that actually is). I truly do not like to see people have problems, but I also want anyone else reading this posting to understand that the system listed above, it's not recommended.
Thank you for the words and a comment like this is definitely needed. So this yes you're correct this system is used in an office environment and is used as one of our backups (we have 3 including this one) for our main server.

I opted towards the 'gaming'/consumer level hardware due to cost reasonings. This machine was a replacement for a 6 bay QNAP which died on us, we were going to get another QNAP but the price of them in comparison to the machine we built was roughly double. As well as the cost, it allowed us to select a case with enough expandability for more drives if required.

Originally we selected that ASRock motherboard due to sufficient amount of sata ports and the onboard 2.5gb ethernet. However, we learnt the hard way when setting it up that onboard ethernet is not supported by TrueNAS. We had this TP-Link NIC lying around and we knew it was supported so just chucked that in there.

In hindsight I should've spec'd the machine differently for more stability but that was me being completely blind and wanting a cheaper alternative to off the shelf products.

So with this all in mind and your advice, we will be continuing to use this machine. So I'll be looking to order a different NIC and also at some better quality RAM. Will then leave the machine alone and see if any issues occur again. If same thing happens then we will look at setting up a job which will reboot biweekly.

Greatly appreciate the comment again
 

nanodisk

Dabbler
Joined
Dec 8, 2021
Messages
15
Is that a Realtek chipset Lan Card ?
They have been known to be unstable
I will have to double check this. But from what others have said and within that Thread you linked. An Intel based NIC seems to be the better option
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
An Intel based NIC seems to be the better option
Yes, an Intel based NIC is well supported. One of the primary reasons Intel based is better is due to the processing is handled by the NIC, whereas a Realtek pushes the processing off to the CPU. But also Realtek drivers are not as well supported for BSD/Linux.
 

nanodisk

Dabbler
Joined
Dec 8, 2021
Messages
15
Not even in the /var/log/messages file? view /var/log/messages. Ctrl+Z to exit.

Turning off XMP would be a advisable, but it's unlikely the issue here. I'm more concerned about memory starvation or defective sticks since thaty's the usual cause of similar behaviours.

If you understand the risks of using RAIDZ1 and are fine with them by all means continue to use it, however since RAID5/RAIDZ1 is dead RAIDZ2 would be advisable. Please read the following resource:
Yep no logs at all, I've seen similar threads saying the same thing where their system crashed and there were no logs before the system rebooted etc.

I'll read through that WhitePaper and make a decision from there. But am curios to how RAIDZ1 is dead but not RAIDZ2 as well? May be an obvious answer but just trying to find out some more information.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I didn't think RAIDZ1 was a bad choice, this server is only used for backups and we have another server which we work off of. Please could you shed some light onto what you'd suggest to use instead.
If it's just a backup, of which you already have 3 copies total, RAIDZ1 is totally fine. If instead, it's your only copy, then yes, I'd be more on edge.
Do realize though as your drives get larger and larger (double digit TB), I'd instead opt for stripe mirrors as resilver times and performance in general, is just way better. A resilver operation in a 6 drive RAIDZ1, for instance, has FIVE times more I/O load than a simple striped mirrors of 6 drives.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
My position on backups is a bit different. The thinking goes along the lines of when the backup is needed. Usually, when the backup is used something has gone badly wrong. For me the backup is the last line of defense in such a situation. Therefore, the reliability, resilience, etc. of a backup server should in my view be higher than that of the regular server.

As to RAIDZ1 vs RAIDZ2: In today's world you need to expect that an HDD will break sooner or later. If that happens your data, if on RAIDZ1, do not have protection against another failure. Until the HDD has been replaced successfully there is an increased risk.

If my data are valuable to me, why accept that risk, if another 100-300 Euros/Dollars can eliminate it? If that amount of money is a problem in a business context, there is a bigger issue. I know this may sound arrogant. But in today's world availability of correct data equals the survival of the business more than ever. At the end of the day we are talking about business continuity, risk management, and therefore statistics. The latter, unfortunately, evades common sense quite a bit. Yes, the probability of an event happening may be little. But if the consequence is bad enough, that is the factor determining what should be done to mitigate the consequences.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
So I'll be looking to order a different NIC and also at some better quality RAM.
If by "better quality RAM" you mean ECC RAM you will have to change both your motherboard and CPU in order to make use of that error correction.

Yep no logs at all, I've seen similar threads saying the same thing where their system crashed and there were no logs before the system rebooted etc
That's unfortunate.
 
Top