petr
Contributor
- Joined
- Jun 13, 2013
- Messages
- 142
I had a flawless run of almost, however, recently I've started having problems. It all started with Watchdog triggered reboot, after which I started seeing occasional "ATA error count" appear across all of the drives and controllers.
I've initially thought that my M1015 card is on its last legs - I've therefore replaced it with a spare I had ready. However, it did not seem to help at all, the situation was exactly the same.
The machine lives in a room with around 32C ambient temperature (good airflow with all temps reporting LOW in the IPMI).
I am now experiencing watchdog-triggered reboot approx. once a month plus the "ATA error count" also occasionally jumps by 1. The ATA errors seem to have subsided after I've moved the M1015 one PCI slot up - though this could be anecdotal. Another anecdotal piece of evidence pertains to when the problem started - the first reboot occurred when thunderstorm was passing through my area (the PC is behind UPS/surge protector but I suppose you never really know).
My question is - does this point to the motherboard/PSU/other part? What tests can I run to determine what is wrong? I suppose the cheapest option would be to start with PSU replacement but I have no idea if this would correspond to the symptoms above.
My setup:
As I cannot really afford any downtime, it would be great to know if it's likely a motherboard or not before spending 200E on a new one. What is/was my course of action:
I've initially thought that my M1015 card is on its last legs - I've therefore replaced it with a spare I had ready. However, it did not seem to help at all, the situation was exactly the same.
The machine lives in a room with around 32C ambient temperature (good airflow with all temps reporting LOW in the IPMI).
I am now experiencing watchdog-triggered reboot approx. once a month plus the "ATA error count" also occasionally jumps by 1. The ATA errors seem to have subsided after I've moved the M1015 one PCI slot up - though this could be anecdotal. Another anecdotal piece of evidence pertains to when the problem started - the first reboot occurred when thunderstorm was passing through my area (the PC is behind UPS/surge protector but I suppose you never really know).
My question is - does this point to the motherboard/PSU/other part? What tests can I run to determine what is wrong? I suppose the cheapest option would be to start with PSU replacement but I have no idea if this would correspond to the symptoms above.
My setup:
- X9SCM-F-O paired with Xeon 1230 v2,
- 32GB ECC RAM
- 10x3TB WD RED for storage, 1x120GBSSD for VirtualBox VMs
- Large Noctua cooler for the CPU, drives in cages pushing air through the whole front of the case. Additional fan for the M1015. I am quite confident that the temps are OK.
As I cannot really afford any downtime, it would be great to know if it's likely a motherboard or not before spending 200E on a new one. What is/was my course of action:
- (done) Check all cabling, replace/swap hdd cables
- replace PSU (could it be that?)
- replace motherboard
- (less likely IMO) replace CPU
- (less likely IMO) check memory
- throw the whole thing out of a window and build a new one
Last edited: