Hard to find hardware issue: SATA lost connection

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
During the course of the past 3 months my FreeNAS server started to show CAM lost connection errors on multiple disks. The server consists of a 5x3TB disk RAIDZ array. The connection would drop and then reestablish. Unfortunately FreeNAS flags the datastore INOPERATIVE and does not recover, even after the disk was online again. A reboot helped and - deeply appreciating the work of the FreeNAS development team - the server recovered. Initially this happened only once or twice a week but recently the server stopped every day. I slowly started to identify the root cause and swapped
  1. cabling,
  2. power adapters,
  3. SATA cards (disks are connected through a SATA card),
  4. hard disk,
  5. mainboard,
  6. power supply.
None of the above helped. There was only ONE thing left not yet replaced: a YF-VR882 3 fan power adapter cable (comes with a Fractal Design Define Mini mATX case) to three fans. The power adapter has some electronics.

Once the fan adapter was disconnected all SATA connections went back to normal, i.e. they were stable without ANY CAM errors.

I am now testing the fans with the mainboard fan connectors. Looks good so far.

Looks to me like an electronics parts on the adapter died during the course of the past months. It doesn't have a lot: two resistors, one diode, one transistor and a part with a turning knob marked as "VR" (my guess is Voltage Regulator).

Anyone else has experienced issues with a fan power adapters?
 
Last edited:

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
The issue is back. I've replaced the fan power supply with a cable from the mainboard using a Sharkoon regulator and a 1 to 3 adapter.

Maybe one of the fans is bad?

Unfortunately I cannot post the error messages as they don't appear in /var/log/messages, only on the console.
 

Stevie_1der

Explorer
Joined
Feb 5, 2019
Messages
80
What are your hardware specs?
Please supply the exact models of mainboard, CPU, RAM, drives, chassis, SATA cards, fans and so on, and describe how all of that is wired together.

If you can't copy the error messages, try a good photo of the screen instead.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
Without additional chassis fans there is no issue.
With three chassis fans, two in front, one in the back, I'm getting disk errors as per below. This setup worked from June 2018 to February 2019. Since February the situation gradually degraded.

I am now checking whether I can isolate one of the fans.

The errors I observed today happened some time this night between midnight and 10:00 in the morning. As access to the root partition was lost there is nothing in /var/log.

Asrock B150M-HDV/D3, 32GB Kingston RAM
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
bequiet PurePower 10, 350W
5 x WDC-30EZRZ 3TB harddisk
1 x 32GB SSD

root@nas1:~ # lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
00:1c.7 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #8 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)


root@nas1:~ # camcontrol devlist
<WDC WD30EZRZ-00Z5HB0 80.00A80> at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD30EZRZ-00Z5HB0 80.00A80> at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD30EZRZ-00Z5HB0 80.00A80> at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD30EZRZ-00Z5HB0 80.00A80> at scbus3 target 0 lun 0 (pass3,ada3)
<TS32GSSD370S P1225CE> at scbus4 target 0 lun 0 (pass4,ada4)
<WDC WD30EZRZ-00Z5HB0 80.00A80> at scbus5 target 0 lun 0 (pass5,ada5)




Disk failure.png


Server.jpg
 

Stevie_1der

Explorer
Joined
Feb 5, 2019
Messages
80
You didn't list what fans you are using, but the fan controller you are using now is only powered by the fan header if I see it correct on the picture.
This might be too much for the onboard header, unfortunately your mainboard only has 2 fan connectors, one for CPU and one for chassis fan.

However, this does not explain the problems with your first fan adapter, as this one was powered directly by the PSU and not the mainboard fan header.
So you may likely be right with your guess, that one fan might be defective.
Although I've never experienced a fan going bad in a way it would disturb other components, it sounds at least not impossible.

So I would do the following:
  • Connect all chassis fans directly to the PSU at 12V, no fan controllers or whatsoever involved.
    Just a cable like that: Molex to 3x fan, 12V.
    Does one fan sound or behave strange, or makes unusual noises?
    Mark every fan, like "Front Fan 1", "Front Fan 2", "Rear Fan" or "1", "2", "3" or "A", "B", "C" or whatever.
    Then let the system run like this for some days and see if the errror reappears, or until the error appears.

  • After that, a series of cross-tests follows, with only two fans running in the front to cool the HDDs.
    First, just disconnect the rear fan and check again.
    After some successful days, or if the error reappears, note the results.

  • Then unmount the rear fan and the first front fan, and mount the rear fan to the front, don't mount the first front fan, just leave that for now.
    Let it run again for some days or until the error appears, and note down the results.

  • Then unmount the second front fan, and mount the first front fan to the front.
    Again, let it run and note the results.

After that, please post the results.
Or even better, post updates after every test.

I hope the cross-testing brings some conclusive results.
Good luck, and keep us up to date.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Without additional chassis fans there is no issue.
With three chassis fans, two in front, one in the back, I'm getting disk errors as per below. This setup worked from June 2018 to February 2019. Since February the situation gradually degraded.

I am now checking whether I can isolate one of the fans.

The errors I observed today happened some time this night between midnight and 10:00 in the morning. As access to the root partition was lost there is nothing in /var/log.

Asrock B150M-HDV/D3, 32GB Kingston RAM
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
bequiet PurePower 10, 350W
5 x WDC-30EZRZ 3TB harddisk
1 x 32GB SSD

So. Your PSU is undersized. It is sufficient to sustain your system under optimal conditions, but that is not how one properly sizes a PSU. This is such a common problem that we have a guide.

https://www.ixsystems.com/community/threads/proper-power-supply-sizing-guidance.38811/

Sizing power supplies too small can cause all sorts of problems. If you have a marginal fan that is taking a variable amount of current, that could be bad. Things are relatively cool, fan slows down, fan stalls, suddenly fan is eating crazy-amps for a few moments, and spins back up, but in the meantime has caused a brownout elsewhere in the system, causing drives to spin down.

If the system is getting worse, this could simply be the fan getting worse, or it could be the PSU having been pushed too far and going into premature failure.

My suggestions:

Upsize your PSU to the recommended size, which will leave some overhead to drive a stalling fan without as much likelihood of browning out the system.

Replace the fans. Fans are cheap.

Doing both of these has better than an 80% chance of fixing your issues.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,450
350W power supply might actually be good enough, but it also depends on how the 350W delivery system is being broken down into.
For the sake of argument, I would remove all the fans altogether except for the CPU and see how the systems will hold.
A weak power supply might cause CPU to crash, but having all those drives failing while everything else seems to be working is I think more than a coincidence and doesn't strike me as a power supply issue. It might but I think some other part of the system would also be affected as well.
If running 11.2, I would step back to the previous version you came from and see how it stands?
A system like yours shouldn't really consume more than 150W at peak and on idle might be around 80W or less. (Though, I don't have experience with this particular system and this is just an educated guest from my part).
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
350W power supply might actually be good enough, but it also depends on how the 350W delivery system is being broken down into.
For the sake of argument, I would remove all the fans altogether except for the CPU and see how the systems will hold.
A weak power supply might cause CPU to crash, but having all those drives failing while everything else seems to be working is I think more than a coincidence and doesn't strike me as a power supply issue. It might but I think some other part of the system would also be affected as well.
If running 11.2, I would step back to the previous version you came from and see how it stands?
A system like yours shouldn't really consume more than 150W at peak and on idle might be around 80W or less.

A system like that will easily consume more than 150W at peak. The power supply sizing guide explains how to properly account for peak consumption in rather gory detail, and how to put sufficient headroom on top of that to deal with situations such as fan stalls.

this is just an educated guest from my part).

I have to strongly disagree. The PSU sizing guide shows its homework along the way to making some educated guesses. It was written by someone who builds systems professionally and has a background in electronics.

The fact that the link drops and then re-establishes suggests the possibility of a system brownout, but could also be a sign that the SATA card might have a driver issue, or several less likely options. The initial post described a fan power adapter having failed, which read to me as though there was a strong possibility that something with the fans is dragging down the voltages, which is known to cause detachments.

The use of a fan controller of unknown design could mean that at one point, the fans had been set to a speed that was too slow, causing a stall, which would have caused an ungodly large amount of current to flow through the fan controller, burning it out, a fact suggested in the first post. Depending on the specifics, this might also have damaged -

1) The PSU
2) One or more of the fans
3) Possibly the mainboard
4) Possibly the HDD's

This is the point in the game where there's risk of damage having happened to the more expensive components in the system and I'm crossing my fingers hoping that maybe it is just a marginal fan. However, even if it is just a marginal fan that is actually causing brownouts, this is taxing the PSU and may be doing undetected damage that will come back at a later date to cause more trouble. This is one of the reasons I'm so big on sizing PSU's properly. You can undersize them and basically goad them into dying an early death.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
With respect to the power supply:
  1. I've used the power supply calculator at https://outervision.com/b/nO4kb9 to size the PSU. The calculator recommends 300W, my supply has 350W.
  2. I have already swapped the PSU with an older bequiet model with the same wattage. No change in behaviour. While not impossible having two PSUs with the same issue seems unlikely.
  3. One month ago I initially suspected the power supply to be at fault. I called the technical hotline at bequiet. The guy there told me that their supplies - if they fail - typically fail hard, i.e. shut down completely and cause the board and CPU to loose power. He told me he had never seen a PSU partially fail and cause SATA errors.
  4. I could see SATA errors also while sitting next to the server. Fans never stalled at the fan controller setting I use. While not impossible as stated by the bequiet technician I'd expect the server to completely shut down if the fan draws too much current. The 12V rail is good for 18A. Anything short of a complete short circuit in the fan (with fumes and odors) should not be able to bring down the PSU.
  5. My suspicion is rather that the fan in question has an issue with coil isolation or a failed capacitor and produces electrical sparks. These sparks produce loud (!) noise on a very wide frequency range (like a bolt of lightning). Certain frequencies pass through the power supply or the mainboard, reach the the SATA links, pollute the frequency band in use and break the connection there.
@Stevie_1der: I agree with your assessment and action plan.
Current status:
Back fan fail
lower front ok
upper front testing
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
With respect to the power supply:
  1. I've used the power supply calculator at https://outervision.com/b/nO4kb9 to size the PSU. The calculator recommends 300W, my supply has 350W.

To the best of my knowledge, no one here has ever found a "power supply calculator" that calculated a proper size for the PSU for a system with more than maybe two or three drives, or provides derating and the ability to cope with faults. My guide shows its homework. We have seen lots of problems caused by undersizing PSU, which was a large part of my motivation for writing this. Many other "power supply calculators" seem to be written by hobbyists for hobbyists, or have other glaring flaws.

I could see SATA errors also while sitting next to the server. Fans never stalled at the fan controller setting I use. While not impossible as stated by the bequiet technician I'd expect the server to completely shut down if the fan draws too much current. The 12V rail is good for 18A. Anything short of a complete short circuit in the fan (with fumes and odors) should not be able to bring down the PSU.

Sadly, while the last sentence is true in principle, it is false in practice.

My suspicion is rather that the fan in question has an issue with coil isolation or a failed capacitor and produces electrical sparks. These sparks produce loud (!) noise on a very wide frequency range (like a bolt of lightning). Certain frequencies pass through the power supply or the mainboard, reach the the SATA links, pollute the frequency band in use and break the connection there.

Stranger things have happened. Typically these things are mitigated somewhat by a larger PSU. SATA/SAS do have the ability to cope with a little noise, as the protocol was designed with this in mind, but it's not out of the question.

It would be vaguely interesting to see if the SMART data revealed any issues, as your theory ought to result in an increasing count showing up in the attributes maybe ~183-188, but WD doesn't collect those on many drives. On the other hand, you could also look at 4 and 12 which could hint as to whether there are brownouts.

In any case, this brings us around to my first bit of advice posted above:

My suggestions:

Upsize your PSU to the recommended size, which will leave some overhead to drive a stalling fan without as much likelihood of browning out the system.

Replace the fans. Fans are cheap.

Doing both of these has better than an 80% chance of fixing your issues.

If you have a failing fan (your apparent prevailing theory), my suggestion fixes it.

If you have a PSU that's too small (demonstrable if you walk through the PSU sizing guide), my suggestion *might* fix it if the problem is a failing fan, or *should* fix it if you have a fan that is causing brownouts.

Trying to debug a system with data on it that involves possibly compromised hardware carries a bit of risk to it, so I'm not a big fan of the "keep trying different things" strategy.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
Current status (inconclusive): after some more running the server with only a single fan I found

Back fan fail
lower front ok
upper front ok

When I tried both upper and lower front fan that failed, though.
Then I've removed all chassis fans. After that I've seen SATA disconnects again one min after freenas startup sequence was completed.

That scraps my bad fan theory (I will replace the fans nonetheless).

I guess I will go ahead and upgrade the PSU to a 450W model. According to the PSU sizing article:

2) For an E3-1230v3 (32-98W board+CPU, 12W memory):
5-6 Drives: 360W peak, 132W idle -> SeaSonic G-450

that should be sufficient.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Bear in mind that's only an estimate. You should run through your actual inventory.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
First thing after starting up with the new Seasonic 450W was a detach of ada0 - notably a different port and disk compared to previous detachs. I have restarted to check whether it was a one-off event. Currently no chassis fans attached. Enclosed the smartd dumps for all drives.

Btw, load wattage for my setup totals to 412W. IMHO the 450W PSU should be sufficient.

Seasonic 450W ada1 deatach.png
 

Attachments

  • ada0.txt
    6.1 KB · Views: 232
  • ada1.txt
    6.1 KB · Views: 216
  • ada2.txt
    5.5 KB · Views: 234
  • ada3.txt
    6.1 KB · Views: 225
  • ada4.txt
    4.6 KB · Views: 227
  • ada5.txt
    6.1 KB · Views: 234

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
That's really bad, because now the options start to get bad, and it starts to look like a failing board.

I'll ponder this a bit and see if I come up with any other ideas. In the meantime, it might be prudent to check temps on important bits of the board such as the CPU and PCH, as temperature can cause failures.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
Two failing boards with the same issue? Remember I have exchanged the board already. I have the original board and CPU now in a gaming PC where it runs without issues, i.e. no issues bothering Windows 8.1 playing games...

There is two more things not yet swapped: memory and the case. I will try the memory, too - both sets of memory on the compatibility list, btw.

The problem I will have: not knowing what is bad how would I rebuild a NAS? Reuse the disks, the case, ... that is an interesting challenge.
 

Stevie_1der

Explorer
Joined
Feb 5, 2019
Messages
80
Oh, this is bad, what a bummer.
Have you checked the memory?
It would be best to run Memtest86(+) for some days, or until errors occur.

The chassis itself should be no problem, if all necessary mainboard standoffs are mounted and fastened, and no excess standoffs produce a short circuit.
Some weeks or so ago, someone here had a problem that was caused by a standoff without a corresponding mounting hole on the board...
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,450
I still don't buy the hardware failure.
I still strongly believe a driver or programming issue on FreeBSD or Freenas side. Maybe something your recently change in the BIOS.
You haven't responded with which version of Freenas you are having this issue.
If you could, I would recommend you install on a new media a version that used to work on your system pre 11.1
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Oh, this is bad, what a bummer.
Have you checked the memory?
It would be best to run Memtest86(+) for some days, or until errors occur.

I've never heard of a memory problem causing such a specific reproducible error in an I/O subsystem. On the other hand, it's good to validate everything again. Or possibly for the first time.

The chassis itself should be no problem, if all necessary mainboard standoffs are mounted and fastened, and no excess standoffs produce a short circuit.
Some weeks or so ago, someone here had a problem that was caused by a standoff without a corresponding mounting hole on the board...

Well, that's a good point. There's something unusual going on here.

Check for loose screws wedged behind the board. While you've got the board pulled, make sure your standoffs are all correct. Inspect the back of the board for any signs of damage.

On the front, make sure that the PCH isn't getting too warm. There should be airflow over it. Check to make sure that the heatsink is secure and doesn't move at all - a heatsink failure could definitely cause this sort of issue, but is also quite out-of-the-ordinary. You can usually remove, clean, repaste, and reinstall a PCH heatsink. Use something like ArctiClean and Arctic Silver 5.

Clear the BIOS and reset to manufacturer defaults. Make sure that you only make necessary changes such as changing the SATA ports to AHCI mode (if that's needed). If it supports "hot swap" or "hot plug" mode also disable that. Because you're using a consumer grade board, these settings often come up "wrong" for FreeBSD/Linux, and sometimes some of these features don't really work correctly.

If there are power savings settings in the BIOS, try disabling them all. Also walk through all other BIOS settings with a mind towards "is this how it would be set for server use."

Make sure that all your drives are evenly spread across the power leads. If you've got five drives and three modular power leads that can handle drives, make sure you're going two drives on the first lead, two on the second, one and your fans on the third. Do not use SATA power Y's. Try to avoid Molex power Y's.

I disagree somewhat with @Apollo above - if basic AHCI support was broken in FreeBSD, it is likely lots of people would be up in arms. However, because this is PC hardware, it is perfectly possible that the mainboard's implementation is somehow defective and needs some workaround for a quirk. That answer isn't pleasing in the modern era, however, as it is usually the PCH that is running the SATA, and that'll be an Intel chip, the same that works fine on many other boards. Not *impossible* but unlikely. FreeBSD and Linux used to have terrible issues supporting random discrete SATA chips in the pre-AHCI era. The advent of AHCI and boards using Intel-supplied silicon for it basically brought that horrible era to an end years ago.

Unfortunately, most home users aren't really outfitted to do deeper dives into these issues. Here in the shop we'd put the power on a scope and actually just *see* if there were brownouts. And/or we'd swap in a heavy PSU to *see*. We'd swap boards to a completely different model (and probably a Supermicro model) to see if that made a difference. We'd blow down the contacts with contact cleaner. We'd try new SATA cables and different drives. Etc. But all of this basically involves additional resources.

The other angle is that as someone who does server work professionally, there are things that you might have done that are just not obvious to me, or might even flabbergast me, but would have seemed just fine to you. We had someone who had neatly outfitted their rig with a bunch of SATA power Y's once, and wasn't really aware of what ampacity is. Likewise, as @Apollo mentions, things like making sure standoffs are correct.

It's possible we aren't going to find the issue. One of the reasons we discourage consumer mainboards is because qualifying them for use with FreeBSD and FreeNAS would have to happen on a case-by-case basis. Manufacturers design these things to "work with Windows" and so the BIOS options tend to be all screwed up for use as a UNIX server. We find the Supermicro boards are generally well-suited for use as servers, because they are designed as server boards, and Supermicro knows what Linux, FreeBSD, ESXi, etc., are, and aims for these things to run on their platform. Supermicro actually made custom variants of their boards based on ESXi ethernet support years ago. But since you had a working system and now it isn't working, that's more of a head-scratcher.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
Recently, there were some posts about a stand-off in the wrong position causing intermittent problems. Since the case itself was not changed, you should probably check that all stand-offs are accounted for and in the right places. This was one of @jgreco suggestions.

I personally once had issues with a bad molex to sata power connector that caused intermittent system problems. I went through a lot of things, including replacement of the power supply, before I found that one.
 

bal0an

Explorer
Joined
Mar 2, 2012
Messages
72
Memory was not the issue. It was rather mundane, actually matching with the suspect list originally provided.

I replaced the Molex to SATA power connectors with brand new ones and it looks like that resolved the issue. System's running stable for 24 hours without any errors, under load and idling. I've made sure all of the used SATA power connectors went to the bin.
I guess the SATA power connectors had a dodgy cable, sometimes working fine and sometimes not.

I've never seen a cable become faulty in a server setup where the server case is not openend. In my case the disks sit in the case sideways, and the SATA power cables touch the case wall. With the hard disk vibrating and the cables pushing against the case my guess now is that the cables got stress fractures and dodgy contacts.

The original setup involved some ASMedia SATA controllers, and I one of those seems to have a bad link, too.
Having those issues in overlay and the SATA power connector sometimes working and sometimes not - got me epically confused.

Thanks for looking into this and providing help.
 
Top