FreeNAS 9.1.1 build - hardware failure

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Hi all, I built a FreeNAS server in 2013 which has served me very well for the past several years. Build details are here:


I'm running FreeNAS 9.1.1 from a USB drive, so I have no OS hard drive. All 4 hard drives are for my data pool.

Recently, the computer started powering itself off. After powering it back on, sometimes it would go into a cycle where it would stay powered on for about 5 seconds, then restart. Sometimes it would stay on for a few minutes -- I would actually be able to boot up, but then it would shut itself off again. The last time I got it to boot up all the way, there were no errors reported by freenas and zpool status showed all of my drives healthy.

I tried lots of combinations of moving my RAM modules around, hoping it would be that easy, but unfortunately the problem doesn't go away. I'm probably going to have to take it somewhere to figure out if it's the mobo, cpu, or psu and then replace it.

Are there any considerations I need to make when swapping out hardware? I'm also assuming this will be a good time to upgrade to a more recent version of FreeNAS. If I do replace hardware and start with a fresh version of FreeNAS, will it recognize my existing zpool and be happy, or am I at risk for data loss? I do have nightly backups of my important stuff so it wouldn't be the end of the world if I lost the data, but I would really prefer not to. Any tips or advice would be greatly appreciated!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You pool will be fine in any hardware so that is not an issue.

Some troubleshooting tips...
1) Inspect all the fans, including inside the power supply, are they turning as expected? Sometimes the power supply fan can fail and then the power supply will fail. If this is the case, replace the entire power supply vice just the fan.
2) You could run the Burn-In testing again and possibly identify a failing component. Disconnect the hard drives and any add-on cards you have and boot up MemTest 86+ or similar, then if that passes try our a CPU Stress Test. If that works then I'd be surprised but report back.

Good luck.
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Thanks for the tips! I actually just dropped it off at a local repair shop and they said they should get back to me by tomorrow.

So the pool will be fine with any hardware. I'm struggling trying to determine if this is a good time to upgrade hardware and/or a good time to try to mount the pool on a fresh version of FreeNAS 11, or if I should follow the whole "if it ain't broke, don't fix it" mantra -- replace any faulty hardware with the exact same hardware (as it's likely to be way cheaper now?) and leave the OS alone.

I could purchase another USB drive and load FreeNAS 11 and give it a shot, so long as the pool created with 9.1 will work on 11.2 without issues and won't jack anything up that would not allow me to go back to my existing 9.1 configuration if I needed/wanted to. Decisions, decisions...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Thanks for the tips! I actually just dropped it off at a local repair shop and they said they should get back to me by tomorrow.

So the pool will be fine with any hardware. I'm struggling trying to determine if this is a good time to upgrade hardware and/or a good time to try to mount the pool on a fresh version of FreeNAS 11, or if I should follow the whole "if it ain't broke, don't fix it" mantra -- replace any faulty hardware with the exact same hardware (as it's likely to be way cheaper now?) and leave the OS alone.

I could purchase another USB drive and load FreeNAS 11 and give it a shot, so long as the pool created with 9.1 will work on 11.2 without issues and won't jack anything up that would not allow me to go back to my existing 9.1 configuration if I needed/wanted to. Decisions, decisions...

It really depends on what broke and whether your previous system was up to snuff. If you went full out with a nice Supermicro and a Xeon CPU and lots of memory, and it's just that the PSU blew, then that old system is probably still quite serviceable after a PSU replacement. On the other hand, if it's an old recycled gaming system that's been overclocked in the past, and the CPU's failed, ... replacing the board and CPU would be quite reasonable.

Bear in mind that you do not necessarily need to replace a board and CPU with a *new* board and CPU. There are still lots of Sandy and newer generation boards out there available cheaply that are very competent options.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Did you pull your data drives before dropping it off? Hopefully they don't wipe out your drives if you did leave them installed.
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Didn't even think of it... god I hope they don't wipe them. But they shouldn't... they actually specialize in data recovery, but for now they're just trying to identify which component is faulty. I have all of my important stuff backed up... I hope I don't lose any data, but if I did, I would still be OK.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Didn't even think of it... god I hope they don't wipe them. But they shouldn't... they actually specialize in data recovery, but for now they're just trying to identify which component is faulty. I have all of my important stuff backed up... I hope I don't lose any data, but if I did, I would still be OK.
A good rule to follow is "Trust No One". You never know how many entry level employees they have hired that thing they are awesome but in reality are either dumb as rocks or more likely just do not have good troubleshooting skills or experience. For example my company's IT department logged in remotely to my work laptop to fix a minor annoyance I was having, I questioned what he was doing because while I'm not an IT expert, I do tend to pick up on things quickly. Anyway in a few seconds my computer was scrap, all lost. I had to ship it back to the IT office and was told they would just fix it and all would be good again. Nope, they reinstalled the OS Image and all my data was gone. I always keep a backup of my computer data since 1977, been down the road of pain many times in my decades of computer experience, most of it self induced of course. That was about 3 years ago and that IT person is now a reliable IT support tech, but I do give him a hard time whenever I talk to him to remind him of the time he killed my computer. Good thing you have the important stuff backed up.
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
So I'm being told with 99% certainty that it's the motherboard. It's a little scary because there is a flashing light that indicates it's the PSU, but the tech said he measured all the voltages coming out of the PSU and they were fine. They also didn't have a spare socket 1155 CPU to swap out with to be 100% certain that it wasn't the CPU.

So... I'm in the market for a new motherboard and consequently, a new CPU. Looking through the recommended hardware guide, I'm a bit caught off guard by the prices. An X11 + Xeon processor looks like it's going to run me over $500. I'm guessing if I go this route I should just stick with a cheaper i3 or something? Even then, I'm still look at around $350?

I'm not running any jails or anything, and my server has been running perfectly fine (up until now) with my old Supermicro X9 and Pentium 2 processor. Back in 2013 I paid a total of like $210 for the motherboard and CPU. Am I dumb for thinking about buying another old X9 from eBay or something? They have a bunch listed between $33 - $70. Or would I be smart to just buy new hardware and get myself a little more current?
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
So I've decided to replace my motherboard with the exact same model. I just realized that even if I splurged on an X11 and a new CPU, in addition, I would need to replace my DDR3 memory with DD4, so the price just keeps climbing.

eBay has a used model for $50, a new model for $70, and Amazon has a new one for $100. I decided to go with Amazon just to ensure I can return it easily if need be.

 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The repair shop should have tested the PSU using a special load tester, it's quick and accurate. If they just took voltage measurements then hopefully they used an o'scope so they could view the ripple voltage accurately. I'm surprised they didn't have a suitable test CPU or even a motherboard that they could plug your CPU into for testing, just to rule out the CPU or motherboard. Getting back to the PSU, they should have also plugging in a different PSU to see if that would have fixed it. Hope they didn't charge you too much.

Best of luck to you.
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Ya, it cost my $59 which now seems to be a waste because he can't actually tell me anything definitively. Your troubleshooting techniques are spot on and what I would have expected the shop to do. I am blown away that they don't have any spare hardware to swap out and isolate the issue or any proper testing equipment.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I've been designing electronic circuits and computer related parts, and troubleshooting them for 4.5 decades (man that sounds like a long time) and have my own way of doing things. If I were to ever take an item in to troubleshooting (like you had to) then I'd ensure that in writing they would have the proper tools to diagnose the problem to a specific part failure. I wish for your sake that they did a better job and I hope that you can get your system back online with minimal effort and cost.

So getting back to your problem, one of the reasons that could cause your issue is a bad capacitor and it does happen, these components do wear out and fail, most likely on one of the voltage regulators. A few things you could do is disconnect everything from your motherboard except leave one stick or RAM installed (see you manual for RAM slot configuration) and your CPU w/heatsink, power supply, keyboard and monitor. No hard drives, nothing at all. Power on the system, if it boots to the prompt that is need an OS or looking for something to boostrap it, let it sit for several hours, make sure it doesn't fail. If that works then try to boot from a USB flash drive or CD-R/DVD-R, maybe Ubuntu. If you have 1GB or RAM then it may take a while but the goal here it to isolate the problem.

So lets say it restarts when you power it on, well swap out the RAM with a different piece and try again.

Also, go into your BIOS and select Factory Default settings, see if that works. You might take note of any voltage settings, do not change them if you don't know what you ar doing, you could easily destroy your CPU by increasing the voltage too much, and too much could be half a volt.

If you are able to keep it running, make a bootable MemTest86 flash drive (or CD) or use The UBCD and run MemTest86 for a few days, if it works then great. If it works then power down and add the other RAM and rerun the test.

If you have another power supply then you could try that out, I doubt you have another motherboard which uses that type of CPU but if you did you could still diagnose the problem yourself.

Cheers
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Thank you so much for all of your advice @joeschmuck! You're dead on accurate and the shop did a very poor job of troubleshooting. Turns out it was the motherboard -- the computer seems to be perfectly stable with the new board in place. I'm a bit troubled, however... I usually don't keep a monitor hooked up to my NAS, but while I'm ensuring it's still up and everything is good, I'm noticing a whole lot of these messages scrolling by:

Code:
> (ada2:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 20 20 84 74 40 9b 00 00 00 00 00
> (ada2:ahcich0:0:0:0): CAM status: ATA Status Error
> (ada2:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada2:ahcich0:0:0:0): RES: 41 40 20 84 74 40 9b 00 00 00 00
> (ada2:ahcich0:0:0:0): Retrying command

Does this mean I have a failing drive? I have no other alerts and my zpool status shows all drives are healthy.

I'm also seeing this message"

Code:
nas smartd[2569]: Device /dev/ada2, 17 Currently unreadable (pending) sectors

I just ran the command `smartctl -t long /dev/ada2` and can report back the results when it completes.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You may what two things going on...

The first could be a data cable issue, unfortunately data cable fail for seaming no reason but since you had to move everything, that may have caused an issue. Look at your drive status and if you see anything wrong, then try to move your cables around to see if you can fix/move the problem.

The second issue is with drive ada2, it has "pending" sector issues. This alone may mean nothing. You are already running a SMART Long test on this drive and see what shakes from the trees. There is a link in my signature on Hard drive troubleshooting that should help. If you have any concerns, post the output of your smart test.

Glad the new motherboard fixed it.
 

turick

Dabbler
Joined
Nov 17, 2013
Messages
43
Here are the results after running `smartctl -a /dev/ada2`. Not 100% certain what to make of these results, but it does seem the drive is having issues?

Code:
[root@nas] ~# smartctl -a /dev/ada2
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC300210663
LU WWN Device Id: 5 0014ee 6ad7626f0
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jan 22 06:39:51 2020 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:         (26940) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 272) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3122
  3 Spin_Up_Time            0x0027   176   176   021    Pre-fail  Always       -       4200
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       76
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   023   023   000    Old_age   Always       -       56356
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       76
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       52
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   104   096   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       22
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   199   000    Old_age   Offline      -       82

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     56351         36649888
# 2  Extended offline    Completed: read failure       40%     56350         2066217192
# 3  Short offline       Completed without error       00%     56319         -
# 4  Short offline       Completed without error       00%     56271         -
# 5  Short offline       Completed without error       00%     56192         -
# 6  Short offline       Completed without error       00%     56144         -
# 7  Short offline       Completed without error       00%     56096         -
# 8  Extended offline    Completed: read failure       10%     56078         3708162216
# 9  Short offline       Completed without error       00%     56048         -
#10  Short offline       Completed without error       00%     56000         -
#11  Short offline       Completed without error       00%     55952         -
#12  Short offline       Completed without error       00%     55904         -
#13  Short offline       Completed without error       00%     55856         -
#14  Short offline       Completed without error       00%     55808         -
#15  Short offline       Completed without error       00%     55760         -
#16  Short offline       Completed without error       00%     55712         -
#17  Extended offline    Completed without error       00%     55694         -
#18  Short offline       Completed without error       00%     55664         -
#19  Short offline       Completed without error       00%     55616         -
#20  Short offline       Completed without error       00%     55568         -
#21  Short offline       Completed without error       00%     55521         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Guessing it's time to order a new hard drive. Right now I have 4x2TB WD Red hard drives and I'm near capacity. 6TB drives are now cheaper than the cost of my 2TB drives at the time of purchase :) I'm wondering if there is any benefit to replacing the drive with a higher capacity drive, or if I would need to replace all the drives with higher capacity drives? Or in that case, can I go from 4x2TB to like 3x6TB or 2x8TB or something?
 
Last edited:

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 22
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 199 000 Old_age Offline - 82


That drive needs to be replaced.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
ID 5 = 1 Not good but not terrible, counts of 5 or more are likely problematic.
ID 197 = 22 Not good but not terrible, possible signs of failure pending.
ID 200 = 82 Not Good and on it's own not terrible BUT it's not on it's own.
Extended Test Failure = 100% Bad Drive if you cannot pass this surface test.

Agreed, drive need to be replaced.
 
Top