FreeNAS + ESXI Lab Build Log

Status
Not open for further replies.

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
I had a drive fail catastrophically in one of the servers I manage for work. According to the log, the drive hit 185 celsius before it died and the heat was so intense it cause the two neighboring drives to have errors too. It is a good thing the server was running RAIDz2 with a hot spare. I ended up having to replace three drives at one time, which is rare... That server didn't burst into flames though. It is very unlikely that would happen for anything short of an electrical fault. I had a power supply fail that smelled like it was on fire, but no flames ever appeared. These things don't happen often. I have only seen them because I have been dealing with many servers for many years.
  • 365°F (I have to think in F past 100°C) ... that is crazy talk!!! (I believe you).
  • I'm surprised it made it that far before dying ... do you know what caused the failure?
  • I don't proclaim to know anything about hard drive engineering, but I'm surprised there isn't some sort of internal fail safe that spins them down before they hit such a high temp. Programmatically, I suppose that would require them to issue their own smartmontool command, interpret the result, and conditionally determine whether or not to continue to operate (so that wouldn't make any sense). But I'm surprised there isn't some sort of integrated tprobe to achieve the same.
PS. We had one of the two cooling units for my section of the datacenter go down and the room temp went up to the low 80s (fahrenheit) causing the drives to go up to the high 50s celsius. It was a stressful couple weeks waiting for the parts to fix the cooling unit. Those commercial coolers, I would have thought the parts were more readily available. The way the data-center is partitioned, the other coolers didn't help my section much...
  • Ugh that sucks.
  • Stupid question #1: When you hear A & B Power or Cooling, that refers to 100% redundancy, right? So your two cooling units would be Cooling A and there wasn't a Cooling B?
  • Stupid question #2: I've only been in a bonafide colo once, but working in F&A I know that our cost was fixed for renting the space for the ~12 racks we had there, point being isn't there some sort of SLA that guarantees power / cooling by the colo operator, i.e. the failed cooling unit was their problem (obviously impacting you unfortunately). My inference from your wording is that your company had to maintain the cooling infrastructure.
  • Somewhat surprised low 80s°F ambient = high 50s°F HDD temp. I suppose I can't extrapolate my own server to a datacenter (drives could be 15k / multiple servers in constrained space could require a cool aisle / heat level could have produced a compounding effect), but I keep the air on 78°F during the summer and my warmest drive only hits 36°F with an average of 34.4°C with the full speed fan profile (and I imagine your servers would have wound up their fans to cope with the temp). Thankfully, for me, full speed isn't needed to stay sub 40°C @ 78°F ambient and usually Standard with a throttle here and there to Heavy I/O gets the job done (using X9 fan profile speak). Reference below image, top right (produced some time ago, when I was trying to get temps under control).
I have to be my usual smart a$$ self (with a horrible sense of humor) and propose the solution to both problems, the common issue of servers catching on fire which has been a bigger issue for the FreeNAS forum than users running non-ECC RAM and failing cooling units = integrated water cooling!
  • Server catches on fire = no problem. PVC cooling loop melts and douses the fire with coolant.
  • Cooling unit fails = no problem. You don't need cool air, you have cool liquid.
Seriously though, I understand why liquid cooling hasn't penetrated the enterprise space to date; however, I was recently reading about an upcoming racked offering from IBM that doesn't use a single fan, and everything down to the memory is cooled by the loop (which it would have to be without airflow). Cool stuff and my apologies but a quick google search doesn't yield the article I'm looking for.

Off to bed for the night (hopefully I'm not awoken by a fire in my server closet) - Good Night. ;) [Yes, my sense of humor is that bad]

Full.png

Edit: Bottom right should read "Full Speed Comparison" not "Standard Speed Comparison" (used the same worksheet to compare all fan modes for a change I made and must have forgotten to change that text manually).
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
FWIW, my X10-SRi manual states 2.5A for the the fan headers.

Screen Shot 2018-03-16 at 7.43.37 pm.png


And this FAQ specifically states that the X10-SRL-f is rated for 3A max.

https://www.supermicro.com/support/faqs/faq.cfm?faq=24693

And ditto for the H8
https://www.supermicro.com/support/faqs/faq.cfm?faq=20667

And 1.5 for the X9DR7-TF+
http://www.supermicro.com/support/faqs/faq.cfm?faq=16165

BTW, while we're talking about fans and IPMI. The x9 series does not support for PWM control, but you do have this

http://www.supermicro.com/support/faqs/faq.cfm?faq=18009
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Somewhat surprised low 80s°F ambient = high 50s°F HDD temp

50C not F

And if you have a system in the top of the rack that is water cooled... and it fails... it can take out the whole rack.

Ouch.
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
BTW, while we're talking about fans and IPMI. The x9 series does not support for PWM control, but you do have this

http://www.supermicro.com/support/faqs/faq.cfm?faq=18009
  • Don't take one of our fan modes away from us ... as us X9 users only have a 4-speed to begin with!!! OK maybe I'm just jealous of X10+ users.
  • Seriously though, that FAQ is misleading / incorrect in that it omits the Heavy I/O fan mode.
  • I personally find Heavy I/O to be quite useful as my script uses it when Standard doesn't quite keep the warmest drive under 40°C, but Full Speed ahead isn't quite needed.
  • So if you are controlling the peripheral zone via fan header A, you get all of 4 speeds:
  • Optimal @ ~30% (fixed)
  • Standard @ 50% (I believe this is target, not fixed)
  • Heavy I/O @ 75% (fixed)
  • Full @ 100% (fixed)
  • My point is that 75% (Heavy I/O) comes in quite handy as there is a large variance in the sound pressure level between Standard and Full.
Directionally your point is of course well noted and I tried to prepare him for that previously (lack of granular fan control) as I was a bit bummed when I found out ...
... and correct me if I'm wrong, but the X9 does have Pulse Width Modulated fan connectors and can control duty cycle as mentioned above, differing from the X10+ where you can issue IPMI RAW commands for any % duty cycle.
A few references for your perusal when you start to play with fan speeds:
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
50C not F

  • I'll take that correction in stride as (a) I always appreciate someone who is detail oriented, and (b) you bothered to read through my babble. ;) Much love as always, as I've learned a lot from you.
And if you have a system in the top of the rack that is water cooled... and it fails... it can take out the whole rack.
Ouch.
  • The water cooling idea was a joke (as noted), save for the bit about IBM deploying a system using it (and they actually have previously, but I don't believe it was well received due to the risk you are driving at).
  • Anyway, back on topic, if you had the water cooling system at the bottom of the rack or in another rack, I think your risk would be marginalized, right? When I delidded and OCed my i7-4770k, I put a Noctua NH-D14 on it so I don't have personal experience with water (next build coming shortly!), but if pumps/radiator, etc were segregated the only concern would be failure of a water block or a fitting, right?
  • Not my field, but doesn't density per RU eventually hit a point where air doesn't get the job done any longer, and water has to be adopted?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
No, rack density just makes the input air need to be colder and higher volume.
We have some racks that are full, top to bottom with 1U servers that are used for virtualization. The heat in the hot isles is hard to tolerate and you need to wear a coat and gloves in the cold isle.

PS. About ten years ago, we had some gear in the supercomputer center that had big radiators on top of each rack with fluid lines running through the compute nodes.
That was installed by Cray, but the new system they put in more recently doesn't have that.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Seriously though, that FAQ is misleading / incorrect in that it omits the Heavy I/O fan mode.

Another FAQ discusses that a 4th mode was added in a bios/IPMI update ;)
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Think the limiting factor these days is power density on the data centre. You only get so many Watts per rack.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
and correct me if I'm wrong, but the X9 does have Pulse Width Modulated fan connectors and can control duty cycle as mentioned above, differing from the X10+ where you can issue IPMI RAW commands for any % duty cycle.

Yes, that’s right.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
  • 365°F (I have to think in F past 100°C) ... that is crazy talk!!! (I believe you).
  • I'm surprised it made it that far before dying ... do you know what caused the failure?
I only have the log data to go by. I was pretty surprised to see that and I have a feeling it might not be perfectly accurate because there was not melted plastic. I don't know what caused it. The drive completely stopped responding and the drives on either side of the failed drive started throwing errors and needed to be replaced also. It happened when I was not present and I only have the information that was logged to tell me what happened. The neighboring drives only started having errors after the temp of the failed drive got high. Those neighboring drives were still working, but they were having errors that I felt made them need to be replaced.
Stupid question #1: When you hear A & B Power or Cooling, that refers to 100% redundancy, right? So your two cooling units would be Cooling A and there wasn't a Cooling B?
That is the ideal situation, that all the cooling you need would be handled by either cooling system. Unfortunately, the funding is not always what you might like and things are not always ideal. One of our two cooling systems (the one that failed) is enough to handle all the cooling on it's own. The other system is actually larger but we only get a portion of the capacity it has and that can only do enough to keep the temp down to uncomfortable levels. I actually did shutdown four storage servers (and their attached drive shelves) to reduce heat-load so that more important servers could be kept online.
This is one of the things we will need to address during this year because we are working on adding another petabyte of storage which adds to the heat-load.
Stupid question #2: I've only been in a bonafide colo once, but working in F&A I know that our cost was fixed for renting the space for the ~12 racks we had there, point being isn't there some sort of SLA that guarantees power / cooling by the colo operator, i.e. the failed cooling unit was their problem (obviously impacting you unfortunately). My inference from your wording is that your company had to maintain the cooling infrastructure.
It is our 'in house' data-center. When the cooling unit went down, there was a monitoring system that automatically contacted facility maintenance and they come out to figure out why the cooler is not working, they order the part and when it arrives we have to schedule the down-time to allow them to install it. Something I don't get is this, what ever it was that was broken, we had to shutdown power to the entire building to replace it. Every computer in the building had to be shutdown as a precaution against them not getting the work done before the battery bank ran out. You may ask, what about the generator system. We had to disable that so the portion of the power infrastructure they needed access to did not get re-energized by the generator plant..
Somewhat surprised low 80s°F ambient = high 50s°F HDD temp. I suppose I can't extrapolate my own server to a datacenter (drives could be 15k / multiple servers in constrained space could require a cool aisle / heat level could have produced a compounding effect), but I keep the air on 78°F during the summer and my warmest drive only hits 36°F with an average of 34.4°C with the full speed fan profile (and I imagine your servers would have wound up their fans to cope with the temp). Thankfully, for me, full speed isn't needed to stay sub 40°C @ 78°F ambient and usually Standard with a throttle here and there to Heavy I/O gets the job done (using X9 fan profile speak).
The servers in our data-center run their fans at full speed all the time. Yes, it is loud in there, so loud that you can hear it outside the room. Under normal circumstances, the hottest drive might be around 32°C. When the cooling system went down, I was seeing high temps around 60°C and some of the drives we have are rated to handle that, but others are not. Either way, I don't like to have that situation. It wasn't happy days.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Think the limiting factor these days is power density on the data centre. You only get so many Watts per rack.
Since we own the data-center, we were able to have more power run to specific racks. For example, we put a batch of IBM blade centers in that wanted 3-phase power and it took some time to get the electricians in but we got more power installed. It has happened a few times over the years that we needed more power for new servers, which is a pain because of the delay it incurs on getting a new system installed. I do my best to plan ahead for these things but not all the people that order the gear are as attentive as I am.
It also happens that vendors ship equipment that is not what we actually wanted and we have to go back through the procurement process to get some corrective hardware which extends the installation by months.
Managing these processes is a pain.
 
Last edited:

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Since we own the data-center, we were able to have more power run to specific racks. For example, we put a batch of IBM blade centers in that wanted 3-phase power and it took some time to get the electricians in but we got more power installed. It has happened a few times over the years that we needed more power for new servers, which is a pain because of the delay it incurs on getting a new system installed. I do my best to plan ahead for these things but not all the people that order the gear are as attentive as I am.
It also happens that vendors ship equipment that is not what we actually wanted and we have to go back through the procurement process to get some corrective hardware which extends the installation by months.
Managing these processes is a pain.
  • All interesting stuff Chris, thanks for taking the time to share (sincerely). You highlight some of the downsides, but clearly you are passionate about what you do. I wish I had went into IT, but at this point, if I wanted to I would have to essentially start my career over and at the bottom as a deskside support technician (and not looking down on that title, my point is that it is a little too late for me).
  • I love learning about this stuff as it is so foreign and interesting to me. During that visit to the Equinix facility @ 56 Marietta, I remember feeling like a kid in a candy store. It was cool to see all that stuff, and then there was Google: Their cage wasn't up and running yet and there were pallets of hardware waiting to be installed, but funnily enough, they had installed black plywood along the periphery of their cage, presumably to obstruct view of what hardware they would be running ... so dark and mysterious.
  • As to not further derail the thread, just curious on one point that actually converges @Stux and your replies.
  • So is "power density" the true limit, or a bit of an artificial one imposed by datacenter owners? I'm assuming that power companies would be happy to sell datacenter operators power all day long at commercial rates and usage thresholds, right? Are racks limited to 1.21 Gigawatts because more Gigawatts would require upgrading the transmission / feeder line, internal power delivery infrastructure, or something of that nature?
serveimage
 

Maelos

Explorer
Joined
Feb 21, 2018
Messages
99
Lots to respond to, though I won't respond to all as much of it is a good read.
Keep in mind that many IPMI products have known, open security vulnerabilities... especially if you aren't running brand-new systems with 100% current patching. If you put something on the public interwebz on 443, you will start getting bots hitting it in minutes. One vulnerability and you're owned. IPMI systems were *never* intended to be publicly exposed.
I have it port forwarded it to 443, but I do not have any well-known ports open to the outside (anymore...). I will keep that in mind though. Knowing there is a trap is half the battle in avoiding it (Dune quote I believe).
  • Heat is a valid concern, but fire? Yes, I'm aware that if you throw enough heat at something you can get it to combust, but I think we are a long ways from that threshold here.
  • I'm very curious what your HDD temps look like, with the closet door closed, considering there isn't a proper exhaust?
  • You could have your wife power it back on by pressing the power button on the front panel.
  1. Fire may have been a bit of an extreme example, but want to be overly cautious until I have some systems in place to notify me. Today I hope to set up alerts from IPMI for temps, look into what versions and upgrades are needed for IPMI, the BIOS, and the HBA card, and better securing ESXI.
  2. I only have an SSD and a SATA-DOM online at the moment. I am waiting to get things better tuned before I put an HDD or 7 in there and starting tests on them. I am considering exhaust options and may post a more thorough video of the closet and potential exhaust options.
  3. Yes, I suppose. Hopefully I will not have need of this as everything gets better tested and configured.
According to the log, the drive hit 185 celsius before it died and the heat was so intense it cause the two neighboring drives to have errors too. It is a good thing the server was running RAIDz2 with a hot spare. I ended up having to replace three drives at one time, which is rare... That server didn't burst into flames though.
I am currently *blind* to my home server without active monitoring of IPMI or the ESXI. The unknown is what makes me nervous, albeit a little extreme in the example. As I set in place monitoring and alerts, I will hopefully move past that. This is all part of the experience - monitoring is a big part of my job in an operations center so learning how to set it all up is a huge +.
Seriously though, I understand why liquid cooling hasn't penetrated the enterprise space to date; however, I was recently reading about an upcoming racked offering from IBM that doesn't use a single fan, and everything down to the memory is cooled by the loop (which it would have to be without airflow). Cool stuff and my apologies but a quick google search doesn't yield the article I'm looking for.

Off to bed for the night (hopefully I'm not awoken by a fire in my server closet) - Good Night. ;) [Yes, my sense of humor is that bad]

View attachment 23432
Edit: Bottom right should read "Full Speed Comparison" not "Standard Speed Comparison" (used the same worksheet to compare all fan modes for a change I made and must have forgotten to change that text manually).
What are those images from, thats exactly what I would like to have, ideally on a mobile app with alerts/notifications set where I like them.
FWIW, my X10-SRi manual states 2.5A for the the fan headers.
View attachment 23433
And 1.5 for the X9DR7-TF+
http://www.supermicro.com/support/faqs/faq.cfm?faq=16165
BTW, while we're talking about fans and IPMI. The x9 series does not support for PWM control, but you do have this
http://www.supermicro.com/support/faqs/faq.cfm?faq=18009
Good to know, thanks. I wonder why they have the data in some, but not all Mobo manuals :-/. I think I will use one of the scripts available here for the fans, and I may even have a chance to do that today or tonight. Exciting!
50C not F
And if you have a system in the top of the rack that is water cooled... and it fails... it can take out the whole rack.
Ouch.
This is why you have a separate system to evac the water cooling safely...and a redundant water cooling supply. On a somewhat related note, did you all hear that Microsoft was submerging its "data centers" in the ocean? How is that for water cooling! https://news.microsoft.com/features...oject-puts-cloud-in-ocean-for-the-first-time/
  • Optimal @ ~30% (fixed)
  • Standard @ 50% (I believe this is target, not fixed)
  • Heavy I/O @ 75% (fixed)
  • Full @ 100% (fixed)
  • My point is that 75% (Heavy I/O) comes in quite handy as there is a large variance in the sound pressure level between Standard and Full.
Directionally your point is of course well noted and I tried to prepare him for that previously (lack of granular fan control) as I was a bit bummed when I found out ...
... and correct me if I'm wrong, but the X9 does have Pulse Width Modulated fan connectors and can control duty cycle as mentioned above, differing from the X10+ where you can issue IPMI RAW commands for any % duty cycle.
Interesting. I will hopefully be looking into this more. I never got to look through all the jumpers on the board last night, but I will add this to the to do list also. Speaking of, maybe I need to have an internal wiki/Confluence hosted...
  • The water cooling idea was a joke (as noted), save for the bit about IBM deploying a system using it (and they actually have previously, but I don't believe it was well received due to the risk you are driving at).
I loved my water cooling rig and may consider something in the future, but I want to work up from hardware and play in the software realm for a good long while before returning. If I had not just sold all of my water cooling equipment, I would have kept all but the 560mm Rad (though I could mount this on the rack somewhere) and I may have eventually rigged something up.

Think the limiting factor these days is power density on the data centre. You only get so many Watts per rack.
Hence why Cryptomining is not welcome in many data centers - power and heat ++
It wasn't happy days.
I can't imagine that happening here, that would be nuts. Luckily there are quite a few DCs up here so shifting to a non-broken one, while painful, would be possible.
Managing these processes is a pain.
At least you own the datacenter. Having third party/outside techs have to do the work, especially when at a great distance, has been less than stellar.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I am currently *blind* to my home server without active monitoring of IPMI or the ESXI. The unknown is what makes me nervous, albeit a little extreme in the example. As I can set in place monitoring and alerts, I will hopefully move past that. All part of the experience - monitoring is a big part of my job in an operations center so learning how to set it all up is a huge +.
There are some monitoring scripts here, and they will help you monitor the situation once you get FreeNAS running.

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
At least you own the datacenter. Having third party/outside techs have to do the work, especially when at a great distance, has been less than stellar.
Well, not me personally.... ;-) I just work here.
The department I work for now does not have any equipment co-located in centers under other control. I used to work for another department that maintained 56 small facilities around the US and I was one of their "fly-away" technicians. In those days, if something went down and it couldn't be resolved quickly by the people on the ground, they would put me in the air and I would figure out the problem when I got there. I have had to replace satellite antennas or run new fiber through 4 floors of a building. The stories I could tell... I don't travel any more.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
So is "power density" the true limit, or a bit of an artificial one imposed by datacenter owners? I'm assuming that power companies would be happy to sell datacenter operators power all day long at commercial rates and usage thresholds, right? Are racks limited to 1.21 Gigawatts because more Gigawatts would require upgrading the transmission / feeder line, internal power delivery infrastructure, or something of that nature?
There is a point at which you need significant structural changes or you have to get facility power upgraded. The big commercial facilities are outside the scope of my experience, but I can tell you this. The data center where I work was originally on the second floor of the building and there came a time, I can't remember exactly when but it was around 2009 when they wanted to upgrade the SAN and it was going to be a lot more weight of equipment. The director decided it would be a good idea to have the structural engineer evaluate the flooring system before they did the upgrade. They were thinking that the old raised flooring system might not be strong enough because they had noticed times and places where it would shift. It turned out that the second floor was not built to hold that much weight. They had to move the entire data-center to the first floor or risk structural collapse. When the building was originally built, in the 1960s, they never planned for the amount of equipment that had been moved in over time and the structure of the building just wasn't up to the task. They moved the people out of the offices under the data-center, gutted that part of the space and built out a whole new data-center that all the existing equipment, and the new SAN, could be moved into. The people in charge at the time, in my opinion, did a lot right but they also failed to take into account some new concepts that would have made the facility more efficient and they didn't do enough to plan for future growth. Capacity planning is one of the most difficult tasks. It is easy to build a facility to do what you are doing now, but it is difficult to build a facility that will still be relevant in 50 years. For example, the original data-center that was built in the 60s had to be scrapped and an entirely new one configured in it's place. They really have no way to expand beyond this without demolishing the building and starting over. That means they have to focus more on density and this is what has required them to bring in more power than the facility (just opened in 2010) was originally laid out to have. They ran 2 outlets that take this kind of plug for each rack:
Dell-9R905-Power-Cable-Cord-For-Server-Rack-Cabinets-PDU-s-[1]-20704-p.jpg

It was enough back in 2010, but those new blade centers want even more... There is a limit, but I suppose it depends on how much money you throw at the problem to go from where you are to where the limit is, because we are not really at the limit yet.
At my house, it is a different story. I am probably at the limit (or close to it) of what my wife can tolerate. I will try again next year.
 
Last edited:

Maelos

Explorer
Joined
Feb 21, 2018
Messages
99
I have a few more questions as I feel like I am struggling with the basic setup. I don't want to keep coming back here for everything, but I also don't want to get burnt out. While I do a bit more research I wanted to leave this here: http://jro.io/nas/. It may already be a well known resource, but wowza. This is something like what I had hoped to put together after having everything up, tested, and running smoothly.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I have a few more questions as I feel like I am struggling with the basic setup.
Please ask them. Someone here might be able to point you in a direction even if they can't answer the question.
 

Maelos

Explorer
Joined
Feb 21, 2018
Messages
99
FoxLabESXI.png

Above is a simplified (no Cisco setup) lab view.

I have been attempting to work from the "ground up" on this build, learning and adjusting along the way. Building a home server started months ago when I was consider a mining+server build. I have thought about adding in a dedicated circuit, and still may, all the way through what type of environments I would like to be able to deploy for lab/learning purposes. Here are the goals I have set and met thus far.
  1. Requirements gathering - Storage/Fileserver and Backups are first priority, then lab
  2. Hardware requirements (initial build is mostly sold off, new and improved build is working thanks to you all)
  3. Management and monitoring - I can remotely work on the server and have minimal access management for my friends/coworkers. I would like to improve the granularity of user management, but first I would like to implement monitoring. Whether trough IPMI, FreeNAS, a VM, ESXI...whatever, I would like to get notifications of temperature and hardware errors. If it can be all done through FreeNAS, great.
  4. Security - I want to be able to securely offer access to family, friends, and coworkers. I wish I could implement a privileged access management on top of all this, but that could take weeks to properly set up. I am currently looking at pfSense for a combination of user account/password, possibly PGP keys, and or certs.
  5. FreeNAS Setup - when I would finally move non-sensitive files over and setup scrubs, SMART testing, etc to play with for a while.
  6. Nagios XI - doing further work here
  7. JIRA - testing it and how it works with Nagios XI
  8. Lab... - where everything else fits
My biggest issue right now is that I am still unable to get a basic email alert from IPMI or other tool. For peace of mind I would like to have alerts setup and a daily temperature reading. I have created a gmail account for use of their free SMTP service but not having much luck getting it to work. As stated above, it does not matter to me exactly where this is run - be it from IPMI, a VM using IPMIView, FreeNAS, ESXI, or otherwise. I have been re-reading Stux's thread as it has become even more relevant to my struggles. Below are a few screens of my IPMI settings - I know the version is out of date, which may be the cause, but the update failed remotely so I will attempt from home soon.
IPMI_SMTP_3_21_18.png
IPMI_alerts_3_21_18.png

IPMI_version_3_21_18.png


My next big hurdle is figuring out a system to securely let friends, family, and coworkers access the environment and being able to manage what they can see and touch. I have been looking for an access management/PAM or similar solution that is free without much luck. I know many will suggest a VPN, but if I do so, it may make using it at work harder as appearing from a separate network will limit my in-work access. Any ideas? I will be looking at pfSense to get something better than just a port redirect and user/password. My coworker suggested "port knock" and Fail2Ban additionally. I look forward to feedback as I look into setting up pfSense in ESXI properly - hard to tell what NICs are what remotely...

P.S. - I also looked into Dynamic DNS so people won't have to keep moving the lab's IP. I registered for NS1 but may have done so in error. I'm not sure if they offer free DynDNS like NoIP and others do. Usage of DynDNS may have to coincide with the rehosting of my website off of this server as I will then have real domains to use...maybe?

P.S. 2 - I also looked into using the virtual appliance for the management of the APC UPS that I purchased. I ran into a road block there too when the config asked for the address of the management NIC with user and password... I may have to default to using the service on ESXI or FreeNAS as described and scripted by the community (Stux has at least one).

P.S. 3 - Where do you even measure the HDD temps? I assume this is somewhere in FreeNAS. I added two HDDs to the array just to see if the HBA worked (it does, but still may need to be flashed to recent version if I find it warranted). The closet itself has actually stayed nice and cool, with only a few degrees difference from outside the door. I also snagged a ton of egg crate foam from the data center trash to use within the cage/room for sound dampening. I will have to upload that picture later.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
My biggest issue right now is that I am still unable to get a basic email alert from IPMI or other tool. For peace of mind I would like to have alerts setup and a daily temperature reading.
Once you have drives in FreeNAS, it can send you a daily report. There is a script, I gave you a link to the github page where the script is. Set it up to run daily (or more often) through cron and it sends a report like this:
Code:
########## SMART status report summary for all drives on server EMILY-NAS ##########

+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |Writes|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|da0   |x4x3xxxx		  | 30 |11675|   32|	0|	  0|	  0|	   0|	 1|  34790523|	 1|	  0|   0|
|da1   |x305xxxx		  | 28 | 4890|   16|	0|	  0|	  0|	   0|	 0|  22226378|	 0|	  0|   0|
|da2   |x305xx2x		  | 29 | 4905|   18|	0|	  0|	  0|	   0|	 0|  22750814|	 0|	  0|   0|
|da3   |x4x3xxxx		  | 29 |11469|   32|	0|	  0|	  0|	   0|	 1|  33442494|	 0|	  0|   0|
|da4   |x4x3xxxx		  | 29 |11467|   30|	0|	  0|	  0|	   0|	 0|  33545435|	 2|	  0|   0|
|da5   |x4x29xxx		  | 30 |11470|   34|	0|	  0|	  0|	   0|	 0|  32531155|	 3|	  0|   0|
|da6   |x4x19x0x		  | 30 | 8434|   64|	0|	  0|	  0|	   0|	 0| 964322391|	 0|	  6|   0|
|da7   |x4x1xxx1		  | 30 | 8434|   60|	0|	  0|	  0|	   0|	 0| 972819247|	 0|	  6|   0|
|da8   |x307xx18		  | 29 | 5017|   47|	0|	  0|	  0|	   0|	 0|  17628102|	 0|	  0|   0|
|da9   |x307xx4x		  | 28 | 7595|   90|	0|	  0|	  0|	   0|	 0|  19654954|	 0|	  0|   0|
|da10  |x307xx56		  | 28 | 7236|   49|	0|	  0|	  0|	   0|	 0|  19126716|	 0|	  0|   0|
|da11  |x307xx35		  | 29 | 8517|   82|	0|	  0|	  0|	   0|	 0|  21678809|	 0|	  0|   0|
|da12  |x4x22xxx		  | 29 |13651|   79|	0|	  0|	  0|	   0|	 0|  63215653|	 2|	  0|   0|
|da13  |x4x2xx0x		  | 29 | 8019|  106|	0|	  0|	  0|	   0|	 0|  57801039|	 4|	  0|   0|
|da14  |x4x1x916		  | 30 | 8434|   59|	0|	  0|	  0|	   0|	 0| 957156034|	 0|	  0|   0|
|da15  |x4x1x36x		  | 29 | 8409|   60|	0|	  0|	  0|	   0|	 0| 960180080|	 0|	  0|   0|
|da16  |x4xxx58x		  | 35 | 4139|   58|	0|	  0|	  0|	   0|	 0|  38626923|	 0|	  1|   0|
|da17  |x4xx6xxx		  | 34 | 4170|  102|	0|	  0|	  0|	   0|	 0|  38856188|	 0|	  0|   0|
|da18  |x2xxx7xx		  | 35 |38925|  115|	0|	  0|	  0|	   0|	 0| 512919050|	 0|	  0|   0|
|da19  |x3x4xxxx		  | 35 |23411|  723|	0|	  0|	  0|	   0|	 0| 140766959|	 0|	  0|   0|
|da20  |x6x3xxx7		  | 34 |20760|   74|	0|	  0|	  0|	   0|	 0|  77242797|	 0|	  0|   0|
|da21  |x3x2xxxx		  | 35 | 1527|   22|	0|	  0|	  0|	   0|	 0|  36779652|	 0|	  0|   0|
|da22  |x3xxx0xx		  | 35 |14749|  867|	0|	  0|	  0|	   0|	 0| 210905594|	 0|	  0|   0|
|da23  |x3xx262x		  | 36 |15285|  813|	0|	  0|	  0|	   0|	 1| 238499218|	 0|	  0|   0|
|da24  |x2xxxx9x		  | 35 |14480|   37|	0|	  0|	  0|	   0|	 0|  43665318|	 0|	  0|   0|
|da25  |x2xxxxx2		  | 36 |22363| 2213|	0|	  0|	  0|	   0|	 0| 289352536|	 0|	  0|   0|
|da26  |x6xx2xxx		  | 36 |11918|  381|	0|	  0|	  0|	   0|	 0|  51053042|	 0|	  0|   0|
|da27  |x6xxx29x		  | 35 |12612|  196|	0|	  0|	  0|	   0|	 1| 111577501|	 0|	  0|   0|
|da28  |xx444x0x		  | 34 |  335|   18|	0|	  0|	  0|	   0|	 0|   5394931|	 0|	  0|   0|
|da29  |x3xxx7xx		  | 35 |28985|  192|	0|	  0|	  0|	   0|	 2| 363422298|	 0|	  0|   0|
|da30  |x2xxxxxx		  | 35 |37161|   92|	0|	  0|	  0|	   0|	 0| 856527568|	 0|	  0|   0|
|da31  |x2xxxxxx		  | 35 |33683| 1275|	0|	  0|	  0|	   0|	 0| 252059899|	 0|	  0|   0|
|ada0  |xxxx32150051120xxx|	| 7420|	0|	 |	  0|	   |		|   N/A|	   N/A|   N/A|	N/A|   0|
|ada1  |xx05x7725xxx	  | 39 |15613|  124|	0|	  0|	  0|	   0|	 0|	   558|   N/A|	N/A|   0|
|ada2  |xx05x7725xxx	  | 39 |15613|  129|	0|	  0|	  0|	   0|	 0|	  1082|   N/A|	N/A|   0|
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
and then there is a bit like this for each of the drives in the pool:
Code:
########## SMART status report for da0 drive (Seagate Barracuda 7200.14: SN:xxxxxxxx) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   120   099   006	Pre-fail  Always	   -	   242122200
  3 Spin_Up_Time			0x0003   096   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   32
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   075   060   030	Pre-fail  Always	   -	   4329757824
  9 Power_On_Hours		  0x0032   087   087   000	Old_age   Always	   -	   11675
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   32
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   099   099   000	Old_age   Always	   -	   1
190 Airflow_Temperature_Cel 0x0022   070   056   045	Old_age   Always	   -	   30 (Min/Max 27/33)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   27
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   599
194 Temperature_Celsius	 0x0022   030   044   000	Old_age   Always	   -	   30 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   11630h+23m+54.280s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   7199859848
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   1018687155215

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline	   Completed without error	   00%	 11668		 -
 
Status
Not open for further replies.
Top