Replace MSI Motherboard With SuperMicro

Status
Not open for further replies.

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
I have FreeNAS 8.0.3 running on an MSI 890GXM-G65 motherboard (which uses an AMD Phenom processor) and the system locks up every single time I transfer files to the ZFS disks. They are set up as CIFS shares. I am about ready to replace this with a SuperMicro H8SGL-F-O and 32 Gb of memory. Also I will either upgrade my current FreeNAS to 8.0.4 or I will start fresh, dump the existing config and set everything up from scratch. It bothers me that I can't narrow down the cause of the system lockups. Perhaps I killed the motherboard with an unintentional static electrical discharge at some point. The cautions about FreeNAS' sensitivity to motherboard hardware also contribute to my thinking I had better replace the MSI board. I still have nagging doubts and I dislike the considerable expense.

I have another person running FreeNAS 8.0.3 with UFS disks, not ZFS, and much less memory, and it has worked flawlessly. Indeed it has made me look very good to that person. I often wonder if the real problem here is that ZFS is causing the lockups. However that FreeNAS system is on a Dell Vostro and I think Dell uses Intel motherboards. FreeNAS is supposed to be much more stable on Intel parts, right?

The other thing I am doing to address my problem is replacing my network switches with managed switches. I have a Cisco SG300-10 just installed, I have been told this is a very low end, rebranded Linksys switch. I am considering adding a Juniper EX2200-C switch.

Thanks

Bob
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
Couple of things:
1) The brand/model switch should not be relevant to FreeNAS stability.
2) Can you look at the console of the machine which locks up to see if there are any errors printed on screen? This would help debug the issue.
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
I've observed the console and/or tailed the output of /var/log/messages during these lockup events. The sure sign of this problem, anywhere from 17 minutes to just under 3 hours of transfer, is a popup message on the machine doing the access to a ZFS volume stating that the network connection has been dropped.

/var/log/messages output looks normal to that point -- most of the time, the most recent messages are from ntpd. If I don't tail the messages, I usually get no error message text printed to the console.

I'm unable to access the web GUI when these events happen.

These symptoms are being experienced by others. I thought perhaps in my case ZFS was causing kernel memory starvation, so I followed Protosd's suggestion for adding "loaders" to increase kernel memory. Perhaps I've done those wrong. I can post the loaders I am using.

Perhaps there is a different log file I need to tail?

The release notes for 8.03 state

"- FreeBSD can be really touchy with hardware. Please be sure to update
your BIOS/BMC firmware when upgrading..."

My gut feeling is that I may have damaged the hardware in some way (maybe through ESD...or a dropped screwdriver once?), but I am certainly open to suggestions to look at other log files.

Bob

 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
If you're using 64bit, you shouldn't need to play with memory. One of the advantages of 64bit is that memory can flow (as needed) between kernel and user.

In my experience, ZFS errors spit out a note on the console informing you of a lack of memory (kmem) when they panic.

I'd probably try both the upgrade to 8.0.4, and also the start from scratch. I've been prodding hardware for quite a while and I've never managed to zzt a component yet.
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
Thanks, Louis for your help. I really appreciate it over here.

I am running 64 bit, yes, and I had the same thought as you suggest. Now that you remind me -- I did get some kernel panic messages related to kmem. That is what prompted me to add the loaders.

One other thought I have is that perhaps the Phenom processor is overheating. It seems doubtful however since I have 4 fans installed in the case in addition to the fan on the processor heat sink.

I will try your suggestions out; perhaps this unit can still be saved.

I did however very reluctantly order a new motherboard, memory, and a power supply. I also ordered a used IBM M1015 LSI board to act as an SAS controller. I'll flash it to IT mode when it arrives.

Bob
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
I'm guessing you can't just move the usb stick and spindles to another machine for testing. I don't know if the BIOS keeps a log of overheat or not. Perhaps worth checking.

You might try using the board as a RAID and create RAID0 arrays each with 1 spindle and then see if zfs behaves differently.

I did this with the PERC I have because FreeNAS wouldn't see it as pass-through.
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
Thanks for the suggestion to check the BIOS logs, Louis. I had not thought of that and will do so Wednesday evening my time. I will report back soon. I may also try the suggestion about trying RAID with just one hard drive and seeing how ZFS behaves.

Bob
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
The MSI motherboard BIOS does not keep logs of anything as far as I can see. I have 8.0.3-p1 running on the MSI system right now, and it will likely work just fine...with no traffic to and from the shares, e.g. no data load. If I try to transfer 200 Gb of data, I'm in trouble; that is when network connections start to drop: anywhere from every 17 minutes to 2.75 hours during a transfer, but usually every hour or so, it happens. FreeNAS freezes up and the system has to be rebooted.

Bob
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
Can you confirm that w/o network traffic, the system will run for at least 24hrs w/o issue?

What network card do you have?
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
Hi Louis,

Yes I believe I can confirm that the system runs for days if it is not under load. I know this because I get system status emails from the system daily at 3:00 a.m. I got last night's emails and fully expect to get tonight's as well.

I just checked the system to verify the network card. So it is running fine.

The network card is an Intel 82574L Gigabit Ethernet Controller and is set up as em0. The card is a PCI Express device.

I think the driver is listed in dmesg:

em0 Intel Pro/1000 Network Connection 7.1.9.

Thanks

Bob
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
Interesting.

I would probably start by pulling 12G of RAM (drop it down to 4G) and see what happens. Try rotating memory and see if that makes a difference. If nothing pans out there, I would try swapping CPUs. If that doesn't have any effect, I would suspect its the motherboard.
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
Thanks again, Louis. I had not thought of checking the memory at all. I tend to think of memory as reliable and yet also realize it might not be. I'm going to move the entire system to a different case and try your suggestions. At the same time I'm going to build an Opteron-based system with a Supermicro motherboard because my friend really needs storage. This will take a few days to do. I will report back. I want to pursue this and nail down the real problem with the MSI-based system.

Bob
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
If you have ECC memory, you shouldn't be having issues, but most consumer things don't run ECC as its more expensive and most customers wouldn't know what ECC warnings are anyway.

GL
 

BobCochran

Contributor
Joined
Aug 5, 2011
Messages
184
You are correct, thee MSI system is not running ECC memory. I remember now that I put some memory in the motherboard a long time ago and then after a while the system would not boot up. When I removed one stick of the memory the system would boot. Then I noticed I had made a mistake: I had not installed the memory in the correct memory slots for 2 DIMMs to begin with. I tried plugging in the second DIMM in the correct memory slot, and the system would not boot again. So I explained the issue to crucial.com and they replaced the memory.

I have since added 8 Gb of non-ECC memory to that motherboard in order to boost it from 8 to 16 Gb.

Could I have damaged my motherboard by not installing the memory in the correct slots per the motherboard manual's instructions?

The new Opteron will have ECC memory.

Bob
 
Status
Not open for further replies.
Top