Disk problems possibly?

Status
Not open for further replies.

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
This is my first Freenas build, on 9.1 release and I have nothing production on it yet, just testing a few things to make sure it's going to run smoothly, so all of the data is backed up.

The hardware is an older supermicro server, that has an X6dh8-g2 with a 3ghz processor (e7520 chipset), I've got 4gb of ram in it currently with another 12gb on order, but for testing, this should be fine, I'm not putting a load on it yet. It has an areca 1120 and its on the latest 1.49 bios. I have 6 drives in it in jbod mode, with 6 1tb drives in a raid 10 setup, so three sets of mirrors, extending the same volume. I plan to use this for a lab esxi iscsi as well as file storage for the house.

First day that I connected the drives and installed Freenas, I had no problems, setup the volume and the shares (afp, nfs, cifs and iscsi) and got everything tested. Overnight, it went offline, the console was locked up and it had messages about interupts on the screen (sorry, should have taken a photo, but didn't think about it at the time).

I brought it back online, everything seemed fine. This server was previously an ubuntu linux server, using this same hardware. I added two known good drives (from another system) and a thumb drive for booting. The thumb drive was an older one, I didn't really think it was the issue, but I replaced it with a new sandisk by saving the config off, swapping out the thumb drives, booting and reloading the config. The volume was still there and the shares were still good. I ran out of time working on it that evening and the next day, same problem, crashed overnight and was hard locked.

I then setup email notifications and syslog to my mac pro, hoping if it crashed again overnight, I'd see something. This morning, it wasn't locked up, but the email notification from the daily run output said "One or more devices has experienced an unrecoverable error." My syslog stopped working on my mac for some unknown reason and I can't get it to start back up, I've now pointed my syslog to a bsd box and will monitor that.

From smartctrl, all 6 disks look fine, no uncorrected errors on any of them, and from the areca-cli, all disks are in an OK state, but as to the individual numbers and RAW data, I'm not sure what I'm looking at, or what I should be looking for. I did a zpool clear and the alert in the freenas page has gone away.

Is there a way to get more info from the message that daily run test? Re-running the 404.status-zfs from the daily periodic reports everything fine. The syslog server didn't receive any messages from freenas since I setup the email notifications and had an error in there in my smtp. I since fixed that problem of course.

Sorry for the long winded post, but I wanted to make sure I gave all of the relevant info and back story, any thoughts would be appreciated.

@Superjock, I'm probably going to try out your email script at some point on the areca cards, your posts about the areca helped me setup some of this, and it's appreciated.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
What's the output of "zpool status" ?

Aha, nice command, I hadn't run across that yet. It looks good.

Code:
zpool status
  pool: vol1
state: ONLINE
  scan: none requested
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    vol1                                            ONLINE      0    0    0
      mirror-0                                      ONLINE      0    0    0
        gptid/cb322b73-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    0
        gptid/cbb72147-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    0
      mirror-1                                      ONLINE      0    0    0
        gptid/e7c8e5a9-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    0
        gptid/e87d5005-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    0
      mirror-2                                      ONLINE      0    0    0
        gptid/18d0eaea-03a4-11e3-b77b-0030482d7f02  ONLINE      0    0    0
        gptid/198b9972-03a4-11e3-b77b-0030482d7f02  ONLINE      0    0    0
 
errors: No known data errors
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
No problems with it today, but before I was heading off to bed, checked the status and found this.

Code:
[root@freenas /]# zpool status
  pool: vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: none requested
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    vol1                                            ONLINE      0    0    0
      mirror-0                                      ONLINE      0    0    2
        gptid/cb322b73-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    2
        gptid/cbb72147-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    3
      mirror-1                                      ONLINE      0    0    1
        gptid/e7c8e5a9-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    1
        gptid/e87d5005-03a3-11e3-b77b-0030482d7f02  ONLINE      0    0    1
      mirror-2                                      ONLINE      0    0    1
        gptid/18d0eaea-03a4-11e3-b77b-0030482d7f02  ONLINE      0    0    1
        gptid/198b9972-03a4-11e3-b77b-0030482d7f02  ONLINE      0    0    1
 
errors: No known data errors
[root@freenas /]# 


So, ran smartctrl and areca-cli, but everything looks fine. Here is the smartctrl and areca-cli outputs: https://gist.github.com/anonymous/6238759
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd do a RAM check....
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Well, I think I found the issue, seems to be the motherboard. It was locked up this morning, no error on screen. Rebooted to load up an ubuntu install for the mem test, and got an IRQ conflict at Post, but in slot 8. This mobo doesn't have 8 slots, so must be referring to something onboard. I pulled out my Areca (only card I have in it) and rebooted a few times, the error popped up at least once more, so it's not the raid card. Bios is no help, its the latest this board offers, but it won't let me turn off things, like the onboard nics, etc. Also, it has no mention of irq settings in bios. Resetting bios settings did not help.

So, looks like I'll be buying a new mobo/cpu/ram setup. Ugh, and this server was just fine before freenas.

Thanks for the suggestions all.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd try to do a RAM test by putting your RAM in another system. Bad RAM can appear to do random things because many of the hardware setup is stored in RAM. And if that stuff is being corrupted it'll make you think something else is broken. This is why RAM problems are so elusive. You often go in a different direction because you see the consequences of the data in RAM being corrupted, not the actual cause.
 

survive

Behold the Wumpus
Moderator
Joined
May 28, 2011
Messages
875
No reason not to run memtest overnight.....

Another thing you could try....grab one of these: http://mfsbsd.vx.sk/

Boot it & see if you get the same messages. No reason not to do a quick install & see if it runs overnight.

-Will
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
It's not even wanting to boot regularly now, and some of the ram I had ordered for this system came in, same part number as the ram I already had and it won't post with it, not in addition to my existing ram nor by itself.

I broke down and ordered a new supermicro mboard, cpu and ram to match the new setup.

Supermicro X9SRI-F-O, Xeon e5-2603 and 16g of Crucial ecc ram. Board expands to 256gb.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Woot! Congrats on your new purchase!
 
Status
Not open for further replies.
Top