disk/da write errors from unknown source

Status
Not open for further replies.

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
No expert on Freenas by any standard and I'm getting stuck here trying to figure this out:

I have 6 2Tb driver running in RaidZ from a M1015 running as per the many recommendations.
One drive/da is constantly throwing up issues.
Volume is working fine reading, but when try to write to the volume it will throw up a lost connection and volume becomes degraded. Replace with new RED Pro drive - resilver will complete.
I've replaced the drive twice now - it's not the HDD.

I've rebuild the volume from scratch- same issue comes comes back though da number may change.

I've tried using different SAS connectors on M1015

Now - this volume just has some test info on it so I'm not concerned about the data.
I'm testing it because several months ago I had the same issue using 8 drives in Z2 configuration - having no time then I gave up on that for the time being.

To note I have two other volumes on this server, both RaidZ: one with 5 drivers, one with 6. These are working flawlessly.

What I'm really looking for is some info on WHERE to look for the error - would this be a problem with FreeNAS (least likely but I can't discount it with my limited knowledge) or with my motherboard or do I have a bad M1015?

Have others seen this issue?

Some tips on where I could find info in FreeNAS on where the actual issue may be?

If I can provide added info - let me know where I can retrieve it and I will.

Much obliged!
 

Fuganater

Patron
Joined
Sep 28, 2015
Messages
477
Please post your full system specs and chassis.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
would this be a problem with FreeNAS (least likely but I can't discount it with my limited knowledge) or with my motherboard or do I have a bad M1015
Could be any of the above, or a cable or power problem.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Ah sorry - thought I had it in my signature, but it seems to refuse to load it.

Case is a Thermaltake Eureka
Power is supplied by a Seasonic M12 II EVO 850
Supermicro X9SCM-F
Pentium G2020 @ 2.9
16Gb of RAM
2 ZFS
a) 5 WD30EFRX RAIDZ direct on MB
b) 6 WD30EFRX RAIDZ on USB 3.0 card
FreeNas 9.3 Stable train (latest update) on Samsung SSD
I am not running any add-ons/jails

Now a few thing to point out about the above before I get shot down ;)
-USB is ONLY there because of the issues I had already using the M1015 add-on card. These issues were identical prior to having the USB card - card was added only to add storage untill I had time to figure things out.
-Yes, I'm aware RAIDZ(1) is 'dead'. Once again - the only reason for the above setup is to HAVE something available until I figure this out. I split the ZFS setup as I didn't want to mix the usb into the direct units.
-Yes, I know I'm 'low' on RAM and am intending to go with more - however I'm holding off until I can figure out if it's a motherboard issue or not as I'll likely replace to a different board if that's the case.
- In general and in addition to the above points - understand that this is NOT my first nor main build on FreeNAS - this server is intended as a media server/NAS - all files on here are a duplicate from my actual storage server (a 24 disk build on Z3) . This unit exists solely to alleviate the daily use access (in other words reads) from that main server. So no - I do not 'care' about data loss on this unit (bit of a pain to redo is all)

Now for the unit's I'm trying to use:
IBM ServeRAID M1015 crossflashed to IT mode (p20 now, 16 initially in mentioned attempts earlier in the year)
6 WD20EFRX RAIDZ

A few more notes
-in the original setup some months ago I had 12 WD30EFRX RAIDZ - 4 on motherboard and 8 on the card in Z2 - no USB drives (so a similar setup as the 'temp' I have now) with the same issue popping up
-I have tried more than 1 PCIe slot (including the one running the USB3 card without issues now)
-I have replaced the SAS cables
-I have swapped the internal power cables.
-I have tried a separate freenas build (on USB) with JUST the M1015 and a different set of HDDs (HDDs only connected to the M1015) with the same issues.
-I have reflashed the M1015 more than once
-Made sure motherboard BIOS is up to date.

Like I said - I'm at a bit of a loss on WHERE the issue may be coming from.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Could be any of the above, or a cable or power problem.

Cable and power, while not impossible are unlikely - see above post for details but cables have been replaced, power internal cables included and even with only the 6 drives in (far less than it's running normally) via the M1015 it's giving me the same problems.
 

Fuganater

Patron
Joined
Sep 28, 2015
Messages
477
You swapped the cable, did you change ports on the M1015? It is way down in the weeds but the port has 4 channels and one of them could be going wonky.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
You swapped the cable, did you change ports on the M1015? It is way down in the weeds but the port has 4 channels and one of them could be going wonky.

As I'm using 6 drives I can't NOT use one set, but I might be able to try switchingthe cable at the port - that should move the issue over I suppose - though only one drive on the one port doing it seems odd. Will give that a shot.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
^ Nope - problem persists at da5. I destroy the pool, create again, and a different disks now has the issue. (though one on the other 'new' port on the M1015)
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Hm, it seems to have gotten worse actually with the affected da (again - another disk now, tough that grabbed da5 again) throwing up errors upon reboot.
I'm starting to lean towards a bad m1015
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It really sounds like one of the channels on the M1015 is possibly bad. Unfortunately you're going to have to diagnose it down by tracing the cables and such.

It's been almost 2 weeks since you opened the thread, did you ever figure this out?
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Sorry, been busy - both trying to trace it and otherwise. I think I got it narrowed down at least somewhat, though another issue popped up. I'll give a more detailed report later on - probably tonight if I find a moment.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
I ended up taking the time and trouble of moving the M1015 to another motherboard and testing it there. While initially it seemed to work fine there - I found out that it start giving me trouble as soon as the files get larger than approx 3 Gb.
Returning the board to the server, I was able to replicate that issue, and found that there, anything over approx 300 Mb started giving trouble. In both scenarios, one of the drives would fail in write errors and fall offline.

Several attempts and tests along these lines determined that I could reproduce this issue on either port on the M1015, though when using both ports, still only one driver would end up failing (though who knows what longer term testing ie weeks would come up with). Interestingly - other than the drive going offline and the volume going into a degraded state - it keeps working fine.

The driver by the way can be resilvered after the transfers are complete, but subsequent reads from the volume would then cause the same end result. Short reads are fine, trying to copy larger files off of it results in a similar course of events (though obviously on read errors, not write errors).

Seems to all lead to conclude it's 'just' the M1015 - although it's odd that there's a difference in the 'size' threshold' on two different setups - especially because the one I setup to test (and failed at larger file sizes) is actually a lower CPU with less RAM.

Regardless, me having poor luck and having a bad M1015 remains the logical conclusion.

While I don't doubt that fact even still - I tried to setup ANOTHER card (basic non-raid expansion card) with just 4 SATA ports to add a few drivers for temporary storage (as I'm clearing and rebuilding a desktop computer) and I came across another issue.
Those drives will fail almost instantly once data is written to them with CAM errors - on all drives - to the point that FreeNAS completely locks up.
Now - pretty generic card, so I figure fine - I grab another card, different chip - same thing. Next I get another USB card - IDENTICAL to the one that is running fine still on the FreeNAS build and use identical cases - and same thing.

What THAT part is supposed to mean I'm not sure - but I suspect that I'm either reaching the end of the motherboards or the RAMs ability - though I find it odd that it keeps being the new one that continues to fail while the 'older' volumes continue to work flawlessly.

Short of the story is that I guess I'm going to look for a new (or additional) hardware setup. There's just too many unknowns for me to try and 'upgrade' what I have now (ie RAM) and risk the whole setup failing after all.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I wouldn't give up on that hardware. I'm using the X9SCM-F with an E3-1230v2 and 32GB of RAM and I've got more than a dozen jails, 50+TB of disk space, etc. If it weren't for the fact that I am look got upgrade to 96GB of RAM for various reasons that aren't related to this conversation, I'll probably still be using this system in a year or two.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
I wouldn't give up on that hardware. I'm using the X9SCM-F with an E3-1230v2 and 32GB of RAM and I've got more than a dozen jails, 50+TB of disk space, etc. If it weren't for the fact that I am look got upgrade to 96GB of RAM for various reasons that aren't related to this conversation, I'll probably still be using this system in a year or two.

My worry is simply that there is more going on than 'just' a bad M1015 - unless you suspect that the CAM errors are likely due to lack of RAM
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
CAM probably isn't because of a lack of RAM, CAM is typically a storage subsystem problem. Since you are already having problems with the M1015, it seems very plausible that as soon as you replace the M1015 everything will work great.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Mhh. Now that the holiday bills are done - I'll consider it. In the end I'll likely end up adding one or two in a new build anyway so I suppose there's no real risk to trying.
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
So revisiting after quite some time. Built a quick alternative for now to bypass the issue, but now it's expanded. the M1015 isn't in, yet I'm geting CAM errors now all of a sudden on one of the already exiting RAIDs - escalating to a degree that the unit will not even boot up. So it ends up NOT being isolated to the M1015 (good thing I've just been too durn busy I guess).
I was just WONDERING though - having searched these forums some more - could it possibly have something to do with the SSD I'm running FreeNAS on? Yes I've tested this with the M1015, but still....
While it's hard to 'recommend' a specific USB flash drive due to them changing so often - is there perhaps a (current) drive someone has used recently that seem to work very well to test this possibility?
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Hmm. At a bit of a loss, though it's not really an 'issue' anymore at this point for me. I built another system using the M1015 - on an old TYAN tempest i5400pw s5397 - and it's been working forover a month no problem.

Meanwhile the Supermicro board got a brand new install (also a new USB stick) as if a new setup. Imported the volumes, and it's been working flawlessly for the same amount of time (well OK 1 days shorter).
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
When you were having trouble, were your system dataset and reporting database on your boot device?
 

Cyknight

Dabbler
Joined
May 2, 2015
Messages
19
Yes, I believe so - thinking running out of room? I checked the original USB (a 32 Bg unit) and it had plenty of space left.
I should still have the drive - untouched - around here somewhere
 
Status
Not open for further replies.
Top