File corruption issue with FreeNAS lagg (lacp)

Status
Not open for further replies.

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi all,
Before pointing the problem below is my environment.

Freenas Box.

16GB RAM
WD RED HDD 4TB x 3 with RaidZ usable 8TB
4 NIC Intel with 1 gig of speed each.
Intel Motherboard

Switch cisco sg300 -52

Config and problems are as below

I have configured LAG (lacp) in freenas box as well as on the switch, all the four ports of switch and freenas box is in lag to get the better performance.

I have configure samba share on the nas box where users come and pick the files to work on it live the file types are images. They open the file in Photoshop and edit it live. When they open files it works perfect but after doing work on it when they save the files the saved image is corrupting and it happens with random users.

e.g. if 10 people is working simultaneous on the 10 different files they 8 of them are able to save it properly and 2 of them is facing the corrupt issue, and this happens randomly.

And if I destroy the lag and let them work with single nic on nasbox they don't face any issue.

Please help me in this case and suggest me where I am wrong in the config.

Regards,
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So we need full specs.. What motherboard, What CPU, what RAM, etc.

You can look at the output of "netstat -i" and see if there are lots of network packet issues.

There's 4 possibilities for the issues I can think of:

1. Network traffic is being corrupted (netstat -i should hint at that)
2. Your storage subsystem is having issues (I'd expect problems seen with zpool status if that was the case)
3. You have some kind of issue like bad RAM and no ECC support and so you are dealing with silent corruption of your zpool.
4. Your desktop(s) have bad RAM. Believe it or not we've actually had people in this forum complain about corrupt files on their FreeNAS that was due to bad RAM in their desktop and the desktop was saving corrupted data to the FreeNAS, which dutifully stored the corrupted data with checksums, etc.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi cyberjok


Thanks for your reply,

I have gone through the ppt of yours over freenas and it was great, clear many doubts of mine.

and I am glad that you are assisting me with the issue.

I am using intel DH87MC mother board and 16 GB HyberX Kingston ram cpu is Core I3 4th Gen.

All NIC's are Intel based.

Below are the answers of your points,

1. Network traffic is being corrupted (netstat -i should hint at that)
>> I will monitor that but since we are using tcp protocol so there are less chances of that.
2. Your storage subsystem is having issues (I'd expect problems seen with zpool status if that was the case)
>> here is the output of my zpool status.
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 496K in 0h0m with 0 errors on Sun Jul 24 19:35:44 2016
config:

NAME STATE READ WRITE CKS UM
ie ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/12dc990b-510c-11e6-a986-2025640cf16a ONLINE 0 0 0
gptid/13a85fde-510c-11e6-a986-2025640cf16a ONLINE 0 0 0
gptid/14687c8e-510c-11e6-a986-2025640cf16a ONLINE 0 0 0


and when I fire zpool status -xv

Below is the output

entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 496K in 0h0m with 0 errors on Sun Jul 24 19:35:44 2016
config:

NAME STATE READ WRITE CKS UM
ie ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/12dc990b-510c-11e6-a986-2025640cf16a ONLINE 0 0 0
gptid/13a85fde-510c-11e6-a986-2025640cf16a ONLINE 0 0 0
gptid/14687c8e-510c-11e6-a986-2025640cf16a ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

ie/File Share:<0xeec0>
ie/File Share:<0xeec1>
ie/File Share:<0xeec4>



3. You have some kind of issue like bad RAM and no ECC support and so you are dealing with silent corruption of your zpool.
>> yes you are right The ram doesn't support ECC that I will put there if this is the only case.

4. Your desktop(s) have bad RAM. Believe it or not we've actually had people in this forum complain about corrupt files on their FreeNAS that was due to bad RAM in their desktop and the desktop was saving corrupted data to the FreeNAS, which dutifully stored the corrupted data with checksums, etc.

>> All my desktops are having Hyperx Rams and they are not facing any issues, I belive this can not be the case, even though I will not deny that bad ram can do this, but belive me this is not the case.
 
Last edited:

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Please post the output of smartctl -a for each disk. Use CODE tags to preserve formatting.
code.png
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi below are the outputs

Code:
smartctl -a /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E2103811
LU WWN Device Id: 5 0014ee 20a5ea908
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 27 13:26:24 2016 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (50160) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 502) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   202   179   021    Pre-fail  Always       -       6883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       180
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       15050
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       180
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       117
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       616
194 Temperature_Celsius     0x0022   119   094   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
smartctl -a /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E2086456
LU WWN Device Id: 5 0014ee 25fb3ce0b
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 27 13:27:47 2016 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (55440) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 554) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   176   176   021    Pre-fail  Always       -       8175
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       181
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       16488
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       181
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       115
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       167
194 Temperature_Celsius     0x0022   120   099   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



#smartctl -a /dev/ada2

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E1FSUPDE
LU WWN Device Id: 5 0014ee 20d39f535
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 27 13:28:17 2016 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52680) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 527) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   178   178   021    Pre-fail  Always       -       8066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1261
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1167
194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
yes you are right The ram doesn't support ECC that I will put there if this is the only case.

I would say this is definitely a contributing factor. Strange that it only crops up when you are using LACP though. Thankfully the 4th gen i3s do have ECC support when used in a compatible motherboard like the Supermicro X10 series.

All my desktops are having Hyperx Rams and they are not facing any issues, I belive this can not be the case, even though I will not deny that bad ram can do this, but belive me this is not the case.

Never underestimate the ability of Kingston to give you bad RAM. I would run memtest on the affected machines.

Do you have rolling snapshots or backups that can recover the corrupted images? You should probably drop to a single link to avoid damaging additional files.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
I would say this is definitely a contributing factor. Strange that it only crops up when you are using LACP though. Thankfully the 4th gen i3s do have ECC support when used in a compatible motherboard like the Supermicro X10 series.

Yes I have already ordered it and waiting for it to come. and yes I have removed the lagg and monitoring from yesterday, till now I have not faced any issues of files corruption which is also strange for me.

Never underestimate the ability of Kingston to give you bad RAM. I would run memtest on the affected machines.

Do you have rolling snapshots or backups that can recover the corrupted images? You should probably drop to a single link to avoid damaging additional files.

Well I am not saying the Kingston can not cater the bad ram, I was just sure because uptill now I was using linux and sharing the files with samba and was not facing any issues, but still I will run the ram test on each machine. Luckily I was having the backup of the images from where I was able to restore it but for next time I cannot guarantee, ;).

I have raised this question in the forum because it was strange behavior with lagg and I want to know it what could be the reason.

Let me know if any one requires more input from me.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
below are the outputs
I don't see any obvious problems with your disks, however ...
No self-tests have been logged.
I suggest setting up a regular SMART short and extended testing schedule, if you haven't already, and making sure email notifications are working.
One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 496K in 0h0m with 0 errors on Sun Jul 24 19:35:44 2016
This might indicate an underlying hardware (or software) issue causing corruption, and problems when using LAGG might simply be a symptom.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So I'm betting you have bad RAM in your FreeNAS server. The reason I say that is that you may have bad RAM that is corrupting ZFS as well as the network interface buffers (which are in RAM).

So I'd start with a RAM test on your server and see what happens. It's also possible that your power supply sucks and it has some ripple in the DC power to the components that is causing problems.

Your desktops could have bad RAM. In the case of the guy that had corrupt files on his server he had no other outwordly indications from his desktop that anything was wrong and was totally off-guard when he tested his desktop and it failed. In fact, I only mentioned that he should test his desktop as a passing idea because it was possible, but I thought he'd have other reasons to suspect the desktop is a problem. Clearly I was wrong. ;) In your case I think its unlikely to be your desktops, I'm just mentioning this because it is not a "zero probability" cause.

In any case, you have bad network traffic and a bad zpool. They are likely related (bad RAM) but they could be 2 separate issues caused by 2 separate problems. In any case, you aren't using a motherboard that has a chipset that supports ECC, so switching to ECC is out of the questions unless you plan to buy a new motherboard too.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hmmm......

I have already asked for the purchase team to get the motherboard with ecc ram support, now I have two options in that 1) go for supermicro board 2) go for Intel S1200sps with xeon E3-220v5

I need your opinion for choosing between them, and as far as power supply is concerned I will check and replace it. But I need your help to determine the bad network traffic because I am clueless that how it is happening only with lagg.

I will initiate each desktop ram testing to get the ram status, suggest me any preferred tools ?

I will keep updating you on the status and the output of the test.

Thanks you guy's for your great support and time....... :) coz you are doing really great job.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd go supermicro anyday.

As for the network traffic, I have no idea. That would require doing a remote session, etc. and isn't something you can just do via a few posts on the forums. If you have a network engineer you call for contract work or otherwise, you should consider getting them involved.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
Hi all,
Before pointing the problem below is my environment.

I have configure samba share on the nas box where users come and pick the files to work on it live the file types are images. They open the file in Photoshop and edit it live. When they open files it works perfect but after doing work on it when they save the files the saved image is corrupting and it happens with random users.

e.g. if 10 people is working simultaneous on the 10 different files they 8 of them are able to save it properly and 2 of them is facing the corrupt issue, and this happens randomly.

Just to clarify, multiple users are not working on the same file at the same time, right? Eg User A accesses file A, User B accesses file B, etc?

If multiple users are using the same files at the same time, then there may be locking issues, allowing 2 users to write to the same file at the same time, which would cause corruption.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Just to clarify, multiple users are not working on the same file at the same time, right? Eg User A accesses file A, User B accesses file B, etc?

If multiple users are using the same files at the same time, then there may be locking issues, allowing 2 users to write to the same file at the same time, which would cause corruption.

No there are no such users who are using same time same file.

@cyberjock

I have already consulted the network engineers..... they said there is no problem with the network....... I don't know how far they are correct.

However I have ordered the supemicro board. From my last post I am using freenas box without Lagg interface and it is working fine. Now there is no file courruption.

Anyways I will keep posting the results with supemicro board with Lagg.

:)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't want to sound like I'm bashing your network engineers (and I have zero evidence they are/aren't capable of doing their job) but I've worked with network engineers that swore up and down all was well until I showed them iperf tests or other things that totally proved they were wrong. This isn't limited to network engineers either, virtually anyone can say all is fine and base it on 'all available information' and until information is presented that is contrary, all seems okay.

This is where you, hopefully, have the knowledge and experience to be able to identify something to a particular subsystem. For example, be able to prove that something is "not quite right" with the network and be able to prove it's not the FreeNAS. Once done, you can let your network engineers have a hayday trying to figure out what network peculiarity you managed to uncover by running a few tests. :P
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
The iperf test sounds good...... where have you performed the test. I mean on the freenas box or any other machine.... and what you shown to your network engineers..... I would like to perform the same test. Please let me know, to be very honest I feel that there is something wrong with the network. But I don't have any clue.

Please do let me know what tests on the network is to be performed.


UPDATE ON DESKTOP RAM TEST
I have test all desktop with memtest86 tool and the results shows there are no problems with ram.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi

Just for the info

I have changed the hardware with intel s1200sps xeon e3 processor 16gb kingston ECC ram.

It working perfect now........ I will try the LACP on saturday and post the update....... till then thanks for the support and patients..... :)

by the I have tried the iperf but it seems there is no loss or jitter with the network...... I will test again that on saturday and let you guys know......
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi Guys,

This Saturday I have tried to create lagg with lacp but the lagg interface is not pinging after creating it, in gui the lagg media status shows down but all the wires are connected to the NIC I don't know what is wrong, the same setup was working with 9.3 now I am using 9.10 and it is not working, is this issue faced by any one in freenas 9.10, I can see in terminal the IP's are assigned to lagg interface but still it is not pinging and media status shows down in GUI, any comment on this ?
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Okay....... now I have successfully configured the lagg in freenas 9.10 the issue was with the switch, the lacp protocol was set to disable in the switch lagg config and the port was not showing active in the switch (cisco-sg300) config it was showing 1 active and 1 in standby mode even the both port was up, when I configured it with lacp and both ports as active in lagg it started working, and finally got the ping reply.

Now I will post the file corruption final result tomorrow.
 

Deepak Singh

Dabbler
Joined
Jun 6, 2016
Messages
16
Hi everyone,

Now there is no more file corruption issue in freenas.

The main culprit is seems to be the hardware of freenas specially RAM.

Just a short description of earlier Hardware and current Hardware as follows.

Old Hardware

Core i3 4th gen CPU
16GB Kingston Hyperx Ram
Intel DH87MC mother board
and WD RED NAS drive


New hardware

Xeon e3 processor
16gb kingston ECC ram
Intel s1200sps Motherboard (if you are going with this board please there are various series available suitable for you needs.)
WD RED NAS Drive

Conlusion : The hardware config must be justified before installing the freenas, and ECC ram's must be used, in my case the main problem was the rams (however I have ran the mem86 test on that and the results says the rams were good.) overall the ECC rams are compulsory if you want to avoid the future issues.

Thanks for your support and time.

Enjoy :)

Regards,

Deepak Singh
 
Status
Not open for further replies.
Top