Corruption NFS

Rodrigo

Cadet
Joined
Mar 31, 2015
Messages
9
I have a big problem here.

I use a shared NFS folder from my freenas mounted on around 100 virtual machines and I execute the Oracle database backup using rman and writing in this share limiting 20 backups at the same time for avoid I/O problem.

All backups are terminated with success without errors, but after I try read theses files for validation and I receive the error:

ORA-19599: block number 256190 is corrupt in backup piece /nfsshare/FILENAME

I moved to a new server think was a hardware issue , but I have the same problem.

When I use the same process using a NFS SharPe at NETAPP storage I don't have this issue.

- I try change server, no success
- I decrease the concurrency from 20 for 5, no success

I don't receive any erros on freenas:

Freenas Version: FreeNAS-11.2-U5
Processor: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (16 cores)
Memory: 36GB DDR3-1333 REGISTERD ECC MEMORY MODULE 9965447-030.A00LF
Disk Controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) - IT MODE/JBOD
disk boot: 2x KINGSTON SA400S37120G
disk data: 8x Seagate Archive HDD ST6000AS0002-1N917X

Zpool: Sync disable / Compression OFF / Deduplication Off

NFS mount point option in clients:
rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=32768,wsize=32768

The corruption is random, same server fail on day works in another day.

# zpool status pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0 days 00:03:52 with 0 errors on Fri Nov 1 03:48:52 2019 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 ada1p2 ONLINE 0 0 0 errors: No known data errors pool: vol state: ONLINE scan: scrub repaired 0 in 1 days 00:18:21 with 0 errors on Mon Oct 21 00:18:24 2019 config: NAME STATE READ WRITE CKSUM vol ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/31138b9b-b245-11e9-86e0-0025902d086c ONLINE 0 0 0 gptid/38e6d6be-b245-11e9-86e0-0025902d086c ONLINE 0 0 0 gptid/40c09b56-b245-11e9-86e0-0025902d086c ONLINE 0 0 0 gptid/4898448d-b245-11e9-86e0-0025902d086c ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 gptid/b7b8e9cc-ce71-11e9-acde-0025902d086c ONLINE 0 0 0 gptid/c53be1d6-ce71-11e9-acde-0025902d086c ONLINE 0 0 0 gptid/c97ae6c4-ce71-11e9-acde-0025902d086c ONLINE 0 0 0 gptid/cb85b363-ce71-11e9-acde-0025902d086c ONLINE 0 0 0 errors: No known data errors

Someone Can help me ?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Ah, I see, you disabled sync. That is not good for a share hosting a db workload.

From the looks of the original post it appears the NFS export is only being used as a backup target, not to back the live/transactional DB:

I use a shared NFS folder from my freenas mounted on around 100 virtual machines and I execute the Oracle database backup using rman and writing in this share limiting 20 backups at the same time for avoid I/O problem.

sync=disabled should be OK for a purely backup target workload, and probably also the only way this system will see anything approaching acceptable NFS write performance with no SLOG and SMR disks.

@Rodrigo - Have you confirmed that your DB tables and archivelogs are clean? You also are using SMR disks which are known to have very bad performance under a rewrite workload. What does gstat -dp look like?

When was your last scrub, and do your SMART stats look good on the drives?
 

Rodrigo

Cadet
Joined
Mar 31, 2015
Messages
9
@HoneyBadger thank you for your help.

About SMR disks and the bad performance for rewrite can be the root cause.

When I replaced the complete hardware and use clean disks, I executed the full backup for the first time and all worked good without errors, maybe bacause don't have re-write ?

I will collect gstat -dp during backup processes .
The scrubs and SMART don't have any errors.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Rewrite on SMR should only be hurting performance, not causing errors. Unless they took so long as to cause a timeout of the backup job itself or the NFS client - I noted you have timeout manually specified (although still at the default 60s or 600 tenths) in the mount options.

Before you disabled sync, were you noticing failures, or was this set in an attempt to improve performance? I would also wonder if you might be better off letting ZFS handle the compression (via LZ4) versus Oracle rman, since I imagine you have more spare cycles available on the FreeNAS machine vs. the DB.
 

Rodrigo

Cadet
Joined
Mar 31, 2015
Messages
9
I don't have any errors on backup, I don't receive any timeout , the backup is completed with success without errors or warning.

The failed is detect during the Oracle Rman Validate backup process, this process try read all backups files and simulate the restore.

I disabled the SYNC because I dont have ZIL and this impact the write performance.

After the coment from @dlavigne I try enable ZIL and I have the same problem, the SYNC is enable at this moment.

I can disable compress from Oracle Rman and active the LZ4 on Freenas, I will do this and report the results.

Thanks !
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
What are your network adapters, and have you tried changing network switch or ports?
 

Rodrigo

Cadet
Joined
Mar 31, 2015
Messages
9
@rs225 I replaced complete hardware including nic and sfp+ adapters.
Now I'm using Broadcom Limited NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
 
Top