performance issues on RAID10 & dd testing

plissje · Sep 27, 2018

Hi guys,
Would appreciate your input on an issue I'm having.
We have a FreeNAS server that serves as a datastore host for our testing environment which has 2 ESXi's and around 30~ active VMs which are not that taxing.

Server spec:
FreeNAS 11 u6
ASRock 2750D4I
16 GB ECC RAM
1TBx4 RAID 10 random WD(7200), 3TBx4 RAID 10 WD RED(5400)

We don't need prod level performance, just for it to work fine :) and most days it does. From time to time we get serious performance drops, so I've decided to investigate this time.
After powering off all VMs, and running a dd test, I get around 200MB/s for both read and write.

Now for me, this seems like a pretty low value for a RAID10 WD Reds, isn't it?
Shouldn't I be getting at least the read at around 400 minimum?

Any insight on this would be greatly appreciated :)

HoneyBadger · Sep 27, 2018

Could you please post the output of zpool status inside of CODE tags?

Also, please post the network interfaces you are using as well as the protocol used for VMware (NFS vs iSCSI)

plissje · Sep 27, 2018

Hye HoneyBadger,
Here's the zpool status:

Code:

root@GEN-FreeNAS:~ # zpool status
  pool: BxLab_RAID10
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:14:58 with 0 errors on Sat Sep 15 06:14:59 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		BxLab_RAID10									ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/044f0a75-64c3-11e7-9224-d05099c14d99  ONLINE	   0	 0	 0
			gptid/a02796ad-252e-11e7-978b-d05099c14d99  ONLINE	   0	 0	 0
		  mirror-1									  ONLINE	   0	 0	 0
			gptid/a0d1184f-252e-11e7-978b-d05099c14d99  ONLINE	   0	 0	 0
			gptid/a16d482c-252e-11e7-978b-d05099c14d99  ONLINE	   0	 0	 0

errors: No known data errors

  pool: BxLab_RAID10_2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 05:03:13 with 0 errors on Sat Sep  1 09:03:15 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		BxLab_RAID10_2								  ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/f253ff50-22e8-11e8-8bec-d05099c14d99  ONLINE	   0	 0	 0
			gptid/f3ae52d2-22e8-11e8-8bec-d05099c14d99  ONLINE	   0	 0	 0
		  mirror-1									  ONLINE	   0	 0	 0
			gptid/f52b06d7-22e8-11e8-8bec-d05099c14d99  ONLINE	   0	 0	 0
			gptid/f68ed5bc-22e8-11e8-8bec-d05099c14d99  ONLINE	   0	 0	 0

errors: No known data errors

  pool: General_Storage
 state: ONLINE
  scan: scrub repaired 0 in 0 days 01:09:19 with 0 errors on Sat Sep 15 05:09:20 2018
config:

		NAME										  STATE	 READ WRITE CKSUM
		General_Storage							   ONLINE	   0	 0	 0
		  gptid/bb99db03-252e-11e7-978b-d05099c14d99  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:48 with 0 errors on Sat Sep 22 03:45:48 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  mirror-0  ONLINE	   0	 0	 0
			da0p2   ONLINE	   0	 0	 0
			da1p2   ONLINE	   0	 0	 0

errors: No known data errors
root@GEN-FreeNAS:~ #

Currently using nfs as the tests I did to compare ISCSI to NFS gave pretty much the same results, but my problem at the moment is getting the 200/200 on dd that i ran locally on the FreeNAS.

HoneyBadger · Sep 27, 2018

I assume both the BxLab_RAID10_n pools are the ones being presented to VMware. For remote access, iSCSI and NFS shouldn't be giving you the same performance from ESXi, since the former defaults to asynchronous writes and the latter to synchronous, and since you have no SLOG device, NFS would be much slower unless you've forced sync=disabled on the dataset/zvol.

Whether 200/200 is good or not is entirely dependent on the benchmark you're running.

What is the dd command you are using to test? And please also show the result of the following 3 commands, in separate code blocks:

zpool list -v

zfs get all BxLab_RAID10_1

zfs get all BxLab_RAID10_2

Stux · Sep 27, 2018

200MB/s is within an order of magnitude of what you should be seeing...

Are you using NFS or iSCSI to mount the stores?

You probably want to consider a SLOG. With NFS, I would expect you to be seeing 10-20MB/s without a SLOG to ESXi.

plissje · Sep 28, 2018

I'm using NFS with syncs disabled (yes I'm aware of the risks)
Regarding your request Honey, here are the extra details:

Code:

root@GEN-FreeNAS:~ # zpool list -v
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
BxLab_RAID10							1.81T   663G  1.17T		 -	47%	35%  1.00x  ONLINE  /mnt
  mirror								 928G   312G   616G		 -	47%	33%
	gptid/044f0a75-64c3-11e7-9224-d05099c14d99	  -	  -	  -		 -	  -	  -
	gptid/a02796ad-252e-11e7-978b-d05099c14d99	  -	  -	  -		 -	  -	  -
  mirror								 928G   351G   577G		 -	47%	37%
	gptid/a0d1184f-252e-11e7-978b-d05099c14d99	  -	  -	  -		 -	  -	  -
	gptid/a16d482c-252e-11e7-978b-d05099c14d99	  -	  -	  -		 -	  -	  -
BxLab_RAID10_2						  5.44T  1.65T  3.79T		 -	41%	30%  1.00x  ONLINE  /mnt
  mirror								2.72T   840G  1.90T		 -	41%	30%
	gptid/f253ff50-22e8-11e8-8bec-d05099c14d99	  -	  -	  -		 -	  -	  -
	gptid/f3ae52d2-22e8-11e8-8bec-d05099c14d99	  -	  -	  -		 -	  -	  -
  mirror								2.72T   845G  1.89T		 -	42%	30%
	gptid/f52b06d7-22e8-11e8-8bec-d05099c14d99	  -	  -	  -		 -	  -	  -
	gptid/f68ed5bc-22e8-11e8-8bec-d05099c14d99	  -	  -	  -		 -	  -	  -
General_Storage						 1.81T   521G  1.30T		 -	15%	28%  1.00x  ONLINE  /mnt
  gptid/bb99db03-252e-11e7-978b-d05099c14d99  1.81T   521G  1.30T		 -	15%	28%
freenas-boot							14.5G   854M  13.7G		 -	  -	 5%  1.00x  ONLINE  -
  mirror								14.5G   854M  13.7G		 -	  -	 5%
	da0p2								   -	  -	  -		 -	  -	  -
	da1p2								   -	  -	  -		 -	  -	  -
root@GEN-FreeNAS:~ #

Code:

root@GEN-FreeNAS:~ # zfs get all BxLab_RAID10
NAME		  PROPERTY			  VALUE				  SOURCE
BxLab_RAID10  type				  filesystem			 -
BxLab_RAID10  creation			  Wed Apr 19 21:33 2017  -
BxLab_RAID10  used				  663G				   -
BxLab_RAID10  available			 1.11T				  -
BxLab_RAID10  referenced			88K					-
BxLab_RAID10  compressratio		 1.63x				  -
BxLab_RAID10  mounted			   yes					-
BxLab_RAID10  quota				 none				   default
BxLab_RAID10  reservation		   none				   default
BxLab_RAID10  recordsize			128K				   default
BxLab_RAID10  mountpoint			/mnt/BxLab_RAID10	  default
BxLab_RAID10  sharenfs			  off					default
BxLab_RAID10  checksum			  on					 default
BxLab_RAID10  compression		   lz4					local
BxLab_RAID10  atime				 on					 default
BxLab_RAID10  devices			   on					 default
BxLab_RAID10  exec				  on					 default
BxLab_RAID10  setuid				on					 default
BxLab_RAID10  readonly			  off					default
BxLab_RAID10  jailed				off					default
BxLab_RAID10  snapdir			   hidden				 default
BxLab_RAID10  aclmode			   passthrough			local
BxLab_RAID10  aclinherit			passthrough			local
BxLab_RAID10  canmount			  on					 default
BxLab_RAID10  xattr				 off					temporary
BxLab_RAID10  copies				1					  default
BxLab_RAID10  version			   5					  -
BxLab_RAID10  utf8only			  off					-
BxLab_RAID10  normalization		 none				   -
BxLab_RAID10  casesensitivity	   sensitive			  -
BxLab_RAID10  vscan				 off					default
BxLab_RAID10  nbmand				off					default
BxLab_RAID10  sharesmb			  off					default
BxLab_RAID10  refquota			  none				   default
BxLab_RAID10  refreservation		none				   default
BxLab_RAID10  primarycache		  all					default
BxLab_RAID10  secondarycache		all					default
BxLab_RAID10  usedbysnapshots	   0					  -
BxLab_RAID10  usedbydataset		 88K					-
BxLab_RAID10  usedbychildren		663G				   -
BxLab_RAID10  usedbyrefreservation  0					  -
BxLab_RAID10  logbias			   latency				default
BxLab_RAID10  dedup				 off					default
BxLab_RAID10  mlslabel									 -
BxLab_RAID10  sync				  disabled			   local
BxLab_RAID10  refcompressratio	  1.00x				  -
BxLab_RAID10  written			   88K					-
BxLab_RAID10  logicalused		   1.06T				  -
BxLab_RAID10  logicalreferenced	 31K					-
BxLab_RAID10  volmode			   default				default
BxLab_RAID10  filesystem_limit	  none				   default
BxLab_RAID10  snapshot_limit		none				   default
BxLab_RAID10  filesystem_count	  none				   default
BxLab_RAID10  snapshot_count		none				   default
BxLab_RAID10  redundant_metadata	all					default
root@GEN-FreeNAS:~ #

Code:

root@GEN-FreeNAS:~ # zfs get all BxLab_RAID10_2
NAME			PROPERTY			  VALUE				  SOURCE
BxLab_RAID10_2  type				  filesystem			 -
BxLab_RAID10_2  creation			  Thu Mar  8 17:54 2018  -
BxLab_RAID10_2  used				  1.65T				  -
BxLab_RAID10_2  available			 3.61T				  -
BxLab_RAID10_2  referenced			88K					-
BxLab_RAID10_2  compressratio		 1.41x				  -
BxLab_RAID10_2  mounted			   yes					-
BxLab_RAID10_2  quota				 none				   default
BxLab_RAID10_2  reservation		   none				   default
BxLab_RAID10_2  recordsize			128K				   default
BxLab_RAID10_2  mountpoint			/mnt/BxLab_RAID10_2	default
BxLab_RAID10_2  sharenfs			  off					default
BxLab_RAID10_2  checksum			  on					 default
BxLab_RAID10_2  compression		   lz4					local
BxLab_RAID10_2  atime				 on					 default
BxLab_RAID10_2  devices			   on					 default
BxLab_RAID10_2  exec				  on					 default
BxLab_RAID10_2  setuid				on					 default
BxLab_RAID10_2  readonly			  off					default
BxLab_RAID10_2  jailed				off					default
BxLab_RAID10_2  snapdir			   hidden				 default
BxLab_RAID10_2  aclmode			   passthrough			local
BxLab_RAID10_2  aclinherit			passthrough			local
BxLab_RAID10_2  canmount			  on					 default
BxLab_RAID10_2  xattr				 off					temporary
BxLab_RAID10_2  copies				1					  default
BxLab_RAID10_2  version			   5					  -
BxLab_RAID10_2  utf8only			  off					-
BxLab_RAID10_2  normalization		 none				   -
BxLab_RAID10_2  casesensitivity	   sensitive			  -
BxLab_RAID10_2  vscan				 off					default
BxLab_RAID10_2  nbmand				off					default
BxLab_RAID10_2  sharesmb			  off					default
BxLab_RAID10_2  refquota			  none				   default
BxLab_RAID10_2  refreservation		none				   default
BxLab_RAID10_2  primarycache		  all					default
BxLab_RAID10_2  secondarycache		all					default
BxLab_RAID10_2  usedbysnapshots	   0					  -
BxLab_RAID10_2  usedbydataset		 88K					-
BxLab_RAID10_2  usedbychildren		1.65T				  -
BxLab_RAID10_2  usedbyrefreservation  0					  -
BxLab_RAID10_2  logbias			   latency				default
BxLab_RAID10_2  dedup				 off					default
BxLab_RAID10_2  mlslabel									 -
BxLab_RAID10_2  sync				  disabled			   local
BxLab_RAID10_2  refcompressratio	  1.00x				  -
BxLab_RAID10_2  written			   88K					-
BxLab_RAID10_2  logicalused		   2.34T				  -
BxLab_RAID10_2  logicalreferenced	 31K					-
BxLab_RAID10_2  volmode			   default				default
BxLab_RAID10_2  filesystem_limit	  none				   default
BxLab_RAID10_2  snapshot_limit		none				   default
BxLab_RAID10_2  filesystem_count	  none				   default
BxLab_RAID10_2  snapshot_count		none				   default
BxLab_RAID10_2  redundant_metadata	all					default
root@GEN-FreeNAS:~ #

I always thought that 200 is something I will see for writes, while having much higher reading for RAID10.

toadman · Sep 28, 2018

It's not clear the bandwidth is the issue. With that workload I'd expect a significant bunch of the I/O from ESXi to be random. You might have an IOPS issue with just two vdevs per pool. If you don't need the capacity from the 3TB drives you could go with a single pool of 4 mirrors and double the IOPS.

Factors like free space in the pool, ARC size (you don't have much for serving VMs), L2ARC (you don't have any), and an SLOG can play a large part in overall performance as well.

HoneyBadger · Sep 28, 2018

plissje said:
I'm using NFS with syncs disabled (yes I'm aware of the risks)

Make sure you are taking frequent (VMware-integrated) snapshots of the dataset so that you have crash-consistent backups.

sync=disabled could also be contributing to some of the inconsistent performance you're seeing.

Regarding your request Honey, here are the extra details:

Thanks, Dear. ;)

Quick overall suggestion - you only have 16GB of RAM - that's not enough to support 30 VMs. Hopefully the RAM is in 2x8GB configuration and you can expand to 4x8GB (or 2x8 + 2x16, if you've got the money) which should help.

Let's dig in to the performance a little.

Code:

root@GEN-FreeNAS:~ # zpool list -v
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
BxLab_RAID10							1.81T   663G  1.17T		 -	47%	35%  1.00x  ONLINE  /mnt
BxLab_RAID10_2						  5.44T  1.65T  3.79T		 -	41%	30%  1.00x  ONLINE  /mnt

Your vdevs are balanced (I wanted the verbosity for that) but the overall disk count is low and fragmentation could be an issue. Let's get into where I suspect the issue is.

Code:

root@GEN-FreeNAS:~ # zfs get all BxLab_RAID10
NAME		  PROPERTY			  VALUE				  SOURCE
BxLab_RAID10  recordsize			128K				   default
BxLab_RAID10  atime				 on					 default

Code:

root@GEN-FreeNAS:~ # zfs get all BxLab_RAID10_2
NAME			PROPERTY			  VALUE				  SOURCE
BxLab_RAID10_2  recordsize			128K				   default
BxLab_RAID10_2  atime				 on					 default

Now we're getting somewhere. The recordsize=128K is way too large for a VMware workload. Thickly provisioned disks (or disks that are svMotioned into this datastore) will all be written as big sequential chunks of 128K; then when you want to write a smaller chunk (say, an NTFS-formatted 4K block) you'll have to read the entire 128K, modify the 4K, and then fire the 128K back to disk again (in smaller pieces this time, hopefully) - this hurts throughput.

I'd advise creating a new dataset on the 4x3TB pool with recordsize=16K and try svMotioning a VM into it. See if performance there becomes more consistent.

atime should also be OFF since you're storing VMDKs and not individual files being directly edited by end-users, but I don't think this will be a huge difference.

I always thought that 200 is something I will see for writes, while having much higher reading for RAID10.

Sequential writes into empty slabs that's possible, but the free-space fragmentation is an issue - your disks will need to seek a bit more to find free space.

I assume you're connected using the two onboard 1Gbps interfaces, and not by anything faster? Consider that link aggregation won't give you 1+1=2Gbps; iSCSI MPIO will get you closer to that assuming proper round-robin. But in your current scenario with NFS, you're able to take in about 100MB/s writes from the network. There's obviously points where your disks can't keep up, and the ZFS write throttle has to kick in to slow things down. Second post to follow.

HoneyBadger · Sep 28, 2018

So I'm going to bring in a little dtrace script from another thread that I've found useful.

Create a file dirty.d using vi or nano and dump this code in there:

Code:

txg-syncing
{
		this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
		printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
			`zfs_dirty_data_max / 1024 / 1024);
}

Then from a shell do dtrace -s dirty.d YourPool and wait. You'll see a bunch of lines that look like:

Code:

dtrace: script 'dirty.d' matched 2 probes
CPU	 ID					FUNCTION:NAME
  4  56342				 none:txg-syncing   62MB of 4096MB used
  4  56342				 none:txg-syncing   64MB of 4096MB used
  5  56342				 none:txg-syncing   64MB of 4096MB used

Your second number won't be 4096MB - it will be probably be 1638MB. Hammer your pool with some write load and watch the first number. That's how much outstanding data is waiting to be written. Your pool needs to be able to write quickly enough to drain that pool - in your case, you're using async writes over 1Gbps - that means that "bucket" can fill at about 100MB/s. If your vdevs aren't able to sustain that write speed (based on slab fragmentation, spindle count/type, raw disk speed) that bucket will grow. At 60% full (~983MB) ZFS starts inserting a delay before acknowledging the writes, and it ramps up higher and higher as you approach 100%. When you hit 100% full, ZFS stops accepting writes until it can free up space. (You never, ever want to hit that point.)

Increasing the size of the bucket isn't the answer as that just kicks the problem down the road - ultimately you need faster devices. The smaller recordsize we discussed above is one way to assist this, since you'll have less of a read-modify-write penalty. More vdevs or faster vdevs help as well. More RAM also helps, because the more requests that are served from ARC, the fewer requests hit the disks - which means less contention.

plissje · Oct 2, 2018

Wow HoneyBadger, I didn't expect someone to put so much analysis on this, first and foremost, thanks! that's a lot of great info.
I'll try creating the new pool and migrate the VMs there. I was actually pretty certain i had atime off, interesting that it's not.

Regarding the vdevs, theoretically speaking, if I have my current 2 pools and I split the VMs evenly between them, that would be the same as having 4 vdevs in the same pool right?
So technically, until I get a better budget to buy more disks, I can do with this.
For networking, I'm already working on getting 10GB cards. Seems like Chelsio/Intel would be too pricey and I would have to do with Mellanox ones.
RAM is 8x2. I've been looking into upgrading that for a while now. the 10g and RAM were my top priorities for upgrades even before this post.

Last thing here, what about some Intel DC SSDs for zil/SLOG? would I see any benefit?
What about moving from nfs to scsi? would that do any good?

HoneyBadger · Oct 2, 2018

atime shouldn't be a huge impact; lower recordsize though should improve the performance of small random I/O (which tends to be the cornerstone of VM performance) at the cost of some of the streaming throughput (svMotion, copy, etc)

plissje said:
Regarding the vdevs, theoretically speaking, if I have my current 2 pools and I split the VMs evenly between them, that would be the same as having 4 vdevs in the same pool right?
So technically, until I get a better budget to buy more disks, I can do with this.

Splitting the load isn't exactly the same, but might have a similar enough impact to let that work. You'll also eliminate the potential of a "noisy neighbor" situation where one rogue VM/process can go ballistic and chew up all the I/O capacity of a single larger pool.

plissje said:
Last thing here, what about some Intel DC SSDs for zil/SLOG? would I see any benefit?

Only if you stop using sync=disabled - otherwise, they'll be bypassed entirely. Check my signature for the SLOG benchmark thread, but the short answer is that the king of NVMe SLOG right now is the Intel Optane series, followed closely by the Intel P-series NVMe drives; then if you have to go SATA, Intel DC series. (Sensing a pattern?)

plissje said:
What about moving from nfs to scsi? would that do any good?

iSCSI with proper MPIO (multi-path I/O) would let you utilize both 1Gbps links simultaneously, for a 2Gbps (200MB/s) potential peak. Once you switch to a single 10Gbps link though, the difference between NFS and iSCSI should be largely academic.

plissje · Oct 3, 2018

Thanks for all the help HoneyBadger. Took the notes I've needed and will work on suggestions.

Important Announcement for the TrueNAS Community.

performance issues on RAID10 & dd testing

plissje

Dabbler

HoneyBadger

actually does care

plissje

Dabbler

HoneyBadger

actually does care

Stux

MVP

plissje

Dabbler

toadman

Guru

HoneyBadger

actually does care

HoneyBadger

actually does care

plissje

Dabbler

HoneyBadger

actually does care

plissje

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

performance issues on RAID10 & dd testing

Dabbler

actually does care

Dabbler

actually does care

MVP

Dabbler

Guru

actually does care

actually does care

Dabbler

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "performance issues on RAID10 & dd testing"

Similar threads