Massive CPU usage on concurrent access (iSCSI)

Jason Keller · May 9, 2017

Sad as this is to say, compressed ARC is indeed my culprit. Upgrading to 11-RC on the box originally noted above, and ensuring that vfs.zfs.compressed_arc_enabled: 0 shows thus...

Code:

last pid:  4735;  load averages:  8.54,  3.62,  1.80									   up 0+00:11:32  14:48:05
1523 processes:40 running, 1422 sleeping, 61 waiting
CPU:  0.0% user,  0.0% nice, 37.5% system,  3.3% interrupt, 59.2% idle
Mem: 381M Active, 188M Inact, 21G Wired, 72G Free
ARC: 19G Total, 935M MFU, 18G MRU, 64K Anon, 261M Header, 8452K Other
Swap:

  PID USERNAME   PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND
	5 root	   -16	-	 0K   128K CPU22  22   1:01  64.59% ctl{work0}
	0 root		 8	-	 0K 20080K CPU22  22   0:40  44.33% kernel{cblk0 taskq_0}
	0 root		 8	-	 0K 20080K CPU9	9   0:40  44.31% kernel{cblk0 taskq_1}
	0 root		 8	-	 0K 20080K -	   4   0:40  44.08% kernel{cblk0 taskq_12}
	0 root		 8	-	 0K 20080K CPU5	5   0:40  44.08% kernel{cblk0 taskq_3}
	0 root		 8	-	 0K 20080K -	   8   0:40  44.06% kernel{cblk0 taskq_13}
	0 root		 8	-	 0K 20080K CPU19  19   0:40  44.06% kernel{cblk0 taskq_2}
	0 root		 8	-	 0K 20080K CPU14  14   0:40  44.01% kernel{cblk0 taskq_10}
	0 root		 8	-	 0K 20080K -	  21   0:40  43.98% kernel{cblk0 taskq_4}
	0 root		 8	-	 0K 20080K CPU16  16   0:40  43.96% kernel{cblk0 taskq_8}
	0 root		 8	-	 0K 20080K -	   4   0:40  43.94% kernel{cblk0 taskq_11}
	0 root		 8	-	 0K 20080K CPU20  20   0:40  43.93% kernel{cblk0 taskq_7}
	0 root		 8	-	 0K 20080K -	  16   0:40  43.86% kernel{cblk0 taskq_9}
	0 root		 8	-	 0K 20080K -	  13   0:40  43.84% kernel{cblk0 taskq_6}
	0 root		 8	-	 0K 20080K CPU12  12   0:40  43.60% kernel{cblk0 taskq_5}
   12 root	   -92	-	 0K   992K WAIT	8   0:38  33.30% intr{irq263: t5nex0:0a4}
	0 root	   -16	-	 0K 20080K CPU7	7   0:24  31.09% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20080K CPU6	6   0:29  29.44% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20080K CPU18  18   0:22  29.27% kernel{cfiscsitx}
	0 root	   -16	-	 0K 20080K icl_tx  6   0:27  26.63% kernel{cfiscsitx}
   12 root	   -92	-	 0K   992K WAIT   20   0:27  24.02% intr{irq273: t5nex0:1a4}
	0 root	   -16	-	 0K 20080K icl_rx  6   0:19  23.68% kernel{cfiscsirx}
	6 root		-8	-	 0K   192K dbuf_e 15   0:19  23.31% zfskern{dbuf_evict_thread}
   12 root	   -92	-	 0K   992K CPU0	0   0:20  22.23% intr{irq274: t5nex0:1a5}
	0 root	   -16	-	 0K 20080K icl_rx 17   0:21  21.93% kernel{cfiscsirx}
	0 root	   -16	-	 0K 20080K CPU13  13   0:17  20.71% kernel{cfiscsirx}
	0 root	   -16	-	 0K 20080K icl_rx 11   0:19  17.37% kernel{cfiscsirx}
	5 root	   -16	-	 0K   128K -	   8   0:12  15.75% ctl{work3}
	5 root	   -16	-	 0K   128K -	   2   0:11  14.82% ctl{work4}
	5 root	   -16	-	 0K   128K CPU4	4   0:13  14.54% ctl{work5}
4682 root		20	0 29212K 11512K kqread 21   0:01   1.01% pmcstat
4417 root		20	0 32464K 10356K CPU10  10   0:01   0.26% top
	7 root	   -16	-	 0K	16K -	   1   0:01   0.22% rand_harvestq
	0 root	   -16	-	 0K 20080K -	   3   0:09   0.07% kernel{zio_write_intr_3}
	6 root		-8	-	 0K   192K tx->tx  1   0:11   0.07% zfskern{txg_thread_enter}
	0 root	   -16	-	 0K 20080K -	  21   0:09   0.07% kernel{zio_write_intr_1}
	0 root	   -16	-	 0K 20080K -	  22   0:09   0.07% kernel{zio_write_intr_5}
	0 root	   -16	-	 0K 20080K -	  16   0:09   0.06% kernel{zio_write_intr_2}
4683 root	   -16	-	 0K	16K pmcloo 17   0:00   0.06% hwpmc: proc(4682)
	0 root	   -16	-	 0K 20080K -	  12   0:09   0.06% kernel{zio_write_intr_4}
	0 root	   -16	-	 0K 20080K -	  22   0:09   0.06% kernel{zio_write_intr_6}
	0 root	   -16	-	 0K 20080K -	  12   0:09   0.06% kernel{zio_write_intr_7}
	0 root	   -16	-	 0K 20080K -	   5   0:01   0.05% kernel{zio_null_issue}
	0 root	   -16	-	 0K 20080K -	  21   0:09   0.05% kernel{zio_write_intr_0}
	4 root	   -16	-	 0K	96K -	   8   0:01   0.04% cam{doneq3}
	4 root	   -16	-	 0K	96K -	  22   0:01   0.04% cam{doneq2}
   12 root	   -60	-	 0K   992K WAIT	0   0:02   0.04% intr{swi4: clock (0)}
3338 ladvd	   20	0 19624K  2820K kqread 17   0:00   0.04% ladvd
   12 root	   -88	-	 0K   992K WAIT   10   0:01   0.03% intr{irq279: mps1}
	4 root	   -16	-	 0K	96K -	  10   0:01   0.03% cam{doneq1}
	0 root	   -12	-	 0K 20080K -	  17   0:08   0.03% kernel{zio_write_issue_15}
	4 root	   -16	-	 0K	96K -	   8   0:01   0.02% cam{doneq0}
	0 root	   -12	-	 0K 20080K -	  13   0:08   0.02% kernel{zio_write_issue_10}
	0 root	   -12	-	 0K 20080K -	   4   0:08   0.02% kernel{zio_write_issue_2}
	0 root	   -12	-	 0K 20080K -	  19   0:08   0.02% kernel{zio_write_issue_16}
	0 root	   -12	-	 0K 20080K -	   1   0:08   0.02% kernel{zio_write_issue_12}
	0 root	   -12	-	 0K 20080K -	  20   0:08   0.02% kernel{zio_write_issue_6}
	6 root		-8	-	 0K   192K spa->s  9   0:00   0.02% zfskern{trim flash}
   12 root	   -88	-	 0K   992K WAIT   12   0:01   0.02% intr{irq256: mps0}
	0 root	   -12	-	 0K 20080K -	  19   0:08   0.02% kernel{zio_write_issue_13}
3337 root		20	0 21672K  2868K kqread  8   0:00   0.02% ladvd




PMC: [CPU_CLK_UNHALTED_CORE] Samples: 14806 (100.0%) , 0 unresolved

%SAMP IMAGE	  FUNCTION			 CALLERS
13.9 kernel	 lock_delay		   _sx_xlock_hard:4.9 __mtx_lock_sleep:4.3 __rw_wlock_hard:4.1 ...
  8.0 kernel	 lz4_decompress	   zio_decompress
  4.3 kernel	 cpu_search_highest   cpu_search_highest:3.7 sched_idletd:0.6
  3.2 kernel	 memcpy			   icl_soft_conn_pdu_append_data
  2.4 kernel	 uma_zalloc_arg
  2.1 kernel	 cpu_idle			 sched_idletd
  1.7 kernel	 bcopy				uiomove_faultflag
  1.7 kernel	 ctl_work_thread	  fork_exit
  1.6 kernel	 bzero
  1.4 kernel	 ctl_check_ooa		ctl_work_thread
  1.3 kernel	 uma_zfree_arg
  1.2 kernel	 tcp_output		   tcp_usr_send
  1.1 kernel	 sched_switch		 mi_switch
  1.0 kernel	 cpu_search_lowest	cpu_search_lowest
  1.0 kernel	 thread_lock_flags_
  0.9 kernel	 dbuf_destroy		 dmu_read_uio_dnode
  0.9 kernel	 dbuf_find			dbuf_hold_impl
  0.9 kernel	 dbuf_create		  dbuf_hold_impl
  0.9 kernel	 _sx_xlock_hard
  0.8 kernel	 cfiscsi_datamove	 ctl_datamove
  0.8 kernel	 avl_remove		   dbuf_destroy
  0.8 kernel	 cpu_switch		   mi_switch
  0.7 kernel	 sched_idletd		 fork_exit
  0.7 kernel	 ctl_be_block_worker  taskqueue_run_locked
  0.7 kernel	 avl_add			  dbuf_create
  0.7 kernel	 dbuf_compare		 avl_add
  0.7 kernel	 zrl_remove
  0.7 kernel	 critical_exit
  0.7 kernel	 __mtx_lock_sleep
  0.7 kernel	 zrl_add_impl
  0.7 kernel	 sleepq_lock
  0.7 kernel	 sched_pickcpu		sched_add
  0.6 kernel	 atomic_add_64_nv
  0.6 kernel	 dmu_zfetch		   dmu_buf_hold_array_by_dnode
  0.6 kernel	 sched_add
  0.6 kernel	 eth_tx			   drain_ring
  0.5 kernel	 zio_wait_for_childre
  0.5 kernel	 m_freem

I'm now up well over 200k 8k RIOPS (hovered around 212k) with only about 40% CPU usage via software iSCSI initiators in VMware. Latency looks much better as well (a full 200us lower), as can be seen in the screenshot. I'm getting much better throughput loading on the links as well.

Jason Keller · May 10, 2017

I will preface this with the understanding that this synthetic test doesn't mirror a very real world workload, but @mav@ should I file a bug report for this (increased latency and cpu usage on concurrent access from compressed ARC)? Or is it more of an enhancement request (I'd lean bug as it's a performance regression in some workloads, but in other workloads the increased volume of data in ARC could increase the ARC hit rate, leading to potentially better performance in that instance in spite of the higher CPU usage). I don't know if there's anything that can be done to limit the impact of compressed ARC on very hot-block workloads aside from disabling it completely right now.

For now hopefully this thread will serve to highlight the compressed ARC feature (which there is scant documentation on, and I'm sure most of the community is only dimly aware of if at all), along with some pertinent data to determine whether or not it should be enabled for your particular workload. Right now, even with powerful dual-deca core Ivy-Bridge CPUs it looks like the storage head caps out at 130k 8K read IOPS with compressed ARC enabled.

In sum...

If your data compresses decently and your workload does not fetch the same block repetitively, leave compressed ARC enabled and enable LZ4 on your dataset/ZVOL.
If your data doesn't compress hardly at all, disable LZ4 on your dataset/ZVOL.
If your data compresses decently and your workload involves very hot blocks in ARC, it may be worthwhile to disable the compressed ARC feature to reduce CPU usage, decrease latency and increase your read IOPS out of ARC. Some loads in Elasticsearch would possibly fit this profile if you have compression disabled in the indexes to save CPU usage on your ES nodes, but transparently compress them in ZFS backend storage to save space.

As @mav@ has elucidated earlier, compressed ARC does not actually compress your data as it moves up into ARC. It simply moves the stage at which the block is decompressed (so on read instead of inflating the block and then storing it in ARC, it reads the block off the pool drives and loads the compressed block directly into ARC, and only decompresses it just-in-time upon the client read. If the block isn't compressed on disk, it isn't compressed in ARC. But with LZ4 enabled by default on newly created pools, it's worthwhile to revisit if you need it or not).

If anyone has additional notes I will gladly edit this post and/or add to it.

mav@ · May 10, 2017

I don't think it is a bug, more like a tradeoff case indeed that may need some optimization. You may file me a ticket to look on it one day.

ZFS actually stores up to 100MB of uncompressed data for very hot blocks to not pessimize it too much, but unfortunately that value is not tunable on FreeBSD, and increasing it too high would increase memory waste by data duplication (storing both compressed and uncompressed).

Here is a talk with more technical information about compressed ARC, if somebody want: https://www.youtube.com/watch?v=zWcgccpVRfA

Jason Keller · May 10, 2017

Ticket filed with an assignment to you @mav@ (I filed it under umbrella in the bug tracker; please move as necessary).

Important Announcement for the TrueNAS Community.

Massive CPU usage on concurrent access (iSCSI)

Jason Keller

Explorer

Jason Keller

Explorer

mav@

iXsystems

Jason Keller

Explorer

Similar threads