Notes on Performance, Benchmarks and Cache.

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
Understood, my thought here was to try and understand the performance of the underlying disk hardware.
Well to point to one specific thing, the choice of "sync=always" will completely invalidate such a test. Do you realise that, among other problems, it will cause all data to be written twice?
 
Last edited:

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
@wrayste ZFS was not designed to be the fastest filesystem per se, but using sync=always on a dataset in your dd test will have given a low value. In any case, the fio program is a far better way to benchmark your pool and you need to appreciate the difference in the way zfs handles sync and async writes (see for exmaple: https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/ and https://www.ixsystems.com/blog/zfs-zil-and-slog-demystified/).

With just 4 drives I'd suggest you have only two realistic choices of pool layout: raidz2 or a stripe of mirrors. The later maximises IOPs, as explained here: https://www.ixsystems.com/blog/zfs-pool-performance-1/ & https://www.ixsystems.com/blog/zfs-pool-performance-2/ So Chris Moore's question re: proposed use such as iSCSI, SMB, or NFS for shares is very pertinent here.

Alongside the fio benchmarking program, the CLI tools gstat, zilstat and zpool iostat can be used to monitor disk/pool performance/activity.
 

wrayste

Cadet
Joined
Jun 4, 2019
Messages
5
Well to point to one specific thing, the choice of "sync=always" will completely invalidate such a test. Do you realise that, among other problems, it will cause all data to be written twice?

I did not realise that, why does it cause a second write? I thought that setting ensured the disk write had completed before a sync request was acknowledged.

@KrisBee thanks, I will have a look at those this evening. I’m not expecting ZFS to be the fastest, the numbers I were getting were so far from what I was expecting was what caused me to query this.
 

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
I did not realise that, why does it cause a second write? I thought that setting ensured the disk write had completed before a sync request was acknowledged.

Sync writes ensure the data is written to disk before the write returns, yes, but ZFS does this by writing the data immediately to the "ZFS intent log" (ZIL). That data is then discarded from the ZIL once it's written again later through the normal async write process via a subsequent transaction group.

Look through this forum's resources for info on the ZIL for lots more information.

(The usual way to ameliorate the inevitable performance hit for sync writes is to move the ZIL to dedicated fast storage, where it is called a "separate intent log" (SLOG). The terms ZIL and SLOG are frequently confused.)

Another wrinkle with using sync writes for low-level disk performance testing is that the *application's* sync write call does not return until data is written to disk, which means that notification of the successful write has to travel all the way up the stack from the disk through all the layers of hardware and firmware and software back to the user-space application, before the thread that made the sync write call can issue another write. It's really hard to predict a priori what the impact of that is, but it doesn't seem like a great start if what you are actually trying to do is see how fast the disks can go.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The storage will eventually be a mixture of video, music, pdfs, etc.
For this use, which sounds like a very standard installation, I would suggest a SMB share with sync set to standard. Most applications do not call for sync writes and asynchronous is much faster as FreeNAS can utilize a RAM cache instead of needing to use the ZIL, as previously described by others. FreeNAS / ZFS uses RAM extensively for cache, which is why we suggest the use of ECC memory to ensure reliable function of the system. It is common for the memory to remain constantly around 97% utilized. I have servers at work that have 256GB of RAM and use it all, all the time. ZFS can be fast, if you give it enough resources.
I think this is a good image to illustrate that:
Screenshot from 2019-06-04 12-22-20.png
One thing to understand about ZFS pools (very generally speaking) a pool is made of one or more vdevs (virtual devices) and each virtual device behaves much like a single instance of the constituent disks that make the vdev. This can vary depending on the type of vdev and vdevs can be n-way mirrors, or some RAIDz (z1, z2, z3) which can have an impact, but the simple answer is, more vdevs provides more performance. If you have only one vdev, you are roughly limited to the performance of one drive. Also note that, with all vdevs being striped together (much like a RAID-0) I hope it is clear that redundancy (resilience) is at the vdev level. Loss of any vdev due to disk failure would result in total loss of the pool.

There is a lot of knowledge and experience in the forum members, please ask any questions you have and someone will certainly help you, even if it is just to say that it isn't a good idea because FreeNAS and ZFS is not always the answer to every question. ZFS was designed to be reliable, so to make it fast, that can be quite expensive. The question is, how fast does it need to be? Are you trying to do 10Gb networking or do you plan a 1Gb network and how many users will be accessing the system simultaneously?
 

wrayste

Cadet
Joined
Jun 4, 2019
Messages
5
@wrayste ZFS was not designed to be the fastest filesystem per se, but using sync=always on a dataset in your dd test will have given a low value. In any case, the fio program is a far better way to benchmark your pool and you need to appreciate the difference in the way zfs handles sync and async writes (see for exmaple: https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/ and https://www.ixsystems.com/blog/zfs-zil-and-slog-demystified/).

With just 4 drives I'd suggest you have only two realistic choices of pool layout: raidz2 or a stripe of mirrors. The later maximises IOPs, as explained here: https://www.ixsystems.com/blog/zfs-pool-performance-1/ & https://www.ixsystems.com/blog/zfs-pool-performance-2/ So Chris Moore's question re: proposed use such as iSCSI, SMB, or NFS for shares is very pertinent here.

Alongside the fio benchmarking program, the CLI tools gstat, zilstat and zpool iostat can be used to monitor disk/pool performance/activity.

Thanks, those links are really useful. I think some I may have glanced at but the mistake I made was assuming they were only relevant when using cache drives. I'll look at using fio to investigate further.

@anmnz Thanks for the extra information, with the above this is now clearer and I think the numbers seen are probably more realistic given what is actually going on (given those settings).

... I have servers at work that have 256GB of RAM and use it all, all the time. ZFS can be fast, if you give it enough resources.
...
The question is, how fast does it need to be? Are you trying to do 10Gb networking or do you plan a 1Gb network and how many users will be accessing the system simultaneously?

Yep, I understood the importance of RAM (and ECC) which lead me to go for a platform that supported RDIMMs, depending on price later I could add another 128 GB. The 10Gb is for future proofing at the moment as the alternative board with 4x 1Gb not being more attractive.

Thanks all for the help, as mentioned the critical information about the ZIL being written to the pool as well as the data then (when not using a cache device) is probably the critical information that I'd overlooked.
 

miercoles131

Dabbler
Joined
Apr 17, 2019
Messages
15
Silly question, just trying to figure out what to throw at a new build im working on with 192GBs of RAM. From what I understood from this thread is that running the following would create 100Gig file to test read/write....

Code:
Write
dd if=/dev/zero of=tmp.dat bs=2048k count=50k
Read
dd if=tmp.dat of=/dev/null bs=2048k count=50k

Should I use this as well to test it on my system or change it to a larger number to account for the large RAM size?
 

wrayste

Cadet
Joined
Jun 4, 2019
Messages
5
Silly question, just trying to figure out what to throw at a new build im working on with 192GBs of RAM. From what I understood from this thread is that running the following would create 100Gig file to test read/write....

Code:
Write
dd if=/dev/zero of=tmp.dat bs=2048k count=50k
Read
dd if=tmp.dat of=/dev/null bs=2048k count=50k

Should I use this as well to test it on my system or change it to a larger number to account for the large RAM size?

The simplest option I found was turn off compression and turn off asynchronous writes and then do the above. This should give you worse case performance for the drives.

Another option is to use the 'fio' tool which is a bit more in-depth and will possibly provide more accurate results.

With asynchronous writes you'll get much better performance and with 192 GB of RAM a lot of headroom, I wasn't so interested in RAM performance which is why I tested with the settings above.

Good luck with your testing, it is an art and the most important thing I learnt was about the double writes (described in some of the links above).
 

lonelyzinc

Dabbler
Joined
Aug 8, 2019
Messages
35
root@freenas2[/mnt/tank2/smb_dataset]# dd if=/dev/zero of=tmp.dat bs=2048k count=50k;dd if=tmp.dat of=/dev/null bs=2048k count=50k51200+0 records in
51200+0 records out
107374182400 bytes transferred in 54.939112 secs (1954421499 bytes/sec)
# write 1864 MB/s
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 58.866708 secs (1824022211 bytes/sec)
# read 1740 MB/s

Why would the write speed be faster on my standalone Samsung 970 NVMe drive using this benchmark? Every benchmark I have ever seen for these type of drives, directly attached to Mac or Windows, has indicated that the read speed should always be noticeably faster.

The average Mac/Windows user who uses NVMe as a boot drive should be able to understand where I'm coming from
 
Last edited:

purduephotog

Explorer
Joined
Jan 14, 2013
Messages
73
That's not just one drive, is it? You've got it set to IT mode I see from your build... and it is a 970, which is uber fast. I'm guessing you've maxed out on the hba.

Just guessing though- I've got nothing to compare it to.
 

lonelyzinc

Dabbler
Joined
Aug 8, 2019
Messages
35
That's not just one drive, is it? You've got it set to IT mode I see from your build... and it is a 970, which is uber fast. I'm guessing you've maxed out on the hba.

Just guessing though- I've got nothing to compare it to.
It's a standalone NVMe drive in a PCIe 3.0 4x slot
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Why would the write speed be faster on my standalone Samsung 970 NVMe drive using this benchmark? Every benchmark I have ever seen for these type of drives, directly attached to Mac or Windows, has indicated that the read speed should always be noticeably faster.

The average Mac/Windows user who uses NVMe as a boot drive should be able to understand where I'm coming from

Caching - you are hitting two different write caches here
1. FreeNas In Memory cache - depending on Ram size (only a part of it)
2. Drive write cache - most nvme drives have an internal write cache (dependent on drive size; seems to be ~5% of the size) that caches writes - this is especailly true for m2 drives where it commonly can be seen as a step write drop off when copying large amounts of data (movie collections).
Its also visible in reviews, eg

If you want a more realistic result increase the amount of data written to 200 or 300 GB to impact write cache impact.
 

lonelyzinc

Dabbler
Joined
Aug 8, 2019
Messages
35
Caching - you are hitting two different write caches here
1. FreeNas In Memory cache - depending on Ram size (only a part of it)
2. Drive write cache - most nvme drives have an internal write cache (dependent on drive size; seems to be ~5% of the size) that caches writes - this is especailly true for m2 drives where it commonly can be seen as a step write drop off when copying large amounts of data (movie collections).
Its also visible in reviews, eg

If you want a more realistic result increase the amount of data written to 200 or 300 GB to impact write cache impact.
Thanks for the info, but I don't think this really explains the slowness of the read performance. I haven't had a moment to compare with other operating systems on the same machine, but there are a lot of Windows benchmarks for this drive out there on say Amazon.

In fact the specs page itself says reads up to 3400 MB/s, and I seem to remember this same drive performing up there when it was in a Thunderbolt 3 enclosure connected to a Mac

I'd be curious to see what other people are getting with a similar high performance (PCIe 3.0 4x) single NVMe drive
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
What CPU load did you have during this test (single core) - maybe it is maxed out with that?
 

phil1c

Dabbler
Joined
Dec 28, 2019
Messages
21
root@freenas2[/mnt/tank2/smb_dataset]# dd if=/dev/zero of=tmp.dat bs=2048k count=50k;dd if=tmp.dat of=/dev/null bs=2048k count=50k51200+0 records in
51200+0 records out
107374182400 bytes transferred in 54.939112 secs (1954421499 bytes/sec)
# write 1864 MB/s
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 58.866708 secs (1824022211 bytes/sec)
# read 1740 MB/s

Why would the write speed be faster on my standalone Samsung 970 NVMe drive using this benchmark? Every benchmark I have ever seen for these type of drives, directly attached to Mac or Windows, has indicated that the read speed should always be noticeably faster.

The average Mac/Windows user who uses NVMe as a boot drive should be able to understand where I'm coming from

What version of Proxmox are you running and what's your hardware (meaning RAM, CPU, chassis/mobo)? My situation isn't the same as mine involves a FreeNAS box and 10GB connections, but the root of my issue is writes that are way faster than reads (reads are anywhere from 30-60% of writes) and I'm on a hunt to find any and all information that I can to figure this out.
 

RegularJoe

Patron
Joined
Aug 19, 2013
Messages
330
With Windows there are tricks the Samsung software does to enhance the speed and reporting to benchmark applications. Is this a fresh and blank NVME disk? Do you have compression turned on? Is there anything else writing to the disk? If you use diskinfo -citv /dev/xxx does it show your read speed is what your expecting? If you have a device with no data you can do write tests I think with diskinfo as well. I have noticed that the ATTO Disk Benchmark for Windows shows that small I/O size is not showing the 3,400MBs where the larger I/O does show what Samsung states.
 

Jerami1981

Dabbler
Joined
Jan 4, 2018
Messages
32
I feel like I have done something wrong.

root@mjolnir[~]# dd if=/dev/zero of=tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 20.990478 secs (5115375883 bytes/sec)

root@mjolnir[~]# dd if=tmp.dat of=/dev/null bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 11.482589 secs (9351043237 bytes/sec)

System harware:
SuperMicro MBD-X11SPI-TF-O
Intel Xeon Silver 4110 Skylake 8-Core BX806734110
64 GB 2133 ECC Ram
HBA x2 LSI 9305 12 Gb/s
Mellanox MNPA19-XTR 10G

Pool A
Samsung 860 1TB x3 Striped

Pool B
RaidZ2
Vdev#1 x6 10TB WD
Vdev#2 x6 8TB WD & SG
Vdev#3 x6 4TB WD & SG
Vdev#4 x6 4TB WD & SG
 

Jerami1981

Dabbler
Joined
Jan 4, 2018
Messages
32

wdp

Explorer
Joined
Apr 16, 2021
Messages
52
I'll throw my hat in the ring...

Fresh build, was looking at RaidZ2 vs Mirrored performance on a 12x 18TB setup before I looked at possible caching performance and tuning.

2x 6x RaidZ2

Write
Code:
root@anton[~]# dd if=/dev/zero of=/mnt/tank/Share1/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 153.184751 secs (700945632 bytes/sec)


Read
Code:
root@anton[~]# dd if=/mnt/tank/Share1/tmp.dat of=/dev/null bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 122.100346 secs (879392941 bytes/sec)


6x Mirrored vDevs

Write
Code:
root@anton[~]# dd if=/dev/zero of=/mnt/tank/Share1/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 107.400316 secs (999756671 bytes/sec)


Read
Code:
root@anton[~]# dd if=/mnt/tank/Share1/tmp.dat of=/dev/null bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 251.934330 secs (426199091 bytes/sec)


No clue why my reads took a total dive in mirrored vdevs. Maybe I shouldn't stay up until 1am building a truenas box.
 
Top