All-Flash NVMe Pool. Slow Read.

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Thanks. I'd try:
  • A clean install
  • Mirrored vdevs
  • Using a single CPU
  • Using an HBA instead of the backplate/cables you are using.
Iperf shows expected speeds.
Solnet array shows regular drive speeds.
The issue has to either be in the way you execute your tests or in some kind of configuration issue... perhaps in the BIOS.

As I said I don't understand the issue. How is CPU usage during your tests? What are the temps? How do you setup the share?
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
A clean install
Ok. Same version? or have recommened specified version?
Mirrored vdevs
I rollbacked all of data. I can destroy new systems pool now. I'll try.
But in the end, I need RAIDZ1. Insufficient capacity.
Using a single CPU
According to the design of the PowerEdge R640, the PCIe NVMe Backplane and the CPU are connected to the PCIe Lane on the second CPU. Therefore, cannot remove one CPU if want 10-Bay NVMe.
Using an HBA instead of the backplate/cables you are using.
Does the NVMe HBA exist? Are you talking about PCIe Switch?
The issue has to either be in the way you execute your tests or in some kind of configuration issue... perhaps in the BIOS.
Some settings were adjusted due to electricity bill problems. However, the problem occurred, so I tried to disable all settings related to power saving for test purposes, but it is the same.
As I said I don't understand the issue. How is CPU usage during your tests? What are the temps? How do you setup the share?
This is a bit of an interesting part. In effect, it stays almost idle.

In fact, I have collected a lot of information about NVMe Arrays using ZFS before, but most of them have not shown positive examples.
To be honest, I'm starting to feel convinced that ZFS isn't very close to extreme performance like NVMe.
I had room for hope until I experienced it firsthand. :)
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
Alright, I clean installed Truenas Core 12.1-MASTER-202103300439.
and destroyed the pool and create Mirrored layout.
1686206264121.png

and.... result? same problem.

I think I need to tune a few more, but I'm getting tired.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Does the NVMe HBA exist? Are you talking about PCIe Switch?
I was under the impression you were using SAS as interface.

In fact, I have collected a lot of information about NVMe Arrays using ZFS before, but most of them have not shown positive examples.
To be honest, I'm starting to feel convinced that ZFS isn't very close to extreme performance like NVMe.
I had room for hope until I experienced it firsthand. :)
I have seen quite a few threads with high speeds, but ZFS surely is focused more on data integrity than pure speeds.

Tests have shown that the system should be able to work as intended, so I ask you details about your testing metodology (I recall zip archives and a SMB share?).

I think I need to tune a few more, but I'm getting tired.
Understandably so, I'm sorry for not being of much help. @Ericloewe do you know anything to solve the issue, like a hidden magical tunable?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I heard that I need 1GB of RAM per TiB for the data I acquired before.
So I configured it with 64GB, is it different from the truth?

The 1GB per TB guideline is aimed at a typical 1Gbps ethernet filer handling common filesharing duties. It would be overshoot for lightly loaded files, and is not sufficient for heavily loaded systems such as a busy code development or departmental fileserver especially with lots of small files. It's just that users need to be told SOMETHING for sizing, because users typically come into ZFS not appreciating that RAM needs to scale with disk. Busy or stressy fileservers typically benefit from additional ARC and/or L2ARC.

Why is it slower than my previous 14TB SAS HDD x6? same RAIDZ1

Good question. It could have to do with lots of things, such as low queue depth for I/O to the raw device, or how your NVMe devices are attached to the system. The reason I originally wrote solnet-array-test back in the late 90's/early 2000's was an unexplained slowness in a Dell PERC RAID system handling 75 drives, and it turned out that the RAID cards themselves simply were unable to keep up with the simultaneous I/O to so many devices. You need to consider how your array is deriving so many PCIe lanes, is there a PLX, etc.

Your individual devices seem to be getting ~~1900MBytes/sec sequentially but under stress seem to be managing more like ~~2900MBytes/sec, which suggests parallelism inside the controller, and also the SSD is not managing to make optimal use of queueing to reduce latency. If it can go faster when two jobs are running, that means it won't go as fast with just one.

This gets back to how ZFS manages I/O towards devices, and I suspect your real problem is the RAIDZ1 vdev you've created. If there is a possibility to deconstruct that and experiment with the raw pool, I recommend:

1) Try making two 4-device RAIDZ1 vdevs. My guess is you may see a significant bump in I/O speed.

2) Try making the recommended mirror vdev setup. Between this and a two RAIDZ1 vdev, it may help identify whether this is a line of testing to pursue.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
I was under the impression you were using SAS as interface.
It was not like Tri-Mode haha, just pure PCIe lane!
I have seen quite a few threads with high speeds, but ZFS surely is focused more on data integrity than pure speeds.
Agree, that is why i can't giveup yet!
Tests have shown that the system should be able to work as intended, so I ask you details about your testing metodology (I recall zip archives and a SMB share?).
When measuring speed, most people use fio, dd commands, or if they are more granular, vdbench, hcibench (VMware).
I know how to use these tools, but I was focusing this project on 'So how much can I actually see and feel with my eyes?'
Just like most people use, nothing special. I just read and write files in the folder shared by SMB. In Windows Explorer, one zip file that I already had.
Understandably so, I'm sorry for not being of much help.
There's nothing to be sorry about. I'm still honored and grateful to you for letting me know new knowledge, giving me opinions and trying to help!

and i found someone.
This is a video of Linus.
It contains creating and distributing NVMe Pool through Truenas Core as I wanted. I use a similar number of vdevs as I do, and I don't use separate L2ARCs or SLOGs.
I think I took some hints from here.
'zfs set primarydata=metadata'.
They said the RAM (ARC) does not need to contain information other than metadata because the NVMe disk is fast enough.
Additionally they did some tuning on smb service, I'm going to make this part of the reference.
It's a little lower than they expected, but it's convincing to me and it's the performance I wanted.
Of course their servers have higher specifications than mine, but what I'm looking forward to is at least one NVMe disk for performance.
I believe this will be possible.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
You need to consider how your array is deriving so many PCIe lanes, is there a PLX, etc.
The CPU provides sufficient PCIe Lane.
For Dell PowerEdge R640 design, of the 10 bays (slot numbers 0 to 9), 0 to 1 are connected through PCIe Expander (PLX Card, x16) parts mounted in the PCIe slot of CPU1, and 2-9 are connected to the Slimsas connector on the motherboard via the Slimsas cable. The connector is directly connected to CPU2.
I'm currently using 2~9 slots.
1) Try making two 4-device RAIDZ1 vdevs. My guess is you may see a significant bump in I/O speed.
Ok, i'll try this one too.
2) Try making the recommended mirror vdev setup. Between this and a two RAIDZ1 vdev, it may help identify whether this is a line of testing to pursue.
uh... this one? when it is right, was same result.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
1) Try making two 4-device RAIDZ1 vdevs. My guess is you may see a significant bump in I/O speed.
I modified some options while trying this.
Compression=disable, atime=off.
And the same result was achieved in SMB.
But I found one interesting part, when I put the SMB down for a while and did the test with fio, I see a pretty good looking result.

This one is write.
Code:
root@truenas[/mnt/NVMe-DATA-Pool-01/TEST]# fio --filename=./test --direct=1 --rw=write --bs=1M --iodepth=2 --numjobs=12 --group_reporting --name=test --size=50G
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=2
...
fio-3.19
Starting 12 processes
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
test: Laying out IO file (1 file / 51200MiB)
Jobs: 12 (f=12): [W(12)][100.0%][w=7668MiB/s][w=7667 IOPS][eta 00m:00s]
test: (groupid=0, jobs=12): err= 0: pid=6814: Thu Jun  8 06:27:09 2023
  write: IOPS=6991, BW=6992MiB/s (7331MB/s)(600GiB/87877msec)
    clat (usec): min=260, max=10806, avg=1683.61, stdev=479.59
     lat (usec): min=278, max=10824, avg=1711.13, stdev=482.52
    clat percentiles (usec):
     |  1.00th=[  873],  5.00th=[ 1045], 10.00th=[ 1139], 20.00th=[ 1287],
     | 30.00th=[ 1401], 40.00th=[ 1516], 50.00th=[ 1614], 60.00th=[ 1729],
     | 70.00th=[ 1860], 80.00th=[ 2024], 90.00th=[ 2343], 95.00th=[ 2606],
     | 99.00th=[ 3097], 99.50th=[ 3261], 99.90th=[ 3589], 99.95th=[ 3687],
     | 99.99th=[ 4293]
   bw (  MiB/s): min= 4772, max= 9518, per=100.00%, avg=6995.16, stdev=53.66, samples=2076
   iops        : min= 4770, max= 9515, avg=6988.21, stdev=53.76, samples=2076
  lat (usec)   : 500=0.08%, 750=0.22%, 1000=3.01%
  lat (msec)   : 2=75.26%, 4=21.41%, 10=0.01%, 20=0.01%
  cpu          : usr=1.86%, sys=19.73%, ctx=706575, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,614400,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
  WRITE: bw=6992MiB/s (7331MB/s), 6992MiB/s-6992MiB/s (7331MB/s-7331MB/s), io=600GiB (644GB), run=87877-87877msec


and this one is read
Code:
root@truenas[/mnt/NVMe-DATA-Pool-01/TEST]# fio --filename=./test --direct=1 --rw=read --bs=1M --iodepth=2 --numjobs=12 --group_reporting --name=test --size=50G
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=2
...
fio-3.19
Starting 12 processes
Jobs: 1 (f=1): [_(6),R(1),_(5)][100.0%][r=1063MiB/s][r=1063 IOPS][eta 00m:00s]       
test: (groupid=0, jobs=12): err= 0: pid=6841: Thu Jun  8 06:28:13 2023
  read: IOPS=12.1k, BW=11.8GiB/s (12.7GB/s)(600GiB/50784msec)
    clat (usec): min=112, max=24685, avg=848.31, stdev=182.63
     lat (usec): min=113, max=24685, avg=848.54, stdev=182.63
    clat percentiles (usec):
     |  1.00th=[  553],  5.00th=[  619], 10.00th=[  644], 20.00th=[  693],
     | 30.00th=[  734], 40.00th=[  791], 50.00th=[  848], 60.00th=[  889],
     | 70.00th=[  930], 80.00th=[  979], 90.00th=[ 1057], 95.00th=[ 1156],
     | 99.00th=[ 1319], 99.50th=[ 1385], 99.90th=[ 1565], 99.95th=[ 1647],
     | 99.99th=[ 2442]
   bw (  MiB/s): min=11362, max=17775, per=100.00%, avg=14187.36, stdev=166.14, samples=1026
   iops        : min=11361, max=17772, avg=14181.80, stdev=166.24, samples=1026
  lat (usec)   : 250=0.01%, 500=0.20%, 750=33.05%, 1000=50.29%
  lat (msec)   : 2=16.44%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.42%, sys=43.65%, ctx=1611829, majf=0, minf=3072
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=614400,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=11.8GiB/s (12.7GB/s), 11.8GiB/s-11.8GiB/s (12.7GB/s-12.7GB/s), io=600GiB (644GB), run=50784-50784msec


And I found one strange situation on my 40G switch.
Discard packets occur when transferring files.
Currently, I use the vlan interface on Truenas devices(I mean create 'vlan' interface in Truenas and use it.), and I use multiple vlan and NW IDs with Trunk settings on the switch.
I changed it to access mode for testing, and I got a reading speed of about 400MB/s to 600MB/s on SMB.
ZFS itself needs some tuning, but there seems to be a problem with Truenas's NW.
This is because traditional HDD-based Truenas Scale systems use the same configuration, but these symptoms are not occurring.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And I found one strange situation on my 40G switch.
Discard packets occur when transferring files.

Packet loss (whether artificial or accidental) is going to be a disaster for performance.

Also I don't know if anyone has pointed you at


which will talk about congestion control algorithms that may be more suitable to high speed networking. I might suggest trying dctcp in your situation. I realize I never quite finished the entirety of that article but it will also give you some good hints as to other things that may be impactful.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
Packet loss (whether artificial or accidental) is going to be a disaster for performance.

Also I don't know if anyone has pointed you at


which will talk about congestion control algorithms that may be more suitable to high speed networking. I might suggest trying dctcp in your situation. I realize I never quite finished the entirety of that article but it will also give you some good hints as to other things that may be impactful.
Sweet christmas! I think this is part of the cause of the problem I have. I'll try this.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
I conducted various additional tests.
Link Speed is 40G for Truenas and 10G for PCs.
In this case, Packet Discard occurs when i get files.(Truenas -> PC)
If the Link Speed on Truenas side is readjusted to 10G, Packet Discard will not occur and approximately 600MB/s will be obtained.
I've never seen anything like this before, it's moving in a strange direction.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I've never seen anything like this before, it's moving in a strange direction.

Talk to us about your switching. Do you have any troublesome settings such as cut-through switching enabled? And, of course, troublesome settings such as cut-through switching NOT enabled?

When you have mixed link speeds especially with a fatter pipe up front, your 40G ethernet card can be trying to cram data out to the switch fabric at a pace faster than the fabric can handle it. In a normal store-and-forward situation, this can result in packet loss if your switchports have a weedy transmit buffer, because you can be reasonably certain it is going to be overflowed, and especially if you have "smaller" buffers it happens more often. If you have cut-thru configured, then corrupt frames can cause other various kinds of problems.

The general rule is to use the faster 40G ethernet for interswitch trunking or in situations where you can be assured that egress from the switch fabric is not going to be blocked. In many cases, it turns out that you can get better performance by restricting ingress speeds to egress speeds, in other words, use 10G for both ingress and egress. As you've discovered, this counterintuitive solution can be effective. If you have better quality switchgear with large buffers on a store-and-forward configuration, you will also get better performance out of that, which is why people pay for the high end switchgear.

fs.com has a reasonable introduction article on the topic of c/t vs s/f switching and makes the important point that both have their places. If you absolutely need a super-low latency switching environment, c/t is the way to go but it comes at the cost of less resiliency, and sometimes the availability of c/t is used as an excuse for crappy buffering capabilities in switchgear.


That article does not touch on the topic of funneling a faster link speed into a lower link speed port. Many of us who do this professionally lived through the 1990's and had experiences with the 10Gbps->1Gbps and/or 1Gbps->100Mbps eras of pipe restriction. These were a bit different in that "standard" "modern" ethernet congestion algorithms tend to be better suited towards these link speeds. If you get unnecessary retries with newreno, for example, you really will find yourself capped rather harshly. Some of the newer algorithms do a better job of recovery at higher link speeds, which is why I suggested trying dctcp above. Fundamentally this boils down to the age old problem that you can't fit fifty pounds of crap into a five pound sack and that there will be trouble if you try.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
Do you have any troublesome settings such as cut-through switching enabled?
I using Mellanox SX6012, i think yes. it is default operation and can't change that.
1686481523592.png

On Spectrum switchs(such as SN2100), it can change with command. 'switchmode store-and-forward'
Unfortunately, It seems like doesn't support these commands to my old stuff.
It might be biggest problem is the c/t and s/f you mentioned.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Part of the reason I don't generally like the Mellanox stuff is that it is intended for all this cluster service/infiniband/super-low-latency crap. And it's very good at that, just like a NASCAR racecar is very good at certain things, as long as they involve going fast and then turning left. But it isn't so good at a lot of the other stuff that a typical network card/ethernet switch should be good at. There's a reason that Mellanox is available so cheaply on the used market.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
Part of the reason I don't generally like the Mellanox stuff is that it is intended for all this cluster service/infiniband/super-low-latency crap. And it's very good at that, just like a NASCAR racecar is very good at certain things, as long as they involve going fast and then turning left. But it isn't so good at a lot of the other stuff that a typical network card/ethernet switch should be good at. There's a reason that Mellanox is available so cheaply on the used market.
There is a part that I forgot to write down.
Discard packets increments the counters only on ports associated with the Truenas system.
And this is the first time I've seen this issue in a mixed Link-Speed environment, and it doesn't happen when communicating with VMs in my vSphere that use 40G vDS Uplink together.

This problem is like scrambled eggs created by complex problems. Network, ZFS Tune, SMB, etc...
I think need to go over everything again.
First of all, I'm going to drop Truenas for a moment, and then I'm going to configure it as mdadm in a typical Linux distribution to see if the same problem occurs.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Discard packets increments the counters only on ports associated with the Truenas system.
And this is the first time I've seen this issue in a mixed Link-Speed environment, and it doesn't happen when communicating with VMs in my vSphere that use 40G vDS Uplink together.

Well that really sounds like you're just flooding the port and the switch can't cope, so it drops the frames.

First of all, I'm going to drop Truenas for a moment, and then I'm going to configure it as mdadm in a typical Linux distribution to see if the same problem occurs.

It probably won't. ZFS has a tendency to use ARC to hold stuff for clients, and if data is sourced from ARC, it will definitely get sent out at maximum possible speed (i.e. 40Gbps) which causes port flooding. If you use mdadm and it has to source data from physical devices, the data will probably come in more slowly and it will work because things aren't crashing out the ethernet at 40Gbps.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
Well that really sounds like you're just flooding the port and the switch can't cope, so it drops the frames.



It probably won't. ZFS has a tendency to use ARC to hold stuff for clients, and if data is sourced from ARC, it will definitely get sent out at maximum possible speed (i.e. 40Gbps) which causes port flooding. If you use mdadm and it has to source data from physical devices, the data will probably come in more slowly and it will work because things aren't crashing out the ethernet at 40Gbps.
You have a point.
Actually, I'd like to change the switch to a model like the Arista or the Dell Z9100, but it's still too much to buy.
Now that I think about it, I didn't conduct a 40G to 40G test, I should do this test first.
I think we'll get a little closer to the conclusion.
 
Last edited:

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
I didn't conduct a 40G to 40G test, I should do this test first.
Sadly, It seems like your opinion is right.
The problem does not occur with the same Link-Speed.
And I also activated and configure dctcp, but it doesn't help.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sadly, It seems like your opinion is right.
The problem does not occur with the same Link-Speed.
And I also activated and configure dctcp, but it doesn't help.

I'm sorry to hear it, but I'm glad we got to the underlying cause. Modern systems are extremely complicated and there are so many opportunities for bottlenecks. Digging down through a variety of subsystems just reinforces how complicated this would be. I also appreciate your patience and willingness to experiment. Sometimes that is just what's needed.

You may get the best bang for your buck on your current setup reducing link speed to 10G and then making sure you follow the tuning guidance I linked earlier. Alternatively, just accept that your setup may be suboptimal for this kind of fileservice. If you have enough client systems trying to access the fileserver, there may be advantage to having the 40G even if it cannot always work optimally. I suspect the best fix is a store-and-forward switch with large transmit buffers; this is very helpful for any link where you're downsizing connections, because it moves the work from TCP's retransmit algorithm on each end into the switchgear. As long as the switch can buffer up some traffic and avoid having to drop it, this is a lot more favorable for TCP even if the packets arrive a few ms later. You will notice that there's a repeated theme in the things I'm pointing you at... increasing buffer sizes (see tuning guide), using gear with large buffer sizes, etc. My experience is such that keeping the data flowing smoothly is a path to success, much better than losing packets and playing the retries game.
 

Linuchan

Dabbler
Joined
Jun 4, 2023
Messages
27
I'm sorry to hear it, but I'm glad we got to the underlying cause. Modern systems are extremely complicated and there are so many opportunities for bottlenecks. Digging down through a variety of subsystems just reinforces how complicated this would be. I also appreciate your patience and willingness to experiment. Sometimes that is just what's needed.

You may get the best bang for your buck on your current setup reducing link speed to 10G and then making sure you follow the tuning guidance I linked earlier. Alternatively, just accept that your setup may be suboptimal for this kind of fileservice. If you have enough client systems trying to access the fileserver, there may be advantage to having the 40G even if it cannot always work optimally. I suspect the best fix is a store-and-forward switch with large transmit buffers; this is very helpful for any link where you're downsizing connections, because it moves the work from TCP's retransmit algorithm on each end into the switchgear. As long as the switch can buffer up some traffic and avoid having to drop it, this is a lot more favorable for TCP even if the packets arrive a few ms later. You will notice that there's a repeated theme in the things I'm pointing you at... increasing buffer sizes (see tuning guide), using gear with large buffer sizes, etc. My experience is such that keeping the data flowing smoothly is a path to success, much better than losing packets and playing the retries game.
I'll start with the full quarantine of 10G and 40G NW.
Once the network-related issues are resolved, I will review the performance issues again.
Thank you very much for sharing your opinions and knowledge. (to @jgreco, @Davvo)
 
Top