What is needed to achieve 10Gbps read with RAIDZ-2 (raid6)

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Wow this is insane. That's 16 disks minimum
Not really that uncommon for 10Gb networking. I put together a set of 16 disks in mirror vdevs, (8 vdevs) for testing and found that I could only get about 550MB/s over a 10Gb link. The problem boiled down to the fact I was using old drives that were individually slow and that slowed the performance of the entire pool. Every aspect of the system needs to be accounted for if you want it to perform the way you want.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
But given @ElliotDierksen has indicated that even with a 2 vdev setup he is able to get 10Gbps then I guess I can achieve that with 2 vdevs each with 4 disks (total 8 disks). Yes?
It depends on the disks. What disks are you looking at getting?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Interesting. Why Cisco over say Dell / HP / Lenovo for a server or storage pod?
Is there really no way to achieve 10Gbps read speed using spinning disks if I want to go with RAIDZ/2 or RAID6, limited to 12 disks bay? One thing I'm still not certain yet (I know you and @SweetAndLow are saying vdev is what matters) is whether there is really no benefit of more disks within a vdev in terms of read speed or are you saying the increase is not scaling as well as with a RAID (non ZFS) setup? If there is really no benefit to having more disks in a vdev and I want 2 disks failure redundancy than I can go with vdevs that are 4 disks in RAIDZ/2 setup. With a 12 disk bay I can at least squeeze out 3 vdevs. But given @ElliotDierksen has indicated that even with a 2 vdev setup he is able to get 10Gbps then I guess I can achieve that with 2 vdevs each with 4 disks (total 8 disks). Yes?

You want best case or worst case? More disks gives you better sequential reads more vdevs gives you better iops(hint iops are super important). Best case you can saturate 10gbps using 8 7200 rpm drives in a single raidz2 vdev with sequential reads. Realistically that will probably give you way less. I can't test that scenario so I don't know what performance it would give you. With the stat's I posted with about 200MB/s read, that is from a pool that is 50% full and actively being used for several tasks. So right now worst case for my pool it can provide a single client with 200MB/s if that client wants it. Things get worse if more clients start to use it and things get better if it starts to pull things out of arc and not disk. You can't just look at disks, you have to look at workload, memory, network, vdevs, free space, ect.

If you want easy 10gig speeds just build a ssd pool.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421

TimoJ

Dabbler
Joined
Jul 28, 2018
Messages
39

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Chris was incorrect in his statement.
Not according to ixsystems: https://www.ixsystems.com/blog/zfs-pool-performance-2/
"Streaming read speed: (N - p) * Streaming read speed of single drive"
Or are you talking about IOPS?
The OP was asking about being able to randomly access any photo in his photo library. That is not sequential / streaming. The usage that was described in the discussion would be random IO and it would probably also qualify as small files. Small files are the slowest to access because each file is a new seek, so there is seek time to contend with. The reason that streaming can be faster is because there is one seek to find the start of the file and the disk just keeps pushing data until the read is over. When there are multiple files, each file is a seek and the smaller the file the more time the disk spends looking for the start of the next file. Small files are the worst. I deal with a file database system at work where there are millions of tiny MB to a couple GB size files, about 330 TB of files, but they are all quite small and to get adequate performance out of the system, we have ten vdevs of 6 drives each, so we can have the random IOPS to get at the data quickly.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Chris was incorrect in his statement.
Imprecise may have been a better term, but I see he's already responded. Performance being a rather nebulous term, could be taken to mean either IOPS or Sequential read/write operations.

Either way, If the OP would like 6 disks (as delineated in his pool design example) to saturate read operations on 10Gbe consistently over time, I say go SSD. Performance isn't always cheap.
 
Joined
Dec 29, 2014
Messages
1,135
Interesting. Why Cisco over say Dell / HP / Lenovo for a server or storage pod?
I use them for work, so I am familiar with them. Because of the work connection, I also have easy access to software.
But given @ElliotDierksen has indicated that even with a 2 vdev setup he is able to get 10Gbps then I guess I can achieve that with 2 vdevs each with 4 disks (total 8 disks). Yes?
4 disks in a RAIDZ2 vdev gives you roughly 50% usable space. If you are going to do that, I would suggest just using mirrors instead. You will have roughly the same available space, and mirrors are always the recommendation for VM storage that need high IOPS. I use 6 or 8 disks (depending on the chassis) in RAIDZ2 vdevs which gives me roughly 66% and 75% usable storage respectively. My performance needs aren't high enough to do split storage. All but 3 of my 30ish VM's are lab machines. If they were production machines, I would definitely uses mirrors likely comprised of SSD's.
 

titusc

Dabbler
Joined
Feb 20, 2014
Messages
37
No way in hell. Well, maybe if there's like GOBS of free space, like 80%++ of your space is unused. Over time, fragmentation is going to mean that you do not have the long sequential runs of blocks necessary to be able to consider drives to be running at "200MB/s". They will run SUBSTANTIALLY slower.
Good point thanks.


Chris Moore said:
It depends on the disks. What disks are you looking at getting?
3.5" LFF 7200 RPM spinning disks. I have a few on hand already that I'd like to reuse and I do not know the speed they are but I can selectively buy new ones to use if need be. What I don't want is get SSD. If there is an equation I can see how I need how the number of disks or vdevs affect IOPS and sequential read or write speeds then I can pick and choose the correct setup for the given fault tolerance I need which is 2 disk failure per vdev. I think https://www.ixsystems.com/blog/zfs-pool-performance-2/ is a very good page to give me these numbers. What this page doesn't cover however is the fact that over time as drives get full things change. One thing I have been trying to get to is whether RAIDZ in ZFS does scale similarly to normal RAID do and it does in fact. It also shows how using even only 1 vdev of 12 disks in RAIDZ3 can outperform and provide an extra disk failure of redundancy over 2 vdevs of 6 disks each in RAIDZ2 for sequential read. In fact at this point I'm wondering why bother with the trouble of ZFS if using normal RAID achieves a similar goal and performance. For example if my only requirement is to have 2 disks failure redundancy and read 10Gbps speed, then either of the following works. Both provides sequential read of 800MB/s (1Gbps minus network overhead) and 2 disks failure protection.
- ZFS with 1 vdev with RAIDZ/2 with 6 disks (4 usable disk space + 2 for parity) each at 200MB/s.
- RAID6 with 6 disks (4 usable disk space + 2 for parity) each at 200MB/s.


Chris Moore said:
The OP was asking about being able to randomly access any photo in his photo library. That is not sequential / streaming. The usage that was described in the discussion would be random IO and it would probably also qualify as small files. Small files are the slowest to access because each file is a new seek, so there is seek time to contend with. The reason that streaming can be faster is because there is one seek to find the start of the file and the disk just keeps pushing data until the read is over. When there are multiple files, each file is a seek and the smaller the file the more time the disk spends looking for the start of the next file. Small files are the worst. I deal with a file database system at work where there are millions of tiny MB to a couple GB size files, about 330 TB of files, but they are all quite small and to get adequate performance out of the system, we have ten vdevs of 6 drives each, so we can have the random IOPS to get at the data quickly.
Well yes that's a good point. Each photo can be at a random location and not necessarily next to the next photo in the album on disk. But given each photo is about 25MB in size I'd have thought there's some sort of sequential read there. Apologies I think I said 25GB earlier! There will be at most 2 persons working together and 99% only 1 person using the pod. It's a family NAS so only me and my wife are going to be using it which is why I'm a bit reluctant to go with 12 disks if possible. At work sure if there are more people using it but this is home use for family photos.


Elliot Dierksen said:
]4 disks in a RAIDZ2 vdev gives you roughly 50% usable space. If you are going to do that, I would suggest just using mirrors instead. You will have roughly the same available space, and mirrors are always the recommendation for VM storage that need high IOPS.
I see your point although using mirrored vdev I'll need 3 disks per vdev to achieve 2 disks failure which means I'll have only 33% usable space.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Per the explanation RAIDZ in ZFS does scale similarly to normal RAID do. It shows how using even only 1 vdev of 12 disks in RAIDZ3 can outperform and provide an extra disk failure of redundancy over 2 vdevs of 6 disks each in RAIDZ2 for sequential read.
From what you are saying, it sounds like you misunderstood what the article was saying. I commend your effort to learn. Keep it up.
 

titusc

Dabbler
Joined
Feb 20, 2014
Messages
37
From what you are saying, it sounds like you misunderstood what the article was saying. I commend your effort to learn. Keep it up.
Misunderstood? Per the article it states the following:
- 1vdev of 12 disks providing 900MB/s with 75% efficiency allowing for 3 disks failure.
- 2vdevs of 6 disks each providing 800MB/s with 66.7 efficiency allowing for 2 disks failure.

1x 12-wide Z3:

  • Read IOPS: 250
  • Write IOPS: 250
  • Streaming read speed: 900 MB/s
  • Streaming write speed: 900 MB/s
  • Storage space efficiency: 75% (54 TB)
  • Fault tolerance: 3

2x 6-wide Z2:

  • Read IOPS: 500
  • Write IOPS: 500
  • Streaming read speed: 800 MB/s
  • Streaming write speed: 800 MB/s
  • Storage space efficiency: 66.7% (48 TB)
  • Fault tolerance: 2 per vdev, 4 total
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Misunderstood? Per the article it states the following:
You don't have a streaming workload. For what you want to do IOPS is more important. The reason IOPS doubles between the two examples is because the number of vdevs doubles, completely unrelated to the number of disks in the vdev.

I feel like you are trying to prove me wrong. I do this all day, every day, for my job. I have multiple servers with hundreds of drives running at work and even my main home NAS has thirty drives in three pools. I don't know everything, but I used to run a photography business as a side job and I am familiar with Light Room, I bought version 1 of the product when it first came out. I don't claim to know everything, but I know about this.
I am tired of talking to you about it.
Good Luck. Have fun. Do whatever you want.
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
A pool layout more conducive to IOPs; however, it certainly has its drawbacks as well ...

12 x WDC WD100EMAZ-00WJTA0 (completely empty)

raidZ 3x4x10.0 TB (Capacity: 72.77 TiB)
  • 128k recordize = 1,226 MB/s
  • 1M recordzie = 1,494 MB/s
Code:
root@FreeNAS-02[~]#

zfs create Tank1/disabled
zfs set recordsize=128k compression=off sync=disabled Tank1/disabled
dd if=/dev/zero of=/mnt/Tank1/disabled/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 83.550967 secs (1285133934 bytes/sec)

dd of=/dev/null if=/mnt/Tank1/disabled/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 41.302678 secs (2599690542 bytes/sec)
zfs destroy Tank1/disabled

zfs create Tank1/disabled
zfs set recordsize=1M compression=off sync=disabled Tank1/disabled
dd if=/dev/zero of=/mnt/Tank1/disabled/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 68.533230 secs (1566746277 bytes/sec)

dd of=/dev/null if=/mnt/Tank1/disabled/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 34.156175 secs (3143624288 bytes/sec)
zfs destroy Tank1/disabled

root@FreeNAS-02[~]#

Code:
root@FreeNAS-02[~]# zpool status
  pool: Tank1
state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tank1                                           ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/64842ac5-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/651ac7fc-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/65c0f4a8-461f-11e9-8874-000c299addec  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            gptid/665a3562-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/66f69857-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/6791d65e-461f-11e9-8874-000c299addec  ONLINE       0     0     0
          raidz1-2                                      ONLINE       0     0     0
            gptid/68340556-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/68d1d92f-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/697f52a6-461f-11e9-8874-000c299addec  ONLINE       0     0     0
          raidz1-3                                      ONLINE       0     0     0
            gptid/6a1dde4a-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/6abbceb4-461f-11e9-8874-000c299addec  ONLINE       0     0     0
            gptid/6b7eb914-461f-11e9-8874-000c299addec  ONLINE       0     0     0
errors: No known data errors
 

titusc

Dabbler
Joined
Feb 20, 2014
Messages
37
You don't have a streaming workload. For what you want to do IOPS is more important. The reason IOPS doubles between the two examples is because the number of vdevs doubles, completely unrelated to the number of disks in the vdev.
Yes now that I think about it this is true about IOPS being more important. Sorry I thought it was 25GB per photo when I started this thread and this thought have been stuck with me which was why I asked about sequential read speed instead of IOPS. Given it's only 25MB per photo then yes it is IOPS that is more important.

I feel like you are trying to prove me wrong. I do this all day, every day, for my job. I have multiple servers with hundreds of drives running at work and even my main home NAS has thirty drives in three pools. I don't know everything, but I used to run a photography business as a side job and I am familiar with Light Room, I bought version 1 of the product when it first came out. I don't claim to know everything, but I know about this.
I am tired of talking to you about it.
Good Luck. Have fun. Do whatever you want.
I'm not sure why you think that way. Again I asked about pure sequential read speed from the start and have been thinking about this, aside from double disk failure, as the single and only metric that I care. I thought when the suggestion of using multiple vdevs it was for achieving this goal. Then I came across that article which suggests even with 1vdev it can outperform 2 vdevs so I found it interesting. It does contradict what you said earlier which was why I pointed that out so we can talk about it as perhaps there are specific situations or reasons that I'm not aware of. When you said I misunderstood what that article says I thought that was odd. Again at that point I only cared about sequential read speed which was why I copied and pasted the 1 vs 2 vdev comparison so we can talk about it.

I'm here to ask questions. For all I care all opinions count for anyone have used and setup a ZFS system. If you have setup a lot of systems and do this every day great. But even if you have only setup one system that is great as well.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
A pool layout more conducive to IOPs; however, it certainly has its drawbacks as well ...
Very nice. The drawback, for anyone that doesn't already know, in a RAIDz vdev there is one drive of redundancy, so two drive failures in the same vdev and the entire pool is lost. It is a risk.

I have a server at work running 60 of the Seagate Exos 10TB drives and in the first six months, no failures. I am pretty happy with them.
We put a server online a couple years ago with 60 of the 6TB WD Red Pro drives and had five drives fail in the first year, three of which were in the first six months.
 

titusc

Dabbler
Joined
Feb 20, 2014
Messages
37
A pool layout more conducive to IOPs; however, it certainly has its drawbacks as well ...
This is intriguing with what your test demonstrated. I actually didn't understand what you were trying to show despite knowing exactly what each of the command you typed mean, until I read the following on https://blog.programster.org/zfs-record-size.
If an application such as an Innodb database, requests 16K of data, then fetching a 128K block/record doesn't change the latency that the application sees but will waste bandwidth on the disk's channel. A 100 MB channel could handle just under 800 requests at 128k, or it could handle a staggering 6,250 random 16k requests.
 

MDD1963

Dabbler
Joined
May 24, 2018
Messages
12
We put a server online a couple years ago with 60 of the 6TB WD Red Pro drives and had five drives fail in the first year, three of which were in the first six months.

Ouch...an 8+% failure rate in one year is quite abominable for what are supposed to be WD Pro drives...!
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Very nice.
  • Thanks - took a lot of 'shuckin between the two nodes ;)
The drawback, for anyone that doesn't already know, in a RAIDz vdev there is one drive of redundancy, so two drive failures in the same vdev and the entire pool is lost. It is a risk.
  • You are wise to call this out o/c and I would have expanded the "drawbacks" if I wasn't so tired when I authored that.
  • I would never advise anyone to run raidz (and almost posted raidz2 as a comparison, again I was tired), but in my case:
    • FreeNAS-01 = raidZ 3x4x10.0 TB | FreeNAS-02 = raidZ 3x4x10.0 TB (I hadn't yet started replication for the new disks, thus the "empty" comment)
    • Synced to 5 min via replication | Full offsite backup at 1 hour intervals.
  • As to the risk (for me) ... quite tolerable only given particulars mentioned above:
    • Of data < 5 min, quite small | Catastrophic on-site total loss = even smaller | Catastrophic global total loss = incalculably small
I have a server at work running 60 of the Seagate Exos 10TB drives and in the first six months, no failures. I am pretty happy with them.
We put a server online a couple years ago with 60 of the 6TB WD Red Pro drives and had five drives fail in the first year, three of which were in the first six months.
  • As a tangent and since you presented the opportunity to ask (and unrelated to cited failure rate) => Do you bother to burn in Enterprise drives?
  • My IT experience = limited to being a hobbyist and I would guess no, as (1) for a 10TB HDD, the 4 bb patterns book-ended by SMART extended tests = better part of a week = probably introduces a deployment "lag" where cost > benefit + (2) I assume part of what you pay for with an enterprise HDD at a higher price is a lower risk of a HDD that won't pass burn in (EXOS non-recoverable errors per bits read 1 per 10^15 & MTBF @ 2.5M hours)
  • Ouch regarding the 3 drives - I suppose that speaks to infant mortality.
blog-drivestats-bathtub.jpg
 
Top