36-60 DISK TrueNas Setup - Multiple Q's

Code_Is_Law · Jun 1, 2022

Hello everyone,

First and foremost, thanks to everyone who is taking their time to assist me with some questions I have, appreciated.

So, let's jump right into it.

I have been experimenting with TrueNas 12.8 for a while now with some large scale systems (250-640TiB), and I believe that there are multiple things that I could set-up a lot better. The workload that I experience on these devices is the following:

Writes up to 1GB/s consistently throughout the day, not constant but peaks every 10 minutes.
Daily extremely high random-IO across all disks for 2 hours. Idling my writes during this period. (same hours every day - check mechanism)
Rsync (mirror) of the whole server to an offsite location. Sometimes interferes with the random-IO load process.

The current system details:

(2x) Xeon E5-2660 v3
128GB DDR4
(36x) 8 or 18TB SAS (Seagate or WD)
Raidz3 pool with 36 drives
4x 10G LAN (dedicated 10G towards the servers that do writes/IO/backup)
2x 480GB Sata (bootdrive)
Storage is used mostly for archival and backup.

My main concern is not being able to do the check mechanism due to load elsewhere, the disks should be prioritised to do these heavy random-IO tasks always. Is there any way to set this up properly? My other question is working with another layer of SSD's inside the NAS, for cache or ZIL or anything else that would benefit this setup. Also, Raidz3 with so many disks might not be ideal, what is the preferred pool size for the configuration above? The last question I have is about scrubs and deep S.M.A.R.T. test, how often would you do these and how to make sure they do not run during the random-IO load?

I am upgrading to a new server with a 60 bay JBOD attached next month and it would be best to set this NAS up properly at once. Please let me know what you would change about the configuration, everything is acceptable. Main concern is setting up the VDEV/ZFS correctly to achieve a reasonable IO and not loosing most of my capacity. Server will be fully backed up eventually, so mirrors are not needed. Same load and usage as the server mentioned above.

(2x) AMD EPYC 7413
512GB DDR4
480GB Micron 5300 PRO
Mellanox 100GbE Adapter ConnectX-6 Dx (Support?)
Broadcom HBA 9500-8e
(60x) 18TB SAS (more JBOD added later)

Thanks for assisting so far, love to chat more with you about TrueNas 13 also!

sretalla · Jun 1, 2022

I would start with looking into some of the topics here:

Performance and Tuning — OpenZFS documentation

openzfs.github.io

Code_Is_Law said:
Raidz3 with so many disks might not be ideal, what is the preferred pool size for the configuration above?

Pool size isn't so much the question, but over 16 disks in one VDEV is an issue, mostly for resilvering, as is mentioned in the tuning link. Multiple VDEVs would be better in a few ways, but would cost you additional disks of capacity to redundancy... maybe worth considering RAIDZ2 for multiple VDEVs instead to keep that number down.

Code_Is_Law said:
My other question is working with another layer of SSD's inside the NAS, for cache or ZIL or anything else that would benefit this setup

SLOG doesn't seem to be a good thing for you as you make no mention of Sync Writes... but you could check that with arc_summary (at the bottom of the output)

You may find improvement with a couple of things mentioned in this thread

ZFS "ARC" doesn't seem that smart...

Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata? I upgraded from 16GB to 32GB ECC RAM. Yet there is zero change in this behavior. This keeps happening: I run regular rsync tasks from a few local clients, which is...

www.truenas.com

Using L2ARC in metadata only for some or all of the datasets might improve performance of the rsync tasks (and the link above refers to tuning ARC to not eject metadata so quickly to improve on that further).

Overall, arc_summary will help you to see if L2ARC may help or not, but you'll also need to think about how much of your data is really repetitive reads.

Code_Is_Law said:
The last question I have is about scrubs and deep S.M.A.R.T. test, how often would you do these and how to make sure they do not run during the random-IO load?

Schedules are a thing for those, so I guess that's your answer on how to avoid things, but there's no getting around the need to eventually SCRUB (which will take a really long time for your enormous pool, probably more than 24 hours) and long SMART tests (I schedule those for odd numbered disks every 2 weeks and even numbered disks the other 2 weeks... primarily to reduce/distribute the heat load). Generally, with that many disks, you shouldn't need to be too worried about spreading 1GB/s of data across the pool even when running those tasks.

Code_Is_Law · Jun 1, 2022

Thanks for your answers Sretalla, I will read the posts you linked and come back to you with new insights later today.

Code_Is_Law · Jun 1, 2022

So, talking about the VDEV options, a change to 3 x 12 disks with RAIDZ2 inside a 36 bay chassis would benefit me in what exactly?
Yes, resilvering would be a lot quicker and I will have slightly less capacity. But how about IO, reads and writes with 3 x 12 disk VDEVs?

sretalla · Jun 1, 2022

Code_Is_Law said:
So, talking about the VDEV options, a change to 3 x 12 disks with RAIDZ2 inside a 36 bay chassis would benefit me in what exactly?

3 VDEVs is 3x the IOPS (you currently get the IOPS of a single disk only).

Code_Is_Law said:
But how about IO, reads and writes with 3 x 12 disk VDEVs?

As mentioned, IOPS will be tripled (more-or-less).

Read/write speeds (throughput) may not change, but I wouldn't expect worse than current, maybe better, but remains to be seen/tested (I don't see many published test reports at those numbers)

Code_Is_Law · Jun 1, 2022

Trying to google this but can not find a lot of clues due to "why" a single VDEV has only the IOPS of a single disk, can you explain?

Code_Is_Law · Jun 1, 2022

A little more insights in this random IO task that we need to do daily.

This load is serial, not parallel. We are looking at big 32GB files where we need to read chunks of 4096 Byte (those differ day by day) and we are doing this file by file (not multiple at the same time)

Our tests show better IO performance on 36 disk VDEV compared to 10 disk VDEV. No clue why yet.

It seems to me that during these random IO times the disks are able to respond by giving the process single blocks from disk while not having to do a 36 disk rebuild of the original data (for example a 30G movie file). The disks seem to be fine with sending back 4096 Byte of data to the checking process with each using their own IOPS.

Can I share screenshots of processes or disk load? I can not seem to find any weird or odd statistics thus far. Let me know.

sretalla · Jun 2, 2022

Code_Is_Law said:
Trying to google this but can not find a lot of clues due to "why" a single VDEV has only the IOPS of a single disk, can you explain?

Without knowing exactly how far down the rabbit hole you want to go, maybe this would be a good start: https://www.reddit.com/r/zfs/comments/fd1ou7/is_this_true_one_vdev_one_drives_iops_why_or_why/

You can, of course, do some research right down to the lines of code (since OpenZFS is open source) if that's what you really want.

Short answer is the way that transactions are grouped and sent to the disks essentially means you're always waiting for the slowest disk in the VDEV to finish the transaction.

sretalla · Jun 2, 2022

Code_Is_Law said:
Our tests show better IO performance on 36 disk VDEV compared to 10 disk VDEV. No clue why yet.

There may be some logic there as you're talking about throughput and sequential reads.

I think RAIDZ3 is optimal in VDEVs of 9... (https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz), so maybe with 36, you're at 4x that and somehow the same.

As suggested in that article, look at your recordsize also.

SweetAndLow · Jun 2, 2022

You should do some testing with 36 drives setup in a strip, this will show you the best performance from the disks. Then test with mirrors and then with 6 vdevs in raidz2. Then finally your larger vdev setups. The more vdevs you get the better your performance is going to be assuming iops are usually what help the most. I think this test will give you some insight into your workflows and what you actually need.

In my 24 drive system I have 3 vdevs with 8 disks in raidz2 and I wouldn't go larger than 8 disks. Personal preference and performance needs/wants.

Important Announcement for the TrueNAS Community.

36-60 DISK TrueNas Setup - Multiple Q's

Code_Is_Law

Cadet

sretalla

Powered by Neutrality

Performance and Tuning — OpenZFS documentation

ZFS "ARC" doesn't seem that smart...

Code_Is_Law

Cadet

Code_Is_Law

Cadet

sretalla

Powered by Neutrality

Code_Is_Law

Cadet

Code_Is_Law

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

SweetAndLow

Sweet'NASty

Similar threads