new system for use in schools

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Hi All

I'm a network manager that is ore accustom to working within a windows environment. We have a requirement to increase our storage provision, however additional nimble boxes are prohibitively expensive. I have been looking at other options that could operate in a windows environment running Hyper-V failover cluster. Our existing storage does not have the capacity or throughput to the network to cope with current demand.

I was looking to repurpose a Dell PowerEdge R620 that was previously used as a virtual host - its specs are 384GB Ram and duel Xeon e52620. The server has 4 x 10 GBE NICS and a HBA. in addition, i am debating repurposing 3 dell PowerVault MD1220 disk shelves with 24 x 600gb 10k SAS drives.

I’ve installed Freenas on the server and configured 2 on-board SATA drives in a mirror for the system. In addition to that i have created a pool consisting of 12 x z2 vdevs consisting of 6 x 600gb SAS drives and added 2 x 250gb ssd for L2ARC and 4 x250gb ssd as a mirror for the LOG. the network cards are configured for MPIO and the server is setup for iSCSI.

I have one windows server connected via iSCSI and all appears to work "okay ish" until we start to move data around. After 100gb or so, the copying halts and the server hangs. It eventually recovers but i cannot fathom out as to why this keeps happening.

Any help or advice would most welcom....

Sincerely
 

Attachments

  • freenas CPU.jpg
    freenas CPU.jpg
    208.3 KB · Views: 446
  • pool.jpg
    pool.jpg
    118.8 KB · Views: 405
  • NICS.jpg
    NICS.jpg
    61.1 KB · Views: 346

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Hi Damien (if I can presume the name)

There's going to be a bit to unpack here, so bear with me.

The R620, depending on its configuration, can be a very good or very bad candidate to be the "head unit" for a system including those MD1220 enclosures, but there may be some configuration changes needed, and potentially new hardware. Key questions are in bold.

Looking at your screenshots, it looks as if that system does not actually have an HBA (or if so, it isn't being used as one/doesn't have the right firmware) as your cache/log devices are exposed as mfisysXXX which indicates they're being provided by the mfi driver (LSI MegaRAID).

What is the exact model of the HBA/RAID controller in the R620, and is it installed in the "Mini/Integrated" slot, or regular PCIe?

The model of SSD is also very important here - SLOG devices should have power-loss-prevention that lets them ignore or quickly respond to a cache flush; most general-purpose (and virtually no consumer-level) disks have this.

What is the exact manufacturer and model of the SSDs being used for cache/log?

We'll want to take a look at the network topology as well. Depending on the answer to the HBA question, you may need to use a regular PCIe slot in your system to connect to the internal drives.

Regarding your potential causes of slow performance:

The drives are being obfuscated by a RAID card, and you're eventually blowing out its write cache. This won't help.
Your SLOG devices aren't likely being used at the moment (iSCSI default sync is "don't bother")
RAIDZ2 is not an optimal configuration for virtual machine hosting - strongly prefer mirrors for random access. I know this costs you space, but the performance advantage is significant.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Hi Damien (if I can presume the name)

There's going to be a bit to unpack here, so bear with me.

The R620, depending on its configuration, can be a very good or very bad candidate to be the "head unit" for a system including those MD1220 enclosures, but there may be some configuration changes needed, and potentially new hardware. Key questions are in bold.

Looking at your screenshots, it looks as if that system does not actually have an HBA (or if so, it isn't being used as one/doesn't have the right firmware) as your cache/log devices are exposed as mfisysXXX which indicates they're being provided by the mfi driver (LSI MegaRAID).

What is the exact model of the HBA/RAID controller in the R620, and is it installed in the "Mini/Integrated" slot, or regular PCIe?

The model of SSD is also very important here - SLOG devices should have power-loss-prevention that lets them ignore or quickly respond to a cache flush; most general-purpose (and virtually no consumer-level) disks have this.

What is the exact manufacturer and model of the SSDs being used for cache/log?

We'll want to take a look at the network topology as well. Depending on the answer to the HBA question, you may need to use a regular PCIe slot in your system to connect to the internal drives.

Regarding your potential causes of slow performance:

The drives are being obfuscated by a RAID card, and you're eventually blowing out its write cache. This won't help.
Your SLOG devices aren't likely being used at the moment (iSCSI default sync is "don't bother")
RAIDZ2 is not an optimal configuration for virtual machine hosting - strongly prefer mirrors for random access. I know this costs you space, but the performance advantage is significant.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Hey, thanks for the reply.

The HBA in use is a dell 12DNW, this is a PCIe card. The r620 is making use of an internal perc h310 in none raid mode. This is driving the system drive and 6 x 250gb SSDs.

The SSDs are some consumer Kingston drives that we had lying around - that said I will happily replace them with something more suitable if the setup can be make to work well and reliably.

I’m happy with any and all suggestions that will improve proformance as this does promis to be a rather cost effective solution.

Sincerely
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Thanks for following up.

The external HBA (Dell 12DNW) translates to the Dell H200e - this will work fine, it's based on the LSI SAS2008 chipset. You'll likely want to perform a cross-flash to the official LSI firmware by following the instructions here:

https://www.ixsystems.com/community/threads/crossflash-dell-h200e-to-lsi-9200-8e.41307/

Regarding the internal HBA (Perc H310) if that is the "Perc H310 Mini/Mono" or "Integrated" type - basically, if it's physically anything other than the standard PCIe model - it unfortunately won't be able to be flashed to the official LSI firmware. (Don't try it - the card will brick itself.) Where this throws a wrench into things is that the stock Dell firmware on that card is limited to a very low adapter queue depth (25, I believe). The log workload does not really do queued I/O (it wants to flush to stable storage ASAP) but the cache workload does, so you wouldn't be able to have the cache drives on that adapter. I'd recommend changing it to a regular PCIe H200/H310 (flashed to LSI IT firmware, of course) if you have a slot available.

Can you tell me what is in each of your system's PCIe slots, and how many are available? I know that 10GbE is an option for the onboard networking, and the mention of 4x10GbE in your opening post would suggest that you have 2x10GbE onboard and 2x10GbE from an add-in card. The 10x SFF chassis does mention that there should be 3x PCIe slots, but I'm uncertain if you have that model.

Your Kingston drives, depending on their type, could be used for cache; but they are definitely no good for SLOG. And since you're talking about multiple 10GbE links, you're well beyond the point where SAS/SATA will keep up. You'll definitely want an NVMe card in a PCIe slot (Intel Optane or Intel P3700 - check my signature link about SLOG benchmarking for an idea of the performance numbers)

But unfortunately now you're at the point of wanting three expansion cards:

1. External HBA
2. Internal HBA
3. NVMe SLOG

And if your chassis only has 2 slots ... well, we have a problem, and something will have to be compromised. This might be a rare situation where I would consider saying "use a Perc H710 mini-mono, disable its write cache, and use it only for the boot device and L2ARC/cache" but I'd still rather have the answer be "I have 3 PCIe slots, I can do all of that."

Assuming we find a solution to the PCIe slot problem, my suggested layout is quite simply "36 vdevs of 2-way mirrors" - that will give you maximum random performance. Attach the NVMe SLOG, set sync=always on the volumes, and because the default volblocksize for iSCSI is 16K, everything should flow merrily through the attached log device.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Hey

You are correct, the h310 is the mini internal variant and not a full length PCIe version. In relation to the PCIe slots; the system has two but an alternative PCIe riser may be available that would offer two 1/2 hight cards - but not sure at the moment.

As it stands the system has 1 full hight slot with the HBA inserted and a 1/2 hight with an intel x520 da2 NIC inserted.

We could remove the intel nic in favour of an
Intel P3700 reducing or network ports to 2. Their appears to be a number of Intel P3700 options in relation to capacity; would capacity matter?

Thanks for your help and support with this, it is very much appreaciated.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I would suggest removing the x520 and sticking with the onboard 10GbE ports is the way to go; in a sync write setting you're far more likely to be bound by your SLOG performance.

For the P3700 capacity question, it does gain performance at larger record sizes (and endurance) with larger capacity; have a look at the graph in this post comparing the consumer-equivalent Intel 750 drives at 400G vs 1.2T

https://www.ixsystems.com/community...inding-the-best-slog.63521/page-7#post-487107

Given that you will, I assume, be running Hyper-V VMs from this, the majority of your I/O will be smaller recordsize, so look at the early part of the chart showing 4-16K.

With the P3700 being on the purchase list, it's also worth ensuring it is converted to 4K native sector size, and under-provisioning the size of it down to 100G (or possibly even less) to improve consistency of writes and extend the endurance.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Okay, so, I’ve been trying to get hold of a new p3700 but they are end of life and don’t fancy any of the second hand options. I have been offered an alternative which is the intel ssd dc p4800x series - does this look like a suitable alternative to the p3700?
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
HI - i now have a h710 mini but it wont work in a none reaid mode - whats the best way of configuring this? (the p4800x will be with us on monday)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
HI - i now have a h710 mini but it won't work in a none reaid mode - whats the best way of configuring this? (the p4800x will be with us on monday)
Go into the H710 BIOS and disable the controller write cache (just for the safety of your boot pool)

Then create individual RAID0 drives for each attached disk, and install FreeNAS on a mirror in regular ZFS mode. Add the other SSDs as L2ARC/cache devices after you create the pool.

Remember to only attach boot or L2ARC/cache devices to this controller. Don't connect a log or any data disks.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
So a nice new shiny p4800x turned up today. as already mentioned, we will be using this pool to server VMs running in Hyper V, would you have any reccomendations into how best to provision the p4800x?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
You'll need to temporarily install the Optane card into a Linux or Windows system and follow the guides in this Intel article to format your P4800X to use a 4K sector size:
https://www.intel.com/content/www/u...6238/memory-and-storage/data-center-ssds.html

You can then install the P4800X in your FreeNAS machine (during an appropriate downtime window) and use the following commands to create a 16GB partition on it, assuming nvd0 is the P4800X (if it's the only NVMe device in the system, it should be)

Code:
gpart create -s gpt nvd0
gpart add -t freebsd-zfs -a 1m -l sloga -s 16G nvd0


Find the gptid of the new partition:
Code:
glabel status | grep nvd0


And then finally add it to the pool as a log device
Code:
zpool add POOLNAME log gptid/big-long-gptid-goes-here


Once that's done you'll need to set sync=always on the iSCSI ZVOLs that you're presenting to the Hyper-V cluster, otherwise the SLOG will sit idle. RAIDZ2 still isn't an optimal way to configure the underlying vdevs - unfortunately changing this would require you to either have free drives (which it sounds like you don't have) or to destroy the pool and create it anew with mirror vdevs.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Thank you for all of your help and support with this.

I will moving all of the data off the drives today and recreate the pool based on 47 mirrors and 2 drives as spare (using the md1220s). The PE R620 will have freenas reinstalled with 2 drives for the os, the remaining 6 bays will be populated with 250gb SSDs for l2ark running from the perc 710 mini with write cache set to write through. I’ll install the p4800x after it’s been formatted for 4k

I think my only remaining question on this is why create a 16gb partition on the p4800?
 

ethereal

Guru
Joined
Sep 10, 2012
Messages
762
With the P3700 being on the purchase list, it's also worth ensuring it is converted to 4K native sector size, and under-provisioning the size of it down to 100G (or possibly even less) to improve consistency of writes and extend the endurance.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I think my only remaining question on this is why create a 16gb partition on the p4800?

ZFS has a limit on the amount of "dirty" data it will hold in RAM/SLOG - the default tunables put this at 4GB. You can adjust it upward but it does expose you to a little more risk as that's data that isn't yet committed to the pool. Limiting the partition size allows the SSD to use the remaining space for wear-leveling; it can merrily program new, empty pages and then tidy up the old ones later when there's a lull in writes. It's less critical on Optane due to the different type of NAND used, but still good practice.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Over the weekend we had two drives report a "self-test long error count" increase (see alerts error.jpg for an example). I ordered some replacement drives that came yesterday and in line with information i could find online, went to “Disks” to check the serial number of the failing drive (so i could compare it to a spreadsheet detailing its location). I then went to multipath to document the disk and paths prior to going to “pool status” to take the drive offline (in this instance DA177). The replacement drive was inserted into the same drive bay as the faulting drive. I checked on “disks” to see what “Name” was associated to the new drive serial number (it was again DA177) and returned to “Pool status” and selected the “replace” option for the offline drive. In the replacing disk dialogue, I select the member disk DA177 and click replace but get the error “Select a valid choice. Da177 is not one of the available choices” (see replacing disk error.jpg)

Am I doing something wrong or missing a step or two?
 

Attachments

  • replace disk error.jpg
    replace disk error.jpg
    132 KB · Views: 291
  • Alerts Error.jpg
    Alerts Error.jpg
    58.3 KB · Views: 292
  • Multipath.jpg
    Multipath.jpg
    53.2 KB · Views: 302
  • Pool Status.jpg
    Pool Status.jpg
    83.1 KB · Views: 312
  • Ssheet.jpg
    Ssheet.jpg
    499.2 KB · Views: 298

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Turns out that going through the same process but using the legacy interface works perfectly well; it must be a bug in the new interface.
 

damienginty

Dabbler
Joined
Jan 23, 2019
Messages
15
Hi All

I'm just looking for a bit more advice in relation to the above..........

If we wanted to improve the iscsi write speeds to the freenas array, would we best achieve this by adding another Intel P4800X SSD and adding it to the pool? I realise that this would require changing the headend from the PER620 to a PER720 – if we were to look at this, what other components would be best to look at i.e. an additional HBA etc.


Any help of advice would be very much appreciated.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Since this thread started, there's been a major development - the "Mini/Mono" PERC H310 cards can now be flashed with the LSI IT mode firmware. However, it is not an easy process based on the comments in this thread ( https://www.ixsystems.com/community/threads/dell-perc-h310-mini-mono-flashing-to-it-mode.75956/ ) and to change it now would require you to reinstall FreeNAS. Probably best to stick with your current config until you are prepared for a rebuild.

With regards to your write speeds; what are you presently achieving, and how full is the pool? Do your write speeds start fast, and then slow down after a short period? Your vdevs may not be capable of keeping up with the network ingest speed; while you do have a lot of them, and they are 10K SAS, they are still spinning disks.

You have a fast network and a fast SLOG (assuming the P4800X is working as expected) so the next link in the chain, barring network issues, is the vdev speed.
 
Top