Disk Latency in mirrored ssd pool

srit · Jan 11, 2017

Hello. First time poster here on this forum. I've been using freenas on and off for quite a few years and moved my home storage array to freenas full time now as my sole array.

Looking for some help with figuring out what or even if something is in fact wrong with one of my pools. If i posted in the wrong section, i apoligize in advance.
The pool in question, VM_T1_SSD, is made up of 4 mirrored vdevs. The drives are 8 intel 530 SSD's serving 2 vm hosts.
The box has Dual Intel Xeon E5645 2.4Ghz 6 core CPUs and 32GB Ram the ssd's are connected to the motherboard through an avago 9211-8i and the backplane is sas2
The vmhosts are connecting to freenas using 2 1gbe pipes configured using mpio
The pool has 49% free space, compression = lz4, dedupe = disabled, compression ratio 1.52x, pool is healthy.
cpu's are always at about 90% idle, system load almost always around 1.54, 1.30, 1.09, network nic avg's for each approimately tx 22.6M rx 4.5M
iSCSI Read avg 5.6M/min, Write 1.3M/Min. tested with and without jumbo frames.
There are a few more other disk pools which can be found in my signature if that is needed for further diagnostics. Those other pools are giving the epected latency, iops and bandwith.

The issue im having is that i noticed on the vmhosts that the datastore was experiencing latency at times in the 100-600ms range with averages about 2-4ms read and 1-4ms write. I noticed this using performance datastore and esxtop d, DAVG with high stats. DAVG being high points to the storage array as being the culprit according to all vmsodc i could find. everything else on the esx hosts is normal. during these times on the freenas box the disks were passing under 40i/o's and less than 40KBps per disk. I noticed this while using gstat in 1s, 2s, 5s, 10s , and 60s interupts. i also saw similar results with the pool using zpool iostat poolname with the same increments. With gstat i saw the per disk latency going going in the 100+/ms per SSD disk. to me this does not make sense since those disks can handle way more io's and bandwith before the latency should be an issue. I know 2-4ms is nothing tho sneeze at and hoping for sub 1ms is the unicorn but im hoping:)

Just to throw it in there, the network switch is configured to best practices and is extremely under utilized.

My first thought is more memory based on all the posts I've read. but at io/s and bandwith that low it does not make sense for it to be more cache needed. They are SSD's
Hoping someone can help point me in the right direction on further figuring out whats causing the disk latency and how to improve it.

If it helps below are my arc stats. i know the host hasent been up for too long but the stats arent much different then what it was when it was up for a month.

8:37PM up 1 day, 23:07, 1 user, load averages: 1.47, 1.21, 1.04
14.4TiB / 43.5TiB (Pool1)
48.9GiB / 7.25TiB (Pool2)
482GiB / 1.35TiB (Pool3)
432GiB / 880GiB (VM_T1_SSD)
437GiB / 1.08TiB (VM_T3_10K_HDD)
1.89GiB / 14.9GiB (freenas-boot)
26.55GiB (MRU: 6.89GiB, MFU: 19.66GiB) / 32.00GiB
Hit ratio -> 82.22% (higher is better)
Prefetch -> 27.64% (higher is better)
Hit MFU:MRU -> 65.13%:32.38% (higher ratio is better)
Hit MRU Ghost -> 2.29% (lower is better)
Hit MFU Ghost -> 1.76% (lower is better)

cant think of any more stats to provide of hand

Thanks,

bigphil · Jan 11, 2017

I wonder if it's related to the trim issue on PCIe SSD devices. You should try disabling OS trim support and see if it helps. See this bug report for info.

srit · Jan 11, 2017

bigphil said:
I wonder if it's related to the trim issue on PCIe SSD devices. You should try disabling OS trim support and see if it helps. See this bug report for info.

Thank you for the quick reply. never done that before, ill search the forum for instructions on adding that and get back when its done with the results.

bigphil · Jan 11, 2017

srit said:
Thank you for the quick reply. never done that before, ill search the forum for instructions on adding that and get back when its done with the results.

In the GUI > System > Tunables > Add Tunable button >

srit · Jan 11, 2017

bigphil said:
In the GUI > System > Tunables > Add Tunable button >

View attachment 15362

Thanks. Does it require a reboot?

srit · Jan 11, 2017

never mind just finished reading the guide and testing. it requires a reboot. I will get back as soon as i can do the reboot. probably tomorrow sometime. I appreciate the help. :)

bigphil · Jan 11, 2017

Just an FYI, this may need to be set as a "loader" type and not "sysctl." This bug report suggested the user added it as a loader. The FreeBSD zfs tuning guide says this, "Values set in /etc/sysctl.conf are set after ZFS pools are imported" so probably best as a loader and the system rebooted.

srit · Jan 12, 2017

I changed it to loader as mentioned and rebooted. according to sysctl vfs.zfs.trim.enabled it is now disabled.

I will post an update in a few hours with an update to the performance.

srit · Jan 12, 2017

at this point i need to say there is no difference with or with out the trim disabled.

mav@ · Jan 12, 2017

I don't know whether this is your case, but I have several Intel 530's is my test lab, and I am also unhappy about their write/flush latency. I have set of older 520's in the same system, and while their top speed is slightly lower, their latency is a way better. I'd say it is just a bad series, that may be fine for desktops, but does not work well on synchronous workload. :(

srit · Jan 12, 2017

at this time i can confidently say there is definitely no improvement with the tuneable change bigphil recommended i test. Do you or anyone else have any other ideas? I am continuing to dig through the forums and the great master google for anything else i can find regarding this issue.

@mav
Thank you for that insight. have you ran any iometer tests and check the disk latency while its running? interestingly enough while i ran one test tuned for what i believe is my pool use case, i noticed that i was able to hit around 19000 iops consistently (shown with iometer and gstat) with minimal change to the latency on the pool disks. that is pretty odd to me. i would expect the latency to jump up while i am hitting higher iops. it isnt at times it looks to be lowering the latency while those higher iops are achieved.

has anyone else with these disks notices any similar issues? id hate to just blame the disks without trying to troubleshoot the issue a bit further. That is unless you or someone else has already done deeper testing and know that the disks are definitely the issue.

Holt Andrei Tiberiu · Jan 18, 2017

Backup your data first, do a FW update on the drives, if there is any available an see what happenes. all of the drives should be on the same FW ( Firmware )
I'm not saying that Intel SSD's are no good, but i never had a pleasant experince with them, for me they didn't work.

tvsjr · Jan 18, 2017

Holt Andrei Tiberiu said:
I'm not saying that Intel SSD's are no good, but i never had a pleasant experince with them, for me they didn't work.

Err? Considering Intel owns a large part of the data center SSD space, the market would seem to disagree with you.

Holt Andrei Tiberiu · Jan 18, 2017

tvsjr said:
Err? Considering Intel owns a large part of the data center SSD space, the market would seem to disagree with you.

Ithat for me intel ssd's did not work.
Sorry that i ofended you, as i gues you are a fan.
I do not care about who won the ssd space in datacenters, thoese ar not running on user built systems and free operating sistems.
And i am sure that they do not use 520 series

Ferrari is a premium car brand, Fist is not, but both are part of the same manufacturer group.

srit · Feb 2, 2017

Holt Andrei Tiberiu said:
Ithat for me intel ssd's did not work.
Sorry that i ofended you, as i gues you are a fan.
I do not care about who won the ssd space in datacenters, thoese ar not running on user built systems and free operating sistems.
And i am sure that they do not use 520 series

Ferrari is a premium car brand, Fist is not, but both are part of the same manufacturer group.

Thank you all for the replies. I was away for a bit and couldn't get to this until now. I recently upgraded my Freenas to 262 GB ram and let it run for 7 days to see if there is any improvement. There has not been any improvement. Watching the SSD disks for its i/o's on the vm disk pool i am still seeing the same latency within vmware pointing to freenas as being the culprit. This is while i am seeing almost no read i/o's in freenas on the SSD disks which tell me it is almost all cached. If i am wrong about that feel free to tell me so.

If anyone has any ideas as to where i should look next i am all ears. If it helps i have posted my most recent arcstats below.

8:42AM up 7 days, 1:27, 1 user, load averages: 0.21, 0.12, 0.09
14.9TiB / 43.5TiB (Pool1)
1.08TiB / 7.25TiB (Pool2)
549GiB / 1.80TiB (Pool3)
437GiB / 880GiB (VM_T1_SSD)
436GiB / 1.08TiB (VM_T3_10K_HDD)
1.89GiB / 14.9GiB (freenas-boot)
218.28GiB (MRU: 56.71GiB, MFU: 161.57GiB) / 256.00GiB
Hit ratio -> 87.65% (higher is better)
Prefetch -> 34.09% (higher is better)
Hit MFU:MRU -> 78.66%:18.99% (higher ratio is better)
Hit MRU Ghost -> 1.15% (lower is better)
Hit MFU Ghost -> 0.53% (lower is better)

srit · Feb 3, 2017

So something i notice when dropping the gstat refresh rate to 800ms is that the spikes in esx correlate to when a single disk in the SSD pool has a large latency spike. Not sure why its spiking from time to time on a single disk in the pool especially since the spike happens to different disks randomly and usually to a single disk. Not multiple disks.

tvsjr · Feb 3, 2017

I wonder if that drive has a problem? SSDs fail in weird ways, which aren't common to their spinning rust counterparts. Do you have a spare you could swap with?

srit · Feb 3, 2017

tvsjr said:
I wonder if that drive has a problem? SSDs fail in weird ways, which aren't common to their spinning rust counterparts. Do you have a spare you could swap with?

Thank you or the responce.
I guess i wasn't clear so let me elaborate. The issue im having is on random ssd disks its not a specific one thats having the spikes. for example, now it can be disk one spiking in 30 seconds it can be disk 9 then 3 minutes later it can be disk 4, etc. its not any specific disks or vdev within the pool. its totally random in the sense i cannot figure out the logic behind the spikes.

Hope that clears up the miscommunication.

Jim Streit · Feb 14, 2017

Did you ever get a resolution to this problem. I'm experiencing very similar results with my SSD pool that I don't experience with my HDD pool.
Thanks.

srit · Feb 16, 2017

Jim Streit said:
Did you ever get a resolution to this problem. I'm experiencing very similar results with my SSD pool that I don't experience with my HDD pool.
Thanks.

Nope. Are you using the same SSD's as i am?

Important Announcement for The TrueNAS Community.

Disk Latency in mirrored ssd pool

Dabbler

Patron

Dabbler

Patron

Dabbler

Dabbler

Patron

Dabbler

Dabbler

iXsystems

Dabbler

Contributor

Guru

Contributor

Dabbler

Dabbler

Guru

Dabbler

Cadet

Dabbler

Similar threads

Important Announcement for The TrueNAS Community.