Strange vdev performance

JaymzMac · Mar 13, 2019

I'm seeing some strange behaviour for one of my vdevs and I wondered if someone here could shed some light on what is happening.

Here's my pool configuration:

Code:

zpool list -v

NAME                                     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
ex-freenas-01                            195T  91.6T   103T         -    28%    46%  1.00x  ONLINE  /mnt
  raidz2                                  65T  30.1T  34.9T         -    28%    46%
    gptid/1231bea4-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/12fd9bab-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/13c714d0-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/15d05b7c-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/17cac22c-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/19dd2db3-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
  raidz2                                  65T  30.7T  34.3T         -    28%    47%
    gptid/1b01046c-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/1be27e07-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/1cb75a2e-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/1d8901a8-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/1e5e6bea-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/1f3fc5c7-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
  raidz2                                  65T  30.7T  34.3T         -    28%    47%
    gptid/20554da7-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/21386d9c-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/2229a259-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/23172de4-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/24068c91-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/250f5938-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
log                                         -      -      -         -      -      -
  mirror                                 184G   174M   184G         -     0%     0%
    gptid/2641fb9e-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
    gptid/26cd51b6-54e1-11e8-a35e-ac1f6b8d5ba6      -      -      -         -      -      -
cache                                       -      -      -         -      -      -
  gptid/25b4e86c-54e1-11e8-a35e-ac1f6b8d5ba6   373G   357G  15.7G         -     0%    95%

For a period of time, we see that the disks in the 2nd vdev are extremely busy, and have high latency, while the other vdevs are performing as expected.

As you can see from the following gstat output, the disks in the 2nd vdev are running at 100% busy and the latency for flushing is over 250ms:

Code:

dT: 1.006s  w: 1.000s  filter: ^da[0-9]+$
 L(q)  ops/s    r/s     kB   kBps   ms/r    w/s     kB   kBps   ms/w    d/s     kB   kBps   ms/d    o/s   ms/o   %busy Name
    5    194     61     12    700   18.2    131     24   3107    0.7      0      0      0    0.0      2   45.6   35.1  da0
   11    274    138      8   1122   12.1    134     22   3015    0.2      0      0      0    0.0      2   21.2   42.6  da1
   14    268    109      6    708   15.4    157     21   3258    0.8      0      0      0    0.0      2   24.7   40.4  da2
    3    243     86      7    557   13.9    155     21   3198    0.2      0      0      0    0.0      2   41.0   32.3  da3
    3    298    136      7    891   15.6    160     20   3174    0.2      0      0      0    0.0      2   24.2   41.6  da4
    2    250     89     10    887   10.3    158     21   3290    0.2      0      0      0    0.0      2   24.1   34.4  da5
    -----------------------------------------------------------------------------------------------------------------------
    2    362    118     15   1786   47.2    242      7   1647    0.3      0      0      0    0.0      2  257.1   84.4  da6
    2    383    128     14   1794   56.2    253      7   1718    0.2      0      0      0    0.0      2  283.1   86.0  da7
    3    293    105     16   1683   97.6    186      8   1448    0.3      0      0      0    0.0      2  366.3   98.7  da8
    6    309    128     14   1806   65.9    179      8   1404    0.3      0      0      0    0.0      2  279.9  100.4  da9
    6    386    104     22   2291   99.5    279      7   1834    0.3      0      0      0    0.0      2  346.6   99.8  da10
    3    371    125     15   1889   67.7    244      7   1671    0.2      0      0      0    0.0      2  298.3   93.8  da11
    -----------------------------------------------------------------------------------------------------------------------
    1    362    189     22   4244    4.9    171     17   2828    0.2      0      0      0    0.0      2   25.5   33.7  da12
    1    290    126     24   3063   15.8    162     17   2812    0.4      0      0      0    0.0      2   34.7   40.0  da13
    1    240     73     15   1090   25.3    165     18   2900    0.2      0      0      0    0.0      2   33.1   32.6  da14
    1    200     48     28   1325   17.6    150     19   2788    0.2      0      0      0    0.0      2   33.3   27.7  da15
    1    290    147     29   4284   10.7    141     20   2788    0.3      0      0      0    0.0      2   28.5   40.2  da16
    1    498    327     14   4519    2.8    169     17   2916    0.3      0      0      0    0.0      2   33.9   35.4  da17
    -----------------------------------------------------------------------------------------------------------------------
    0   3152      0      0      0    0.0   1605     36  57103    0.1      0      0      0    0.0   1547    0.0   21.7  da18
    0   3152      0      0      0    0.0   1605     36  57103    0.1      0      0      0    0.0   1547    0.0   21.4  da19
    -----------------------------------------------------------------------------------------------------------------------
    0   1361   1190      4   4936    0.1    171    121  20759    0.7      0      0      0    0.0      0    0.0    6.4  da20

I've made the following observations while experiencing the issue:
1. According to the gstat output, the IO size of the write operations average around 6-8KB on the 2nd vdev, whereas on the other vdevs the size is around 20-30KB.
2. The number of writes operations to the disks in the second vdev is higher than the writes to the other vdevs.
3. All the disks in the second vdev are running on average at 90-100% busy. The other disks are running at around 40-50% busy.
4. There is not one specific disk in the 2nd vdev which is causing the issue.
5. We're running with sync set to always, and when I run zilstat, there are very few operations between 4-32kB being written to the slog from the clients.
6. The issue normally lasts for a few hours before it disappears and the performance becomes uniform across all the vdevs again.

My theory is that the small write size is making the disks in the vdev extremely busy, and causing the high latency, slowing down the pool performance. But where are the small IOs coming from, if they're not coming from the clients (I can't see them in zilstat)? And why are they isolated to the 2nd vdev? Shouldn't it be spread across the vdevs assuming equal usage across the pool? Could it be some kind of internal ZFS workload that I'm not aware of causing the issue?

Appreciate any thoughts / ideas from the community.

SweetAndLow · Mar 13, 2019

Hardware? What the pool built all at the same time or were vdevs added later? What protocol? What workload?

JaymzMac · Mar 13, 2019

Hardware:
- 2 x Intel Xeon E5-2620v4
- 128GB Ram
- LSI 9300-8i HBA
- 12TB Seagate EXOS 12TB drives capacity drives
- Seagate 1200.2 SSD 200GB for SLOG
- Seagate 1200.2 SSD 400GB for L2ARC
- 2x 10GBE Intel NIC

VDEVs were added at the same time.
Running iSCSI to ESXi environment.

sretalla · Mar 13, 2019

How is your SMART data looking for the 2nd VDEV? error correction happens on read, so if there's somehow more work needing to go on for that VDEV, then you might have some explanation.

Maybe run a scrub?

output of zpool status -v would be interesting to see

JaymzMac · Mar 13, 2019

I can't see anything that stands out for the devices in vdev2 in the SMART data. The number of errors corrected seems to be consistent across the board.

I haven't run a scrub in a while because we've been having some other issues and I wanted to rule out 'scrub' being the root cause.

Here's the output of zpool status -v:

Code:

zpool status -v
  pool: ex-freenas-01
 state: ONLINE
  scan: scrub canceled on Sun Jan  6 08:38:32 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        ex-freenas-01                                   ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/1231bea4-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/12fd9bab-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/13c714d0-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/15d05b7c-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/17cac22c-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/19dd2db3-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/1b01046c-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/1be27e07-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/1cb75a2e-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/1d8901a8-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/1e5e6bea-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/1f3fc5c7-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/20554da7-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/21386d9c-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/2229a259-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/23172de4-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/24068c91-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/250f5938-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
        logs
          mirror-3                                      ONLINE       0     0     0
            gptid/2641fb9e-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
            gptid/26cd51b6-54e1-11e8-a35e-ac1f6b8d5ba6  ONLINE       0     0     0
        cache
          gptid/25b4e86c-54e1-11e8-a35e-ac1f6b8d5ba6    ONLINE       0     0     0

SweetAndLow · Mar 13, 2019

JaymzMac said:
The number of errors corrected seems to be consistent across the board

What does this mean? There should be zero. Anything other than zero is a problem. You are not going to get help here because you will not post actual data. When you share things people can help. Especially all these other problems you seem to be having. You haven't run a scrub in 2 months. You should run one monthly.

sretalla · Mar 13, 2019

Along the lines of my suggestion, you would need to be showing errors on that VDEV even without a scrub (they would show as CKSUM), so I guess we can rule that theory out based on your post.

It probably also rules out cabling or anything like that, so the question remains, why does the system hit that VDEV and not the others?

It's probably worth noting that files are not spread across VDEVs (as far as I recall from my reading), so could it be that you have one specific file (maybe a really large one) that's getting read or written a lot? Just a theory, but worth thinking about.

Maybe lsof can help to figure out which files exactly are involved.

JaymzMac · Mar 14, 2019

SweetAndLow said:
What does this mean? There should be zero. Anything other than zero is a problem. You are not going to get help here because you will not post actual data. When you share things people can help. Especially all these other problems you seem to be having. You haven't run a scrub in 2 months. You should run one monthly.

My intention is not to hide data from anyone, and I very much appreciate the help I've been given thus far. The reason I didn't provide any output was because I wasn't sure which values are 'normal', and which would indicate a problem. Here is the output for one of the devices in vdev2:

Code:

root@gs1203:~ # smartctl -a /dev/da7
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST12000NM0027
Revision:             E001
Compliance:           SPC-4
User Capacity:        12,000,138,625,024 bytes [12.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5009498855f
Serial number:        ZCH08RV50000J730Y296
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Mar 14 11:01:16 2019 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 42 of year 2017
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  213
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1323
Elements in grown defect list: 1

Vendor (Seagate) cache information
  Blocks sent to initiator = 1333266480
  Blocks received from initiator = 4269137512
  Blocks read from cache and sent to initiator = 3579705786
  Number of read and write commands whose size <= segment size = 369353500
  Number of read and write commands whose size > segment size = 8729674

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 7728.58
  number of minutes until next internal SMART test = 38

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   639567054        0         0  639567054         20      90842.588          20
write:         0        0         9         9          9      77032.904           0

Non-medium error count:    12156


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged

I thought the Errors Corrected value was high, but all the devices in my pool have similar values. I even checked a newly installed FreeNAS server (exactly the same hardware) and even there the values were high. It might be how Seagate reports these errors? (https://forums.servethehome.com/index.php?threads/sas-drives-with-high-ecc-corrected-errors.6960/)

sretalla said:
Along the lines of my suggestion, you would need to be showing errors on that VDEV even without a scrub (they would show as CKSUM), so I guess we can rule that theory out based on your post.

It probably also rules out cabling or anything like that, so the question remains, why does the system hit that VDEV and not the others?

It's probably worth noting that files are not spread across VDEVs (as far as I recall from my reading), so could it be that you have one specific file (maybe a really large one) that's getting read or written a lot? Just a theory, but worth thinking about.

Maybe lsof can help to figure out which files exactly are involved.

We have 4 large zvols (device based extent) but I'm fairly sure these are spread across the vdevs. Since FreeNAS is unaware of the filesystem (VMFS), it will be impossible to identify which .vmdk file is getting written to with lsof. But maybe it is as simple as one of our VMs residing on only that vdev and causing the issue?

Or could it be something to do with TRIM? I know ESXi is sending UNMAPs to FreeNAS to say which blocks can be released. Maybe when ZFS is releasing the blocks in vdev2, this is causing the disks to be very busy? Could this be a valid theory?

sretalla · Mar 14, 2019

So zvols are block storage, so you're right in that FreeNAS knows nothing about what is being accessed in terms of file names.

Generally, levelling out the blocks across all VDEVS would be the behavior as they are written, so it's really a lottery as to what you may be doing to call back a lot of the blocks written to one VDEV and less to the others. It's going to be hard to work it out. Another way to look at it is that each zvol is just a file in a dataset, so maybe we can see at least which zvol... then we could probably try to work out how you set up the zvol in terms of allocation (sparse or not) to understand if zfs will be trying to put all of one zvol on one VDEV or not.

JaymzMac · Mar 18, 2019

While troubleshooting this issue, I noticed that our SLOG devices started receiving 'delete' ops at exactly the same time as the performance issue appeared for one of vdevs:

Delete Ops:

vdev disk write performance down from 8MB/s to 2MB/s:

In order to test my theory I need to identify which initiator is sending the TRIM commands.

Does anyone know how I can achieve this from the FreeNAS server? I've tried capturing the traffic with tcpdump and opened the pcap with wireshark but I cannot see where the TRIMs are coming from...

The delete OPs are very large as per the following gstat output:

Code:

L(q)  ops/s    r/s     kB   kBps   ms/r    w/s     kB   kBps   ms/w    d/s     kB   kBps   ms/d    o/s   ms/o   %busy Name
    0   6966      0      0      0    0.0   3535     22  78368    0.1      4 199563 797198   40.4   3426    0.1   46.9  da18
    0   6966      0      0      0    0.0   3535     22  78368    0.1      4 199563 797198   40.7   3426    0.1   43.6  da19

I thought I might be able to see this 800MB/s in trafshow, but no luck.

Important Announcement for the TrueNAS Community.

Strange vdev performance

JaymzMac

Cadet

SweetAndLow

Sweet'NASty

JaymzMac

Cadet

sretalla

Powered by Neutrality

JaymzMac

Cadet

SweetAndLow

Sweet'NASty

sretalla

Powered by Neutrality

JaymzMac

Cadet

sretalla

Powered by Neutrality

JaymzMac

Cadet

Similar threads