Greg10
Dabbler
- Joined
- Dec 16, 2016
- Messages
- 24
I've been working on a new FreeNAS iSCSI SAN solution and I've been having a problem where iSCSI connections drop and don't recover after running under load for several hours.
The console is filled with the following text:
This issue crops up on v11 and v11.1
Hardware:
FreeNAS:
Supermicro H8SGL-F Motherboard
AMD Opteron Processor 6128 (8 cores @ 2GHz)
16GB RAM
LSI 9240-8i (Flashed to 9220-IT Mode)
8x Samsung EVO 850 250GB SSD in RAID-10 pool in FreeNAS
500GB Samsung magnetic disk for OS (I've also tried this on a 16GB USB stick with similar results.)
HP Infiniband 4X DDR Connect-X PCI-e Dual Port HBA (Flashed to Mellanox 2.9.1)
I have configured the IB HBA to Ethernet mode running at 10gbps and it is directly connected to a Windows Server 2016 machine running the similar card. Both sides of the link are running mtu 9014.
Server 2016 box:
HP DL160 G6
2x Xeon E5620 @ 2.4GHz (16 cores)
48GB RAM
HP Infiniband 4X DDR Connect-X PCI-e Dual Port HBA (Flashed to Mellanox 2.9.1)
My test setup is this:
FreeNAS
2 targets configured with 3 100GB file-based extents each, running under the same target IP address that is associated with one of the ports on the Mellanox adapter.
Server 2016
6 extents mounted as local iSCSI volumes
IOmeter configured with two workers pointing to each of the six volumes (12 workers total)
- Access Specification All In One (so a variable mix of reads/writes sequential/random of varying sizes)
After kicking off the test, network traffic hits 9+Gbps as the drives are initialized, then levels off at 2Gbps inbound/outbound during the test, generating a stable 12,000 combined IOPS until the test dies after about four hours. During that time, the FreeNAS CPU is running at about 97% utilization with System Load at about 15 or so.
Memory shows 200M free with 15G used by Wired and swap utilization stable at 520M.
How can I make this setup stable under load?
The console is filled with the following text:
Code:
Jan 21 03:16:23 san1 daemon[6448]: 2018/01/21 03:16:23 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Jan 21 03:18:53 san1 daemon[6448]: 2018/01/21 03:18:53 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Jan 21 03:21:23 san1 daemon[6448]: 2018/01/21 03:21:23 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Jan 21 03:23:53 san1 daemon[6448]: 2018/01/21 03:23:53 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Jan 21 03:26:23 san1 daemon[6448]: 2018/01/21 03:26:23 [WARN] Timed out (30s) running check '/usr/local/etc/consul-checks/freenas_health.sh' Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): waiting for CTL to terminate 1 tasks Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): tasks terminated Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): waiting for CTL to terminate 1 tasks Jan 21 03:27:45 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): tasks terminated Jan 21 03:27:51 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:00 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:00 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:17 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:20 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:28 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:28 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): waiting for CTL to terminate 1 tasks Jan 21 03:28:28 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): tasks terminated Jan 21 03:28:33 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:28:33 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): waiting for CTL to terminate 1 tasks Jan 21 03:28:33 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): tasks terminated Jan 21 11:29:12 san1 ctld[67350]: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): read: Connection reset by peer Jan 21 03:29:12 san1 ctld[20887]: child process 67350 terminated with exit status 1 Jan 21 03:29:31 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:29:36 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:29:39 san1 ctld[67443]: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): read: Connection reset by peer Jan 21 03:29:39 san1 ctld[20887]: child process 67443 terminated with exit status 1 Jan 21 11:29:45 san1 ctld[67361]: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): read: Connection reset by peer Jan 21 03:29:45 san1 ctld[20887]: child process 67361 terminated with exit status 1 Jan 21 03:29:59 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:30:19 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection Jan 21 03:30:19 san1 WARNING: 10.100.1.2 (iqn.1991-05.com.microsoft:db1.bleh.local): no ping reply (NOP-Out) after 5 seconds; dropping connection
This issue crops up on v11 and v11.1
Hardware:
FreeNAS:
Supermicro H8SGL-F Motherboard
AMD Opteron Processor 6128 (8 cores @ 2GHz)
16GB RAM
LSI 9240-8i (Flashed to 9220-IT Mode)
8x Samsung EVO 850 250GB SSD in RAID-10 pool in FreeNAS
500GB Samsung magnetic disk for OS (I've also tried this on a 16GB USB stick with similar results.)
HP Infiniband 4X DDR Connect-X PCI-e Dual Port HBA (Flashed to Mellanox 2.9.1)
I have configured the IB HBA to Ethernet mode running at 10gbps and it is directly connected to a Windows Server 2016 machine running the similar card. Both sides of the link are running mtu 9014.
Server 2016 box:
HP DL160 G6
2x Xeon E5620 @ 2.4GHz (16 cores)
48GB RAM
HP Infiniband 4X DDR Connect-X PCI-e Dual Port HBA (Flashed to Mellanox 2.9.1)
My test setup is this:
FreeNAS
2 targets configured with 3 100GB file-based extents each, running under the same target IP address that is associated with one of the ports on the Mellanox adapter.
Server 2016
6 extents mounted as local iSCSI volumes
IOmeter configured with two workers pointing to each of the six volumes (12 workers total)
- Access Specification All In One (so a variable mix of reads/writes sequential/random of varying sizes)
After kicking off the test, network traffic hits 9+Gbps as the drives are initialized, then levels off at 2Gbps inbound/outbound during the test, generating a stable 12,000 combined IOPS until the test dies after about four hours. During that time, the FreeNAS CPU is running at about 97% utilization with System Load at about 15 or so.
Memory shows 200M free with 15G used by Wired and swap utilization stable at 520M.
How can I make this setup stable under load?
Last edited: