ctld continually exits - read connection lost bug

VioletDragon

Patron
Joined
Aug 6, 2017
Messages
251
Hi,

i believe this is a bug. my Console on FreeNAS gets spammed with,

Dec 18 17:47:50 Telsa WARNING: 192.168.2.2 (iqn.2019-09.com.example:32774b20): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec 18 17:47:50 Telsa WARNING: 192.168.2.2 (iqn.2019-09.com.example:32774b20): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec 18 18:13:07 Telsa kernel: cxgb0: link state changed to UP
Dec 18 18:13:07 Telsa kernel: cxgb0: link state changed to UP
Dec 18 18:14:18 Telsa ctld[49704]: 192.168.2.2: read: connection lost
Dec 18 18:14:18 Telsa ctld[2304]: child process 49704 terminated with exit status 1
Dec 18 18:14:19 Telsa ctld[49705]: 192.168.2.2 (iqn.2019-09.com.example:32774b20): read: Connection reset by peer
Dec 18 18:14:19 Telsa ctld[2304]: child process 49705 terminated with exit status 1
Dec 18 18:14:19 Telsa ctld[49706]: 192.168.2.2: read: connection lost
Dec 18 18:14:19 Telsa ctld[2304]: child process 49706 terminated with exit status 1
Dec 18 18:14:42 Telsa ctld[49730]: 192.168.2.2: read: connection lost
Dec 18 18:14:42 Telsa ctld[2304]: child process 49730 terminated with exit status 1
Dec 18 18:14:42 Telsa ctld[49731]: 192.168.2.2: read: connection lost
Dec 18 18:14:42 Telsa ctld[2304]: child process 49731 terminated with exit status 1

iSCSI Storage for XCP-ng i have also tested with 1gig and 4gig LAG as well as 10gig Network and problem is still there also tested on different hardware boards etc. Problem is there in both 11.1 and 11.2. Someone else has had this bug but closed as it couldnt be reproduced. https://redmine.ixsystems.com/issues/7891

As this happens connection is perfectly fine all VMs work, XCP-ng doesnt show anything in the logs about the link dropping.

Network cards tested and produced this issue.

HP NC360T Dual Port
HP NC364T Quad NIC
Chelsio T320E
 

Attachments

  • 79881678_2702669836446246_6060153212038545408_o.jpg
    79881678_2702669836446246_6060153212038545408_o.jpg
    72.9 KB · Views: 367

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Please post some more detailed information about your system, especially including what you've got configured for a pool. See the forum rules, conveniently linked at the top of each page in red, for specifics that you ought to include.

Typical causes are various combinations of performance-killers such as using RAIDZ for the backing store (you need mirrors) and having insufficient RAM for iSCSI (realistic minimum of 64GB, might be able to get away with 32GB or even less if you're just running a trite do-nothing VM or two). How full is your pool? Over 40%? That could be a problem. What's your fragmentation like? For iSCSI, I also like to set vfs.zfs.txg.timeout to 1 because it helps the write throttle learn the system's performance level in a timeframe that's LESS than iSCSI's timeout.

The usual problem is that the other end is getting antsy waiting for a poorly designed FreeNAS to respond, dropping the connection, etc. This is largely dependent on what your workload is like. The other end might not even be logging a drop but it could still be happening.

You should also make sure your network is not dropping packets or anything like that. Run a long-term ping session from the NAS to the client (several hours min.) with typical traffic on the network. Anything more than a packet or two lost should be researched and remediated. iSCSI is one of the twitchiest protocols out there. Never forget the basics, walking has to work before you can run.

I've mostly stopped bothering to debug these as a network problem because it's really never turned out to be a network issue but a design issue. Having helped dozens or hundreds of people with this kind of thing here in the forums, I'd be shocked if I haven't touched on your issue(s) here in this message. If you give ZFS what it needs, it works. I have absolutely no problems running 61 light duty VM's on a 64GB two-core host with 4 x 12TB HDD in a mirror.
 

VioletDragon

Patron
Joined
Aug 6, 2017
Messages
251
Please post some more detailed information about your system, especially including what you've got configured for a pool. See the forum rules, conveniently linked at the top of each page in red, for specifics that you ought to include.

Typical causes are various combinations of performance-killers such as using RAIDZ for the backing store (you need mirrors) and having insufficient RAM for iSCSI (realistic minimum of 64GB, might be able to get away with 32GB or even less if you're just running a trite do-nothing VM or two). How full is your pool? Over 40%? That could be a problem. What's your fragmentation like? For iSCSI, I also like to set vfs.zfs.txg.timeout to 1 because it helps the write throttle learn the system's performance level in a timeframe that's LESS than iSCSI's timeout.

The usual problem is that the other end is getting antsy waiting for a poorly designed FreeNAS to respond, dropping the connection, etc. This is largely dependent on what your workload is like. The other end might not even be logging a drop but it could still be happening.

You should also make sure your network is not dropping packets or anything like that. Run a long-term ping session from the NAS to the client (several hours min.) with typical traffic on the network. Anything more than a packet or two lost should be researched and remediated. iSCSI is one of the twitchiest protocols out there. Never forget the basics, walking has to work before you can run.

I've mostly stopped bothering to debug these as a network problem because it's really never turned out to be a network issue but a design issue. Having helped dozens or hundreds of people with this kind of thing here in the forums, I'd be shocked if I haven't touched on your issue(s) here in this message. If you give ZFS what it needs, it works. I have absolutely no problems running 61 light duty VM's on a 64GB two-core host with 4 x 12TB HDD in a mirror.

Hi im using a Mirror for iSCSI. I have 16gb of RAM ECC but RAM is not an issue according to Reporting in the Web UI.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I have 16gb of RAM ECC but RAM is not an issue according to Reporting in the Web UI.

That's irrelevant. The UI isn't magic and doesn't know what you're trying to do. Everything needs to be well provisioned in order for iSCSI to work well. What's the design of your pool, and its occupancy and fragmentation rates?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
i have also tested with 1gig and 4gig LAG as well as 10gig Network

FYI, don't use LACP with iSCSI - it has its own internal multipathing protocol (MPIO) that should be used instead.

What is the networking setup between FreeNAS and the hypervisor? A consumer switch might not be able to handle the traffic, and a low-end "access layer" switch or a vendor with interesting ideas about "preventing DoS attacks" might be mistakenly flagging the port as malicious.
 

VioletDragon

Patron
Joined
Aug 6, 2017
Messages
251
Problem solved. The card in the FreeNAS was overheating added a fan to the card and solved the problem, Not sure why it was overheating as there is plenty of airflow in the 4U Chassis.
 

VioletDragon

Patron
Joined
Aug 6, 2017
Messages
251
FYI, don't use LACP with iSCSI - it has its own internal multipathing protocol (MPIO) that should be used instead.

What is the networking setup between FreeNAS and the hypervisor? A consumer switch might not be able to handle the traffic, and a low-end "access layer" switch or a vendor with interesting ideas about "preventing DoS attacks" might be mistakenly flagging the port as malicious.

Link only. I havent bothered with 10gig Switches as they are overpriced for what they are. Link only from FreeNAS to XCP-ng. Im also using the second Port on the Chelsio Card on the FreeNAS Server to my Workstation.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Link only. I havent bothered with 10gig Switches as they are overpriced for what they are. Link only from FreeNAS to XCP-ng. Im also using the second Port on the Chelsio Card on the FreeNAS Server to my Workstation.

Really? Because at $125 for a 4-port 10gig switch, the MikroTik CRS305-1G-4S+IN is kinda hard to argue with.
 

VioletDragon

Patron
Joined
Aug 6, 2017
Messages
251
Really? Because at $125 for a 4-port 10gig switch, the MikroTik CRS305-1G-4S+IN is kinda hard to argue with.

Meh its cheap for a reason. If i were to go with a 10gig switch id rather go with a Cisco or a Ubiquity switch. The MikroTik CRS305-1G-4S+IN is not rack mountable.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Meh its cheap for a reason.

It's similar to the Ubiquiti stuff in quality. I had MikroTik send an 8 port version for eval and I was actually quite impressed. It isn't the datacenter-grade five-figure gear that we use in production, but for a home lab, it'd be amazing.
 
Top