Help with ctl_datamove abort and RST error

Joseph Bucar

Cadet
Joined
Mar 23, 2017
Messages
6
Virtualization Environment:
6 Dell VMware hosts connected via multiple 10GB vlans thru HPE Flexfabric switches

Configuration:
Dell 730xd - 192GB RAM - [LOG] Intel Optane 900P (280GB) - [CACHE] (2) 1TB PCI Intel NvME SSD - FreeNAS 11.3-U5

Nics
  1. 2 2 Port 10GB-BaseT Intel cards, configured with a Management network and 3 VLANs (two for storage and 1 for vmotion)
  2. Nics are LAGGED together and setup for FAILOVER as everything is dual connected for redundancy to a HPE Flexfabric switch stack.
  3. Jumbo frames is turned on and verified thru the storage and vmotion vlans to be working properly.
Controllers/Enclosures
  1. 730xd has 24 2.5 slots which are being used for SSD only.
  2. 2 Dell MD1200 Enclosures with 24 total slots
  3. LSI External 2 port SAS Card
2 Pools
  1. pool7K1 - 24 8TB SAS 7K Drives - Configured to 12 Mirrored vdevs - log is Intel Optane - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (57%) Used / 36.06 TiB Free
  2. poolSSD1 - 12 3.84TB SAS Dell Enterprise SSD - Configured to 6 Mirrored vdevs - no log - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (50%) Used / 8.27 TiB Free
Problem:
Our problem is with our poolSSD1. We are suffering horrible performance on this pool with bad read/write latency numbers. What is sad is that our 7K pool is
outperforming the SSD pool it appears. We are also seeing the following messages in the FreeNAS shell:

Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cb on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cd on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d0 on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d3 on (4:17:113) aborted
Jan 25 14:12:41 dcXXXSAN02 ctl_datamove: tag 0xb55bc0 on (10:24:105) aborted
Jan 25 14:18:03 dcXXXSAN02 ctl_datamove: tag 0x3f941a on (4:25:114) aborted
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec

Questions:
  1. Can someone explain what the ctl_datamove error is and what might cause it?
  2. What does the 99:99:99 refer to? It appears from looking at it to me that the last number is the LUN in question with the issue.
  3. What does the tag hex value refer to?
  4. We have a 2nd, identical setup FreeNAS which does not suffer the same problems.
  5. Is there any suggestions or possible issues with this setup?
Thank you again for any help guys,
Joe
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177

Joseph Bucar

Cadet
Joined
Mar 23, 2017
Messages
6
The problem with that posting (which I have read before) is that nothing describes what the values in the error message pertain to. Any other ideas?
 

Joseph Bucar

Cadet
Joined
Mar 23, 2017
Messages
6
Is there anything I need to add to this post to try to get more input? Just want to make sure I am giving enough information here to get some guidance or help. Thanks.
 

Joseph Bucar

Cadet
Joined
Mar 23, 2017
Messages
6
Well this is disappointing. We were hoping that the community would be able to give some input to the performance issues, hence the reason we decided to go with Free/TrueNAS. Can anyone else lend any input?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Use the Source, Luke! According to https://github.com/lattera/freebsd/blob/master/sys/cam/ctl/ctl_frontend_iscsi.c, ctl_datamove means a specific block failed to be sent via iSCSI. Unfortunately, tracking this down can be difficult.

Here's an old bug report suggesting jumbo frame inter-vendor compatibility can result in this.
Here's a FreeBSD driver bug report suggesting a compatibility issue between LSI HBA and specific Seagate drive firmware.

As you're seeing RST messages from the initiator, you may also have a problem on the VM or ESX host side.
 

Joseph Bucar

Cadet
Joined
Mar 23, 2017
Messages
6
Samuel
Thanks for the reply. We arent using any Seagate drives and all switching is HPE FlexFabric 5800's. Ill look thru the source code and see if there is anything that might tell me some information. Appreciate your time and input!
 

atakacs

Explorer
Joined
Apr 23, 2012
Messages
92
Hi - seeing a similar issue (with NVME SSDs).

Did you ever get a resolution ?
 

belar

Cadet
Joined
Sep 13, 2022
Messages
1
Здравствуйте
, Cisco 9396px, AMD EPYC 7232P, 7 ssd samsang EVO (870x3 и 860x4), mellanox MCX311A-XCAT на клиентах truenas и iscsi.
Ситуация повторяется, у вас было понимание ошибки?
 

a1a23

Cadet
Joined
Sep 19, 2023
Messages
1
Virtualization Environment:
6 Dell VMware hosts connected via multiple 10GB vlans thru HPE Flexfabric switches

Configuration:
Dell 730xd - 192GB RAM - [LOG] Intel Optane 900P (280GB) - [CACHE] (2) 1TB PCI Intel NvME SSD - FreeNAS 11.3-U5

Nics
  1. 2 2 Port 10GB-BaseT Intel cards, configured with a Management network and 3 VLANs (two for storage and 1 for vmotion)
  2. Nics are LAGGED together and setup for FAILOVER as everything is dual connected for redundancy to a HPE Flexfabric switch stack.
  3. Jumbo frames is turned on and verified thru the storage and vmotion vlans to be working properly.
Controllers/Enclosures
  1. 730xd has 24 2.5 slots which are being used for SSD only.
  2. 2 Dell MD1200 Enclosures with 24 total slots
  3. LSI External 2 port SAS Card
2 Pools
  1. pool7K1 - 24 8TB SAS 7K Drives - Configured to 12 Mirrored vdevs - log is Intel Optane - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (57%) Used / 36.06 TiB Free
  2. poolSSD1 - 12 3.84TB SAS Dell Enterprise SSD - Configured to 6 Mirrored vdevs - no log - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (50%) Used / 8.27 TiB Free
Problem:
Our problem is with our poolSSD1. We are suffering horrible performance on this pool with bad read/write latency numbers. What is sad is that our 7K pool is
outperforming the SSD pool it appears. We are also seeing the following messages in the FreeNAS shell:

Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cb on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cd on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d0 on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d3 on (4:17:113) aborted
Jan 25 14:12:41 dcXXXSAN02 ctl_datamove: tag 0xb55bc0 on (10:24:105) aborted
Jan 25 14:18:03 dcXXXSAN02 ctl_datamove: tag 0x3f941a on (4:25:114) aborted
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec

Questions:
  1. Can someone explain what the ctl_datamove error is and what might cause it?
  2. What does the 99:99:99 refer to? It appears from looking at it to me that the last number is the LUN in question with the issue.
  3. What does the tag hex value refer to?
  4. We have a 2nd, identical setup FreeNAS which does not suffer the same problems.
  5. Is there any suggestions or possible issues with this setup?
Thank you again for any help guys,
Joe
just responding to part of your question which many people were trying to find answers too I guess.
What does the 99:99:99 refer to? :
(Initiator:port:lun)
 
Top