Help with ctl_datamove abort and RST error

Joseph Bucar · Jan 25, 2021

Virtualization Environment:
6 Dell VMware hosts connected via multiple 10GB vlans thru HPE Flexfabric switches

Configuration:
Dell 730xd - 192GB RAM - [LOG] Intel Optane 900P (280GB) - [CACHE] (2) 1TB PCI Intel NvME SSD - FreeNAS 11.3-U5

Nics

2 2 Port 10GB-BaseT Intel cards, configured with a Management network and 3 VLANs (two for storage and 1 for vmotion)
Nics are LAGGED together and setup for FAILOVER as everything is dual connected for redundancy to a HPE Flexfabric switch stack.
Jumbo frames is turned on and verified thru the storage and vmotion vlans to be working properly.

Controllers/Enclosures

730xd has 24 2.5 slots which are being used for SSD only.
2 Dell MD1200 Enclosures with 24 total slots
LSI External 2 port SAS Card

2 Pools

pool7K1 - 24 8TB SAS 7K Drives - Configured to 12 Mirrored vdevs - log is Intel Optane - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (57%) Used / 36.06 TiB Free
poolSSD1 - 12 3.84TB SAS Dell Enterprise SSD - Configured to 6 Mirrored vdevs - no log - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (50%) Used / 8.27 TiB Free

Problem:
Our problem is with our poolSSD1. We are suffering horrible performance on this pool with bad read/write latency numbers. What is sad is that our 7K pool is
outperforming the SSD pool it appears. We are also seeing the following messages in the FreeNAS shell:

Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cb on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cd on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d0 on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d3 on (4:17:113) aborted
Jan 25 14:12:41 dcXXXSAN02 ctl_datamove: tag 0xb55bc0 on (10:24:105) aborted
Jan 25 14:18:03 dcXXXSAN02 ctl_datamove: tag 0x3f941a on (4:25:114) aborted
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec

Questions:

Can someone explain what the ctl_datamove error is and what might cause it?
What does the 99:99:99 refer to? It appears from looking at it to me that the last number is the LUN in question with the issue.
What does the tag hex value refer to?
We have a 2nd, identical setup FreeNAS which does not suffer the same problems.
Is there any suggestions or possible issues with this setup?

Thank you again for any help guys,
Joe

Joseph Bucar · Jan 26, 2021

Does anyone have any thoughts or insights?

Alecmascot · Jan 27, 2021

If you google ctl_datamove:tag you will get a number of hits.....
This looks interesting :

https://redmine.ixsystems.com/issues/25147

Joseph Bucar · Jan 28, 2021

The problem with that posting (which I have read before) is that nothing describes what the values in the error message pertain to. Any other ideas?

Joseph Bucar · Jan 29, 2021

Is there anything I need to add to this post to try to get more input? Just want to make sure I am giving enough information here to get some guidance or help. Thanks.

Joseph Bucar · Feb 1, 2021

Well this is disappointing. We were hoping that the community would be able to give some input to the performance issues, hence the reason we decided to go with Free/TrueNAS. Can anyone else lend any input?

Samuel Tai · Feb 1, 2021

Use the Source, Luke! According to https://github.com/lattera/freebsd/blob/master/sys/cam/ctl/ctl_frontend_iscsi.c, ctl_datamove means a specific block failed to be sent via iSCSI. Unfortunately, tracking this down can be difficult.

Here's an old bug report suggesting jumbo frame inter-vendor compatibility can result in this.
Here's a FreeBSD driver bug report suggesting a compatibility issue between LSI HBA and specific Seagate drive firmware.

As you're seeing RST messages from the initiator, you may also have a problem on the VM or ESX host side.

Joseph Bucar · Feb 1, 2021

Samuel
Thanks for the reply. We arent using any Seagate drives and all switching is HPE FlexFabric 5800's. Ill look thru the source code and see if there is anything that might tell me some information. Appreciate your time and input!

atakacs · Dec 9, 2021

Hi - seeing a similar issue (with NVME SSDs).

Did you ever get a resolution ?

belar · Nov 17, 2022

Здравствуйте
, Cisco 9396px, AMD EPYC 7232P, 7 ssd samsang EVO (870x3 и 860x4), mellanox MCX311A-XCAT на клиентах truenas и iscsi.
Ситуация повторяется, у вас было понимание ошибки?

a1a23 · Sep 19, 2023

Joseph Bucar said:
Virtualization Environment:
6 Dell VMware hosts connected via multiple 10GB vlans thru HPE Flexfabric switches

Configuration:
Dell 730xd - 192GB RAM - [LOG] Intel Optane 900P (280GB) - [CACHE] (2) 1TB PCI Intel NvME SSD - FreeNAS 11.3-U5

Nics

2 2 Port 10GB-BaseT Intel cards, configured with a Management network and 3 VLANs (two for storage and 1 for vmotion)

Nics are LAGGED together and setup for FAILOVER as everything is dual connected for redundancy to a HPE Flexfabric switch stack.

Jumbo frames is turned on and verified thru the storage and vmotion vlans to be working properly.

Controllers/Enclosures

730xd has 24 2.5 slots which are being used for SSD only.

2 Dell MD1200 Enclosures with 24 total slots

LSI External 2 port SAS Card

2 Pools

pool7K1 - 24 8TB SAS 7K Drives - Configured to 12 Mirrored vdevs - log is Intel Optane - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (57%) Used / 36.06 TiB Free

poolSSD1 - 12 3.84TB SAS Dell Enterprise SSD - Configured to 6 Mirrored vdevs - no log - cache is 1TB PCI Intel NvME SSD - pool is HEALTHY: (50%) Used / 8.27 TiB Free

Problem:
Our problem is with our poolSSD1. We are suffering horrible performance on this pool with bad read/write latency numbers. What is sad is that our 7K pool is
outperforming the SSD pool it appears. We are also seeing the following messages in the FreeNAS shell:

Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cb on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549cd on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d0 on (4:17:113) aborted
Jan 25 09:38:15 dcXXXSAN02 ctl_datamove: tag 0x1549d3 on (4:17:113) aborted
Jan 25 14:12:41 dcXXXSAN02 ctl_datamove: tag 0xb55bc0 on (10:24:105) aborted
Jan 25 14:18:03 dcXXXSAN02 ctl_datamove: tag 0x3f941a on (4:25:114) aborted
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec
Jan 25 14:34:30 dcXXXSAN02 kernel: Limiting closed port RST response from 1704 to 200 packets/sec

Questions:

Can someone explain what the ctl_datamove error is and what might cause it?

What does the 99:99:99 refer to? It appears from looking at it to me that the last number is the LUN in question with the issue.

What does the tag hex value refer to?

We have a 2nd, identical setup FreeNAS which does not suffer the same problems.

Is there any suggestions or possible issues with this setup?

Thank you again for any help guys,
Joe

just responding to part of your question which many people were trying to find answers too I guess.
What does the 99:99:99 refer to? :
(Initiator:port:lun)

Important Announcement for the TrueNAS Community.

Help with ctl_datamove abort and RST error

Joseph Bucar

Cadet

Joseph Bucar

Cadet

Alecmascot

Guru

Joseph Bucar

Cadet

Joseph Bucar

Cadet

Joseph Bucar

Cadet

Samuel Tai

Never underestimate your own stupidity

Joseph Bucar

Cadet

atakacs

Explorer

belar

Cadet

a1a23

Cadet

Similar threads