Ignore START STOP UNIT command on FC adaptors?

Status
Not open for further replies.

David Riley

Cadet
Joined
Apr 21, 2017
Messages
5
I think I've finally chased down a persistent problem I've been having for nearly a year since setting up ESXi on my SAN. Some background:

I've been running an FC SAN with FreeNAS for a while longer than that, but most of what it's boiled down to has been hosting a large-ish FC slice for my Windows machine since it's on a small NVMe SSD (performance is quite good!). I have dreams of eventually booting my SPARC and Alpha machines off FC (since SCSI drives are starting to drop like flies), but some bugs in FreeBSD's FC discovery code precluded that until fairly recently (around 11.2, I think, though FreeNAS backported it to 11.1). I haven't tried since.

Around Christmas last year, I set up ESXi to work with FreeNAS and boot from the SAN, which was... an adventure. I learned more about Fibre Channel than I really cared to, and am still learning more (as well as more about ESXi's failure modes when its disks disappear unexpectedly). It *mostly* works OK, but there's been one persistent problem I run into, which is that every so often (usually no more than 24h apart), the FC disks just "space out" and don't respond to ESXi. They work fine on my Windows machine, though. Go figure. If I issue a "service ctld restart", service resumes.

I noticed that this problem never manifested when my Windows machine was fully off. I tried zoning my FC switch (a Brocade 5000 running FabricOS 6.1.0d) such that the ESXi machine had its own port on the SAN and tied only the ESXi LUNs to that port (I needed to do that anyway to boot off the SAN because some older Qlogic firmware will only boot off LUN 0). Still problematic. I pretty much let the whole thing lie fallow for a few months while I took up other things.

This past week, I decided to get serious about diagnosing it. I turned on more verbose logging in the isp driver and set about finding a reproducible cause. As it turns out: whenever the Windows machine *goes to sleep*, the array goes silent. When I take the Windows machine out of sleep, it still works fine for its own LUN that it uses for bulk storage (yes, I should probably zone it to a single LUN but I don't have enough FC ports on the SAN and FreeBSD doesn't support LUN masking by initiator yet unless I muck with virtual ports, which I don't want to do yet), but none of the other ones are alive, including the ones ESXi uses (which should theoretically be isolated, but again I'm an FC dummy, so).

After a few days of digging, I finally found what I was looking for in the logs. When the Windows machine goes to sleep, it issues a START STOP UNIT command (0x1B) to every single LUN it knows about, which (because I have all LUNs mapped on that port) is all of them. This seems to turn off *all* the units on the CTL attached to the port off; a "ctladm tur <lun>" verifies that the LUNs have been turned off. And a "ctladm start <lun>" command brings them back to life.

Sooo... I know the approximate path to fix this, but I don't know the steps. Obviously, someone is being very naughty on the Windows side and telling all the LUNs attached via FC to go to sleep when it does, which is... maybe not a good thing to be doing on a SAN? I don't know how these things usually work. In any case, I can't think of a situation where I'd actually want START STOP UNIT to work on an FC LUN, but I don't see an obvious mechanism for telling the CTL to ignore it in the source. Is there a way to do that?

Meanwhile, I'm probably going to try to stuff another FC card in the box so I can isolate the Windows machine, too. I know the answer is probably "if you want it, sponsor it or do it yourself", which is actually a perfectly acceptable answer, but what's the chance of LUN masking/mapping by initiator coming to the CTL any time soon?
 

David Riley

Cadet
Joined
Apr 21, 2017
Messages
5
From what I can see in the source after some further diving, it doesn't look like there's any way to tell a block backend device that it should ignore START STOP UNIT commands. I can think of several situations where one would want to do so. How do commercial FC arrays typically handle this? Turning off a logical block device doesn't really have much physical purpose, though I guess I could envision scenarios where you'd want to track it logically.
 

David Riley

Cadet
Joined
Apr 21, 2017
Messages
5
Well, lacking other options, and having a switch that supports NPIV, I broke down and switched to virtual ports. It does work, though it makes for rather more zones on the switch than I'd like (especially to keep the host from talking to itself). I'd be interested in contributing to a proper LUN masking/mapping solution per initiator group, though, if I ever get spare time; I think I can see a pretty good path (ha ha) to that in the existing code without too much mucking around. What's the normal procedure for contributing patches?
 
Status
Not open for further replies.
Top