David Riley
Cadet
- Joined
- Apr 21, 2017
- Messages
- 5
I think I've finally chased down a persistent problem I've been having for nearly a year since setting up ESXi on my SAN. Some background:
I've been running an FC SAN with FreeNAS for a while longer than that, but most of what it's boiled down to has been hosting a large-ish FC slice for my Windows machine since it's on a small NVMe SSD (performance is quite good!). I have dreams of eventually booting my SPARC and Alpha machines off FC (since SCSI drives are starting to drop like flies), but some bugs in FreeBSD's FC discovery code precluded that until fairly recently (around 11.2, I think, though FreeNAS backported it to 11.1). I haven't tried since.
Around Christmas last year, I set up ESXi to work with FreeNAS and boot from the SAN, which was... an adventure. I learned more about Fibre Channel than I really cared to, and am still learning more (as well as more about ESXi's failure modes when its disks disappear unexpectedly). It *mostly* works OK, but there's been one persistent problem I run into, which is that every so often (usually no more than 24h apart), the FC disks just "space out" and don't respond to ESXi. They work fine on my Windows machine, though. Go figure. If I issue a "service ctld restart", service resumes.
I noticed that this problem never manifested when my Windows machine was fully off. I tried zoning my FC switch (a Brocade 5000 running FabricOS 6.1.0d) such that the ESXi machine had its own port on the SAN and tied only the ESXi LUNs to that port (I needed to do that anyway to boot off the SAN because some older Qlogic firmware will only boot off LUN 0). Still problematic. I pretty much let the whole thing lie fallow for a few months while I took up other things.
This past week, I decided to get serious about diagnosing it. I turned on more verbose logging in the isp driver and set about finding a reproducible cause. As it turns out: whenever the Windows machine *goes to sleep*, the array goes silent. When I take the Windows machine out of sleep, it still works fine for its own LUN that it uses for bulk storage (yes, I should probably zone it to a single LUN but I don't have enough FC ports on the SAN and FreeBSD doesn't support LUN masking by initiator yet unless I muck with virtual ports, which I don't want to do yet), but none of the other ones are alive, including the ones ESXi uses (which should theoretically be isolated, but again I'm an FC dummy, so).
After a few days of digging, I finally found what I was looking for in the logs. When the Windows machine goes to sleep, it issues a START STOP UNIT command (0x1B) to every single LUN it knows about, which (because I have all LUNs mapped on that port) is all of them. This seems to turn off *all* the units on the CTL attached to the port off; a "ctladm tur <lun>" verifies that the LUNs have been turned off. And a "ctladm start <lun>" command brings them back to life.
Sooo... I know the approximate path to fix this, but I don't know the steps. Obviously, someone is being very naughty on the Windows side and telling all the LUNs attached via FC to go to sleep when it does, which is... maybe not a good thing to be doing on a SAN? I don't know how these things usually work. In any case, I can't think of a situation where I'd actually want START STOP UNIT to work on an FC LUN, but I don't see an obvious mechanism for telling the CTL to ignore it in the source. Is there a way to do that?
Meanwhile, I'm probably going to try to stuff another FC card in the box so I can isolate the Windows machine, too. I know the answer is probably "if you want it, sponsor it or do it yourself", which is actually a perfectly acceptable answer, but what's the chance of LUN masking/mapping by initiator coming to the CTL any time soon?
I've been running an FC SAN with FreeNAS for a while longer than that, but most of what it's boiled down to has been hosting a large-ish FC slice for my Windows machine since it's on a small NVMe SSD (performance is quite good!). I have dreams of eventually booting my SPARC and Alpha machines off FC (since SCSI drives are starting to drop like flies), but some bugs in FreeBSD's FC discovery code precluded that until fairly recently (around 11.2, I think, though FreeNAS backported it to 11.1). I haven't tried since.
Around Christmas last year, I set up ESXi to work with FreeNAS and boot from the SAN, which was... an adventure. I learned more about Fibre Channel than I really cared to, and am still learning more (as well as more about ESXi's failure modes when its disks disappear unexpectedly). It *mostly* works OK, but there's been one persistent problem I run into, which is that every so often (usually no more than 24h apart), the FC disks just "space out" and don't respond to ESXi. They work fine on my Windows machine, though. Go figure. If I issue a "service ctld restart", service resumes.
I noticed that this problem never manifested when my Windows machine was fully off. I tried zoning my FC switch (a Brocade 5000 running FabricOS 6.1.0d) such that the ESXi machine had its own port on the SAN and tied only the ESXi LUNs to that port (I needed to do that anyway to boot off the SAN because some older Qlogic firmware will only boot off LUN 0). Still problematic. I pretty much let the whole thing lie fallow for a few months while I took up other things.
This past week, I decided to get serious about diagnosing it. I turned on more verbose logging in the isp driver and set about finding a reproducible cause. As it turns out: whenever the Windows machine *goes to sleep*, the array goes silent. When I take the Windows machine out of sleep, it still works fine for its own LUN that it uses for bulk storage (yes, I should probably zone it to a single LUN but I don't have enough FC ports on the SAN and FreeBSD doesn't support LUN masking by initiator yet unless I muck with virtual ports, which I don't want to do yet), but none of the other ones are alive, including the ones ESXi uses (which should theoretically be isolated, but again I'm an FC dummy, so).
After a few days of digging, I finally found what I was looking for in the logs. When the Windows machine goes to sleep, it issues a START STOP UNIT command (0x1B) to every single LUN it knows about, which (because I have all LUNs mapped on that port) is all of them. This seems to turn off *all* the units on the CTL attached to the port off; a "ctladm tur <lun>" verifies that the LUNs have been turned off. And a "ctladm start <lun>" command brings them back to life.
Sooo... I know the approximate path to fix this, but I don't know the steps. Obviously, someone is being very naughty on the Windows side and telling all the LUNs attached via FC to go to sleep when it does, which is... maybe not a good thing to be doing on a SAN? I don't know how these things usually work. In any case, I can't think of a situation where I'd actually want START STOP UNIT to work on an FC LUN, but I don't see an obvious mechanism for telling the CTL to ignore it in the source. Is there a way to do that?
Meanwhile, I'm probably going to try to stuff another FC card in the box so I can isolate the Windows machine, too. I know the answer is probably "if you want it, sponsor it or do it yourself", which is actually a perfectly acceptable answer, but what's the chance of LUN masking/mapping by initiator coming to the CTL any time soon?