SubnetMask
Contributor
- Joined
- Jul 27, 2017
- Messages
- 129
I've been having an issue occasionally for a while where after or during heavy writes, one or more FreeNAS pools become inaccessible. Typically, the only time I've seen it is in the few instances where I've had a power failure and I'm trying to suspend a bunch of VMs before my UPS gives out, but during the suspend of the machines, it ends up grinding to a halt and everything crashes anyway. Fast forward, earlier this week, I was migrating a powered off machines storage from two points on one datastore to a single point on another datastore, and this same 'crash' occurred. Once I got everything back up, I built a second FreeNAS machine running FreeNAS 11.1u7 on a Poweredge R620 with 224GB of RAM, connected to a Supermicro JBOD (my main FreeNas is a Poweredge FC630 with 256GB RAM and a Xeon E5-2699v3 running 11.1u7) connected to two Supermicro JBODs, on which I then created two pools consisting of two striped mirrors each in order to do some testing. I moved a number of 'unimportant' machines over to the new FreeNAS to test with, and after doing so, I let it sit for a while, then I set VMWare to move the disks for all 11 machines from one datastore on the test FreeNAS to the other. It ended up doing the same 'crash', and this time, I noticed that on one interface, VMWare was reporting five devices/five paths, but the other interface was reporting 3/3 (neither datastore on the test FreeNAS was available on one interface). I was also unable to browse the datastores on the test FreeNAS. I also found a lot of datastore errors in VMWare indicating path failure and datastore inaccessible and thought maybe it's VMWare, so I rebooted the VMWare host. After that, it was showing 5/5, but I got sidetracked to another task.
After I finished up what I had to do and came back to this, I noticed that I still couldn't browse the datastores, despite VMWare saying all paths were up. I then took my second host out of maintenance mode, and found that it didn't start balancing the VM load like usual. So I rebooted the Test FreeNAS machine, and as soon as it went down, VMs started migrating between hosts like normal, and once the test FreeNAS machine came back up, the datastores were fully accessible again. So it seems like it's actually FreeNAS that's having issues and that was causing the VMWare hosts to hang some functions, most notably, vMotion.
My first thought was that maybe the 'tunalbes' that are in place could have had something to do with the issues (they are autotune entries, quite possibly from as far back as when I was running FreeNAS on a R710), but there are no tunables in place on the test machine, so that likely rules that out. I then found this thread when starting my thread, where overall, while I haven't dug in to look for the errors that poser found, the symptoms seem nearly identical. In his case, he was running 11.1u4 and apparently, adding a SLOG fixed it. So I'm not sure if my issue is the same as that posters, but it seems possible. While adding a SLOG is probably not a bad idea, one would think that once it got to a point where it ran out of cache, the data transfer would slow down, not crash.
Does anyone have any ideas or suggestions? Is any other info needed?
After I finished up what I had to do and came back to this, I noticed that I still couldn't browse the datastores, despite VMWare saying all paths were up. I then took my second host out of maintenance mode, and found that it didn't start balancing the VM load like usual. So I rebooted the Test FreeNAS machine, and as soon as it went down, VMs started migrating between hosts like normal, and once the test FreeNAS machine came back up, the datastores were fully accessible again. So it seems like it's actually FreeNAS that's having issues and that was causing the VMWare hosts to hang some functions, most notably, vMotion.
My first thought was that maybe the 'tunalbes' that are in place could have had something to do with the issues (they are autotune entries, quite possibly from as far back as when I was running FreeNAS on a R710), but there are no tunables in place on the test machine, so that likely rules that out. I then found this thread when starting my thread, where overall, while I haven't dug in to look for the errors that poser found, the symptoms seem nearly identical. In his case, he was running 11.1u4 and apparently, adding a SLOG fixed it. So I'm not sure if my issue is the same as that posters, but it seems possible. While adding a SLOG is probably not a bad idea, one would think that once it got to a point where it ran out of cache, the data transfer would slow down, not crash.
Does anyone have any ideas or suggestions? Is any other info needed?