SOLVED Failing NVMe crashed FreeNAS, prevented (re)booting, and can't be removed from pool.

SamM

Dabbler
Joined
May 29, 2017
Messages
39
Last Sunday, our Production FreeNAS server (HP DL380e G8, 96GB RAM, boots off tiny mirrored boot M.2 SSD's, main pool is Seagate Exos 10Tb x12 in 4 sets of 3-way mirrors + mirrored 240Gb NVMe w/ power loss protection + ARC2 single 500Gb M.2 SATA Crucial w/ power loss protection) crashed. The symptoms is that it seemingly stops processing iSCSI (Sync=Always, out of paranoia...) traffic for our ESXi hosts. The local console is responsive, but the WebGUI is slow at best until it becomes completely unresponsive. At this point the only option we have is to *try* to reboot via the local console menu and if that doesn't work, press-and-hold the power button... Last Sunday was the 2nd or 3rd time this has happened, with the exception that the server would not reboot afterwards with the following error mentioning the NVMe drives.
1572575257679.png


Unlike non-booting SATA & SAS drives, NVMe's seem to have a dangerous ability to crash a system on a whim via Kernel Panics, MNI's, and/or BSOD (and the like). For example, I have a brand new WD 500Gb Black NVMe drive, and it crashes a Windows server just trying to format the device, just like it's predecessor 250Gb version did...

I got onsite, powered off the server, and removed both NVMe drives. Afterwards, the system would boot but would NOT mount the main pool. I put the NVMe pair back in and the system DID boot and attach the main pool (maybe the complete power off helped, who knows...):
1572572246324.png


At the time, I thought I'd rather have no SLOG (and take the performance hit) as opposed to have a SLOG that would crash the Production FreeNAS at any given moment (and all the 'political'/business fallout that comes with downtime...). Simply removing them didn't work. So I figured that I needed to remove them from the pool via the webGUI.

I tried to remove the entire SLOG NVMe pair, but as the screenshot above shows, there's no option to do that. The online manual (https://www.ixsystems.com/documentation/freenas/11.2-U6/storage.html#removing-cache-or-log-devices) says that removing devices from SLOG is supposed to be possible...
1572572774419.png


...but when I tried to removed either "nvd1p2" or "nvd0p2", I just got this error saying "Disk could not be removed" & "operation not supported on this type of pool". I tried to Google the error, but found nothing modern. I tried offlining the drive, then removing it, same result.
WeTYjcg7u7XJk4-u2nZQFCnkKYswmW_gDMGnyidjUi3C4DqQ5bz0bWriBfjFo8QZBIy2CgxKmjI5u3pdNSGJsYfjvylnVS8dWzijwH9mff5UFOLQm6Q9nNNxER7Kr0UgM_CaLsxE


So FreeNAS won't let me remove the NVMe's, but I can't risk another random crash either. Very weird and concerning especially since the previous FreeNAS (albeit older software, lower hardware, and slightly different config) has been rock-solid for years of straight use. Earlier that day, I received (bought for a completely unrelated purpose) a pair of Kingston 500GB A2000 M.2 2280 Nvme Internal SSD PCIe Up to 2000MB/S with Full Security Suite SA2000M8/500G drives. I know these are not ideal SLOG drives, but I had no choice to use them to replace the previous MyDigitalSSD 240GB (256GB) BPX 80mm (2280) M.2 PCI Express 3.0 x4 (PCIe Gen3 x4) NVMe MLC SSD pair which (allegedly) had power loss protection. So I shut down the server, replaced one old 'MyDigitalSSD' with one new 'Kingston', powered the server on, and thank God FreeNAS booted and mounted the pool (albeit in a degraded state due to missing one of the original NVMe drives). From the WebGUI, I used the 'replace' function to replace the NVMe FreeNAS has listed as missing. FreeNAS resilvered the mirror successfully and the pool was happy/green. So I shutdown the server again, replaced the other old MyDigitalSSD with the other new Kingston, and repeating the replace/resilver process, which worked.

So here I am a few days later. The given FreeNAS server has been working seemingly fine, but I'm not thrilled about having to use the current Kingston NVMe's or the fact that FreeNAS won't let me remove the SLOG at all. My inability to find a modern article referencing this issue leads me to believe my GoogleFu is weak or that this is a very unusual, rare, and/or possibly isolated issue. I'm still trying to figure out my next move (eventually when the next maintenance window comes up...) should be:
  1. Should I just leave it as is? (Not liking this option...)
  2. Should I replace the current Kingston NVMe's with another set of NVMe's despite their crazy high crash risk? (I've been having similar issues with NVMe's even in other/unrelated systems lately)? If so, what's the current *affordable* (ie $50-$200 each) fan-favorite?
    1. If I go this route, is there a better procedure than 'power off -> replace 1 -> power on -> resilver -> power off -> replace other -> power on -> resilver'?
  3. Should I look for a HP 663280-B21rear SFF/2.5" drive cage to hang 2 more SFF drives (off backside) to replace the NVMe's?
    1. Maybe 2 mechanical HDDs (or maybe consumer level SSDs trading speed for endurance...) behind a RAID controller as suggested by jgreco?
      An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities: Since the cache absorbs sync writes and writes them as the disk allows, rotational latency becomes a nonissue, and you gain a SLOG device that can operate at the speed the drives are capable of writing at. In the case of a LSI 2208 with 1GB cache, and a pair of slowish ~50MB/sec 2.5" hard drives, it was interesting to note that a burst of ZIL writes could be absorbed by the cache at lightning speed, and then ZIL writes would slow down to the 50MB/sec that the drives were capable of sustaining. With the nearly unlimited endurance of battery-backed RAM and conventional hard drives, this is a very promising technique.
    2. Maybe 2 SSD's w/ power protection, like low level enterprise SSD's or Crucial MX500 Consumer level SSDs which are known for their "Currently unreadable (pending) sectors" issue... If this route, what's the current *affordable* (ie $50-$200 each) fan-favorite SFF SSD?
Can a SLOG be replaced with a smaller device or does it have to be equal/larger?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I'm going to need to file a bug report on this. If a log mirror can be created via GUI, it needs to be removable via GUI.

Open a console and do a zpool status -v and then issue the command zpool remove poolname log-mirror-vdev-name and that should remove your log mirror. Sync write speed will nosedive of course.

Now let's dig into your questions.

  1. Should I just leave it as is? (Not liking this option...)
  2. Should I replace the current Kingston NVMe's with another set of NVMe's despite their crazy high crash risk? (I've been having similar issues with NVMe's even in other/unrelated systems lately)? If so, what's the current *affordable* (ie $50-$200 each) fan-favorite?
    1. If I go this route, is there a better procedure than 'power off -> replace 1 -> power on -> resilver -> power off -> replace other -> power on -> resilver'?
  3. Should I look for a HP 663280-B21rear SFF/2.5" drive cage to hang 2 more SFF drives (off backside) to replace the NVMe's?
    1. Maybe 2 mechanical HDDs (or maybe consumer level SSDs trading speed for endurance...) behind a RAID controller as suggested by jgreco?
    2. Maybe 2 SSD's w/ power protection, like low level enterprise SSD's or Crucial MX500 Consumer level SSDs which are known for their "Currently unreadable (pending) sectors" issue... If this route, what's the current *affordable* (ie $50-$200 each) fan-favorite SFF SSD?
Can a SLOG be replaced with a smaller device or does it have to be equal/larger?

1. Definitely don't leave it as-is. Those Kingston TLC drives will toast themselves pretty quickly if they're used as SLOGs, and I can guarantee neither they nor the MyDigitalSSD has the kind of true power-loss-protection for in-flight data that's necessary to make them a high-performance SLOG. They'll be safe since they have PLP for data at rest, just not fast, since they have to commit the writes to NAND rather than being able to treat their DRAM cache as non-volatile.

2. I haven't had the same negative experience with NVMe, outside of needing to disable MSI interrupts when passing them into VMs. The answer here would be "remove mirrored SLOG, power off, remove both, install both, add mirrored SLOG." With regards to the "fan favorite" the winner by a country mile in the world of NVMe is the Intel Optane series, but those aren't exactly cheap. The 100GB P4801X might be closest to your $200 limit but it will still likely be over it.
Question - how are you connecting those NVMe devices now? Are you using PCIe slot bifurcation?

3. If you aren't confident in NVMe you can go back to SAS/SATA but you will lose a lot of speed vs. Optane or high-end NVMe devices like Intel P-series. Don't bother with the "HDD behind RAID controller" shenanigans, pick up a pair of Intel DC S3700 in 200GB or larger size and that's about the fastest you can get on SATA. HGST Ultrastar would be an option on SAS, but make sure you get at least the 800M if not the 1600M series.

An SLOG does have to be replaced with an equal or larger device if it's in a mirror; but since you'll be temporarily dumping the SLOG entirely using the CLI commands described at the start of this post, you won't have to worry.
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
1st off, thanks for the assist.

2nd:
Question - how are you connecting those NVMe devices now? Are you using PCIe slot bifurcation?

I'm using passive (with exception of the 'activity LED's') PCI-E adapter cards. I originally used SYBA 2 Port M.2 B key and 1 Port M.2 M Key PCI-e x4 Adapter Card Model SI-PEX40124 but as when those became unavailable, I switched to StarTech.com PEXM2SAT32N1 M.2 Adapter - 3 Port - 1 x PCIe (NVMe) M.2 - 2 x SATA III M.2 - SSD PCIE M.2 Adapter - M2 SSD - PCI Express SSD. Both cards take a PCI-E 4x slot and (relatively passively, but technically the board has more than just wire traces on it...) convert it to a M.2 PCI-E slot. Both cards also have a pair of M.2 SATA slots that terminate to conventional SATA ports which are in-turn connected to the onboard B120 SAS/SATA controller running in AHCI mode (as opposed to 'fake RAID' mode...). This is how I get two NVMe drives and four M.2 SATA drives into this these HP DL380e G8's when said servers typically don't house either drive type. The twelve LFF bays across the front house the mechanical drives used in the Mech_Pool, which is the pool in question.

If push comes to shove, I can try to add that HP 663280-B21 part to hang two more SFF drives of the back of the chassis, then connect them to either the onboard controller already mentioned (supports up to 6 drives) *or* to the front backplane (connecting the twelve front panel drives) which in-turn goes to a pair of HP SAS9217-4i4e HBA's.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
The "passive" NVMe M.2 to PCIe cards should be fine; it was mostly trying to determine if there was anything else that would have interfered with your system's ability to talk to them (such as slot bifurcation being in play and maybe needing to be specified manually in the BIOS/EFI setup.) Have you checked for a BIOS update for your HP? There might be some PCIe bugs that got ironed out.

I assume nothing obvious is showing in the iLO event log as far as sensors/memory errors/etc?
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
The firmware *should* be at the latest one. I say should because I currently cannot get into my ILO for reasons unknown, which is concerning but possibly as harmless/innocent as a misplaced Cat5 cable with all the connecting/disconnecting as of late.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
See if you can rectify that and check into your overall hardware health first; let's eliminate the easy stuff before we spend a lot of time spinning our wheels elsewhere.

Speaking of easy stuff - have you replaced the SmartArray RAID card in the HP, or forced it into HBA mode?
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
I'll try to visit the datacenter this weekend to check that out.

As for RAID controllers, the onboard B120 is set to AHCI via the system BIOS and does not have the optional battery/capacitor backed flash module installed. As far as I know, the HP SAS9217-4i4e HBA is just a rebranded LSI 9207 HBA. The one's I have are flashed in IT mode as opposed to IR mode.

Full disclosure: There's 2 of such HBAs in the system, one on PCI-E paths from CPU1 and the other terminates to CPU2. Each HBA has one internal port that terminates to the (same) front-panel backplane (which has 2 uplinks), and one external port that terminate to independent controllers on an external D2700 enclosure loaded with 7 Crucial 1Tb MX500 SSD's which are not currently in a pool (sitting idle for the time being). I did this for controller fault tolerance, but wondering it it really works that way because during POST, one HBA seems to own most of the drives while the other HBA has the rest; but once past POST, one HBA appears to have all the drives and the other gets none, which also limits all respective drives to one 4-channel SAS cable instead of two.

Then again, SAS seems to be really weird about multi-uplink SAS backplanes and SATA drives. On a nearby Windows DL380e G8 with 12+2 (12 in front, 2 in rear) drives off a P820 RAID controller, the HP utility still says all 12+2 are off port I5 (x)or I6 (not both) even though both are connected to the same backplane. I thought it would be smart enough to put some drives on one SAS uplink, the rest on the other, then move between the two if links failed, but maybe that only works with dual-ported SAS drives.
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
I'll be the first to admit ignorance about the Intel Optane drives. I've ignored them in the past because I don't understand all the fuss about them, so please bear with me...

With regards to the "fan favorite" the winner by a country mile in the world of NVMe is the Intel Optane series, but those aren't exactly cheap. The 100GB P4801X might be closest
NewEgg Market Place (I prefer NewEgg over their marketplace vendors, but beggars can't be choosers...) has this listing: Intel Optane DC P4801X 100 GB Internal Solid State Drive - PCI Express - M.2 22110 @ $281/ea
At $281/ea, it's painful on the budget but might be workable. The lack of product info/details isn't encouraging confidence but I'm willing to take your word for it.

Optane or high-end NVMe devices like Intel P-series
I also see this Intel Optane SSD 800P Series (60GB, M.2 80mm PCIe 3.0, 3D XPoint) - SSDPEK1W060GAXT @ $118/ea, and from NewEgg themselves.
The price is right, and I might be able to swing the $219/ea price tag of the 118Gb model if 58Gb isn't enough. It's also a Optane and a P-Series as you suggested.
The thing that gets me is that there's no mention of power protection, so I assume this drive does not have any. I'm also at a loss when it says "Max Sequential Read Up to 1450 MBps" & "Max Sequential Write Up to 640 MBps", which doesn't seem very fast at all for NVMe's and thus why I've skipped over them in the past for SLOG & other applications. On the other hand, things like "Ultra-low latency for exceptional responsiveness", "Performance saturation at queue depth of 4 and lower" (which I hear this is the main reason why it makes a good SLOG device if this claim is true), and "Very High Endurance Capabilities" (another good feature in a write intensive device) sound pretty good to me for this application.

pick up a pair of Intel DC S3700 in 200GB or larger size and that's about the fastest you can get on SATA.
This Intel DC S3700 Series 2.5" 400GB SATA III MLC Internal Solid State Drive (SSD) SSDSC2BA400G3ES @ $126/ea is another 'MarketPlace only' option (which is slightly worrisome. Tells me that this is an older part that may be hard to source real soon).
The $-per-GB is very competitive in my book. Speeds of 'Max Sequential Read Up to 500MB/s Max Sequential Write Up to 460MB/s' are inline with what I'd expect from a SATA drive. And these do specifically mention "Power loss data protection" & "Multi-Level Cell with High Endurance Technology " which is great. The only drawback here is then having to source that rear SFF cage (I already have spare caddies) which is $200-$300 at the moment...
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Pardon the delay. Weekends, you know. ;)

I'll be the first to admit ignorance about the Intel Optane drives. I've ignored them in the past because I don't understand all the fuss about them, so please bear with me...

NewEgg Market Place (I prefer NewEgg over their marketplace vendors, but beggars can't be choosers...) has this listing: Intel Optane DC P4801X 100 GB Internal Solid State Drive - PCI Express - M.2 22110 @ $281/ea
At $281/ea, it's painful on the budget but might be workable. The lack of product info/details isn't encouraging confidence but I'm willing to take your word for it.

Here's a post in the SLOG benchmark thread from a user with that exact card.


Just bear in mind that the P4801X is a 110mm M.2 card, so while the Syba M.2 to PCIe adapter may fit it, the StarTech one seems to be limited to 80mm M.2 size.

I also see this Intel Optane SSD 800P Series (60GB, M.2 80mm PCIe 3.0, 3D XPoint) - SSDPEK1W060GAXT @ $118/ea, and from NewEgg themselves.
The price is right, and I might be able to swing the $219/ea price tag of the 118Gb model if 58Gb isn't enough. It's also a Optane and a P-Series as you suggested.

Sorry for my lack of clarity - I mean the "DC P-series" as in the "DC P3700" or "DC P4801X" which is how the "datacenter" drives are denoted by Intel. Notably they have both the write endurance and the official validation from a warranty perspective to be run in a server setting. If you run a consumer Optane drive (like the 800P, or even the faster 900P/905P) in a datacenter, Intel won't officially warranty it any longer as that's not considered a supported environment.

The thing that gets me is that there's no mention of power protection, so I assume this drive does not have any.

Optane cards, by design, don't cache their writes into DRAM like other SSDs do. They write straight to NAND. This means that they have "built in" power loss protection, in a sense of speaking. The P-series drives get the "enhanced PLP" feature, but Intel is a bit cagey about what exactly this means:

some Intel rep said:
As an enterprise part, the Intel® Optane™ SSD DC P4800X offers multiple data protection features that the Intel® Optane™ SSD 900P does not, including DIF data integrity checking, circuit checks on the power loss system and ECRC. The DC P4800X also offers a higher MTBF/AFR rating.

They're clearly intending you to use the P-series cards for datacenter workloads, probably to avoid the cheaper consumer cards cannibalizing sales of the DC P-series ones.

I'm also at a loss when it says "Max Sequential Read Up to 1450 MBps" & "Max Sequential Write Up to 640 MBps", which doesn't seem very fast at all for NVMe's and thus why I've skipped over them in the past for SLOG & other applications. On the other hand, things like "Ultra-low latency for exceptional responsiveness", "Performance saturation at queue depth of 4 and lower" (which I hear this is the main reason why it makes a good SLOG device if this claim is true), and "Very High Endurance Capabilities" (another good feature in a write intensive device) sound pretty good to me for this application.

The big hero numbers used for advertising and marketing are the "best case scenario" for these drives, which you'll see with big, sequential, asynchronous workloads at high queue depths. That's pretty much the polar opposite of what an SLOG workload is - those are smaller, synchronous writes at low queue depths (QD1, really) - so the "ultra low latency" and "performance saturation at low queue depths" is what makes Optane so appealing for an SLOG device.

This Intel DC S3700 Series 2.5" 400GB SATA III MLC Internal Solid State Drive (SSD) SSDSC2BA400G3ES @ $126/ea is another 'MarketPlace only' option (which is slightly worrisome. Tells me that this is an older part that may be hard to source real soon).
The $-per-GB is very competitive in my book. Speeds of 'Max Sequential Read Up to 500MB/s Max Sequential Write Up to 460MB/s' are inline with what I'd expect from a SATA drive. And these do specifically mention "Power loss data protection" & "Multi-Level Cell with High Endurance Technology " which is great. The only drawback here is then having to source that rear SFF cage (I already have spare caddies) which is $200-$300 at the moment...

The S3700 is old tech, yes, but it was about the fastest you could go on the SATA bus for write latency. There's no real successor because the market shifted to NVMe devices for the write-heavy devices such as the DC P3700. You might be able to find a SAS option like the Toshiba PX04S or HGST SSD1600M but they will likely be more expensive (and possibly still slower) than the Optane P4801X or other NVMe choices.

$ per GB isn't important in the SLOG game, since only a very small amount of the drive is used - you want to look at $ per IOPS and $ per TB of write endurance. Bigger drives tend to get more of the IOPS and the write endurance, but high capacity should be considered a "side effect" of getting the IOPS and TBW you want, not a desire onto itself.

And regarding the speeds they're advertising, we already reviewed the idea of marketing's numbers vs. real-world SLOG results. :)

Simple table of MB/s below using results from disktest -wS


Write Size (in KB)
Intel DC S3700 200GBIntel Optane P4801X 100GB
442.2167.1
877.6257.0
16127.9354.3
32198.9483.5
64272.6602.7
128288.2776.1
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
Pardon the delay. Weekends, you know. ;)
Yeah, I can respect that. Hell, I'm BARELY getting to my data center to check things out. The good news is that the ILO was simply unplugged, probably from all the taking the server in and out of the rack while I was juggling NVMe's.

I checked the logs and there are not major or obvious errors. The ILO FW is apparently one version back due to an update released about a month ago, but the BIOS is up to date.
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
Minor Update: 1 step forward, 2 steps back.

I've ordered (last week) a pair of Intel Optane DC P4801X 100 GB along with a pair of "PCIE M.2 SSD Adapter - M-KEY/B-KEY/ SATA7P Interface - SATA15P Power Supply - Support PCIE X4/ X8/ X16 Slot - Support 2230/2242/ 2260/2280/ 22110 (Red)" since neither of my old adapters *actually* fit 22110 length devices and this is the only one I could find that can still hold 2 additional SATA devices. I'm still awaiting the arrival of the new NVMe's, which won't show up until the day I leave town so I won't be able to try anything until at least the week that follows.

Hopefully that solves this particular issue; but I found myself in a 'Out of the pan and into the fire' situation. Those twelve "Seagate Exos Enterprise Capacity 3.5'' HDD 10TB (Helium) 7200 RPM SATA 6Gb/s 256MB Cache Hyperscale 512e Internal Hard Drive ST10000NM0016" (in this server, there's another 12 in it's backup server) are showing signs mentioned over in thread "LSI (Avago) 9207-8i with Seagate 10TB Enterprise (ST10000NM0016)".

I sent an email over to Seagate about the apparent firmware issue and the lack of any updates. Waiting for a reply to that. In the meantime, I'm debating if I should swap the SAS controller pairs for something newer (probably not helpful since the 9300's seem to have the same issue and I can't find any non-LSI recommendations), or rebuild the FreeNAS server as 9.1 and try to migrate the pool over to that. I see that 10.3 beta is out, but it doesn't seem to address any of the issues I have, including the "Error "Initiator not found in database" when creating/editing iSCSI targets after deleting Initiators" bug.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Good to hear about the M.2 adapters. Direct some airflow at the Optane cards though, they can get toasty under sustained workload in that smaller form factor, and being double-sided means it isn't as easy as "slap on a heatsink."

Unfortunately the FLUSHCACHE issue with the bigger Seagate drives doesn't seem to have a reliable fix on FreeNAS yet. A couple posts seem to indicate that switching to a Linux-based system has resolved things; not sure if anyone's reported back on anything Illumos-based, but it seems to point to something in FreeBSD, the FreeBSD mps driver, or similar not playing nice with the firmware on those drives. Some users did report better luck with FN 9.10 and the earlier P16 firmware, but I'm not sure if regressing that far back on the update chain is going to be better or worse (in terms of effort) than taking a stab at a different ZFS-based solution.
 

SamM

Dabbler
Joined
May 29, 2017
Messages
39
This issue has (mostly) be resolved ages ago (I forgot to come back and say that). We replaced the NVMe's on both servers Intel Optane DC P4801X 100Gb devices and that has worked pretty well. Still getting occasional complaints about the 10Tb mechanical HDD's, but I'm *hoping* that upgrading to FreeNAS FreeNAS-11.3-U2.1 (or better) will resolve that. Now on to replication issues, which is another issue for another thread...

Thanks for all the help @HoneyBadger !
 
Top