Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

Quindor · Sep 8, 2019

I've been looking for the right place to post this and I believe this is it.

To be brief, I've been working on a YouTube series where I'm building a 100TB ZFS based server. Not all Enterprise grade hardware (it's for at home) and not using FreeBSD/FreeNAS (I have some different requirements) but I still hope this is the right place.

The Problem:
Since finishing the server and bringing my 8x Seagate Ironwolf 10TB pool online I've been seeing odd behavior. I have 2x LSI SAS2008 based IT firmware cards in my system and each card has 4x IronWolf 10TB disks attached. While copying my data I would occasionally see a write error on one of the disks which the LSI card would reset and it would continue happily like nothing happened.

Later, during more testing I would see a very sparse but still happening read, write or CRC error on the pool. I tried rebuilding the pool, new cables, etc.. but everything was good and should work perfectly. Although a disk would sometimes give an error the disk itself had no recollection of this in SMART, only ZFS would mention something went wrong and "dmesg" would also fill up with errors.

Finally a solution?:
Now the point of this topic, digging deeper it turns out Seagate released a firmware update for the ST10000VN0004 and ST10000NE0004 last month. They bumped from firmware SC60 to SC61 and in that topic it's stated that this is because of "flush cache timing out bug that was discovered during routine testing" in regards to Synology systems.

As it turns out, write cache (and I believe internally NCQ) had been turned off for these specific drives in Synology systems for a while now because of "stability" issues. Since this firmware update it gets turned on again and all is well.

That got me thinking, if a Synology is having this issue, maybe this was more disk firmware related then anything else. So, since I still have all my data on other drivers anyway I went ahead and flashed all my 8x ST10000VN0004 from SC60 to SC61. This worked without a problem and even a ZFS Scrub found no issues with the data still on there.

But... I was able to finish a scrub twice of 20TB on the pool now without a single read, write or CRC error. I've been hitting the drive with TBs of DD and Bonnie++ and not a single error anymore. So this might actually be a fix for topics like this one I found and this one.

Needs more testing:
I want to bring out a video on this issue and the potential fix for it (and help people how to do it) but I kind of need it better tested to be sure this fixes the issue.

Who here is still having this issue and is willing to test the firmware on their drives and report back? I have no clue about potentially voiding your warranty or anything like that, only that it's maybe a fix the issues these drives have been having!

Firmware:
The original Synology topic I found this information

Firmware for IronWolf 10TB ST10000VN0004
Firmware for IronWolf Pro 10TB ST10000NE0004

--update
Video and article released!

Ericloewe · Sep 8, 2019

Interesting information, thanks for sharing! Hard Drive firmware is quite an unpleasant land in many ways.

Quindor · Sep 8, 2019

Ericloewe said:
Interesting information, thanks for sharing! Hard Drive firmware is quite an unpleasant land in many ways.

Oh yes, I fully agree and that's why I'm being careful before releasing more information "publicly an mass". But I have lots of these drives and in some systems they are running fine (mostly motherboard ports) and on others they give random read, write or CRC errors, changing cables, controllers, controller firmware, etc. makes no difference whatsoever.

I've pulled out a drive and subjected it to Seatools or a SMART long test, 0 errors, even using HDAT2 doing a full write, read, checksum, nothing wrong but put it back in some systems, CRC errors after a while. I've been pulling my hair out as to why for a while now, I actually really love the drives but it's just in some systems that it causes issues.

So if there are others out there who have been facing this issue, specifically with the 10TB model, I'm hoping this new firmware fixes it! They don't have it available on their official download site (yet). So just my 8 drives in this system and 1,5 day with 2 scrubs (20TB) is too little data to go on so I'm hoping more people (who are having these issues) are willing to also test. :)

El Al · Sep 8, 2019

Sadly no update yet for the 14TB Ironwolfs.

Quindor · Sep 9, 2019

El Al said:
Sadly no update yet for the 14TB Ironwolfs.

Are those experiencing a similar issue? Do you have a thread or more info you could link to?

El Al · Sep 9, 2019

Quindor said:
Are those experiencing a similar issue? Do you have a thread or more info you could link to?

https://www.ixsystems.com/community...cache-command-timeout-error.55067/post-542371

I used to experience the described issue until I disabled NCQ on my drives. Issues occured with freenas 9.3 and 11.2.
I have been error free since then.

If you scour the internet you will find some more people on reddit for example that have run into synchronize cache (read,write) issues.

Quindor · Sep 9, 2019

El Al said:
https://www.ixsystems.com/community...cache-command-timeout-error.55067/post-542371

I used to experience the described issue until I disabled NCQ on my drives. Issues occured with freenas 9.3 and 11.2.
I have been error free since then.

If you scour the internet you will find some more people on reddit for example that have run into synchronize cache (read,write) issues.

Aah, ok. I did read those (if not all) topics but thought the issue was mainly related to the 10TB model because there where reports of the 12TB not having the same issue. If you are saying you are having the issues with a 14TB it might be a bit more wide spread then I thought.

I think it was because of you I read the NCQ part and that you need to script it while booting, I didn't test with that long enough to see if that solved the issue but that did make the idea or hunch that it should be fixed in disk firmware stronger. So when I found one (through non-official ways, why isn't there anything in their knowledge or download page....) I wanted to try that first since I still have backups of all my data right now. So if the firmware screws anything up, none is lost.

supercoolfool · Sep 12, 2019

Quindor said:
To be brief, I've been working on a YouTube series where I'm building a 100TB ZFS based server. Not all Enterprise grade hardware (it's for at home) and not using FreeBSD/FreeNAS (I have some different requirements) but I still hope this is the right place.

Well, thanks for the update! I actually have your video bookmarked in my Pocket, just haven't had a chance to watch. I did peruse some of you other videos on the topic as well.

This topic discusses the swap maybe needing to be flushed:

https://www.ixsystems.com/community...rashes-ironwolf-pro-10tb-st10000vn0004.59451/

I just gave up on Freenas with these drives last week and have been playing around with Storage Space as an alternative, as I'm a complete novice on Linux. I built my Freenas box in Nov 2017, and have had numerous issues with these drives, especially on heavy writes. I've tried different controllers, cables, power supplies, etc....

I even had to RMA one drive as the bad sectors kept piling up. Not sure if it's related though. Seagate was nice enough to send me the wrong drive and then charge me for an advance replacement I never got. But that's another story....

Long story short, as I have all my data backed up, I'll update the firmware on my drives and give freenas another go. Storage Spaces write speeds are terrible.

I'll update ASAP, but may not be until after the weekend.

Thanks again!

Quindor · Sep 12, 2019

Awesome @supercoolfool let us know!

In the meantime I've been torturing my setup, another full scrub of 20TB on the 8x 10TB mirror pool, writing and reading TBs of data, all is fine and well and everything is staying 0, 0, 0 as expected ever since the firmware upgrades! I'm starting to trust the pool, I know that's tricky but with how easily I was able to trigger it before, it's at least looking good! :D

Performance also seems unchanged from before. I get about 1.8GB/sec read out of the pool and 800MB/sec write which is in line with what I was getting before the upgrade from SC60 to SC61 and also inline with what I'd estimate a half full drive x4 would deliver.

I have a friend who has been having some random issues with these drives over the past year and he's going to try and firmware upgrade them this weekend and then monitor if it still occurs or not.

All in all, looking good but keep the reports coming in (and spread this topic in places I didn't find!).

supercoolfool · Sep 12, 2019

I can give a mini-status update for now. As I had to RMA one of the drives, the one I got back from Seagate is not the same model as the 5 other 1ZD101'S. The new drive's model is 2GS11L. I've had that drive since mid-July of this year, so fairly recent. (The previous drive accumulated too many bad sectors and would fail Seatools diagnostics.) Seeing as one of my drives is different, I'm not sure how accurate my experience will be.

My config is a 6x10 TB on a Raid6.

I tried desperately to get the same model from Seagate RMA. I'm still dealing with that, believe me! They state they never got my advance replacement back even though I have all the shipping confirmations, and three weeks ago billed me 600$. Good times...

Back to the drives. I completely forgot that when I abandoned Freenas I also changed the drive order in my case. I mistakenly attempted to flash the firmware to the 2GS11L, luckily the utility prevented that. The other 5 drives have all been upgraded to 61 while the 2GS11L is still on 60. I swear, if I wasn't bald already...

I'm still a total newbie to Freenas. I really wanted to learn it but this box has been so unstable it's nerve wracking. Takes all my effort just to keep it stable and once you do, you don't want to even breathe on it! I've since built a 2nd box out of a used Dell vostro, a core-i3 with 8 GB ram and 4x4TB in a raid5, that one runs beautifully, no errors, totally rock solid.

As I'm still learning Freenas I don't really know how to stress it hard. I'm currently throwing a chunk of 5.5 TB of data at it, but all I have is my gigabit LAN. It's saturated steady at 113 MB per sec and has been for the last couple hours, says it will take another 14 hrs. So far so good.

If you have any suggestions for me on how to really stress it, like doing a burn-in which I've never done, I'm game.

supercoolfool · Sep 13, 2019

Well, so far so good. Copied just over 5 TB of data to the pool with no issues. 15+ hour transfer stayed solid at 113 MB per second the entire time. In total I have another 25 TB to load back on, gonna take a couple of days as the data is spread out over 17 single drives.

Hopefully I can get a copy of all my data back on, then I'll start some scrubs.

supercoolfool · Sep 17, 2019

I've loaded approx 26 TB onto the pool over gigbit ethernet. The first 2 drives I restored may data from were 6 GB drives, those transfers each took about 14 hours.
Speeds stayed steady at about 113 MB/sec for the transfers.
I've not run into any issues at all at so far.
My data has now been loaded back on and I'll start scrubbing...

supercoolfool · Sep 21, 2019

At this point I've run a couple of scrubs, they take approx 20 hrs to run. No issues yet, even while writing to the pool. I'm reticent to declare any kind of victory yet though, but I do have high hopes.

My next test will be to put the drives back on my LSI 9211-8i controller. My 9211 is in IT mode and flashed to the most recent BIOS, but having the drives connected to that controller produced numerous CAM status and timeout errors. I purchased a 2nd set of SAS cables, same issue. It made the server so "unstable", I had to abandon the SAS card and go back to the on-board sata ports.

The drives are currently connected to the "intel" sata ports on the motherboard, a Gigabyte GA-Z68X-UD3H-B3, rev 1.

The board has 5 sata ports + 1 esata port controlled by the chipset. (There's also 2 Marvel controlled SATA ports for a total of 8, they're not is use right now, I'm booting from USB.)

I'm using an esata to sata adapter cable on the last drive in the array. Less than ideal that's for sure, but the pool has been far more reliable with this configuration than when I was using the SAS controller. I've also tried using a mix of the onboard Marvel and Intel controlled sata ports, this also produced CAM status issues. I've not had that config in over a year so memory is a little foggy.

Quindor · Sep 22, 2019

@supercoolfool sounds like a good plan. Good to hear the problems you are having in your current setup seems to have at least decreased (or gone away at this point), moving back to the LSI controller is a good step at this point.

Although I don't recognize the CAM errors I do recognize the timeout issues. Very interested to see if those are now gone with the new firmware. I can imagine you hating having that one "odd ball" drive in the pool, sucks they sent you a different model through RMA. Although specifications are probably "close enough" they are never going to be 100% the same. :(

From my side, my pool hasn't given me a single error since running on the new firmware (and the initial scrub after that). I've moved back all the data now (additional 8TB I had stores on all kinds of other computes), live edited my 2 last videos from the server and also run 2 more scrubs just to make sure, 0 errors whatsoever in zpool status, and zpool events is also 100% clear of issues.

So I'm pretty much convinced this issue has been fixed and will start working on my video and article about it to give it a bit of publicity so that other people who are facing this issue finally have a fix for it! Do keep the updates coming though, I'd like to keep this topic as the central location for this issues! :D

supercoolfool · Sep 23, 2019

Oh Quindor, the things I do for you... :)

Yeah, not super thrilled about the one-off drive in the array, but at least Seagate found the drives I RMA'd and has refunded me the 600$.

I put the LSI controller back into my server and during my first boot I got my CM status errors, here's a small sample taken from my syslog:

192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) CAM status: CCB request completed with an error
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) Retrying command
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) READ(10). CDB: 28 00 e0 40 6a c8 00 01 00 00
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) CAM status: CCB request completed with an error
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) Retrying command
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) READ(16). CDB: 88 00 00 00 00 02 2b 95 f3 00 00 00 01 00 00 00
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) CAM status: SCSI Status Error
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) SCSI status: Check Condition
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
192.168.5.10 Sep 23 11:50:37 freenas user notice (da4:mps0:0:5:0) Retrying command (per sense data)

The error is repeated about another 20 times in the log. I'm getting it on da4 now on this particular bootup, but historically I've gotten it on other ports/drives. You'll hear a noise from the drive in question right before the error. It sounds like the drive is "synching" like on initial BIOS detection.
Not on every boot and not always the same drive.

That synching noise makes me think it's the power supply, but since these drives are so flaky I'm not 100% sure. I've tested every port on my power supply with my multimeter, all voltage is right where it should be. Works fine in another PC, but I can't really duplicate the same usage scenario to really troubleshoot.

The power supply is a Cooler Master RS-550-PCAR-N1 power supply rated for 550W. More than ample enough for a Core i5 2500K with 24 GB non-ecc memory and 6 sata drives. It has 3 Molex power connectors on one cable and 6 SATA power connectors, on 2 separate cables. The drives are right now powered using the 6 SATA connectors.

But I discovered this in my troubleshooting months ago: If I don't use the 2nd set of SATA power cables, but instead use the 3 molex connectors with some cheap molex to sata power adapters, there's almost no chance I'll get abnormal noises from the drives or any more CAM status errors. So do I have flaky drives AND a flaky power supply? Every way I can test this power supply, it checks out.

I've been reticent to go out and by another power supply just because this is so odd. I have other powers supplies, but none that are suitable/powerful enough to run the server. There are plenty of times where I COULD have all the drives on the SATA connectors just fine, other times there's just no way it could boot, too many CAM status errors.

I'm going to put the drives on a separate power supply and keep testing.

SuperSpy · Sep 23, 2019

I seem to be having the exact same issue, but with the 8 TB ST8000NE0004 drives.

My current server setup:
Xeon E3-1230 v6
64GB ECC DDR4 2133
Supermicro MBD-X11SSM-F-O
2 x LSI Logic SAS 9207-8i attached to the CPU PCIe slots
12 x 8TB Ironwolf Pro ST8000NE0004 (2x6 stripe of RAIDZ2)
8 x Intel 545s 512GB (4x2 striped mirror)
2 x Intel 545s 128GB (mirror for boot volume)
Intel X520-DA2 dual SFP+ card in the bottom PCH slot

I initially had it in a Rosewill case with a hot swap backplane, but swapped that out as part of the troubleshooting process for the Supermicro case, thinking maybe my backplane was the issue, since it would affect drives randomly, and only the 3.5" drives are in the hot swap bays (the SSDs are all in a custom rack on the inside of the case).

I've been experiencing the same issue with drives randomly dropping out of the array, and if I restart the machine, or offline/online the disk (then do a zpool clear to zero out the error count) it will run fine for another day or 4, then another sometimes same, sometimes different drive will drop out for the same reason. Digging through logs will turn up the same SYNCHRONIZE CACHE(10), timeout, retry, sync/timeout/retry again, then 'retries exhausted' log entries.

Is there a firmware update available for the 8 TB drives? Checking on the Seagate website doesn't yield any updates, and the release notes for the firmwares linked above seem to be explicitly targeting the 10 TB model.

vjayer · Sep 23, 2019

Hi, total noob here about to start my journey into FreeNAS, but I am particularly concerned about the bigger Seagate Ironwolf line and FreeNAS.

I actually saw the Synology thread update before finding this and have a few questions:
The Synology workaround -- and presumably general linux workaround -- is to disable the write cache for the drives. Later apparently they narrowed it down to the 10TB model only, for which Seagate has issued the firmware update. I have the 12TB and 14TB models and AFAIK all the 12TB on the market are still on SC60 (haven't checked my 14TB yet; still testing my 12TB drives)
- but while the Synology (and linux?) seem to be OK with the > 10TB, the posts and threads referenced still seem to indicate issues with the larger drives regarding timeouts. Is that right?
- is this limited to the LSI HBA and these Ironwolves? or specifically the FreeBSD LSI mptsas/mrsas drivers?
This seems to be my impression... that it's more of an issue with general Seagate drives and LSI freebsd driver (or not..)
- does it occur with other HBAs. like Marvell, or the onboard mobo SATA controller?
- does this occur with Exos models, which were previously branded as Enterprise? or Barracuda?

question for @Quindor
- you mentioned in your video that you're using ZFS on linux with Ubuntu. Have you tried other, particularly larger Ironwolf models?
Since Synology doesn't use LSI HBA, and AFAIK just a combination of onboard SATA + Marvel or whatever cheaper HBA like ASmedia, (plus port multiplier for eSATA expansion units), I wonder if the same issue would also still occur on non-10TB Ironwolf on Linux with LSI HBAs?

Firmware SC61 seems to work for the 10TB so far on FreeNAS. But I don't have the 10TB model hence my questions.

supercoolfool · Sep 26, 2019

@Quindor, I've now run 2 full successful scrubs of the pool on the LSI controller without any issues.

24TB of data takes about 10 hours and 40 odd minutes to scrub.

I'm running a 3rd scrub now. I'm not familiar with DD or Bonnie++, but I'm reading up on them and those will be next.

Quindor · Sep 28, 2019

Just to report in, I'm still 100% issue free since the firmware update, all drives have remained 0 errors whatever I do with them, including a week of uptime or multiple reboots, no issues whatsoever anymore!

@supercoolfool Interesting, you seemed to be having multiple problems maybe, the firmware and your power supply? I at least did not have the CAM issues you mentioned. Are you using any chassis in between the drives? I noticed that the 10TB drives would actually wait for a spin-up command from the LSI controller while in my hot-swap chassis while other drives would just start as soon as there was power. Maybe something is going wrong in that regard with your CAM errors?

Your latest message says you've been able to complete lots of scrubs and stuff like that. Is that using the LSI controller still but now using the Molex-SATA cables as you mentioned? It's very weird that that makes a difference and does suggest there might be something wrong with your PSU or Cable.

@SuperSpy Interesting you are having the same issues, most you read online had to do with the 10TB drives only. You could always call Seagate support and ask for the firmware, you might have to be a bit persistent but at least for the 10TB it solves the known issues. At least the " SYNCHRONIZE CACHE" seems very similar. You could try the trick mentioned in some other topics and turn off NCQ, that also seems to "fix" the issue, that way you could test if the new firmware would fix it for your drives also!

@vjayer Some quick answers.
- From what I read it's mostly the 10TB models, although above @SuperSpy is suspecting his 8TB models
- This is not limited to LSI only but it does show up in a lot of those combinations it seems, more so then with motherboards from what I've been able to gather
- Unknown is it happens with those drives too, I haven't found any reports like that. Drive firmware updates are a lot more common in the Enterprise though so maybe it did and they issued an update long ago
- Issue is not limited to FreeNAS since I use Ubuntu with ZOL based systems

I have about 16x 10TB models all running with ZFS and have only started experiencing this issue in my current server build. I have an array of 5x10TB (jmicron external enclosure) in another server and another with 2x10TB (Intel mobo ports) in a different one, both running for 1,5 years now, 0 issues, with monthly scrubs and everything. I also have a 4x12TB server (AMD mobo ports), 0 issues.

So the problem is very hard to tackle. As said, I've been running these drives for years, 0 issues, and suddenly with this build I encounter them. I still have no clue what's different about this build vs the others. Just very glad the SC61 firmware seems to fix it. :)

supercoolfool · Sep 29, 2019

@Quindor I'm about as close as I can get to giving these drives a clean bill of health, but I've been struggling with them for 2 years now. I still need time! :)

I've narrowed my power supply issues down 1 of the 2 sets of SATA cables being unreliable. They seemed to cause the CAM status errors in this instance. As I stated earlier, they check out with my multimeter, but the only time I've had issues since the new firmware is when I was using them.

I'm not using any backplanes or hot swap cages, all drives are connected directly by SATA or SAS cables.

Here's what I've done since upgrading the firmware:

- Loaded 25 TB of data over 3-4 days via gigabit, speeds stayed steady at about 113 MB/sec.
- Scrubbed the pool 3 times, taking approx 11 hrs to complete each time.
- All completed flawlessly with no issues. Drive were connected to the motherboard's Intel SATA ports.

I then tried to use the suspect set of SATA cables and connected the drives to the LSI controller. Got CAM status errors during bootup. Rebooted, CAM status errors again. Stopped using suspect set of SATA power connectors.

With the drives still connected to the LSI controller, I scrubbed the pool another 3 times with no issues. Scrubs again took about 11 hrs each time.

The drives are currently on the LSI contoller.

So, at this point I've loaded 25 TB of data back on and scrubbed it 6 times on 2 different controllers, all with 0 issues. Not had a chance to try out DD writes or Bonnie ++ yet.

Hope that helps, Quindor. For me, I still need a little time to be sure. This is not the first time that the array has seemed to behave no matter what I do to it. I've had extended times of trouble free operation in the past. I'll now start to use this box as my daily driver, RMA my power supply and keep my fingers crossed.

Important Announcement for The TrueNAS Community.

Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

Dabbler

Server Wrangler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Cadet

Cadet

Dabbler

Dabbler

Dabbler

Similar threads