LSI (Avago) 9207-8i with Seagate 10TB Enterprise (ST10000NM0016)

krazos · Dec 5, 2017

Chris Moore said:
Has anyone contacted Seagate about the issue?

I have emailed them this thread and some more info about this problem and asked them to investigate this.. But I haven't got an answer yet. I might reclaim my disks and go for WD instead. Well, unless they can fix this asap.

But I still wonder if it might not be the BSD driver instead, because it worked better on Linux.

krazos · Dec 7, 2017

Update:

They answered my email and required some more information about what server I have, what OS I'm running, error messages, what raid configuration and so on, and when I gave them an answer to it I simply got:

"I can forward the details on to our team for review and testing for possible firmware updates. However, this is not a fast process and we are not able to guarantee any specific results. We do not have an estimated time frame as this team does not reply to us."

So yeah.. I don't know. I guess I'll have to reclaim my disks and go for another vendor.. because I simply cannot wait for long, I need to get this server up and running asap.

cobrakiller58 · Dec 13, 2017

Have any of these drives been tested on SATA ports to see if they still throw errors or only SAS ports?

gregnostic · Dec 20, 2017

I just wanted to throw a data point into the mix. I'm not suggesting anyone take any particular path, but perhaps you'll find this information useful.

I recently grew a vdev by swapping out six disks with 10TB Ironwolf NAS disks and as soon as the array had grown, I started experiencing the same problems as everyone else here. Drives would throw errors seemingly at random (though usually after/during maintenance tasks) and get kicked out of the zpool.

After about a week and a half of this and a few incidents where two disks were thrown out of the zpool and put my data at risk, I started weighing my options based on the information I got from these threads.

I backed up my FreeNAS (9.10.2) config, installed Debian on my server, and tested ZFS on Linux. I ran scrub after scrub after scrub to stress the array. After three days of scrubs, not one error or kicked disk. With that amount of time and under that amount of load in FreeNAS, I probably would have had four to six drive incidents. After about a week now on Debian, still no errors.

What this means for me is that I'm unfortunately abandoning FreeNAS. It's not really FreeNAS' fault, but this was the only real option I had that didn't require spending another couple of grand on disks and then, at the best case, getting hit with a restocking fee on the Ironwolf disks. So over to Linux I go. I would have preferred to stick with FreeNAS, but given that I got close to losing my data multiple times, I couldn't wait it out and hope that Seagate came up with an answer.

(Also cross-posting this to other thread for people who aren't reading both.)

kmr99 · Jan 7, 2018

On one of the two servers that I built with Seagate IronWolf 10GB and Enterprise 10GB drives, I've been getting these errors. Interestingly, on one server with an LSI 9300, I didn't get any errors. The other server has been getting errors. Initially, I had 6 drives using the motherboard (X10SDV) SATA ports and ACHI driver. I'd get about 10 FLUSHCACHE48 errors per day spread across all 6 drives. They appeared like:

Code:

Jan  6 09:00:50 ahcich34: Timeout on slot 31 port 0
Jan  6 09:00:50 ahcich34: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd c0 serr 00000000 cmd 0004df17
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): CAM status: Command timeout
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): Retrying command

So, I just put an LSI 9300 in the server hoping that it was an just a problem with the ACHI driver and the motherboard controller. Unfortunately, after switching the the LSI 9300 card today, I've already gotten my first timeout error:

Code:

Jan  7 15:42:50	(da7:mpr0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 523 Aborting command 0xfffffe0000f40fd0
Jan  7 15:42:50  mpr0: Sending reset from mprsas_send_abort for target ID 5
Jan  7 15:42:50   (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00 length 4096 SMID 277 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jan  7 15:42:50  mpr0: Unfreezing devq for target ID 5
Jan  7 15:42:50  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00
Jan  7 15:42:50  (da7:mpr0:0:5:0): CAM status: CCB request completed with an error
Jan  7 15:42:50  (da7:mpr0:0:5:0): Retrying command
Jan  7 15:42:50  (da7:mpr0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jan  7 15:42:50  (da7:mpr0:0:5:0): CAM status: Command timeout
Jan  7 15:42:50  (da7:mpr0:0:5:0): Retrying command
Jan  7 15:42:51  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00
Jan  7 15:42:51  (da7:mpr0:0:5:0): CAM status: SCSI Status Error
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI status: Check Condition
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan  7 15:42:51  (da7:mpr0:0:5:0): Retrying command (per sense data)
Jan  7 15:42:51  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d eb 78 00 00 00 30 00 00
Jan  7 15:42:51  (da7:mpr0:0:5:0): CAM status: SCSI Status Error
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI status: Check Condition
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan  7 15:42:51  (da7:mpr0:0:5:0): Retrying command (per sense data)

From my reading of this thread, as well as the thread https://forums.freenas.org/index.php?threads/synchronize-cache-command-timeout-error.55067, it appears that no solution has been found except switching to other manufacturer's drives or one poster have success switching to Linux.

As my errors occurred with two different controllers/drivers, I'm cross-posting to both the AHCI and LSI threads

Ericloewe · Jan 7, 2018

If two completely different controllers are reporting similar errors, it's clear that either the disk, cables or backplane is to blame.

kmr99 · Jan 8, 2018

Eric, thanks for your thoughts. I was thinking along a similar line. The last components that I have not swapped out are the power supply and the OS. Different cables and bypassing the backplane hasn't worked. I think it's unlikely that 6 of 6 drives would all have the same defect of of timeouts with cache flushes unless it was a firmware issue. At this time, Seagate does not have any firmware updates for these drives.

Ericloewe · Jan 8, 2018

Same PSU? Yeah, that's a big one, too. Though crap firmware on disks has also been known to wreak havoc.

kmr99 · Jan 8, 2018

Same PSU in the two servers? No, they differ.

Identical components of the two servers:
- ESXI 6.5, latest patches
- FreeNAS 11.0, now 11.1 with 32GB RAM for VM and using PCI passthrough
- Currently, LSI 9300-8i HBA (3008 controller)
- Mitxure of Seagate 10GB Enterprise, IronWolf, IronWolf Pro drives (mixed drives to reduce chance of failures due to a bad batch of identical drives)
- Intel X550 dual ethernet to 10Gbps switch
- All drives burned in/tested for 1 week, RAM tested with 24 hours of memtest, CPU and it's temperature tested with 24 hours of mprime torture test

Differing components:
- Server (main) without timeout errors: drive temperatures 25-33C, X10SRA MB with E5-2696, 192 GB ECC RAM, 8x10GB drives in 4x10GB mirror, backplane/cables, EVGA 850W ATX PSU
- Server (backup) with timeout errors: drive temperatures 31-36C, X10SDV MB with D-1541, 128 GB ECC RAM, 6x10GB drives in 3x10GB mirror, backplane/cables, Silverstone SX600-G 600W Gold SFX PSU, initially using MB-based SATA ports with PCI pass-through and ACHI driver (getting timeout errors on all 6 drives)
- VMs (and disk use) patterns

Things I've changed on server with timeouts without success: SATA cables, bypassing backplane, SATA HBA/drivers

When you say that PSU is a "big one", do you mean a frequent cause of problems, perhaps due to signal integrity secondary by high frequency noise or lower frequency voltage changes? While I've heard that can happen, as well as problems with power cables compromised by poor connections (high or varying impedance), in my personal experience with over-rated wattage, top tier PSUs, I've not yet encountered a problem that I've been able to attribute to the power rails. Unfortunately, my case for the backup server can only accommodate 6 drives, or I'd just swap the drives between the two systems to see if the timeout errors stayed with the backup server's MB/case/PSU/power cables or followed the drives. I could try a different PSU in the backup server case. Corsair sells a 600W SFX power supply that I could place in the backup server, I'd be tempted buy another EVGA 850W ATX PSU since that is one of the few differences between the servers. However, the ATX-sized PSU would likely increase drive temperatures due to less air flow from front to back in the fairly small Liam PC-Q25B mini-ITX case.

Ericloewe · Jan 8, 2018

kmr99 said:
When you say that PSU is a "big one", do you mean a frequent cause of problems, perhaps due to signal integrity secondary by high frequency noise or lower frequency voltage changes?

Both are plausible, though I don't think any vaguely functional PSU can output enough EM noise to seriously increase error rates. Conducted? Oh yeah, that'll mess with all sorts of things. And plain ol' low DC voltage can also cause localized issues, typically on peripherals attached with higher-resistance cables sucking lots of current - if it's generalized, either the PSU or the motherboard will probably shut down the whole thing.

kmr99 said:
Corsair sells a 600W SFX power supply that I could place in the backup server,

Yeah, those are good. Not Seasonic X-Series good, but Corsair RM/Seasonic G-Series good.

kmr99 · Jan 8, 2018

Thanks for the power supply recommendations, I wasn't familiar with Seasonics. I ordered a Seasonic Prime Ultra Titanium 750W ATX PS that I'll try to fit in the case as well as a Corsair 600W SFX in case the Seasonic doesn't fit. If a new PSU fixes the problem, I'll report back.

Ericloewe · Jan 8, 2018

If possible, you can also temporarily try the known-good one (though that might leave you with two less-than-functional servers instead of one).

kmr99 · Jan 8, 2018

Ericloewe said:
If possible, you can also temporarily try the known-good one (though that might leave you with two less-than-functional servers instead of one).

Thanks, that's a fine idea. However, it's easier for me to buy some new PSUs to test which I can later use in new builds if the current PSU is not the issue. Eventually, I want to move the FreeNAS systems to bare metal.

Ericloewe · Jan 8, 2018

kmr99 said:
Thanks, that's a fine idea. However, it's easier for me to buy some new PSUs to test which I can later use in new builds if the current PSU is not the issue. Eventually, I want to move the FreeNAS systems to bare metal.

Yeah, I should get a spare, myself. Maybe I'll replace my old Corsair AX850 from my workstation with a shiny new PSU, now that it's out of warranty, and keep it as a backup.

Also, holy crap, it's seven years old by now. Wow.

kmr99 · Jan 8, 2018

Ericloewe said:
Yeah, I should get a spare, myself. Maybe I'll replace my old Corsair AX850 from my workstation with a shiny new PSU, now that it's out of warranty, and keep it as a backup.

Following HDD, PSUs have been the second least reliable component in my systems.

kmr99 · Jan 9, 2018

I did find a difference between the two servers which seems more likely a key difference rather than the PSU: the difference in the 3008 firmware version

One the server without timeout errors:

Code:

root@freenas2:~ # sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

		Adapter Selected is a Avago SAS: SAS3008(C0)

		Controller Number			  : 0
		Controller					 : SAS3008(C0)
		PCI Address					: 00:13:00:00
		SAS Address					: 500605b-0-0b33-e190
		NVDATA Version (Default)	   : 05.00.00.05
		NVDATA Version (Persistent)	: 05.00.00.05
		Firmware Product ID			: 0x2221 (IT)
		Firmware Version			   : 05.00.00.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : SAS9300-8i
		BIOS Version				   : 08.11.00.00
		UEFI BSD Version			   : 06.00.00.00
		FCODE Version				  : N/A
		Board Name					 : SAS9300-8i
		Board Assembly				 : H3-25573-00H
		Board Tracer Number			: SP62305664

On the server with timeout errors:

Code:

root@freenas1:~ # sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

		Adapter Selected is a Avago SAS: SAS3008(C0)

		Controller Number			  : 0
		Controller					 : SAS3008(C0)
		PCI Address					: 00:03:00:00
		SAS Address					: 500605b-0-0150-5b80
		NVDATA Version (Default)	   : 0e.00.00.07
		NVDATA Version (Persistent)	: 0e.00.00.07
		Firmware Product ID			: 0x2221 (IT)
		Firmware Version			   : 14.00.00.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : SAS9300-8i
		BIOS Version				   : 08.11.00.00
		UEFI BSD Version			   : N/A
		FCODE Version				  : N/A
		Board Name					 : SAS9300-8i
		Board Assembly				 : N/A
		Board Tracer Number			: N/A

Perhaps the older version 5.0.0 of the 3008 firmware is more tolerant of the timeouts talking to the Seagate 10GB drives. If my server without errors had the newer firmware, I'd just update the LSI on the server with the errors to the new version. However, in this case the working server has an older version of the firmware.

Does anyone know of any negative issues with reverting an LSI 9300 card from version 14 to version 5?

kmr99 · Jan 10, 2018

There is some serious negative interaction between the Seagate firmware and the FreeBSD LSI driver. Two data points:

1. On the server that was getting timeouts with the X10SDV MB chipset and ACHI driver, I got hundreds of FLUSHCACHE48 timeouts, but never any drives faulted or any errors detected with scrub. With the LSI 3008 and mpr driver, I've been getting multiple drives faulted after the LSI driver gives up on retries.

2. Technically, I can't imagine how this occurred, but I don't think it is coincidence. On the server without any timeouts for 3 months, about 30 minutes after querying the 3008 controller with sas3flash, I started with frequent timeouts on 5 of the 8 drivers on that server so that I shut down the server for fear of losing my zpool (3 out of 8 drives faulted within 5 hours).

At this point, I don't see how I can continue to use FreeBSD/FreeNAS with my Seagate 10GB drives. Unfortunately, I'm not able to replace the 16 Seagate 10GB drives. Replacing other parts has not solved the problem.

This is regrettable, because I really like FreeNAS and FreeBSD (I started using UCB BSD in the early 80's and built my first internet company upon BSD/OS in 1993). On page 2 of this thread, mattlach mentioned his errors resolved after switching to Linux and its driver. The idea of setting everything up again on two Linux servers and losing FreeNAS's good GUI is not appealing. I suppose I'll also revisit the solaris-based choices, though I think it's clear the overall development momentum has been shifting from solaris to Linux.

cobrakiller58 · Jan 10, 2018

have you reached out to Avago or seagate at all?

kmr99 · Jan 10, 2018

cobrakiller58 said:
have you reached out to Avago or seagate at all?

I have not. Krazos, above, reported it to Seagate who replied they could forward his report to their team for review.

brianm · Jan 10, 2018

KA

"There is some serious negative interaction between the Seagate firmware and the FreeBSD LSI driver."

There is certainly some interaction but if I understand your intentions correctly your reaction seems rather illogical. You have two nearly identical systems, correct? One works properly and one does not. You have confirmed a possible problem, the controller. I fail to see why you would carry out major rebuilds on both systems rather than solve the controller problem on the malfunctioning system. Whatever you do is your choice of course but you seem to be blaming FreeBSD for what is actually a hardware/firmware problem which is rather unfair. If one system based on BSD works correctly but the other does not it is not logical to blame BSD.

I personally feel that the controller is a weak link in the FreeNAS concept, it and its associated cables are a fault waiting to happen but until we get a greatly increased number of on-board SATA connectors we are stuck with controllers.

Important Announcement for The TrueNAS Community.

LSI (Avago) 9207-8i with Seagate 10TB Enterprise (ST10000NM0016)

Dabbler

Dabbler

Guru

Dabbler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Dabbler

Dabbler

Guru

Dabbler

Dabbler

Similar threads

Important Announcement for The TrueNAS Community.