SYNCHRONIZE CACHE command timeout error

miip

Dabbler
Joined
Oct 7, 2017
Messages
15
Did you guys had any luck resolving this? We have the same issues in this thread with Seagate 10TB Enterprise (ST10000NM0016) drives. Seems to be exactly the same issue on LSI controllers.
 
Last edited:

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
Did you guy had any luck resolving this? We have the same issues in this thread with Seagate 10TB Enterprise (ST10000NM0016) drives. Seems to be exactly the same issue on LSI controllers.

Unfortunately, no, I'm still living with the problem, and it sucks. I see from your thread that going up to an LSI 3008 solved your problem, is that still true? Because if so, I may upgrade the system with to a Skylake motherboard with 3008 controller, and use the old hardware with 4TB drives I still have.
 

miip

Dabbler
Joined
Oct 7, 2017
Messages
15
Unfortunately, no, I'm still living with the problem, and it sucks. I see from your thread that going up to an LSI 3008 solved your problem, is that still true? Because if so, I may upgrade the system with to a Skylake motherboard with 3008 controller, and use the old hardware with 4TB drives I still have.

No, the 3008 didn't solve the problem, i am just getting sometimes less serious errors, that won't affect the zpool, but the ones affecting the zpool still happen.

A linux user in the other thread mentioned that he is also using an LSI controller with these drives but is not experiencing these issues. So there might be a bug in the freebsd drivers for the 2008 and 3008 controller series.
 

vincent99

Cadet
Joined
Sep 4, 2017
Messages
5
Sorry I don't remember where I got that tunable but it ended up but helping anyway.

I gave up and switched to Western Digital drives and haven't had any problem since with everything else about the config staying the same.
 

gregnostic

Dabbler
Joined
May 19, 2016
Messages
17
I just wanted to throw a data point into the mix. I'm not suggesting anyone take any particular path, but perhaps you'll find this information useful.

I recently grew a vdev by swapping out six disks with 10TB Ironwolf NAS disks and as soon as the array had grown, I started experiencing the same problems as everyone else here. Drives would throw errors seemingly at random (though usually after/during maintenance tasks) and get kicked out of the zpool.

After about a week and a half of this and a few incidents where two disks were thrown out of the zpool and put my data at risk, I started weighing my options based on the information I got from these threads.

I backed up my FreeNAS (9.10.2) config, installed Debian on my server, and tested ZFS on Linux. I ran scrub after scrub after scrub to stress the array. After three days of scrubs, not one error or kicked disk. With that amount of time and under that amount of load in FreeNAS, I probably would have had four to six drive incidents. After about a week now on Debian, still no errors.

What this means for me is that I'm unfortunately abandoning FreeNAS. It's not really FreeNAS' fault, but this was the only real option I had that didn't require spending another couple of grand on disks and then, at the best case, getting hit with a restocking fee on the Ironwolf disks. So over to Linux I go. I would have preferred to stick with FreeNAS, but given that I got close to losing my data multiple times, I couldn't wait it out and hope that Seagate came up with an answer.

(Also cross-posting this to other thread for people who aren't reading both.)
 

kmr99

Dabbler
Joined
Dec 16, 2017
Messages
18
On one of the two servers that I built with Seagate IronWolf 10GB and Enterprise 10GB drives, I've been getting these errors. Interestingly, on one server with an LSI 9300, I didn't get any errors. The other server has been getting errors. Initially, I had 6 drives using the motherboard (X10SDV) SATA ports and ACHI driver. I'd get about 10 FLUSHCACHE48 errors per day spread across all 6 drives. They appeared like:
Code:
Jan  6 09:00:50 ahcich34: Timeout on slot 31 port 0
Jan  6 09:00:50 ahcich34: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd c0 serr 00000000 cmd 0004df17
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): CAM status: Command timeout
Jan  6 09:00:50  (ada4:ahcich34:0:0:0): Retrying command

So, I just put an LSI 9300 in the server hoping that it was an just a problem with the ACHI driver and the motherboard controller. Unfortunately, after switching the the LSI 9300 card today, I've already gotten my first timeout error:
Code:
Jan  7 15:42:50	(da7:mpr0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 523 Aborting command 0xfffffe0000f40fd0
Jan  7 15:42:50  mpr0: Sending reset from mprsas_send_abort for target ID 5
Jan  7 15:42:50   (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00 length 4096 SMID 277 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jan  7 15:42:50  mpr0: Unfreezing devq for target ID 5
Jan  7 15:42:50  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00
Jan  7 15:42:50  (da7:mpr0:0:5:0): CAM status: CCB request completed with an error
Jan  7 15:42:50  (da7:mpr0:0:5:0): Retrying command
Jan  7 15:42:50  (da7:mpr0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jan  7 15:42:50  (da7:mpr0:0:5:0): CAM status: Command timeout
Jan  7 15:42:50  (da7:mpr0:0:5:0): Retrying command
Jan  7 15:42:51  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d dd a8 00 00 00 08 00 00
Jan  7 15:42:51  (da7:mpr0:0:5:0): CAM status: SCSI Status Error
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI status: Check Condition
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan  7 15:42:51  (da7:mpr0:0:5:0): Retrying command (per sense data)
Jan  7 15:42:51  (da7:mpr0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 02 7d eb 78 00 00 00 30 00 00
Jan  7 15:42:51  (da7:mpr0:0:5:0): CAM status: SCSI Status Error
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI status: Check Condition
Jan  7 15:42:51  (da7:mpr0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan  7 15:42:51  (da7:mpr0:0:5:0): Retrying command (per sense data)

From my reading of this thread, as well as the thread https://forums.freenas.org/index.ph...h-seagate-10tb-enterprise-st10000nm0016.58251, it appears that no solution has been found except switching to other manufacturer's drives or one poster have success switching to Linux.

As my errors occurred with two different controllers/drivers, I'm cross-posting to both the AHCI and LSI threads
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
Thanks for your update kmr, I'm still fighting this as well. I've replaced one of my 8 Seagate drives with an HGST, but that still leaves 7 drives that can randomly time out.

For the moment, I definitely cannot recommend any use of 10TB Ironwolf drives with FreeNAS.
 

scastano

Cadet
Joined
Mar 24, 2018
Messages
1
Just adding to this thread and subscribing to updates to see if anything change...

That being said, I'm having the exact same problem which cropped up recently when I was trying to do some heavier copy operations from one zpool to another. The same IronWolf 10TB drives are giving me the SYNCHRONIZE CACHE error and kicking them out of the pool. I have a 12 drive pool made up of 6 six drive vdevs in RAIDZ2 and the errors have almost cost me my data twice in the last 24 hours.

Similar situation/experiance as other users... it seems like at odd times the sync cache error comes up while trying to do writes and frequently takes out multiple drives at the same time, a lot of the time 2 drives at a time, I've even seen up to 4 at the same time, which by some AMAZING luck happened to 2 drives in each of my two vdevs and I didn't lose any data.

Obviously this isn't something I can have happening on a regular basis and I've seen posts in multiple threads here, nexenta forums and reddit all confirming that FreeNAS and these 10TB Seagate IronWolf drives (ST10000VN0004) are not playing nice together and a few making references to the linux drivers in the Debian based Linux OS family being a lot more stable. So since I run a ton of Ubuntu 16.04 servers right now, with ZFS on them already and am pretty comfortable scripting backups, snapshots, etc... I'm going to rebuild on another server and give that a try.

I'm going to cross post this on other threads as well as reddit and hopefully it works out... but I'd like to monitor the progress with other users as well and maybe switch back to FreeNAS if a resolution comes up... and if Ubuntu doesn't resolve the issue, I may have to look at slowly replacing these drives with another brand. I've been a FreeNAS user for a long time and love the GUI, tools, etc... but I just can't afford to replace these drives or lose any data right now, so I have to try other options.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
I upgraded to 11.1U6 and the problem still persists, though this was expected based on other reports in the thread. I'm going to be swapping these drives into completely new hardware soon so we'll see if it still follows them.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
Well, shit. I moved the drives into a brand new system (still FreeNAS 11.1U6) with a Supermicro X11SSL-CF that has an LSI 3008 controller, and they still time out the same as before. I really hate these damn Ironwolfs.

Code:
Oct  6 01:38:16 vault   (da6:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 932 Aborting command 0xfffffe0001088bc0
Oct  6 01:38:16 vault mpr0: Sending reset from mprsas_send_abort for target ID 6
Oct  6 01:38:16 vault   (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 41 e8 00 00 00 28 00 00 length 20480 SMID 269 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Oct  6 01:38:16 vault   (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 6d d8 00 00 00 08 00 00 length 4096 SMID 927 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Oct  6 01:38:16 vault mpr0: Unfreezing devq for target ID 6
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 41 e8 00 00 00 28 00 00
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): CAM status: CCB request completed with an error
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): Retrying command
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 6d d8 00 00 00 08 00 00
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): CAM status: CCB request completed with an error
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): Retrying command
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): CAM status: Command timeout
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): Retrying command
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 6d d8 00 00 00 08 00 00
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): CAM status: SCSI Status Error
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): SCSI status: Check Condition
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct  6 01:38:16 vault (da6:mpr0:0:6:0): Retrying command (per sense data)
Oct  6 01:38:17 vault   (da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 6e f0 00 00 00 18 00 00 length 12288 SMID 308 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
Oct  6 01:38:17 vault   (da6:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 750 terminated ioc 804b loginfo 311(da6:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 e9 d0 6e f0 00 00 00 18 00 00
Oct  6 01:38:17 vault 10e03 scsi 0 state c xfer 0
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): CAM status: CCB request completed with an error
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): Retrying command
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): CAM status: CCB request completed with an error
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): Retrying command
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): CAM status: SCSI Status Error
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): SCSI status: Check Condition
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): Error 6, Retries exhausted
Oct  6 01:38:17 vault (da6:mpr0:0:6:0): Invalidating pack
 

RFandIT

Cadet
Joined
Mar 30, 2015
Messages
2
I've had success (so far) with 11.2 STABLE and four Ironwolf 10TB drives. Models ST10000VN0004-1ZD101 with FW SC60 with serials in the range of ZA209xxx, ZA20Rxxx, and ZA20Sxxx. So far after three days they're working well.

I had also tried different cards and ways of connecting them and never had any luck until this FreeNAS build. So fingers crossed...
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
I've had success (so far) with 11.2 STABLE and four Ironwolf 10TB drives. Models ST10000VN0004-1ZD101 with FW SC60 with serials in the range of ZA209xxx, ZA20Rxxx, and ZA20Sxxx. So far after three days they're working well.

I had also tried different cards and ways of connecting them and never had any luck until this FreeNAS build. So fingers crossed...

Three days isn't enough time to confirm anything, but I wish you luck. What controller are they on?
 

RFandIT

Cadet
Joined
Mar 30, 2015
Messages
2
Three days isn't enough time to confirm anything, but I wish you luck. What controller are they on?
For the amount of problems I had with them three days was nothing I could ever achieve. I could barely go for 24-48 without a drive dropping out.

I'm currently running the typical flashed IBM card/SAS2008. DMESG shows: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd. It's plugged in to a SM A1SRi then fans out to a Rosewill 4-bay drive cage. With 10.3-STABLE and 11.1-STABLE this setup was incredibly unreliable and would frequently cause the OS to reboot spontaneously. I'm also running bare metal.

I'm going out of town for three weeks so it will have time to bake and then I'll switch to a SAS3008 (9300-8E) card I've got in my daily driver.
 
Joined
Jan 18, 2017
Messages
524
I sure hope it does, I just picked up a couple more 8TB drives would have liked to go larger but I went with what I had confirmed worked properly.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
I've been running 11.2U1 for nearly a month now, and I can say that the problem is vastly improved, but might not be solved. In all this time I've only had 2 failures, one right after the upgrade, and one this morning. I think this was da1 both times, so it's possible that there's legitimately something wrong with it, although it passes SMART short and long tests without issue. I'm going to keep running and see if there are additional failures, and whether or not they are confined to da1. Here's the log from this morning.

Code:
Feb  9 06:03:37 vault   (da1:mpr0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 653 Aborting command 0xfffffe0001089ab0
Feb  9 06:03:37 vault mpr0: Sending reset from mprsas_send_abort for target ID 1
Feb  9 06:03:37 vault smartd[2979]: Device: /dev/da1 [SAT], failed to read SMART Attribute Data
Feb  9 06:03:37 vault   (da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 fb 15 2f 28 00 00 00 98 00 00 length 77824 SMID 233 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Feb  9 06:03:37 vault   (da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 fb 15 2e 28 00 00 01 00 00 00 length 131072 SMID 921 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Feb  9 06:03:37 vault   (pass1:mpr0:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 496 te(da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 fb 15 2f 28 00 00 00 98 00 00
Feb  9 06:03:37 vault rminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Feb  9 06:03:37 vault   (da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 f2 5e 3f c8 00 00 00 08 00 00 length 4096 SMID 1040 terminated ioc 804b l(da1:mpr0:0:1:0): CAM status: CCB request completed with an error
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): Retrying command
Feb  9 06:03:37 vault oginfo 31130000 scsi 0 state c xfer 0
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 fb 15 2e 28 00 00 01 00 00 00
Feb  9 06:03:37 vault mpr0: (da1:mpr0:0:1:0): CAM status: CCB request completed with an error
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): Retrying command
Feb  9 06:03:37 vault Unfreezing devq for target ID 1
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 03 f2 5e 3f c8 00 00 00 08 00 00
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): CAM status: CCB request completed with an error
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): Retrying command
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): CAM status: Command timeout
Feb  9 06:03:37 vault (da1:mpr0:0:1:0): Retrying command
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): CAM status: SCSI Status Error
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): SCSI status: Check Condition
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): Error 6, Retries exhausted
Feb  9 06:03:38 vault (da1:mpr0:0:1:0): Invalidating pack
Feb  9 06:03:38 vault ZFS: vdev state changed, pool_guid=8222513563341353211 vdev_guid=5625070549600050549
Feb  9 06:03:38 vault ZFS: vdev state changed, pool_guid=8222513563341353211 vdev_guid=5625070549600050549
 
Top