SYNCHRONIZE CACHE command timeout error

Pheran · Jul 12, 2017

I'm not sure how to describe it, I'd have to exercise it again to listen. But now I'm getting somewhere, here's the result of a partial badblocks run on da7 that crashed and burned, though I'm confused by the apparent "0 errors" bit. Going to run another SMART test now.

Code:

1073741820
1073741821
1073741822
Too many bad blocks, aborting test
done
Reading and comparing: Too many bad blocks, aborting test1073741823/0 errors)
done
Testing with pattern 0x55: Too many bad blocks, aborting test1073741823/0 errors)
done
Reading and comparing: Too many bad blocks, aborting test1073741823/0 errors)
done
Testing with pattern 0xff: Too many bad blocks, aborting test1073741823/0 errors)
done
Reading and comparing: Too many bad blocks, aborting test1073741823/0 errors)
done
Testing with pattern 0x00: Too many bad blocks, aborting test1073741823/0 errors)
done
Reading and comparing: Too many bad blocks, aborting test1073741823/0 errors)
done

Pheran · Jul 19, 2017

Still having problems, more drives timing out/dropping out. But I just noticed a weird pattern.

Code:

Jul 5 03:04:11 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 5 03:04:11 vault (da7:mps0:0:7:0): CAM status: SCSI Status Error
...
Jul 7 03:02:28 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 7 03:02:28 vault (da5:mps0:0:5:0): CAM status: SCSI Status Error
...
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 03:02:31 vault (da7:mps0:0:7:0): CAM status: SCSI Status Error
...
Jul 19 03:02:33 vault (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 19 03:02:33 vault (da0:mps0:0:0:0): CAM status: SCSI Status Error

Why does this always happen right after 3 AM??

Pheran · Aug 7, 2017

And once again, I've lost a drive at 3:02 AM. I wish I knew what the hell is going on.

Code:

Aug  7 03:02:27 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 342 command timeout cm 0xfffffe00009d40e0 ccb 0xfffff8017d0e5000
Aug  7 03:02:27 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009d40e0
Aug  7 03:02:27 vault mps0: Sending reset from mpssas_send_abort for target ID 5
Aug  7 03:02:27 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 723 command timeout cm 0xfffffe00009f34f0 ccb 0xfffff803c548a000
Aug  7 03:02:27 vault  (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 7d 6d 7f 68 00 00 00 08 00 00 length 4096 SMID 392 terminated ioc 804b scsi 0 state c xfer 0
Aug  7 03:02:27 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 723 terminated ioc 804b scsi 0 stat(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 7d 6d 7f 68 00 00 00 08 00 00
Aug  7 03:02:27 vault e c xfer 0
Aug  7 03:02:27 vault (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Aug  7 03:02:27 vault mps0: (da5:Unfreezing devq for target ID 5
Aug  7 03:02:27 vault mps0:0:5:0): Retrying command
Aug  7 03:02:27 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug  7 03:02:27 vault (da5:mps0:0:5:0): CAM status: Command timeout
Aug  7 03:02:27 vault (da5:mps0:0:5:0): Retrying command
Aug  7 03:02:27 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug  7 03:02:27 vault (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Aug  7 03:02:27 vault (da5:mps0:0:5:0): Retrying command
Aug  7 03:02:28 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug  7 03:02:28 vault (da5:mps0:0:5:0): CAM status: SCSI Status Error
Aug  7 03:02:28 vault (da5:mps0:0:5:0): SCSI status: Check Condition
Aug  7 03:02:28 vault (da5:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug  7 03:02:28 vault (da5:mps0:0:5:0): Error 6, Retries exhausted
Aug  7 03:02:28 vault (da5:mps0:0:5:0): Invalidating pack

Pheran · Aug 7, 2017

I've just realized that the "periodic daily" scripts run at 3:01 AM. I'm now incredibly suspicious that something they are doing is triggering the drive timeouts, but I don't know what yet.

nifoc · Aug 7, 2017

Just to add my 2ct here again: Ever since I replaced one of my two IronWolf drives, I did not have any issues.

Sorry that I can't add any more insights.

Pheran · Aug 14, 2017

The saga of drive timeout hell continues. I did RMA one of the drives that I believe is actually bad, waiting for that to come back. But I cannot believe that all of these drives are bad, and bad drives don't always time out at 3:02 AM either. What I've done for now is to modify every script in /etc/periodic/daily to include a "logger" command that will syslog when its run starts. I'm hoping that this will help identify what script may be causing dropouts when the next one happens. Here's the last dropout just for posterity.

Code:

Aug 13 03:02:30 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 275 command timeout cm 0xfffffe00009ce8f0 ccb 0xfffff80584ceb800
Aug 13 03:02:30 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009ce8f0
Aug 13 03:02:30 vault mps0: Sending reset from mpssas_send_abort for target ID 2
Aug 13 03:02:30 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 501 command timeout cm 0xfffffe00009e1190 ccb 0xfffff80584d19800
Aug 13 03:02:30 vault  (da2:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 80 61 62 90 00 00 00 08 00 00 length 4096 SMID 441 terminated ioc 804b scsi 0 state c xfer 0
Aug 13 03:02:30 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 501 terminated ioc 804b scsi 0 stat(da2:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 80 61 62 90 00 00 00 08 00 00
Aug 13 03:02:30 vault e c xfer 0
Aug 13 03:02:30 vault (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Aug 13 03:02:30 vault mps0: (da2:Unfreezing devq for target ID 2
Aug 13 03:02:30 vault mps0:0:2:0): Retrying command
Aug 13 03:02:30 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 03:02:30 vault (da2:mps0:0:2:0): CAM status: Command timeout
Aug 13 03:02:30 vault (da2:mps0:0:2:0): Retrying command
Aug 13 03:02:30 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 03:02:30 vault (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Aug 13 03:02:30 vault (da2:mps0:0:2:0): Retrying command
Aug 13 03:02:30 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 03:02:30 vault (da2:mps0:0:2:0): CAM status: SCSI Status Error
Aug 13 03:02:30 vault (da2:mps0:0:2:0): SCSI status: Check Condition
Aug 13 03:02:30 vault (da2:mps0:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 13 03:02:30 vault (da2:mps0:0:2:0): Error 6, Retries exhausted
Aug 13 03:02:30 vault (da2:mps0:0:2:0): Invalidating pack

Pheran · Aug 16, 2017

Well, I had another timeout, but FreeNAS helpfully overwrote my changes to the periodic daily scripts, so I didn't get any additional intel. This time I modified both /etc/periodic/daily and /conf/base/etc/periodic/daily.

Code:

Aug 16 03:02:26 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 951 command timeout cm 0xfffffe0000a06030 ccb 0xfffff80111bbf000
Aug 16 03:02:26 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000a06030
Aug 16 03:02:26 vault mps0: Sending reset from mpssas_send_abort for target ID 2
Aug 16 03:02:26 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 458 command timeout cm 0xfffffe00009dd920 ccb 0xfffff80793ed2800
Aug 16 03:02:26 vault  (da2:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 84 a5 87 58 00 00 00 08 00 00 length 4096 SMID 607 terminated ioc 804b scsi 0 state c xfer 0
Aug 16 03:02:26 vault  (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 458 terminated ioc 804b scsi 0 stat(da2:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 84 a5 87 58 00 00 00 08 00 00
Aug 16 03:02:26 vault e c xfer 0
Aug 16 03:02:26 vault (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Aug 16 03:02:26 vault mps0: (da2:Unfreezing devq for target ID 2
Aug 16 03:02:26 vault mps0:0:2:0): Retrying command
Aug 16 03:02:26 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 16 03:02:26 vault (da2:mps0:0:2:0): CAM status: Command timeout
Aug 16 03:02:26 vault (da2:mps0:0:2:0): Retrying command
Aug 16 03:02:26 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 16 03:02:26 vault (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Aug 16 03:02:26 vault (da2:mps0:0:2:0): Retrying command
Aug 16 03:02:26 vault (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 16 03:02:26 vault (da2:mps0:0:2:0): CAM status: SCSI Status Error
Aug 16 03:02:26 vault (da2:mps0:0:2:0): SCSI status: Check Condition
Aug 16 03:02:26 vault (da2:mps0:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 16 03:02:26 vault (da2:mps0:0:2:0): Error 6, Retries exhausted
Aug 16 03:02:26 vault (da2:mps0:0:2:0): Invalidating pack

Pheran · Aug 18, 2017

Well damn, that didn't work as I intended, so I still don't know what's causing this. I won't be able to troubleshoot this for a bit. Once the Seagate RMA drive comes back, I'll get the pool fully stable again and then probably upgrade to FreeNAS 11 to see if this behavior persists.

Code:

Aug 18 03:01:00 vault root: 100.clean-disks
Aug 18 03:01:00 vault root: 110.clean-tmps
Aug 18 03:01:00 vault root: 120.clean-preserve
Aug 18 03:01:00 vault root: 130.clean-msgs
Aug 18 03:01:00 vault root: 200.backup-passwd
Aug 18 03:01:00 vault root: 210.backup-aliases
Aug 18 03:01:00 vault root: 310.accounting
Aug 18 03:01:00 vault root: 330.news
Aug 18 03:01:00 vault root: 400.status-disks
Aug 18 03:01:00 vault root: 401.status-graid
Aug 18 03:01:00 vault root: 404.status-zfs
Aug 18 03:01:00 vault root: 406.status-gmirror
Aug 18 03:01:00 vault root: 407.status-graid3
Aug 18 03:01:00 vault root: 408.status-gstripe
Aug 18 03:01:00 vault root: 409.status-gconcat
Aug 18 03:01:00 vault root: 420.status-network
Aug 18 03:01:00 vault root: 450.status-security
Aug 18 03:01:00 vault root: 480.leapfile-ntpd
Aug 18 03:01:00 vault root: 480.status-ntpd
Aug 18 03:01:00 vault root: 510.status-world-kernel
Aug 18 03:01:00 vault root: 800.scrub-zfs
Aug 18 03:01:00 vault root: 999.local
Aug 18 03:02:32 vault  (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 822 command timeout cm 0xfffffe00009fb6e0 ccb 0xfffff80154408800
Aug 18 03:02:32 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009fb6e0
Aug 18 03:02:32 vault mps0: Sending reset from mpssas_send_abort for target ID 1
Aug 18 03:02:32 vault  (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 394 command timeout cm 0xfffffe00009d8520 ccb 0xfffff80424349800
Aug 18 03:02:32 vault  (da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 01 80 94 3d c8 00 00 00 08 00 00 length 4096 SMID 384 terminated ioc 804b scsi 0 state c xfer 0
Aug 18 03:02:32 vault  (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 394 terminated ioc 804b scsi 0 stat(da1:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 01 80 94 3d c8 00 00 00 08 00 00
Aug 18 03:02:32 vault e c xfer 0
Aug 18 03:02:32 vault (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Aug 18 03:02:32 vault mps0: (da1:Unfreezing devq for target ID 1
Aug 18 03:02:32 vault mps0:0:1:0): Retrying command
Aug 18 03:02:32 vault (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 18 03:02:32 vault (da1:mps0:0:1:0): CAM status: Command timeout
Aug 18 03:02:32 vault (da1:mps0:0:1:0): Retrying command
Aug 18 03:02:32 vault (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 18 03:02:32 vault (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Aug 18 03:02:32 vault (da1:mps0:0:1:0): Retrying command
Aug 18 03:02:32 vault (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 18 03:02:32 vault (da1:mps0:0:1:0): CAM status: SCSI Status Error
Aug 18 03:02:32 vault (da1:mps0:0:1:0): SCSI status: Check Condition
Aug 18 03:02:32 vault (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 18 03:02:32 vault (da1:mps0:0:1:0): Error 6, Retries exhausted
Aug 18 03:02:32 vault (da1:mps0:0:1:0): Invalidating pack

bremnes · Aug 21, 2017

I did upgrade my FreeNas build for 3 weeks ago, and been a nightmare then i found this thread with all the same symptoms.
Before i had Seagate 4TB drives, i upgraded to 10 TB IronWolf from Seagate.
My build is Asus serverboard with 64GB ECC ram, xeon e5 2630L, Dell 9220 8i, with 20.00.07 firmware.

I got every day message like desribed in this post, i tried changing cables 2 times, changed SAS cards to IBM 1015 with 20.00.04 firmware and then upgraded to 07.
I did scrub 5 times last 3 weeks that corrected the errors, it takes never more than couple of days before its degraded again. Most scrub did fix was 84GB, (After a large copy job with several TB), normally it fix some MB's.
I only got the messages from 3 different drives, i did use SeaTools with long test for all of my drives before i started using them in FreeNAS..

After 0 errors for almost 20 hours i got an error tonight and zpool status tells me:

Code:

tid/c65f5722-7879-11e7-aee6-000c29b1b5ee  FAULTED	  6   112	 0  too many errors

The 3 drives that gave message with SYNCHRONIZE CACHE and CAM status: SCSI Status Error has 2 or 4 as value for 199 UDMA_CRC_Error_Count
After reading this thread and my own experience i conclude that the problem is the drives, i promised my self many times stop buying Seagate, i never learn....

If anyone has a clue i'm listening...

Pheran · Aug 21, 2017

bremnes said:
I did upgrade my FreeNas build for 3 weeks ago, and been a nightmare then i found this thread with all the same symptoms.
Before i had Seagate 4TB drives, i upgraded to 10 TB IronWolf from Seagate.
My build is Asus serverboard with 64GB ECC ram, xeon e5 2630L, Dell 9220 8i, with 20.00.07 firmware.

I'm sorry that you also have this problem, though I guess I'm glad I'm not completely alone. I'm very curious, do your timeouts also occur at 3:02 AM?

Thank you for posting, you're giving me even more motivation to find a way to solve this since I'll be helping more than myself.

bremnes · Aug 21, 2017

I can't find any pattern, it's different times, it does not happend when the NAS is "idle", as idle i mean no writing or Reading from CIFS or other protocols. It is always the active periods, and the activity is very modest since it not in production yet.
Maybe you have backup-job, SMART schedule or something at that time?

Code:

Aug 21 22:17:41 NAS1 (da7:mps1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 21 22:17:41 NAS1 (da7:mps1:0:10:0): CAM status: SCSI Status Error

Code:

Aug 20 15:24:01 NAS1 (da9:mps1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 20 15:24:01 NAS1 (da9:mps1:0:9:0): CAM status: Command timeout

I forgot to mention i'm running FreeNAS-11.0-U2 and 9 drives in raidz2.

vincent99 · Sep 4, 2017

I've seen the same thing with 5 IronWolf 10TB drives. Guess I'll return them and try a different brand.

bremnes · Sep 4, 2017

I wonder if anyone got this problem run FreeNAS virtualized?

I still have the problem, and it looks like more and more drives are affected. Last night i moved all the drives to another Box, no parts or cables are reused.
For the next days i will watch and see what happens, until now i got 1 error message earlier today, that is different from the sync error.

Code:

Sep  4 15:00:01 NAS1 ahcich0: Timeout on slot 24 port 0
Sep  4 15:00:01 NAS1 ahcich0: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd c0 serr 00000000 cmd 0004d817
Sep  4 15:00:01 NAS1 (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Sep  4 15:00:01 NAS1 (ada0:ahcich0:0:0:0): CAM status: Command timeout
Sep  4 15:00:01 NAS1 (ada0:ahcich0:0:0:0): Retrying command

vincent99 · Sep 4, 2017

bremnes said:
I wonder if anyone got this problem run FreeNAS virtualized?

Mine is virtualized; SuperMicro 5018d-fn4t with Xeon-D, 64gb ECC RAM. ESXi 6.5 with a LSI 9200-8E passed-through, connected to a Lenovo SA120 expander and 5 10TB IronWolfs. FreeNAS 11.0-U2.

Pretty sure each disk has individually faulted or disappeared entirely at some point in a few weeks. Everything has the newest firmware/bios and swapping SAS cables, drive bays, etc has had no effect.

I'm going to try 6*8TB HGST drives instead and see if that does any better.

bremnes · Sep 6, 2017

I believe the problem has something to do with esxi, i tried to move the hard drives another server, no problem with sync error.
Then i tried for 2 days running on the same box as esxi, running directly without hypervisor. no problems yet.
I going to try to update to esx 6.5 u1, if that does not work then i og to latest esxi 6.0.
I also read several people having trouble with CAM Messages with esxi 6.5, if they go back to 6.0 everything works fine.

vincent99 · Sep 6, 2017

FWIW I apparently have U1 already since it was just setup.. 6.5.0 Update 1 (Build 5969303)

vincent99 · Sep 10, 2017

@bremnes I haven't hit one of these in a couple days now since a) updating to FreeNAS 11.0-U3 and b) adding a loader tunable for hint.ahci.0.msi=0 . Possibly just a run of not-bad luck, but may be worth you trying.

bremnes · Sep 10, 2017

4 days ago i did go back to original setup with FreeNAS virtualized, then i upgraded to esxi 6.5 u1 (late juli), then upgraded to FreeNas 11 u3. And i also reset the bios settings to factory default. All 3 things at the same time.
No more errors, 4 days is a record so i consider it solved.

Pheran · Sep 11, 2017

This is not a virtualization problem, or at least mine isn't. Pure physical hardware. I lost another drive the other night. Of course, the drive is actually fine, if you reboot it comes right back. I do a scrub afterwards just to be sure everything is OK.

Pheran · Sep 11, 2017

vincent99 said:
@bremnes I haven't hit one of these in a couple days now since a) updating to FreeNAS 11.0-U3 and b) adding a loader tunable for hint.ahci.0.msi=0 . Possibly just a run of not-bad luck, but may be worth you trying.

I'm curious, what was your motivation for trying this tunable setting? Let us know if it seems to solve the problem.

Important Announcement for the TrueNAS Community.

SYNCHRONIZE CACHE command timeout error

Patron

Patron

Patron

Patron

Dabbler

Patron

Patron

Patron

Cadet

Patron

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Patron

Patron

Similar threads