Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

SYNCHRONIZE CACHE command timeout error

cobrakiller58

Senior Member
Joined
Jan 18, 2017
Messages
378
Email their tech support, with over 600 reallocated sectors I have no doubt they will honor the warranty but asking them is the best way to be sure. I honestly can't remember the last time I used their diag tool for warranty purposes and I always pay for the advanced replacement service.
 

DGenerateKane

Member
Joined
Sep 4, 2014
Messages
90
Looks like I have to pull all the drives and use another PC, it doesn't look like the utilities can detect any of the drives connected to the backplane. PITA.
 

Pheran

Senior Member
Joined
Jul 14, 2015
Messages
274
Email their tech support, with over 600 reallocated sectors I have no doubt they will honor the warranty but asking them is the best way to be sure. I honestly can't remember the last time I used their diag tool for warranty purposes and I always pay for the advanced replacement service.
Yikes, I guess I'd better do something about da1, since I just checked to see if anything had changed and got this. I wonder how many spare sectors these drives have?

Code:
# for i in {0..7}; do echo -n "da$i "; smartctl -a /dev/da$i | grep 'Reallocated_Sector_Ct' | awk '{ print $10 }'; done
da0 0
da1 13104
da2 16
da3 64
da4 0
da5 0
da6 0
da7 32
 

cobrakiller58

Senior Member
Joined
Jan 18, 2017
Messages
378
13k and still operating, it will probably fail out right before it runs out of sectors. Is it still passing SMART tests?. I'd be getting RMA's going on DA1 and DA3 and one month before warranty expires I'd do DA2 and DA7 assuming they don't start rapidly increasing their reallocated sectors.
 

kyherv

Newbie
Joined
Jan 6, 2020
Messages
1
@mloiterman Have you tried connecting any of your disks to the SATA connectors on that MB? That's one thing I haven't tried yet as mine isn't that unstable. I have a drive dropping out every month or so on average and I just re-join it to the vdev.
This did it for me!

In case this helps anyone else, my situation:
  • Drives: Seagate Barracuda Pro 10TB drives (ST10000DM0004) (6 total)
  • SAS: LSI 2008 (Dell PERC H310 flashed to IT mode with the latest LSI firmware - 20.00.07.00)
  • Issue: Same as OP, drop-offs about every day, usually timed to a routine task (cloud sync in my case)
The NCQ queuing script did not work for me, nor are there firmware updates for my model of Seagate 10TB drives. I tried different cables and PSUs to no avail.

What did end up working was connecting the drives to the SATA ports on my motherboard instead. In my config, I have 3x3TB/3x4TB (mix of WD and Seagate) and the 6x10TB Barracuda Pros. I was able to switch it around so the 10TB drives are connected to the MB and the 3&4TB drives are the ones connected through the SAS controller. Fingers crossed, but for the last week I've had no issues on either sets of drives. I therefore suspect the issue is between large Seagate drives and LSI SAS controllers.
 

Gcon

Junior Member
Joined
Aug 1, 2015
Messages
14
A quick head's up for those following this thread. Chris Mellor from Blocks & Files just published a piece around Western Digital and possibly Seagate shipping Drive-Managed Shingled Media Recording (DM-SMR) drives which hide this fact when queried. Apparently this tech can be detrimental to certain high-load NAS operations. Anyway have a read.

I'm suffering big issues of the type reported in this thread (SYNCHRONIZE CACHE, command timeouts, dropped disks from zpool), with all 6 of my new Seagate SATA 8TB ST8000VN004-2M2101 drives in 6x disk RAID-Z2 array on LSI9207-8i (SAS2308 controller) in IT mode with latest 20.00.07.00 firmware, in FreeNAS11.3U2 (and before that U1 and also 11.2).

Came here looking for answers and it's been very informative. I'll aim to do a write up of my experience in another post.

Edit: 19th April 2020: Ars Technica have picked up on the SMR/CMR issue as well:
 
Last edited:

Gcon

Junior Member
Joined
Aug 1, 2015
Messages
14
OK here's my experience.

Problem: ZFS array reporting DEGRADED or UNAVAILABLE (!!) due to hard drives going in a FAULTED state. Lots of drive-releated jibberish in the system logs and remote VGA (seen via iDRAC login). When this first happened to a drive I RMA'd it. Then I saw that every drive suffers from the issue and I knew something bigger and more sinister was up.

My environment:
  • Dell R710 rev II chassis. 2x 870w PSU. 2x Intel X5650 CPU. 128GB Micron ECC Registered DIMMs (tested with memtest86+ and onboard Lifecycle Controller (F10 at boot) diagnostics. All BIOS and firmware updated to very latest revisions.
  • Intel X520-DA2 10GbE NIC (onboard broadcomm 4x 1GbE disabled in BIOS)
  • LSI9207-8i RAID card flashed with IT-mode firmware 20.00.07.00. UEFI and BIOS erased as unnecessary.
  • Aftermarket 0.8m SAS2 SFF-8087 cables (tested fine elsewhere)
  • RAID array is 6x disk RAIDZ2. No ARC/L2ARC, or ZIL/SLOG
  • Disks are Seagate 8TB SATA 7200rpm "IronWolf NAS" ST8000VN004-2M210 with SC60 f/w (no update available online)
  • OS is FreeNAS 11.3U2 - BIOS boot (not UEFI) off a sandisk 32GB USB3 Samsung FIT flash drive in internal slot
Fault trigger:

I use a great backup tool called Nakivo Backup & Replication to backup VMs from VMware vSphere. vCenter 6.7 with ESXi 6.5. It's only when doing what's called a full data verification scheduled via "run full data verification on a schedule" or triggered manually with "Verify all backups" on the local onboard repository, that the issue occurs. It's definitely not a Naikvo issue due to the fault showing up at a much lower level in the data stack. It just uses the disks in such an intense R/W way as to trigger the underlying issue. Incidentally....
  • Normal Nakivo backups are fine
  • Resilvering is fine
  • ZFS pool scubs are fine
Go figure! It really is a corner case, but it is very noticeable and a data verification with Nakivo can bring down the whole array. There's *always* issues from doing a Nakivo full data verification. Issues don't always bring down the array or even show up on the array ("zpool status") but there are always issues showing up in the /var/log/messages sometimes spread out by tens of minutes or even hours. Sometimes they aren't enough to FAULT a drive and sometimes they are. BTW, Nakivo runs in a FreeNAS IOCage jail.

CLI outputs:

Note it was often much worse than this....
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/24d54693-38cc-11ea-bb29-b8ac6f88c792 ONLINE 0 0 0
gptid/4e94da77-38f5-11ea-bb29-b8ac6f88c792 ONLINE 0 0 0
gptid/1b887e9c-5520-11ea-bb29-b8ac6f88c792 ONLINE 0 0 0
gptid/ed46b8d4-3a62-11ea-bb29-b8ac6f88c792 ONLINE 0 0 0
gptid/3125b087-628d-11ea-af30-b8ac6f88c792 FAULTED 3 0 0 too many errors
gptid/8755b035-3815-11ea-bb29-b8ac6f88c792 ONLINE 0 0 0

Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 52 f6 82 18 00 00 00 90 00 00 length 73728 SMID 986 Aborting command 0xfffffe0001675e20
Apr 14 13:42:26 freenas01 mps0: Sending reset from mpssas_send_abort for target ID 5
Apr 14 13:42:26 freenas01 (pass5:mps0:0:5:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 338 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 141 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 aa 10 00 00 01 00 00 00 length 131072 SMID 224 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 ab 10 00 00 01 00 00 00 length 131072 SMID 334 terminated ioc 804b l(da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Apr 14 13:42:26 freenas01 oginfo 31140000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 ac 10 00 00 01 00 00 00 length 131072 SMID 131 terminated ioc 804b l(da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 aa 10 00 00 01 00 00 00
Apr 14 13:42:26 freenas01 oginfo 31140000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 52 f6 85 68 00 00 00 30 00 00 length 24576 SMID 1117 terminated ioc 804b l(da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 oginfo 31140000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 a9 10 00 00 01 00 00 00 length 131072 SMID 340 terminated ioc 804b l(da5:eek:ginfo 31140000 scsi 0 state c xfer 0
Apr 14 13:42:26 freenas01 mps0: Unfreezing devq for target ID 5
Apr 14 13:42:26 freenas01 mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 ab 10 00 00 01 00 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 ac 10 00 00 01 00 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 52 f6 85 68 00 00 00 30 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 a9 10 00 00 01 00 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 52 f6 82 18 00 00 00 90 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: Command timeout
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 03 2a 27 aa 10 00 00 01 00 00 00
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): CAM status: SCSI Status Error
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): SCSI status: Check Condition
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr 14 13:42:26 freenas01 (da5:mps0:0:5:0): Retrying command (per sense data)
Apr 14 13:42:27 freenas01 (da5:mps0:0:5:0): WRITE(10). CDB: 2a 00 00 40 01 b0 00 00 08 00
Apr 14 13:42:27 freenas01 (da5:mps0:0:5:0): CAM status: SCSI Status Error
Apr 14 13:42:27 freenas01 (da5:mps0:0:5:0): SCSI status: Check Condition
Apr 14 13:42:27 freenas01 (da5:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr 14 13:42:27 freenas01 (da5:mps0:0:5:0): Retrying command (per sense data)

Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9a f7 aa 60 00 00 00 30 00 00 length 24576 SMID 350 Aborting command 0xfffffe0001641b60
Apr 14 15:00:10 freenas01 mps0: Sending reset from mpssas_send_abort for target ID 3
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 02 b5 9d 66 60 00 00 00 08 00 00 length 4096 SMID 569 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 698 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9c 10 d6 c0 00 00 00 a8 00 00 length 86016 SMID 429 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9c 87 47 90 00 00 00 40 00 00 length 32768 SMID 167 terminated ioc 804b lo(da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 02 b5 9d 66 60 00 00 00 08 00 00
Apr 14 15:00:10 freenas01 ginfo 31140000 scsi 0 state c xfer 0
Apr 14 15:00:10 freenas01 mps0: Unfreezing devq for target ID 3
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: CCB request completed with an error
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: CCB request completed with an error
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9a f7 aa 60 00 00 00 30 00 00
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: Command timeout
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9c 10 d6 c0 00 00 00 a8 00 00
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: CCB request completed with an error
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9c 87 47 90 00 00 00 40 00 00
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: CCB request completed with an error
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9a f7 aa 60 00 00 00 30 00 00
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): CAM status: SCSI Status Error
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): SCSI status: Check Condition
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr 14 15:00:10 freenas01 (da3:mps0:0:3:0): Retrying command (per sense data)
Apr 14 15:00:11 freenas01 (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 9e 41 5c 38 00 00 00 40 00 00
Apr 14 15:00:11 freenas01 (da3:mps0:0:3:0): CAM status: SCSI Status Error
Apr 14 15:00:11 freenas01 (da3:mps0:0:3:0): SCSI status: Check Condition
Apr 14 15:00:11 freenas01 (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr 14 15:00:11 freenas01 (da3:mps0:0:3:0): Retrying command (per sense data)

root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da0 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 107375886362
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da0 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 107375886362
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]#
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da1 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 103080787992
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da2 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 90195689493
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]#
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da3 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 124555952157
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da4 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 47245361163
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
root@freenas01[~/testingstuff]# smartctl -a -q noserial /dev/da5 | egrep -e "Reallocated_Sector_Ct|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable"

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 099 099 000 Old_age Always - 55835426829
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

root@freenas01[~]# smartctl -a -q noserial /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST8000VN004-2M2101
Firmware Version: SC60
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 15 14:26:21 2020 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 559) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 715) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 071 064 044 Pre-fail Always - 12836987
3 Spin_Up_Time 0x0003 081 081 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 40
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always - 97489463
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1819
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 37
18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 47245361163
190 Airflow_Temperature_Cel 0x0022 071 051 040 Old_age Always - 29 (Min/Max 24/31)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 27
193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always - 5246
194 Temperature_Celsius 0x0022 029 049 000 Old_age Always - 29 (0 19 0 0 0)
195 Hardware_ECC_Recovered 0x001a 071 064 000 Old_age Always - 12836987
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1134 (138 237 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 33567695866
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 34588513234

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1499 -
# 2 Extended offline Completed without error 00% 1177 -
# 3 Short offline Completed without error 00% 1165 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

NB: I am testing 19.00.00.00 firmware here but will go back to 20.00.07.00 as v19 didn't help.

root@freenas01[~]# camcontrol devlist
<ATA ST8000VN004-2M21 SC60> at scbus0 target 0 lun 0 (pass0,da0)
<ATA ST8000VN004-2M21 SC60> at scbus0 target 1 lun 0 (pass1,da1)
<ATA ST8000VN004-2M21 SC60> at scbus0 target 2 lun 0 (pass2,da2)
<ATA ST8000VN004-2M21 SC60> at scbus0 target 3 lun 0 (pass3,da3)
<ATA ST8000VN004-2M21 SC60> at scbus0 target 4 lun 0 (pass4,da4)
<ATA ST8000VN004-2M21 SC60> at scbus0 target 5 lun 0 (pass5,da5)
<PLDS DVD-ROM DS-8DBSH MD51> at scbus1 target 0 lun 0 (pass6,cd0)
<Samsung Flash Drive FIT 1100> at scbus4 target 0 lun 0 (pass7,da6)
<iDRAC LCDRIVE 0323> at scbus5 target 0 lun 0 (pass8,da7)
<iDRAC Virtual CD 0323> at scbus6 target 0 lun 0 (pass9,cd1)
root@freenas01[~]# mpsutil show all
Adapter:
mps0 Adapter:
Board Name: SAS9207-8i
Board Assembly: H5-25412-00C
Chip Name: LSISAS2308
Chip Revision: ALL
BIOS Revision: 7.37.00.00
Firmware Revision: 19.00.00.00
Integrated RAID: no

PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device
0 0001 0009 N 6.0 1.5 6.0 SAS Initiator
1 0002 000a N 6.0 1.5 6.0 SAS Initiator
2 0004 000c N 6.0 1.5 6.0 SAS Initiator
3 0003 000b N 6.0 1.5 6.0 SAS Initiator
4 N 1.5 6.0 SAS Initiator
5 N 1.5 6.0 SAS Initiator
6 0005 000d N 6.0 1.5 6.0 SAS Initiator
7 0006 000e N 6.0 1.5 6.0 SAS Initiator

Devices:
B____T SAS Address Handle Parent Device Speed Enc Slot Wdt
00 00 4433221100000000 0009 0001 SATA Target 6.0 0001 03 1
00 01 4433221101000000 000a 0002 SATA Target 6.0 0001 02 1
00 03 4433221103000000 000b 0003 SATA Target 6.0 0001 00 1
00 02 4433221102000000 000c 0004 SATA Target 6.0 0001 01 1
00 04 4433221106000000 000d 0005 SATA Target 6.0 0001 05 1
00 05 4433221107000000 000e 0006 SATA Target 6.0 0001 04 1
Enclosures:
Slots Logical ID SEPHandle EncHandle Type
08 500605b008de16d0 0001 Direct Attached SGPIO

Expanders:
NumPhys SAS Address DevHandle Parent EncHandle SAS Level

root@freenas01[~]# diskinfo -t /dev/da0
/dev/da0
512 # sectorsize
8001563222016 # mediasize in bytes (7.3T)
15628053168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
972801 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
ATA ST8000VN004-2M21 # Disk descr.
######## # Disk ident. (I hid this output)
No # TRIM/UNMAP support
7200 # Rotation rate in RPM
Not_Zoned # Zone Mode

Seek times:
Full stroke: 250 iter in 5.823103 sec = 23.292 msec
Half stroke: 250 iter in 4.550814 sec = 18.203 msec
Quarter stroke: 500 iter in 4.087809 sec = 8.176 msec
Short forward: 400 iter in 1.137892 sec = 2.845 msec
Short backward: 400 iter in 2.081851 sec = 5.205 msec
Seq outer: 2048 iter in 0.103975 sec = 0.051 msec
Seq inner: 2048 iter in 0.213326 sec = 0.104 msec

Transfer rates:
outside: 102400 kbytes in 0.588111 sec = 174117 kbytes/sec
middle: 102400 kbytes in 0.474089 sec = 215993 kbytes/sec
inside: 102400 kbytes in 0.843246 sec = 121436 kbytes/sec

My problem-solving actions:
  1. After my first FAULTED disk I RMA'd it straight away with Seagate. Got a new one back.
  2. Problems then happened on several disks - in fact all disks. My whole array became UNAVAILABLE once and I almost had a combined heart attack + brain explosion. A reboot brought everything back up as if nothing had happened (phew!) A zfs scrub of the data showed no issues. Very strange.
  3. I originally had a Dell HBA crossflashed to LSI. It had SAS2008 controller. Suspecting some crossflash dodginess I bought a brand new LSI 9207-8i and updated firmware to 20.00.07.00 and flashed in the BIOS and UEFI boot roms.
  4. Swapped SAS2 SFF-8087 0.8m cables with new ones
  5. Swapped SAS backplane card from another chassis I had spare
  6. Swapped Intel X520-DA2 with a new one (shares IRQ with LSI card)
  7. Swapped NIC and LSI cards around in different PCIe slots trying different IRQs (mindful that some are PCI 2.0 8x and some are PCI 2.0 4x)
  8. Turned off unneeded items in the BIOS like 1Gbps Ethernet ports and serial console
  9. Changed BIOS QPI Bandwidth Priority from "Compute" to "I/O" and turned on "Maximum Performance" power setting
  10. Swapped 870W power supplies with new ones
  11. Downgraded LSI firmware to 19.00.00.00, with and without BIOS and UEFI boot ROMs
  12. You are here :)
There's no reason for me to suspect CPU and motherboard after running motherboard, CPU and RAM tests, and the system runs stable as a rock it's just the ZFS / disk issues I'm seeing. I run an Ubuntu Linux VM under Bhyve and have had zero issues with that.

Thoughts so far:
I really do think at this stage it's the drives, in combination with FreeNAS 11.2 and above. I didn't run FreeNAS 11.1 on this setup. Have tried 11.2 and 11.3, 11.3U1 and 11.3U2 all with the issue. I think FreeNAS BSD drivers are a exposing the underlying drive issue.

There's no firmware updates for my 8TB drives I'm stuck with SC60 it seems.

I will attempt to turn off Native Command Queueing on the Seagates (if I can work out how to do it and confirm it's done, mindful it's not reboot persistant without a shell script).

I did have slower 5,400 rpm WD Red NAS drives in the system initially but swapped them out for the faster Seagates, and because I wanted those slower, cooler, quieter drives in my second home NAS instead of the work array. Those drives are all "WDC WD80EFZX-68UW8N0" and I don't recall those ever being an issue. After playing around with NCQ, I may slot the WD drives into the work array (the one I'm having issues with here) one by one and then see if the issue persists after a full drive swapover.

If that's the case I'd be tempted to get WD Red Pro's in there, but only if I can be guaranteed they are CMR and not SMR drives! Ughh.. the SATA drive space really has become a system builder's nightmare these days.
 
Last edited:

Gcon

Junior Member
Joined
Aug 1, 2015
Messages
14
The results are in. The "camcontrol tags ..." command has worked around the issue I was having! I did a full Nakivo backups verification and it was rock solid. It did take a bit longer than usual, but at least there was no issues whatsover. This never happens with these disks.

At long last I've finally gotten to the root of the issue!

The post-init script I use for my "ST8000VN004-2M21 SC60" 8TB SATA 7200rpm Seagate IronWolf drives is:
Code:
#!/usr/local/bin/bash

for i in `camcontrol devlist | grep "ATA ST800" | cut -d"," -f2 | cut -d")" -f1` ; do camcontrol tags $i -N 1 ; done



For future searchers, I was initially confused that the "camcontrol tags" command wasn't disabling NCQ because of this:
#root@freenas01[~]# camcontrol inquiry da0
pass0: <ATA ST8000VN004-2M21 SC60> Fixed Direct Access SPC-4 SCSI device
pass0: Serial Number ##hidden##
pass0: 600.000MB/s transfers, Command Queueing Enabled


...specifically that Command Queueing was still "Enabled". But what the "tags" command is doing is shrinking the command queue down from 255 to 1, and a queue of one is effectively not a queue.

You can verify before and after the change that the sum of "dev_openings" and "dev_active" shrinks from 255 to 1.
root@freenas01[~]# camcontrol tags /dev/da0 -q -v
(pass0:mps0:0:0:0): dev_openings 252
(pass0:mps0:0:0:0): dev_active 3
(pass0:mps0:0:0:0): allocated 3
(pass0:mps0:0:0:0): queued 0
(pass0:mps0:0:0:0): held 0
(pass0:mps0:0:0:0): mintags 2
(pass0:mps0:0:0:0): maxtags 255
root@freenas01[~]#
root@freenas01[~]# camcontrol tags /dev/da0 -N 1
(pass0:mps0:0:0:0): tagged openings now 1
(pass0:mps0:0:0:0): device openings: 1
root@freenas01[~]#
root@freenas01[~]# camcontrol tags /dev/da0 -q -v
(pass0:mps0:0:0:0): dev_openings 0
(pass0:mps0:0:0:0): dev_active 1
(pass0:mps0:0:0:0): allocated 1
(pass0:mps0:0:0:0): queued 0
(pass0:mps0:0:0:0): held 0
(pass0:mps0:0:0:0): mintags 2
(pass0:mps0:0:0:0): maxtags 255

It really is unbelievable that Seagate would take several years to come out for a fix for this *and* that they would come out for a SC61 firmware update for the 10TB Seagate IronWolf (+ IronWolf Pro) models ST10000VN0004 and ST10000NE0004, but not release SC61 firmware for the equivalent 8TB models ST8000VN004 (IronWolf) and ST8000NE0004 (IronWolf Pro)

IronWolf models
IronWolf Pro models

Now it's time to raise this issue with Seagate support and reference this post; the informative post by user Quindor; and the Synology forums discussion.

PS - I would also like to know what the 3-number vs 4-number suffexes to the model #'s mean. i.e. "x004" vs "x0004". Anyone happen to know?
 
Last edited:

bferrell

Junior Member
Joined
Dec 10, 2018
Messages
15
Thanks for all the information folks, the cache thing didn't work, but the EN02 firmware update seems to have fixed me!
 

Gcon

Junior Member
Joined
Aug 1, 2015
Messages
14
Just to put a full stop to this for my issues. I engaged with Seagate tech support and they supplied me with another firmware for my ST8000VN004 8TB drives. Good news - I could enable NCQ again and not have any issues with SMART Command_Timeout accumulation, and I could do a full Nakivo backups verification, plus ZFS scrub and not have any issues. So I consider this issue resolved.

I used the Seagate utility to boot from a USB stick to update the drives. Thankfully it worked with the LSI RAID card, so that I didn't have to tediously transpose the drives into another caddy to put into a Windows server to run. The weird thing is that the "new" firmware still says SC60 and not SC61. You'd think they'd at least give it an engineering code like SC60e or something or even SE60. Anyways - there's other IT challenges to tackle. Finally I can put this hard drive madness to rest! (touch wood).
 

Pheran

Senior Member
Joined
Jul 14, 2015
Messages
274
Just to put a full stop to this for my issues. I engaged with Seagate tech support and they supplied me with another firmware for my ST8000VN004 8TB drives. Good news - I could enable NCQ again and not have any issues with SMART Command_Timeout accumulation, and I could do a full Nakivo backups verification, plus ZFS scrub and not have any issues. So I consider this issue resolved.

I used the Seagate utility to boot from a USB stick to update the drives. Thankfully it worked with the LSI RAID card, so that I didn't have to tediously transpose the drives into another caddy to put into a Windows server to run. The weird thing is that the "new" firmware still says SC60 and not SC61. You'd think they'd at least give it an engineering code like SC60e or something or even SE60. Anyways - there's other IT challenges to tackle. Finally I can put this hard drive madness to rest! (touch wood).
Could you provide a link to this so others can benefit? I don't have any of these drives myself, but this is the first time I've heard of new firmware being available for the 8TB model.
 

jkng88

Neophyte
Joined
Dec 21, 2014
Messages
6
Just to put a full stop to this for my issues. I engaged with Seagate tech support and they supplied me with another firmware for my ST8000VN004 8TB drives. Good news - I could enable NCQ again and not have any issues with SMART Command_Timeout accumulation, and I could do a full Nakivo backups verification, plus ZFS scrub and not have any issues. So I consider this issue resolved.

I used the Seagate utility to boot from a USB stick to update the drives. Thankfully it worked with the LSI RAID card, so that I didn't have to tediously transpose the drives into another caddy to put into a Windows server to run. The weird thing is that the "new" firmware still says SC60 and not SC61. You'd think they'd at least give it an engineering code like SC60e or something or even SE60. Anyways - there's other IT challenges to tackle. Finally I can put this hard drive madness to rest! (touch wood).
yes, can you share the firmware as I am facing the same issues as well.
 

xandercage78

Newbie
Joined
Jul 3, 2020
Messages
1
I was having similar issues and just recently Seagate provided me a file for my ST8000VN0002 drive. Not sure if it will work for other because i see your model number is different but it worked for mine and it says its SC61. Unfortunately, I cant attach it here. If anyone knows how i can do it let me know or send me a message to share it.
 
Top