Pool degraded, no SMART errors

Odub · Mar 20, 2020

Hello,
I recently replaced all 8 drives in my server with new drives. (4TB HGST to shucked 10TB Seagate Barracuda Pros).

I actually had to rebuild the entire Pool to change the block size to 4k, then replicated to the new drives. I did this using two chassis and two LSI cards due to not having enough space in my primary chassis

I did have some weird device reset errors when I had the drives split up, I attributed this to the precarious nature of the setup, and the errors were on my backup chassis.

Completed the transfer and put the new drives into the primary chassis. Everything was fine until last week when I got a notification the Pool was degraded.
DA7 was showing some read/write errors and degraded, I power cycled the server and re-seated the drive, it booted up fine and the Pool was normal again.

Fast forward to yesterday when I received a notification the Pool was degraded again.
DA7 and DA5 both have read/write errors.
I have SMART tests scheduled but received no SMART warnings so I checked for errors, here is the output:

DA5


SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   064   044    Pre-fail  Always       -       44632200
  3 Spin_Up_Time            0x0003   090   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   045    Pre-fail  Always       -       234027207
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1800 (122 111 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       20
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   069   069   000    Old_age   Always       -       31
190 Airflow_Temperature_Cel 0x0022   074   047   040    Old_age   Always       -       26 (Min/Max 24/28)
191 G-Sense_Error_Rate      0x0032   093   093   000    Old_age   Always       -       15492
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       11551
194 Temperature_Celsius     0x0022   026   053   000    Old_age   Always       -       26 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   009   002   000    Old_age   Always       -       44632200
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       880 (239 115 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       35954624018
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       40571654043

DA7


SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   064   044    Pre-fail  Always       -       38111312
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always       -       122729332
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       571 (199 172 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   081   081   000    Old_age   Always       -       19
190 Airflow_Temperature_Cel 0x0022   074   047   040    Old_age   Always       -       26 (Min/Max 24/28)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1756
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1644
194 Temperature_Celsius     0x0022   026   053   000    Old_age   Always       -       26 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   009   001   000    Old_age   Always       -       38111312
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       483 (175 26 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       28520739476
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32704312746

Disk errors reported in FreeNAS:

I'm no expert with SMART data but I don't see any read/write errors present? What would cause the ZFS to see read/write errors? The LSI card, or am I missing something perhaps? All the new drives are on the same LSI card in IT mode that was working with the old drives.

Thanks

SweetAndLow · Mar 21, 2020

What do your smart long test think?

Odub · Mar 21, 2020

The last scheduled long tests on both drives "completed without error".

I powered down the server, re-seated all of the drives and cards then rebooted. Pool is no longer degraded.

I guess I will see how long this lasts?

Odub · Jul 2, 2020

So I still have this issue, I have been limping it along by rebooting every few days but I'm out of ideas....

Anywhere from a couple hours to a couple days it will be fine, but then I will have a few read/write errors on a random disk due to SCSI power resets. If it leave it without rebooting, another disk will get random errors. This is resolved by rebooting and it re-slivers fine.

My issue seems to be identical to what was encountered in this thread, with SCSI resets:

Solved - CAM status: SCSI Status Error

I noticed these errors this morning. Any ideas? Two such drives had similar messages, but on msp2:0 (i.e. the other drive was (da19:mps2:0:12:0)). More info at https://gist.github.com/dlangille/88eac25349577aaca22a401ac08e9d1b Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): READ(10). CDB...

forums.freebsd.org

Example of the error he was having:

Code:

Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): READ(10). CDB: 28 00 32 86 d0 c8 00 00 18 00
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)

Which was resolved by replacing the chassis, but I have replaced the backplane and power supplies already so I am not sure what else could be the issue.

So far I have replaced the backplane, LSI card, all cables and both power supplies in the chassis to no avail.
I have rearranged the drive positions as well within the chassis, the errors seem random and don't have any pattern I can discern.

Here is a link to the logs I have been taking for the errors, maybe someone can see something I cannot? All Smart reports have no errors.
Google Sheets - FreeNAS SCSI Errors

I've read that one drive on the bus can cause others to reset, and there are a few drives that come up more than others, but I only have one spare drive that I can swap with. Would it be wise to replace a drive, let it re-sliver and see if the error returns? Then put the old drive back if it does and go to the next one? This would mean a lot of re-slivers if it isn't one of the first drives I swap, which could leave my pool vulnerable if a SCSI error occurs while re-slivering.

Any ideas?

SweetAndLow · Jul 3, 2020

It's a hardware problem. You need to tell us what hardware you have if you want better help.

Odub · Jul 3, 2020

Intel server R2312GL4GS
Intel mobo S2600GZ
Dual E5-2650v2 CPU
64GB ECC
Two Avago LSI SAS2008 Firm: 20.00.07.00
8 X Seagate Barracuda Pro 10TB ST10000DM0004-1Z

Everything worked fine on this hardware until I upgraded to Freenas 11 and migrated drives.

Yorick · Jul 3, 2020

Try swapping cables and see whether the error follows the cable. I can only see port, cable or drive as the issue here.

subhuman · Jul 3, 2020

Everything worked fine on this hardware until I upgraded to Freenas 11 and migrated drives.

Those last three words are important. You changed hardware. No matter how experienced or careful you are, there's always the chance for something to go wrong.
Also, this is the first mention you've made about changing FreeNAS versions. Let's get some details:
-what version did you upgrade from?
-what version did you upgrade to?

On an unrelated note,
DA5

193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 11551

That's kinda high for 1800 power-on-hours. Not dangerously or alarmingly so, but that's roughly 6x/hour. The weird thing is DA7 is accumulating load cycles at about half that rate, which is weird if they're part of the same pool.
It's probably nothing, but the discrepancy piqued my curiousity.

Odub · Jul 3, 2020

I have tried swapping my spare cables and purchased all new cables as well, same random errors.

I believe I went from FreeNAS 9.3 to 11.1. I had to recreate the pool as my old pool was configured with 512B block size which was incompatible with my new drives. Thread here:
https://www.ixsystems.com/community/threads/expand-raidz2.81388/#post-563851

All the drives are part of the same pool and should have very similar on times. I'm thinking it must be a drive reset causing issues on the SCSI bus but how do I determine which drive it would be?
Would load cycle be an attribute of that drive reslivering more than others? Perhaps that is the bad drive?

Pheran · Jul 3, 2020

Welcome to the hell of Seagate Ironwolf. Take a look at this thread and see if there's a firmware update for your drive.

Odub · Jul 3, 2020

Pheran said:
Welcome to the hell of Seagate Ironwolf. Take a look at this thread and see if there's a firmware update for your drive.

Damn thanks for this, it looked like THE solution as it is exactly what I am experiencing but Seagate is showing no firmware updates for my particular drive (DM vs VN drives). So I need all new drives then? Damn Seagate, I said I would never buy one again but the price on these was hard to pass up...

The issue does seem to be limited to the LSI cards with these drives though, it may be cheaper for me to switch to server hardware with onboard SAS/SATA connectors.

subhuman · Jul 3, 2020

Would load cycle be an attribute of that drive reslivering more than others? Perhaps that is the bad drive?

A load cycle occurs when the heads park off the platter after the drive's been idle for some time in an attempt to reduce power consumption. Normally, drives in the same pool will have similar numbers because they're all going to be busy or all idle together. You did say these were shucked drives, so maybe they just have more aggressive parking settings than usual. But back to your question, usually a drive that is being worked harder will have a lower value. If a drive's working hard, it's rarely idle, thus no opportunity to park.
Most drives are rated for a certain number of load/unload cycles over their lifetime. That varies based on drive type, typically in the 150k-300k range.

As I said earlier, you're not accumulating them at an alarming rate (6 per hour) unlike this guy:

[SOLVED] - New Seagate NAS HDD - Extremely High Load/Unload cycle count

I bought a new Seagate 6TB NAS HDD to use as a storage upgrade in my desktop. As you will see in the picture it has less than 100hrs of use. I have noticed that it makes a sound every 5-7 seconds and have learned that the sound is the heads parking. The previous drive(Seagate 2TB FireCudaSSHD)...

forums.tomshardware.com

who was accumulating several hundred per hour and Seagate RMAd his drive.
I don't think yours is a problem, but it may behoove you to contact Seagate since there is significant variance between otherwise identical drives.

Odub · Jul 3, 2020

subhuman said:
A load cycle occurs when the heads park off the platter after the drive's been idle for some time in an attempt to reduce power consumption. Normally, drives in the same pool will have similar numbers because they're all going to be busy or all idle together. You did say these were shucked drives, so maybe they just have more aggressive parking settings than usual. But back to your question, usually a drive that is being worked harder will have a lower value. If a drive's working hard, it's rarely idle, thus no opportunity to park.
Most drives are rated for a certain number of load/unload cycles over their lifetime. That varies based on drive type, typically in the 150k-300k range.

As I said earlier, you're not accumulating them at an alarming rate (6 per hour) unlike this guy:

[SOLVED] - New Seagate NAS HDD - Extremely High Load/Unload cycle count

I bought a new Seagate 6TB NAS HDD to use as a storage upgrade in my desktop. As you will see in the picture it has less than 100hrs of use. I have noticed that it makes a sound every 5-7 seconds and have learned that the sound is the heads parking. The previous drive(Seagate 2TB FireCudaSSHD)...

forums.tomshardware.com

who was accumulating several hundred per hour and Seagate RMAd his drive.
I don't think yours is a problem, but it may behoove you to contact Seagate since there is significant variance between otherwise identical drives.

Thanks for that info, something else I just noticed was the UDMA CRC Error Count is quite high on one of the drives:

Ser	UDMA CRC Error Count	Load Cycle Count	Command Timeout
AWL	0	490	0 0 0
F0B	0	3087	19 20 79
D2W	45320	1461	0 0 473
RXB	259	1282	0 0 0
W9Z	0	389	0 0 0
Z3Y	330	11811	1 1 3
DK2	0	1864	0 0 0
Z1G	0	1030	0 0 0

Odub · Jul 12, 2020

I engaged Seagate support looking for a firmware update or any other insights they might have, after a week their feedback has been to put the drives into a windows machine and run Seatools.....

Here is the exact error message I'm seeing on the drives:

Code:


Jul 12 08:52:22 freenas  (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 952 Aborting command 0xfffffe0001904180
Jul 12 08:52:22 freenas mps1: Sending reset from mpssas_send_abort for target ID 22
Jul 12 08:52:22 freenas  (pass7:mps1:0:22:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 518 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 12 08:52:22 freenas mps1: Unfreezing devq for target ID 22
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): CAM status: Command timeout
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Retrying command
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): CAM status: SCSI Status Error
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SCSI status: Check Condition
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Error 6, Retries exhausted
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Invalidating pack
Jul 12 08:52:23 freenas ZFS: vdev state changed, pool_guid=5819163530430613475 vdev_guid=11434061962767615045
Jul 12 08:52:23 freenas ZFS: vdev state changed, pool_guid=5819163530430613475 vdev_guid=11434061962767615045

Pheran · Jul 13, 2020

What firmware do you have right now? There's a bit of online discussion about an EN02 update for Ironwolf Pro drives. This may be the Pro equivalent of SC61 for the regular Ironwolf drives.

Seagate Technology - Download Finder

apps1.seagate.com

Odub · Jul 17, 2020

DN01 Firmware.
I was told by Seagate support that there is no new Firmware available for these and they still want me to put it into a windows machine.
Is there a way to get to Level 2 support or someone who will actually read and understand what the problem is?

Pheran · Jul 17, 2020

Oh, that's not an Ironwolf. Looks like you edited your original post to fix it. I have no idea whether 10TB BarraCuda Pro drives are also affected by this.

Odub · Jul 17, 2020

Yes I had the drive model wrong in the original post, sorry.

Important Announcement for The TrueNAS Community.

Pool degraded, no SMART errors

Dabbler

Sweet'NASty

Dabbler

Dabbler

Sweet'NASty

Dabbler

Wizard

Contributor

Dabbler

Patron

Dabbler

Contributor

Dabbler

Dabbler

Patron

Dabbler

Patron

Dabbler

Similar threads

Important Announcement for The TrueNAS Community.