Pool degraded, no SMART errors

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
Hello,
I recently replaced all 8 drives in my server with new drives. (4TB HGST to shucked 10TB Seagate Barracuda Pros).

I actually had to rebuild the entire Pool to change the block size to 4k, then replicated to the new drives. I did this using two chassis and two LSI cards due to not having enough space in my primary chassis

I did have some weird device reset errors when I had the drives split up, I attributed this to the precarious nature of the setup, and the errors were on my backup chassis.

Completed the transfer and put the new drives into the primary chassis. Everything was fine until last week when I got a notification the Pool was degraded.
DA7 was showing some read/write errors and degraded, I power cycled the server and re-seated the drive, it booted up fine and the Pool was normal again.

Fast forward to yesterday when I received a notification the Pool was degraded again.
DA7 and DA5 both have read/write errors.
I have SMART tests scheduled but received no SMART warnings so I checked for errors, here is the output:

DA5
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 076 064 044 Pre-fail Always - 44632200 3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 21 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 084 060 045 Pre-fail Always - 234027207 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1800 (122 111 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 20 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4295032833 189 High_Fly_Writes 0x003a 069 069 000 Old_age Always - 31 190 Airflow_Temperature_Cel 0x0022 074 047 040 Old_age Always - 26 (Min/Max 24/28) 191 G-Sense_Error_Rate 0x0032 093 093 000 Old_age Always - 15492 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 11551 194 Temperature_Celsius 0x0022 026 053 000 Old_age Always - 26 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 009 002 000 Old_age Always - 44632200 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 880 (239 115 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 35954624018 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 40571654043

DA7
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 076 064 044 Pre-fail Always - 38111312 3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always - 122729332 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 571 (199 172 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 081 081 000 Old_age Always - 19 190 Airflow_Temperature_Cel 0x0022 074 047 040 Old_age Always - 26 (Min/Max 24/28) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 1756 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 2 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1644 194 Temperature_Celsius 0x0022 026 053 000 Old_age Always - 26 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 009 001 000 Old_age Always - 38111312 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 483 (175 26 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 28520739476 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 32704312746

Disk errors reported in FreeNAS:
1584774094243.png


I'm no expert with SMART data but I don't see any read/write errors present? What would cause the ZFS to see read/write errors? The LSI card, or am I missing something perhaps? All the new drives are on the same LSI card in IT mode that was working with the old drives.

Thanks
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
What do your smart long test think?
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
The last scheduled long tests on both drives "completed without error".

I powered down the server, re-seated all of the drives and cards then rebooted. Pool is no longer degraded.

I guess I will see how long this lasts?
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
So I still have this issue, I have been limping it along by rebooting every few days but I'm out of ideas....

Anywhere from a couple hours to a couple days it will be fine, but then I will have a few read/write errors on a random disk due to SCSI power resets. If it leave it without rebooting, another disk will get random errors. This is resolved by rebooting and it re-slivers fine.

My issue seems to be identical to what was encountered in this thread, with SCSI resets:

Example of the error he was having:
Code:
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): READ(10). CDB: 28 00 32 86 d0 c8 00 00 18 00
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Aug 25 06:10:33 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)


Which was resolved by replacing the chassis, but I have replaced the backplane and power supplies already so I am not sure what else could be the issue.

So far I have replaced the backplane, LSI card, all cables and both power supplies in the chassis to no avail.
I have rearranged the drive positions as well within the chassis, the errors seem random and don't have any pattern I can discern.

Here is a link to the logs I have been taking for the errors, maybe someone can see something I cannot? All Smart reports have no errors.
Google Sheets - FreeNAS SCSI Errors

I've read that one drive on the bus can cause others to reset, and there are a few drives that come up more than others, but I only have one spare drive that I can swap with. Would it be wise to replace a drive, let it re-sliver and see if the error returns? Then put the old drive back if it does and go to the next one? This would mean a lot of re-slivers if it isn't one of the first drives I swap, which could leave my pool vulnerable if a SCSI error occurs while re-slivering.

Any ideas?
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
It's a hardware problem. You need to tell us what hardware you have if you want better help.
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
Intel server R2312GL4GS
Intel mobo S2600GZ
Dual E5-2650v2 CPU
64GB ECC
Two Avago LSI SAS2008 Firm: 20.00.07.00
8 X Seagate Barracuda Pro 10TB ST10000DM0004-1Z

Everything worked fine on this hardware until I upgraded to Freenas 11 and migrated drives.
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Try swapping cables and see whether the error follows the cable. I can only see port, cable or drive as the issue here.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
Everything worked fine on this hardware until I upgraded to Freenas 11 and migrated drives.
Those last three words are important. You changed hardware. No matter how experienced or careful you are, there's always the chance for something to go wrong.
Also, this is the first mention you've made about changing FreeNAS versions. Let's get some details:
-what version did you upgrade from?
-what version did you upgrade to?

On an unrelated note,
DA5
193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 11551
That's kinda high for 1800 power-on-hours. Not dangerously or alarmingly so, but that's roughly 6x/hour. The weird thing is DA7 is accumulating load cycles at about half that rate, which is weird if they're part of the same pool.
It's probably nothing, but the discrepancy piqued my curiousity.
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
I have tried swapping my spare cables and purchased all new cables as well, same random errors.

I believe I went from FreeNAS 9.3 to 11.1. I had to recreate the pool as my old pool was configured with 512B block size which was incompatible with my new drives. Thread here:
https://www.ixsystems.com/community/threads/expand-raidz2.81388/#post-563851

All the drives are part of the same pool and should have very similar on times. I'm thinking it must be a drive reset causing issues on the SCSI bus but how do I determine which drive it would be?
Would load cycle be an attribute of that drive reslivering more than others? Perhaps that is the bad drive?
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
Welcome to the hell of Seagate Ironwolf. Take a look at this thread and see if there's a firmware update for your drive.

Damn thanks for this, it looked like THE solution as it is exactly what I am experiencing but Seagate is showing no firmware updates for my particular drive (DM vs VN drives). So I need all new drives then? Damn Seagate, I said I would never buy one again but the price on these was hard to pass up...

The issue does seem to be limited to the LSI cards with these drives though, it may be cheaper for me to switch to server hardware with onboard SAS/SATA connectors.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
Would load cycle be an attribute of that drive reslivering more than others? Perhaps that is the bad drive?
A load cycle occurs when the heads park off the platter after the drive's been idle for some time in an attempt to reduce power consumption. Normally, drives in the same pool will have similar numbers because they're all going to be busy or all idle together. You did say these were shucked drives, so maybe they just have more aggressive parking settings than usual. But back to your question, usually a drive that is being worked harder will have a lower value. If a drive's working hard, it's rarely idle, thus no opportunity to park.
Most drives are rated for a certain number of load/unload cycles over their lifetime. That varies based on drive type, typically in the 150k-300k range.

As I said earlier, you're not accumulating them at an alarming rate (6 per hour) unlike this guy:
who was accumulating several hundred per hour and Seagate RMAd his drive.
I don't think yours is a problem, but it may behoove you to contact Seagate since there is significant variance between otherwise identical drives.
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
A load cycle occurs when the heads park off the platter after the drive's been idle for some time in an attempt to reduce power consumption. Normally, drives in the same pool will have similar numbers because they're all going to be busy or all idle together. You did say these were shucked drives, so maybe they just have more aggressive parking settings than usual. But back to your question, usually a drive that is being worked harder will have a lower value. If a drive's working hard, it's rarely idle, thus no opportunity to park.
Most drives are rated for a certain number of load/unload cycles over their lifetime. That varies based on drive type, typically in the 150k-300k range.

As I said earlier, you're not accumulating them at an alarming rate (6 per hour) unlike this guy:
who was accumulating several hundred per hour and Seagate RMAd his drive.
I don't think yours is a problem, but it may behoove you to contact Seagate since there is significant variance between otherwise identical drives.

Thanks for that info, something else I just noticed was the UDMA CRC Error Count is quite high on one of the drives:


SerUDMA CRC Error CountLoad Cycle CountCommand Timeout
AWL04900 0 0
F0B0308719 20 79
D2W4532014610 0 473
RXB25912820 0 0
W9Z03890 0 0
Z3Y330118111 1 3
DK2018640 0 0
Z1G010300 0 0
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
I engaged Seagate support looking for a firmware update or any other insights they might have, after a week their feedback has been to put the drives into a windows machine and run Seatools.....

Here is the exact error message I'm seeing on the drives:

Code:
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 952 Aborting command 0xfffffe0001904180
Jul 12 08:52:22 freenas mps1: Sending reset from mpssas_send_abort for target ID 22
Jul 12 08:52:22 freenas (pass7:mps1:0:22:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 518 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 12 08:52:22 freenas mps1: Unfreezing devq for target ID 22
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): CAM status: Command timeout
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Retrying command
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): CAM status: SCSI Status Error
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SCSI status: Check Condition
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Error 6, Retries exhausted
Jul 12 08:52:22 freenas (da6:mps1:0:22:0): Invalidating pack
Jul 12 08:52:23 freenas ZFS: vdev state changed, pool_guid=5819163530430613475 vdev_guid=11434061962767615045
Jul 12 08:52:23 freenas ZFS: vdev state changed, pool_guid=5819163530430613475 vdev_guid=11434061962767615045
 

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
DN01 Firmware.
I was told by Seagate support that there is no new Firmware available for these and they still want me to put it into a windows machine.
Is there a way to get to Level 2 support or someone who will actually read and understand what the problem is?


Capture.JPG
 
Last edited:

Odub

Dabbler
Joined
Feb 2, 2017
Messages
15
Yes I had the drive model wrong in the original post, sorry.
 
Top