Pool Degraded Randomly

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
I am running FreeNAS-11.3-U4.1 and unfortunately, since some time, I am getting random ZFS errors. Out of nowhere, one of the devices in my pool will become faulted.

Code:
root@freenas:~ # zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 1.37G in 0 days 00:00:34 with 0 errors on Sun Oct 18 18:45:47 2020
config:

    NAME                                                STATE     READ WRITE CKSUM
    tank                                                DEGRADED     0     0     0
      raidz3-0                                          DEGRADED     0     0     0
        gptid/1f9588fd-181d-11ea-8a68-b9d2674e3063.eli  ONLINE       0     0     0
        gptid/14731eb5-aa63-11e7-ad99-0d581efb06db.eli  ONLINE       0     0     0
        gptid/946e300f-165a-11ea-8a68-b9d2674e3063.eli  ONLINE       0     0     0
        gptid/c3a7a5f9-a99b-11e7-a252-115519ca0956.eli  ONLINE       0     0     0
        gptid/08f95f17-3cf4-11e8-9c0e-1159604507fe.eli  ONLINE       0     0     0
        gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli  FAULTED      6   274     0  too many errors
        gptid/8a84d43a-3e38-11e8-9c0e-1159604507fe.eli  ONLINE       0     0     0
        gptid/c74ea2cb-a99b-11e7-a252-115519ca0956      ONLINE       0     0     0
    cache
      gptid/22609ef9-a583-11ea-90b4-fbdc11792c3d.eli    ONLINE       0     0     0

errors: No known data errors


After a reboot, the pool is online again and seems to behave normally - until the error reappears a while later. The device that is faulted is always a different one.

I am running FreeNAS on

* Proxmox 6.2 Hypervisor
* Intel Xeon E3-1245v6 (CPU set to HOST in Proxmox)
* 64GB Samsung ECC Ram (32 GB for FreeNAS)
* LSI SAS 9207-8i HBA (fully passed through to FreeNAS)

I am aware that running FreeNAS in a virtualized manner is not encouraged, but I think that since I am fully passing through my controller it should hopefully be fine - and it was for over 2 years.

What did I already do:
  • Reseat all cables
  • Try out a different set of cables
  • Reseat HBA
  • Update Proxmox and FreeNAS
  • Run SMART tests on all drives (I scheduled SMART tests of course, and all drives seem to be fine
What can I do to find out where the issues are coming from? Is there any way of narrowing it down to a specific component or issue?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
I scheduled SMART tests of course, and all drives seem to be fine
How are you assessing that?

Can you run smartctl -a for the drive that is faulted right after it happens?

What do you see in dmesg right after the faulted drive is reported?

Also, why is the last drive in the pool shown not to have the .eli extension... if it's an encrypted pool, I would expect to see that all members have it.
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
I am assessing it by running smartctl -a and checking the output. Here is the output after the last failure

Code:
root@freenas:~ # smartctl -a /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST10000VN0004-1ZD101
Serial Number:    ZA210V1Q
LU WWN Device Id: 5 000c50 0a1f28bfe
Firmware Version: SC60
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct 19 14:56:07 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  575) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 932) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   081   064   044    Pre-fail  Always       -       112961745
  3 Spin_Up_Time            0x0003   088   086   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       226
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   096   060   045    Pre-fail  Always       -       4014951376
  9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       31221 (69 93 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       218
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   093   093   000    Old_age   Always       -       7
190 Airflow_Temperature_Cel 0x0022   065   058   040    Old_age   Always       -       35 (Min/Max 31/35)
191 G-Sense_Error_Rate      0x0032   097   097   000    Old_age   Always       -       6070
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       93
193 Load_Cycle_Count        0x0032   049   049   000    Old_age   Always       -       102371
194 Temperature_Celsius     0x0022   035   042   000    Old_age   Always       -       35 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   015   002   000    Old_age   Always       -       112961745
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       15280 (1 187 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       94739342927
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1308789890833

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     31208         -
# 2  Short offline       Completed without error       00%     31040         -
# 3  Extended offline    Completed without error       00%     30962         -
# 4  Short offline       Completed without error       00%     22596         -
# 5  Extended offline    Interrupted (host reset)      20%     22513         -
# 6  Short offline       Completed without error       00%     22428         -
# 7  Short offline       Completed without error       00%     22260         -
# 8  Extended offline    Completed without error       00%     22179         -
# 9  Short offline       Completed without error       00%     22092         -
#10  Short offline       Completed without error       00%     21876         -
#11  Extended offline    Completed without error       00%     21795         -
#12  Short offline       Completed without error       00%     21708         -
#13  Short offline       Completed without error       00%     21540         -
#14  Extended offline    Completed without error       00%     21458         -
#15  Short offline       Completed without error       00%     21371         -
#16  Short offline       Completed without error       00%     21131         -
#17  Extended offline    Completed without error       00%     21050         -
#18  Short offline       Completed without error       00%     20963         -
#19  Short offline       Completed without error       00%     20795         -
#20  Extended offline    Completed without error       00%     20714         -
#21  Short offline       Completed without error       00%     20627         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I'll run dmesg the next time the issue rears it's head.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
That drive is showing 7 "High_Fly_Writes"

Although that translates to something not too concerning (for data integrity) according to the wiki, I would generally not be happy to see those appearing.


I would guess there may be some physical interference rather more likely than magnetic... unless you live next to a power station or something... so could it be a vibration or bumping issue?
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
I do not live near a power station or something like that. I would assume that those are probably related to vibrations. I run 8 HDDs in the same case, and that is the limit that Seagate supports for the drive model afaik.

Do you think this could really cause the issues that I am seeing? I will write down all the "High_Fly_Write" counts of my drives to see if they increase when the issue appears.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Do you think this could really cause the issues that I am seeing?
At least for now it's the best idea I have to investigate further...

watching those values would be the sensible next step to see if you can correlate it with the faulted disks.

I don't know how SMART is recording them (in real-time or only with a test), so you may want to check that too.

You may also want to test/check to see how stable your case is... is it on the end of a floorboard that moves when you step around in another part of the room... things like that.
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
Just to add that I've gone through an entire array's worth of disks in the last couple of years (job lot of ex-datacentre drives), and most of them went the way you have noted. There may be CAM errors in your logs. I've put up threads asking about it before - a drive will offline itself, but be fine on a hot-replug and resilver, nothing showing in SMART results. Puzzling.

Until a day or week or two later, when it'll do it again. Eventually it gets a real fault and fails to spin up or goes into the self-test repeated click of doom.

So it certainly can be the sign of a dying disk, and SMART isn't smart enough.
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
The issue happend again, and I ran dmesg.

Code:
(da6:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 415 Aborting command 0xfffffe000100d0b0
mps0: Sending reset from mpssas_send_abort for target ID 7
mps0: Unfreezing devq for target ID 7
(da6:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps0:0:7:0): CAM status: Command timeout
(da6:mps0:0:7:0): Retrying command
(da6:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps0:0:7:0): CAM status: SCSI Status Error
(da6:mps0:0:7:0): SCSI status: Check Condition
(da6:mps0:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:7:0): Error 6, Retries exhausted
(da6:mps0:0:7:0): Invalidating pack
    (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 222 Aborting command 0xfffffe0000ffd360
mps0: Sending reset from mpssas_send_abort for target ID 1
    (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 03 2b 31 86 a0 00 00 00 08 00 00 length 4096 SMID 608 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 1
(da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 03 2b 31 86 a0 00 00 00 08 00 00
(da1:mps0:0:1:0): CAM status: CCB request completed with an error
(da1:mps0:0:1:0): Retrying command
(da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da1:mps0:0:1:0): CAM status: Command timeout
(da1:mps0:0:1:0): Retrying command
(da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 03 2b 31 86 a0 00 00 00 08 00 00
(da1:mps0:0:1:0): CAM status: SCSI Status Error
(da1:mps0:0:1:0): SCSI status: Check Condition
(da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:1:0): Retrying command (per sense data)
(da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 00 40 01 d0 00 00 08 00
(da1:mps0:0:1:0): CAM status: SCSI Status Error
(da1:mps0:0:1:0): SCSI status: Check Condition
(da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:1:0): Retrying command (per sense data)
    (da4:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 763 Aborting command 0xfffffe0001029970
mps0: Sending reset from mpssas_send_abort for target ID 5
mps0: Unfreezing devq for target ID 5
(da4:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da4:mps0:0:5:0): CAM status: Command timeout
(da4:mps0:0:5:0): Retrying command
(da4:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da4:mps0:0:5:0): CAM status: SCSI Status Error
(da4:mps0:0:5:0): SCSI status: Check Condition
(da4:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da4:mps0:0:5:0): Error 6, Retries exhausted
(da4:mps0:0:5:0): Invalidating pack
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=5229600317440, length=4096)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085637963776, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638000640, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638062080, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638033408, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638094848, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638123520, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638152192, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638180864, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638209536, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638238208, length=24576)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=9998683086848, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=9998683348992, length=8192)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638840320, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638811648, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638868992, length=24576)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7917070372864, length=4096)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638262784, length=544768)]
GEOM_ELI: g_eli_write_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[WRITE(offset=7085638893568, length=143360)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=9998683086848, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) gptid/c577a1de-a99b-11e7-a252-115519ca0956.eli[READ(offset=9998683348992, length=8192)]


I also checked "High_Fly_Writes" again - and the values did not change here at all.

As the error will basically happen on a random disk, I am still thinking that it might not be the disks. I find it hard to belive that all the disks would start to fail at the same time. Could this be a power supply issue or my HBA going bad?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Could this be a power supply issue or my HBA going bad?
We certainly haven't eliminated the possibility.

Do all the disks show High_Fly_Writes?

Dmesg is certainly showing SCSI resets, so the controller may be struggling. Do you have an option to test with another controller?
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
Yes, all the disks have a few "High_Fly_Writes". I looked online for a while, and it seems that having a few of those if pretty normal, at least if they do not show up in batches. Also, with the recent error the "High_Fly_Writes" did not change for any disk.

I am trying to get my hands on a different Controller (LSI SAS9300-8i) to verify this, but I guess it will take a day or two to get the controller.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Do you have enough SATA ports on the Mobo to try them?

A side note... you haven't shared any hardware details.
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
No, unfortunately I have not.

I ordered a different HBA that should arrive in a day or two.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
OK, good luck.
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
So, the HBA arrived and I immediately put it into the system. I fully removed the old HBA. Unfortunately, after around a day, the error was back. I guess we can rule out the HBA as cause of the problem? What should I try next?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
the error was back
Do you mean the same GEOM_ELI errors as posted above?

Is it still a different disk every time?

The next area of investigation might need to be on the method of connection/backplane (again, you haven't shared hardware details such as the chassis, so I don't have a lot to go on)
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
The errors are the same as posted above.

There is no Backplane. The Drives are connected to the HBA by a SAS to 4-times-SATA cable, and I already checked the connection and tried a different set of cables.

Yes, it still seemsto be random which drive is affected.
 
Joined
Jan 18, 2017
Messages
524
if you search the forum for ST10000VN0004-1ZD101 there should be two threads about cam control errors, it was a while ago but i believe one user received a firmware update for the drives from seagate.
 

naglerrr

Dabbler
Joined
Apr 28, 2017
Messages
13
Thanks! This is a great lead! I read several threads based on your hint, and found the new firmware you also mentioned. I will try to update my drives one-by-one, and run a scrub in between. Hopefully, this solves the issue for good.
 
Joined
Jan 18, 2017
Messages
524
be sure to update us on if it works out or not.
 
Top