Storage Pool keeps going into degraded state

Qowy · Jun 12, 2023

Hi,
we have a TrueNAS-13.0-U4 System on some Supermicro Hardware with
a AMD EPYC 7232P , 128GB EEC RAM and
12 TOSHIBA MG07ACA1 SATA HDDs in 3 RAIDZ1

It is mainly used as temporary storage for not really important data (mainly images and stuff that can be recreated, additionally there is a backup) which is why I am still somewhat relaxed about this.

However: Random disks keep Receiving the FAULTED state with the message "too many errors".

Code:

root@storage[~]# zpool status
  pool: Data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:22:21 with 0 errors on Sun Jun  4 00:22:22 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/4a1e4ddd-427e-11ea-92ae-3cecef40151a  FAULTED      6    36     0  too many errors
            gptid/4b57c150-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/4c8e853b-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/4dbcd280-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
          raidz1-1                                      DEGRADED     0     0     0
            gptid/4eeafc3d-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/1cde9ff1-a091-11ed-a30f-3cecef40151a  ONLINE       0     0     0
            gptid/516d8ec2-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/52aef0ca-427e-11ea-92ae-3cecef40151a  FAULTED      3     7     0  too many errors
          raidz1-2                                      ONLINE       0     0     0
            gptid/53e8d7a9-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/55257da6-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/5659ddc4-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0
            gptid/5796471c-427e-11ea-92ae-3cecef40151a  ONLINE       0     0     0

errors: No known data errors

The disks themselves seem to look fine. SMART does not detect anything and even Long SMART tests that I did in the beginning came back clean.

Code:

root@storage[~]# smartctl -a /dev/da0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TE
Serial Number:    89K0A0X5F94G
LU WWN Device Id: 5 000039 998cb5c41
Firmware Version: 0101
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jun 12 09:00:03 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1392) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7795
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       29558
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       30 (Min/Max 18/38)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       1835008
222 Loaded_Hours            0x0032   027   027   000    Old_age   Always       -       29541
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       593
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

By now I go in about once every two weeks and do the following on the faulted disk:

Code:

root@storage[~]# zpool offline Data gptid/52aef0ca-427e-11ea-92ae-3cecef40151a
root@storage[~]# zpool clear Data gptid/52aef0ca-427e-11ea-92ae-3cecef40151a
root@storage[~]# zpool online Data gptid/52aef0ca-427e-11ea-92ae-3cecef40151a

Then it resilvers for about 5 seconds (as I said it is mostly lightly used) and everything is fine again for a few days.

Any ideas what might be causing this behavior? Is there some System bottleneck that fails getting the data to the disk at some load peaks?
Is there any logging that can tell me when these write/read error occurred?

sretalla · Jun 12, 2023

You haven't given the full details of your hardware.

How are the drives connected to the system? (which model of HBA? or is it the onboard SATA ports?)

The on disk you shared smartctl output from hasn't ever recorded running a test in 3.5 years... have you got scheduled tests set up?

I suspect that you may find that the disks involved are either actually having problems (we may need to see output from smartctl on the correct disk(s)) or the controller is somehow not a good one.

sretalla · Jun 12, 2023

Qowy said:
Is there any logging that can tell me when these write/read error occurred?

dmesg

You would see CAM STATUS messages when reads/writes fail.

Qowy · Jun 12, 2023

As I said it is all the disks seemingly at random. Here is a smartctl for one where I still bothered with smart tests.

Code:

root@storage[~]# smartctl -a /dev/da5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TE
Serial Number:    89K0A0W3F94G
LU WWN Device Id: 5 000039 998cb5bd3
Firmware Version: 0101
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jun 12 10:38:26 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1370) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7929
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       29560
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       33 (Min/Max 18/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       1966080
222 Loaded_Hours            0x0032   027   027   000    Old_age   Always       -       29543
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       536
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     26109         -
# 2  Short offline       Completed without error       00%     26088         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It is a Supermicro SuperChassis 836BE1C-R1K03B with H11SSL-i
Backplane is BPN-SAS3-836EL1 https://www.supermicro.com/manuals/other/BPN-SAS3-836EL.pdf
It supplies 16x SAS3 /SATA3 from a Broadcom 9300-4i connectet with 4x SAS3
Currently there are 12x SATA3 disks connected.

Qowy · Jun 12, 2023

Qowy said:
It supplies 16x SAS3 /SATA3 from a Broadcom 9300-4i connectet with 4x SAS3

I am actually going to doubly check that on the real hardware. While this is what my documentation tells me, there is no Broadcom HBA on my Invoice so I will make sure this is correct.

the most relevant dmesg output seems to be this:

Code:

        (da10:mpr0:0:18:0): WRITE(16). CDB: 8a 00 00 00 00 04 e3 0c ec c0 00 00 00 08 00 00 length 4096 SMID 1585 Command timeout on target 18(0x0010), 60000 set, 60.36479848 elapsed
mpr0: At enclosure level 0, slot 10, connector name (    )
mpr0: Sending abort to target 18 for SMID 1585
        (da10:mpr0:0:18:0): WRITE(16). CDB: 8a 00 00 00 00 04 e3 0c ec c0 00 00 00 08 00 00 length 4096 SMID 1585 Aborting command 0xfffffe02021c17f8
        (da10:mpr0:0:18:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 989 Command timeout on target 18(0x0010), 60000 set, 60.196108183 elapsed
mpr0: At enclosure level 0, slot 10, connector name (    )
mpr0: Controller reported scsi ioc terminated tgt 18 SMID 989 loginfo 31130000
(da10:mpr0:0:18:0): WRITE(16). CDB: 8a 00 00 00 00 04 e3 0c ec c0 00 00 00 08 00 00
mpr0: Finished abort recovery for target 18
(da10:mpr0:0:18:0): CAM status: Command timeout
(da10:mpr0:0:18:0): Retrying command, 3 more tries remain
(da10:mpr0:0:18:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da10:mpr0:0:18:0): CAM status: CCB request completed with an error
(da10:mpr0:0:18:0): Retrying command, 0 more tries remain
(da10:mpr0:0:18:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da10:mpr0:0:18:0): CAM status: SCSI Status Error
(da10:mpr0:0:18:0): SCSI status: Check Condition
(da10:mpr0:0:18:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da10:mpr0:0:18:0): Error 6, Retries exhausted
(da10:mpr0:0:18:0): Invalidating pack
        (da0:mpr0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 633 Command timeout on target 8(0x0006), 60000 set, 60.20493298 elapsed
mpr0: At enclosure level 0, slot 0, connector name (    )
mpr0: Sending abort to target 8 for SMID 633
        (da0:mpr0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 633 Aborting command 0xfffffe020216a1b8
        (da0:mpr0:0:8:0): WRITE(10). CDB: 2a 00 d3 9b a1 b8 00 00 08 00 length 4096 SMID 863 Command timeout on target 8(0x0006), 60000 set, 60.78730763 elapsed
mpr0: At enclosure level 0, slot 0, connector name (    )
mpr0: Controller reported scsi ioc terminated tgt 8 SMID 863 loginfo 31130000
mpr0: Finished abort recovery for target 8
(da0:mpr0:0:8:0): WRITE(10). CDB: 2a 00 d3 9b a1 b8 00 00 08 00
(da0:mpr0:0:8:0): CAM status: CCB request completed with an error
(da0:mpr0:0:8:0): Retrying command, 3 more tries remain
(da0:mpr0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da0:mpr0:0:8:0): CAM status: Command timeout
(da0:mpr0:0:8:0): Retrying command, 0 more tries remain
(da0:mpr0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da0:mpr0:0:8:0): CAM status: SCSI Status Error
(da0:mpr0:0:8:0): SCSI status: Check Condition
(da0:mpr0:0:8:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpr0:0:8:0): Error 6, Retries exhausted
(da0:mpr0:0:8:0): Invalidating pack
        (da7:mpr0:0:15:0): WRITE(16). CDB: 8a 00 00 00 00 04 f3 dc 5b a0 00 00 00 08 00 00 length 4096 SMID 1385 Command timeout on target 15(0x000d), 60000 set, 60.258719397 elapsed
mpr0: At enclosure level 0, slot 7, connector name (    )
mpr0: Sending abort to target 15 for SMID 1385
        (da7:mpr0:0:15:0): WRITE(16). CDB: 8a 00 00 00 00 04 f3 dc 5b a0 00 00 00 08 00 00 length 4096 SMID 1385 Aborting command 0xfffffe02021af238
        (da7:mpr0:0:15:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1320 Command timeout on target 15(0x000d), 60000 set, 60.418112088 elapsed
mpr0: At enclosure level 0, slot 7, connector name (    )
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1320 loginfo 31130000
(da7:mpr0:0:15:0): WRITE(16). CDB: 8a 00 00 00 00 04 f3 dc 5b a0 00 00 00 08 00 00
mpr0: Finished abort recovery for target 15
(da7:mpr0:0:15:0): CAM status: Command timeout
(da7:mpr0:0:15:0): Retrying command, 3 more tries remain
(da7:mpr0:0:15:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da7:mpr0:0:15:0): CAM status: CCB request completed with an error
(da7:mpr0:0:15:0): Retrying command, 0 more tries remain
(da7:mpr0:0:15:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da7:mpr0:0:15:0): CAM status: SCSI Status Error
(da7:mpr0:0:15:0): SCSI status: Check Condition
(da7:mpr0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da7:mpr0:0:15:0): Error 6, Retries exhausted
(da7:mpr0:0:15:0): Invalidating pack

Qowy · Jun 12, 2023

Qowy said:
I am actually going to doubly check that on the real hardware. While this is what my documentation tells me, there is no Broadcom HBA on my Invoice so I will make sure this is correct.

I checked and there definately is a Broadcom 9300-4i HBA connected to the backplane with a Mini-SAS-HD cable. Btw is there an edit function?

sretalla · Jun 12, 2023

OK, it may be interesting to know the firmware level of the card.

sas3flash -list

Depending on what version you have, it may be that you will benefit from the fixes in the private .12 release:

LSI 9300-xx Firmware Update

Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs. After working with Broadcom, we’ve come up with a...

www.truenas.com

Qowy said:
Btw is there an edit function?

Only after you go over the minimum threshold of posts to be counted a real forum user... shouldn't be too many more.

Qowy · Jun 12, 2023

sretalla said:
OK, it may be interesting to know the firmware level of the card.

sas3flash -list

Depending on what version you have, it may be that you will benefit from the fixes in the private .12 release:

Code:

root@storage[~]# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3004(C0)

        Controller Number              : 0
        Controller                     : SAS3004(C0)
        PCI Address                    : 00:83:00:00
        SAS Address                    : 500605b-0-0dee-1350
        NVDATA Version (Default)       : 05.00.00.05
        NVDATA Version (Persistent)    : 05.00.00.05
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 05.00.00.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9300-4i
        BIOS Version                   : 08.11.00.00
        UEFI BSD Version               : 06.00.00.00
        FCODE Version                  : N/A
        Board Name                     : SAS9300-4i
        Board Assembly                 : H3-25473-00G
        Board Tracer Number            : SP83318728

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

So I would say 05.00 is definitely below 16 :D

Should I first try to update via "official" releases or directly apply the described method?

sretalla · Jun 12, 2023

I see reference to those drives as "near-line" storage in the promotional material... I wonder if they are agressively spinning down and take too long to spin back up, causing timeouts.

sretalla · Jun 12, 2023

Qowy said:
Should I first try to update via "official" releases or directly apply the described method?

The private version in the thread is an official version from broadcom, just not made available on their website directly.

I think it would be safe to use that.

Make sure if you're planning to do that in TrueNAS, that you first export the pool.

Otherwise, the option is to boot from a linux live install and flash it from there.

Qowy · Jun 12, 2023

sretalla said:
Make sure if you're planning to do that in TrueNAS, that you first export the pool.

Otherwise, the option is to boot from a linux live install and flash it from there.

To make sure that the disks are not actively in use while the flash occurs? I guess a live image would be the most painless method.

I will try this Firmware and then monitor if any errors occur again over the next weeks and will try to report back.

I just checked our other storage servers and they use different HBAs (Broadcom 3008) and different Disks, so I can draw no conclusions for now if the disks would be too aggressive in idling. However the server was put together (without OS) by a company that has quite a big business so I would guess they know their storage disks and how they would perform, but you never know.
Anyway those other server developed a different problem after 3 years now but that I will have to monitor and is definitely a different topic that I might open up.

Thanks for the help so far!

sretalla · Jun 12, 2023

Qowy said:
To make sure that the disks are not actively in use while the flash occurs?

Exactly.

Either way is "fine", but as you say, temporarily booted to a different OS is possibly "simpler" (make sure not to do anything to the disks while in that OS though).

sretalla · Jun 12, 2023

May or may not prove to be true, but the .12 private fix specifically addresses something that looks like what you're seeing.

Qowy · Jun 28, 2023

So far so good. Flashing the firmware seems to have solved the issue. At least it didn't pop up again since then.

Thanks.

Important Announcement for the TrueNAS Community.

Storage Pool keeps going into degraded state

Qowy

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

Qowy

Cadet

Qowy

Cadet

Qowy

Cadet

sretalla

Powered by Neutrality

LSI 9300-xx Firmware Update

Qowy

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

Qowy

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

Qowy

Cadet

Similar threads