Intermittent WD 8TB drive issue - Pool Degraded/Healthy

atlantic · Feb 19, 2021

I'm in the process of upgrading my pools to a single 40TB Z2 Vdev consisting of 7 shucked 8TB WD drives.

However one of these 8TB drives has started acting up.

The other week its pool could not be found - I reseated the drive (in an MD1000) and rebooted and it was found fine. No other problems until today when I got this alert:

New alerts:
* Pool Arctic state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

SMART status is OK, and it recently passed an extended test. However The disk did not display its serial no. or other info in the Storage/Disks GUI.
TrueNAS reported some numbers in the pool status for that disk (see grab), but since I rebooted those are cleared and the pool is reported as Healthy.

I've just received these SCSI errors though.

Drive is in a different slot to before in the MD1000 so I'm confident its not that.

Is the drive going bad? A power issue?

Here's the SMART report:

Code:

=== START OF INFORMATION SECTION ===
Model Family:     WDC HGST Ultrastar He10
Device Model:     WDC WD80EMAZ-00WJTA0
Serial Number:    ---------
LU WWN Device Id: 5 000cca 27dc47f6b
Firmware Version: 83.H0A83
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Feb 19 12:50:42 2021 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   93) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1067) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   127   127   054    Old_age   Offline      -       120
  3 Spin_Up_Time            0x0007   145   145   024    Pre-fail  Always       -       443 (Average 464)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       165
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2979
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       165
22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       331
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       331
194 Temperature_Celsius     0x0002   004   004   000    Old_age   Always       -       25 (Min/Max 2/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2976         -
# 2  Short offline       Completed without error       00%      2955         -
# 3  Short offline       Completed without error       00%      2933         -
# 4  Short offline       Completed without error       00%      2847         -
# 5  Extended offline    Completed without error       00%      2795         -
# 6  Extended offline    Interrupted (host reset)      80%      2677         -
# 7  Extended offline    Interrupted (host reset)      90%      2673         -
# 8  Extended offline    Interrupted (host reset)      90%      2670         -
# 9  Extended offline    Interrupted (host reset)      70%      2670         -
#10  Extended offline    Interrupted (host reset)      90%      2666         -
#11  Extended offline    Interrupted (host reset)      70%      2665         -
#12  Extended offline    Completed without error       00%      2630         -
#13  Extended offline    Completed without error       00%      2608         -
#14  Extended offline    Completed without error       00%      2565         -
#15  Extended offline    Completed without error       00%      2521         -
#16  Extended offline    Completed without error       00%      2446         -
#17  Extended offline    Completed without error       00%      2412         -
#18  Extended offline    Completed without error       00%      2328         -
#19  Extended offline    Completed without error       00%      2290         -
#20  Extended offline    Completed without error       00%      2245         -
#21  Extended offline    Completed without error       00%      2194         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

(Editied for a bit less waffle)

Yorick · Feb 19, 2021

atlantic said:
A power issue?

I'm assuming there's no 3.3V going to the slot at all, otherwise none of these drives would spin up. If you worked around the 3.3V with kapton tape on the drive connector, maybe make sure it's seated well and the drive can't receive a "stray" 3.3V at any time.

Since the issue follows the drive when moving it to a different slot, it's likely to be the drive, although SMART shows no errors.

You can try to clean contacts on the drive. Other than that I suppose it'd be to replace it. Warranty is likely not a thing, since it's been shucked.

atlantic · Feb 19, 2021

Yorick said:
I'm assuming there's no 3.3V going to the slot at all, otherwise none of these drives would spin up. If you worked around the 3.3V with kapton tape on the drive connector, maybe make sure it's seated well and the drive can't receive a "stray" 3.3V at any time.

Since the issue follows the drive when moving it to a different slot, it's likely to be the drive, although SMART shows no errors.

You can try to clean contacts on the drive. Other than that I suppose it'd be to replace it. Warranty is likely not a thing, since it's been shucked.

Yes, no tape was needed to get them to spin up thankfully. The MD1000 has been solid.

I've removed the drive and destroyed its pool and ran WD Dashboard on it from my PC. Said it was absolutely fine. Its now on an extended SMART test again. I've read that WD will still honour the warranty. I have the external case for it anyway so it shouldn't be an issue if I RMA, but it'll take a few weeks, which I want to avoid.

The only other thing I can think of is that I didn't use an interposer card on that drive (nor on some of the others). The cable to the server is over 2m long, which I've learned is not good for SATA drives and I should be using the interposer cards on all the SATA drives in MD1000. Could that be the reason?

Those SCSI errors are a worry... could it be the HBA?

atlantic · Feb 20, 2021

Yorick said:
...and the drive can't receive a "stray" 3.3V at any time.

I may have under appreciated this bit of info. I'm going to tape up all the WD drives just to rule it out.

Yorick · Feb 20, 2021

I doubt 3.3V is the issue here, the drives wouldn't spin up at all.

I don't know the MD1000 so I'm not understanding the "no interposer card" comment. I'd have expected HBA in TrueNAS, then HBA Expander in MD1000, then drives. Is that not how it functions?

atlantic · Feb 20, 2021

Yorick said:
I doubt 3.3V is the issue here, the drives wouldn't spin up at all.

I don't know the MD1000 so I'm not understanding the "no interposer card" comment. I'd have expected HBA in TrueNAS, then HBA Expander in MD1000, then drives. Is that not how it functions?

Interposer cards go between SATA drives and the SAS backplane of the MD1000 (perc 710 in IT mode, SAS HBA in r720 > MD1000 controller card > Drives). From what I've gathered (from the post in the link below) SATA only support <1m cables, while SAS can do up to 10m and the cards allow SATA drives to work like SAS. I've got them all in now, created the pool and I'm actually seeing higher transfer rates, average of 150 compared to 100 MB/s without them.

Interposers - what are they for?

None of those SCSI errors have appeared yet either. I'm going to see what happens.

Yorick · Feb 20, 2021

Interesting. So the SAS expander doesn’t refresh the signal in any way, it all just looks like one big cable, is what I am taking from that: And hence the interposer in the tray that does a sas to SATA conversion.

learn something every day.

atlantic · Feb 20, 2021

Yorick said:
Interesting. So the SAS expander doesn’t refresh the signal in any way, it all just looks like one big cable, is what I am taking from that: And hence the interposer in the tray that does a sas to SATA conversion.

learn something every day.

And you get blinky lights on the MD1000 bays

atlantic · Feb 21, 2021

OK got an error this morning, same drive, da12, in yet another slot in the MD1000. However I don't see where those errors are listed in TrueNAS except in the alert. Pool status shows no errors (were they cleared?). Could it be a problem with the disk's circuit board rather than the platters perhaps?

zsf send has been in progress over night with >11TB transferred to the pool and about 3TB left to go. I guess I should let it finish then offline da12 and get it RMA'd. Is it OK to continue to use the pool in this degraded state? Should I export the pool while waiting for the new drive so I can still use the other pools?

Alert:
Device: /dev/da12 [SAT], ATA error count increased from 0 to 1.

Code:

pool: Caspian
state: ONLINE
  scan: scrub in progress since Sun Feb 21 00:00:04 2021
    8.76T scanned at 283M/s, 6.06T issued at 196M/s, 8.76T total
    0B repaired, 69.19% done, 04:01:12 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    Caspian                                         ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/52e575c5-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/5330f985-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/5300a07a-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/534c5591-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/538c1b0b-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/53c3340f-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0
        gptid/53bae92b-7372-11eb-949d-90b11c4b7a7b  ONLINE       0     0     0

Also this (can anyone shed light on what this means - is this a hardware fault in addition to the drive problem??):

And this is worrying (Information unit iuCRC error detected)

Edit: Smart error is reported in smartctl:

Code:

SMART Error Log Version: 1
ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 3016 hours (125 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 00 30 60 26 40 00      18:09:22.174  READ FPDMA QUEUED
  60 00 00 30 5f 26 40 00      18:09:22.172  READ FPDMA QUEUED
  60 00 00 30 5e 26 40 00      18:09:22.170  READ FPDMA QUEUED
  60 00 00 30 5d 26 40 00      18:09:22.167  READ FPDMA QUEUED
  60 00 08 30 5c 26 40 00      18:09:22.165  READ FPDMA QUEUED

atlantic · Feb 21, 2021

Did some digging and found a post with similar issues with the HBA resetting. It was a heat issue which he fixed by redoing the thermal paste on the card.

lsi-9201-16e-resetting-under-heavy-loads

I have a faulty fan on my r720 so it's running with 1 less fan installed. And, perhaps stupidly, I have my r720 set to operate with reduced fan speed because of the noise. I've run it with lower fan speed for a year with no problems though the dead fan is a bit more recent. I originally replenished the CPU paste in the 720 so I will also do so for all the other cards in the case (the perc, the HBA and the 10G nic).

Attached my logs, some other errors in there too.

atlantic · Feb 21, 2021

Things are getting worse... just had an ATA error on a DIFFERENT disk, da23. So my Z2 pool has two drives with errors. But FreeNAS is not reporting a problem, still says pool is Healthy.

Oh and the zfs send recv has stalled at 12.2TiB... a scrub is in progress on the pool so will wait for it to complete then shutdown and have a think. The stall corresponded with one of the CAM errors timestamp, but previous ones did not stall the replication.

I think these SCSI errors must be related - can it damage drives, need to make sense of this situation. Could be cable, or HBA or something else...

Any Help greatly appreciated!!

Ericloewe · Feb 21, 2021

Oh, you're running P20.00.04 firmware on your LSI HBA. You'll want to update to P20.00.07, which is the latest and last version released by LSI.
The early P20 series was fairly buggy, I can't promise it'll fix your issues, but it's an important step.

atlantic said:
But FreeNAS is not reporting a problem, still says pool is Healthy.

Which means that:

Your data is likely to be fine, regardless of what happens to the controller
The errors are recovered quickly enough not to bother ZFS too much.

So, my first recommendation is a deep breath. The cooling situation you've identified could be a problem, too, so definitely investigate it once the send is done.

Ericloewe · Feb 21, 2021

atlantic said:
From what I've gathered (from the post in the link below) SATA only support <1m cables, while SAS can do up to 10m and the cards allow SATA drives to work like SAS.

Yorick said:
Interesting. So the SAS expander doesn’t refresh the signal in any way, it all just looks like one big cable, is what I am taking from that: And hence the interposer in the tray that does a sas to SATA conversion.

learn something every day.

That's not strictly correct. Interposers exist for two reasons (in order):

Extract money from enterprise customers
Allow for some half-assed dual porting of SATA drives, I guess

SAS controllers and expanders can natively speak SATA to interact with SATA disks. But that only applies to the last mile, or in this case, to the link with the SATA drive. All communications between a SAS controller and a SAS expander are SAS. The SATA stuff from the drive is encapsulated in SCSI commands and forwarded to the controller.
Electrically, this means that the link between SAS controller and expander is not subject to any SATA limitations, only SAS limitations.

Of course, some asterisks apply. LSI SAS3 expanders only support SATA 6Gb/s and SATA 3Gb/s, SATA support is probably an optional feature that some vendors can lock away because it's inconvenient to their business model, etc.

atlantic · Feb 21, 2021

Ericloewe said:
Oh, you're running P20.00.04 firmware on your LSI HBA. You'll want to update to P20.00.07, which is the latest and last version released by LSI.
The early P20 series was fairly buggy, I can't promise it'll fix your issues, but it's an important step.

Which means that:

Your data is likely to be fine, regardless of what happens to the controller

The errors are recovered quickly enough not to bother ZFS too much.

So, my first recommendation is a deep breath. The cooling situation you've identified could be a problem, too, so definitely investigate it once the send is done.

Thanks for the reassurance! The send has stalled and not recovered though. I'm assuming after I shutdown and address the cooling issues it will resume where it left off if I reissue the command...or not?

~~(There are multiple datasets so most will have completed OK - also an assumption.)~~
(EDIT: I was able to resume the dataset that had not completed coming across, but not before I deleted three of its snapshots on the destination. zfs helpfully told me which three and that they were the cause of the 'broken pipe'.)

I bought the perc H310 HBA ready flashed to IT mode. Since it's flashed to LSI is it relatively easy to update to 07 firmware?

Actually I'm a little confused about which board is the actual 'HBA' that's resetting - there's the H310 and also the PCIe SAS HBA board that takes the external cable to the MD1000... does the H310 in passthough mode mean that the 'HBA' we're talking about is actually the PCIe card or is that the 'expander'?

mpsutil output lists the H310 as mps0, but the errors are mps1.

Code:

root@freenas[~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: someguy
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.27.01.01
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0001        0009       N         6.0     1.5    6.0    SAS Initiator
2       0001        0009       N         6.0     1.5    6.0    SAS Initiator
3       0001        0009       N         6.0     1.5    6.0    SAS Initiator
4       0001        0009       N         6.0     1.5    6.0    SAS Initiator
5       0001        0009       N         6.0     1.5    6.0    SAS Initiator
6       0001        0009       N         6.0     1.5    6.0    SAS Initiator
7       0001        0009       N         6.0     1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
          500056b37789abff 0009    0001      SMP Target    6.0   0002 00    8
00   08   5000cca02e0cc449 000a    0009      SAS Target    6.0   0002 08    1
00   20   5000cca02e0cd6e9 000b    0009      SAS Target    6.0   0002 09    1
00   11   5000cca02e092d25 000c    0009      SAS Target    6.0   0002 01    1
00   12   5000cca02e0cd911 000d    0009      SAS Target    6.0   0002 00    1
00   13   5000cca02e0cc415 000e    0009      SAS Target    6.0   0002 02    1
00   14   5000cca02e0929d1 000f    0009      SAS Target    6.0   0002 03    1
00   15   5000cca02e092971 0010    0009      SAS Target    6.0   0002 04    1
00   16   5000cca02e092aa5 0011    0009      SAS Target    6.0   0002 05    1
00   17   5000cca02e093729 0012    0009      SAS Target    6.0   0002 06    1
00   18   5000cca02e094295 0013    0009      SAS Target    6.0   0002 07    1
00   19   500056b37789abfd 0014    0009      SEP Target    6.0   0002 20    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    5d4ae520b137dc00               0001     Direct Attached SGPIO
  38    500056b36789abff    0014       0002     External SES-2

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level
  26    500056b37789abff    0009       0001     0002       1

     Phy  RemotePhy  DevHandle  Speed   Min    Max    Device
     00     04         0001     6.0  1.5  6.0  SAS Initiator
     01     05         0001     6.0  1.5  6.0  SAS Initiator
     02     06         0001     6.0  1.5  6.0  SAS Initiator
     03     07         0001     6.0  1.5  6.0  SAS Initiator
     04     00         000a     6.0  1.5  6.0  SAS Target  
     05     00         000b     6.0  1.5  6.0  SAS Target  
     06                                1.5  6.0  No Device   
     07                                1.5  6.0  No Device   
     08                                1.5  6.0  No Device   
     09                                1.5  6.0  No Device   
     10                                1.5  6.0  No Device   
     11                                1.5  6.0  No Device   
     12     00         000c     6.0  1.5  6.0  SAS Target  
     13     00         000d     6.0  1.5  6.0  SAS Target  
     14     00         000e     6.0  1.5  6.0  SAS Target  
     15     00         000f     6.0  1.5  6.0  SAS Target  
     16     00         0010     6.0  1.5  6.0  SAS Target  
     17     00         0011     6.0  1.5  6.0  SAS Target  
     18     00         0012     6.0  1.5  6.0  SAS Target  
     19     00         0013     6.0  1.5  6.0  SAS Target  
     20     03         0001     6.0  1.5  6.0  SAS Initiator
     21     02         0001     6.0  1.5  6.0  SAS Initiator
     22     01         0001     6.0  1.5  6.0  SAS Initiator
     23     00         0001     6.0  1.5  6.0  SAS Initiator
     24     00         0014     6.0  6.0  6.0  SEP Target  
     25                                6.0  6.0  No Device

atlantic · Feb 21, 2021

I'm being dumb, there are two HBAs, the H310 is for the r720 internal drives, and the SAS HBA is for the MD1000, right?

I had a look in /data and saw a file 'hba_firmware_update.log' contents below.

So mps0 is the H310 and its running 07 firmware, while the PCIe HBA card is mps1 and running 04. IDK what the other revision numbers refer to, but I did read that mismatched HBA firmware can cause problems in addition to the 04 FW not being up to snuff. I guess I need to flash that card.

Code:

root@freenas[/data]# more hba_firmware_update.log
2021-02-14 15:26:31,251 Checking SAS92xx HBAs firmware
2021-02-14 15:26:35,211 0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:03:00:00
2021-02-14 15:26:35,211 Up to date firmware version 20
2021-02-14 15:26:35,211 1  SAS2008(B2)     20.00.04.00    14.01.00.07      No Image      00:42:00:00
2021-02-14 15:26:35,212 Up to date firmware version 20
2021-02-14 15:26:35,212
2021-02-14 15:26:35,212 Checking SAS93xx HBAs firmware
2021-02-14 15:26:35,227
2021-02-14 15:26:35,227 Checking HBA94xx HBAs firmware
2021-02-14 15:26:35,294
2021-02-14 15:26:35,294 HBA firmware check complete
2021-02-14 15:26:35,295

I have to run 'mpsutil show adapters' to get info on both of them, 'show all' doesn't actually show all adapters which is why I was a bit confused I think:

Code:

root@freenas[/data]# mpsutil show adapters
Device Name          Chip Name        Board Name        Firmware
/dev/mps0          LSISAS2008       SAS9211-8i        14000700
/dev/mps1          LSISAS2008       SAS9200-8e        14000400

For anyone interested Here's the link to the Broadcom download page for the LSI9200-8E firmware 20.00.07.00

Ericloewe · Feb 21, 2021

atlantic said:
I bought the perc H310 HBA ready flashed to IT mode. Since it's flashed to LSI is it relatively easy to update to 07 firmware?

Yeah, standard sas2flash. Just don't run it while the pool is imported, bad things may happen.

atlantic said:
I'm being dumb, there are two HBAs, the H310 is for the r720 internal drives, and the SAS HBA is for the MD1000, right?

That would be a typical setup, yes. You can use sas2ircu LIST to get a list of controller IDs and
sas2ircu ${CONTROLLER_ID} DISPLAY to get a list of devices attached to said controller.

atlantic said:
IDK what the other revision numbers refer to,

BIOS and UEFI extension ROMs, probably. The No Image column certainly is one of them. They're optional, unless you boot from the SAS controllers. In which case, my suggestion is to have one controller empty/with the option ROM disabled in system firmware and only have the option ROM on the other controller. A single option ROM will handle all LSI HBAs for its generation, so it's fine even if you needed to do something silly like boot from internal and external disks.

atlantic · Feb 22, 2021

Ericloewe said:
Yeah, standard sas2flash. Just don't run it while the pool is imported, bad things may happen.

Great, thanks. I re-thermal pasted both the HBA's and have had the system running since yesterday with higher fan speeds and have not seen any of those SCSI errors. Only a few hours of high IO though, I will try it for longer today so I can monitor it.

So, the big question is what do I do about the disc errors? TWO drives in the Z2 pool are now suspect, da12 and da23. That's not good but I don't know enough about SMART to interpret what's happened.

Can I trust ZFS when it says the pool is healthy (I want to)? The two drives each have 1 SMART error logged when I run smartctl, but in TrueNAS GUI a short test says passed. More deep breathing?

Ericloewe · Feb 22, 2021

Scrub the pool. If no errors are reported, you're good to go.

atlantic · Feb 22, 2021

Ericloewe said:
Scrub the pool. If no errors are reported, you're good to go.

Yep did that, it was clean as a whistle. Amazing, thanks again. I love ZFS!

Important Announcement for the TrueNAS Community.

Intermittent WD 8TB drive issue - Pool Degraded/Healthy

Explorer

Wizard

Explorer

Explorer

Wizard

Explorer

Wizard

Explorer

Explorer

Explorer

Attachments

Explorer

Server Wrangler

Server Wrangler

Explorer

Explorer

Server Wrangler

Explorer

Server Wrangler

Explorer

Similar threads