Endless Resilvering and some CAM Status: SCSI Status Error

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
Hi everyone

I just bough a HP DL380 G8 with 8x2TB SAS Drivers.

I was able to install and configure FreeNAS-11.3-U2 and create on RaidZ2 pool using all 8 Drivers.

In addition, I have added one M.2 NVME driver for caching and one SSD for logging.

I wanted to test the hot-swap capability and potential drive replacement procedure of the server so I have removed one of the drives, booted from another OS and formatted that driver. I then booted back to FreeNas and re-added that disk. The system started a replacement and re-silvering process started.

Code:
pool: FR4G
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 09:33:29 2020
        7.07T scanned at 3.22G/s, 256G issued at 261M/s, 7.07T total
        8.59G resilvered, 3.53% done, 0 days 07:36:58 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        FR4G                                              DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/1f782fce-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/20024e2c-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/20c6721c-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            replacing-3                                   DEGRADED     0     0     0
              5134065479397326922                         UNAVAIL      0     0     0  was /dev/gptid/20e82911-8198-11ea-b536-2c44fd830388
              gptid/bebdfb0e-8266-11ea-899c-2c44fd830388  ONLINE       0     0     0
            gptid/210337fe-8198-11ea-b536-2c44fd830388    ONLINE       0     0    12
            gptid/212dd557-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/21487bc4-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/2115e086-8198-11ea-b536-2c44fd830388    ONLINE       0     0    13
        logs
          gptid/685d7224-8345-11ea-9ec5-2c44fd830388      ONLINE       0     0     0
        cache
          gptid/68c9e10b-8345-11ea-9ec5-2c44fd830388      ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da8p2     ONLINE       0     0     0

errors: No known data errors


I know that this should have taken a few hours but I have noticed that it take too long and upon looking at the server log, I noticed that the re-silvering process was restarting after the process got to ~4%.

Looking at the server output, I see the following errors:
Code:
Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11181200
Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 4
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): READ(10). CDB: 28 00 02 5c a0 60 00 01 00 00
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): CAM status: SCSI Status Error
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI status: Check Condition
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI sense: RECOVERED ERROR asc:18,5 (Recovered data - recommend reassignment)
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Info: 0x25ca090
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Field Replaceable Unit: 1
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11040400
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 7


The interesting part here is that da5 is not even the drive i removed (it was da7)

Here is the output of the smart test for da5:

Code:
smartctl -a /dev/da5
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0001
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5004129ea67
Serial number:        Z1P1KMH500009232N4PY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr 21 09:51:09 2020 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===

Current Drive Temperature:     32 C
Drive Trip Temperature:        68 C

Manufactured in week 09 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  49
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  49
Elements in grown defect list: 3539

Vendor (Seagate Cache) information
  Blocks sent to initiator = 873378287
  Blocks received from initiator = 638587946
  Blocks read from cache and sent to initiator = 37263257
  Number of read and write commands whose size <= segment size = 633436
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 64723.22
  number of minutes until next internal SMART test = 41

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1542301948      605         0  1542302553        711        447.178         106
write:         0        0         0         0          0        328.511           0
verify:   554223      145         0    554368        190          0.000          45

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64713                 - [-   -    -]
# 2  Background short  Completed                   -   64701                 - [-   -    -]


Here is the same output for da7:

Code:
smartctl -a /dev/da7
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0002
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500573e7883
Serial number:        Z1P66CMH0000940661R9
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr 21 09:53:32 2020 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===

Current Drive Temperature:     30 C
Drive Trip Temperature:        68 C

Manufactured in week 35 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  54
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  54
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 462698561
  Blocks received from initiator = 1051950447
  Blocks read from cache and sent to initiator = 3971534
  Number of read and write commands whose size <= segment size = 1939050
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 50599.40
  number of minutes until next internal SMART test = 33

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   860180076        0         0  860180076          0        236.902           0
write:         0        0         0         0          0        538.893           0

Non-medium error count:        2

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   50589                 - [-   -    -]
# 2  Background short  Completed                   -   50577                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -   50577                 - [-   -    -]


I tried replacing the bays to make sure it is not a cable problem but I wonder is there is a disk problem here that I need to consider.

Here are more details on the system:
  • HP Proliant 665553-B21 DL380p Gen8
  • 2 x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  • 128GB RAM
  • 8 x Segate ST2000NM0001 2TB SAS 3.5inc drivers
  • 1 x Samsung SSD 850 EVO 500GB for log drive
  • 1 x PC401 NVMe SK hynix 512GB for cache
Thanks in advance

Itamar
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
SMART support is: Unavailable - device lacks SMART capability.
How are you connecting the disks to the system? is there a RAID card involved?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
HP Smart Array P420i in passthrough mode
There's a good chance that this is preventing SMART data from being passed correctly... You may be a little stuck there as the controller may be passing each disk through inside a container, so just replacing the controller with an HBA may not help.

Unless you can get to proper SMART data, I wouldn't trust your setup to keep your data.
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
There's a good chance that this is preventing SMART data from being passed correctly... You may be a little stuck there as the controller may be passing each disk through inside a container, so just replacing the controller with an HBA may not help.

Unless you can get to proper SMART data, I wouldn't trust your setup to keep your data.

That sound serious enough, have you ever encountered this before with this type of servers?

Should I even try the HBA option? if so, do you have a recommendation?
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
Update, I am going to replace that drive just to be sure. I have OFFLINE it and removed it from the enclosure. I hope this will allow the disks to resilver properly.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That sound serious enough, have you ever encountered this before with this type of servers?

Should I even try the HBA option? if so, do you have a recommendation?

Other users have reported issues using this specific HBA, because the FreeBSD driver ciss that is used is rather flaky. This might be the root cause of the failure to resilver. Are you sure it's in the full HBA mode (you used the hpssacli or similar program to set it to HBAmode=on?) and not just passing through unconfigured disks by default?


You can use an HP H220 HBA for a "genuine HP option" or any of the other ones based on the LSI SAS2008 or SAS2308 chipset.

  • 1 x Samsung SSD 850 EVO 500GB for log drive
  • 1 x PC401 NVMe SK hynix 512GB for cache

Neither of these are good SLOG devices. They will both make fine cache options though. I would suggest using the 850 EVO for cache, and getting a better NVMe-based card for SLOG - check the thread here for your options but I would recommend an Optane P4801X if your present NVMe adaptor fits the 110mm cards.

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I was looking at the HP H220 HBA but it has only one port and I don't want to get two.

There are definitely models that have two SAS ports - you shouldn't need to get two independent cards.

Regarding buying from random sellers on Amazon - I wouldn't do this. I would be much, much more confident in buying a used-pull from a server refurbisher than I would picking up a piece of third-shift hardware with questionable history.
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
There are definitely models that have two SAS ports - you shouldn't need to get two independent cards.

Regarding buying from random sellers on Amazon - I wouldn't do this. I would be much, much more confident in buying a used-pull from a server refurbisher than I would picking up a piece of third-shift hardware with questionable history.

Thanks I was about to pull the trigger :)
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
Update

I installed CentOS 8.1 on a USB disk and imported the zpool to it, now when I run the smartcrl command, I see this:

Code:
=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0023
Revision:             0006
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005760a8c3
Serial number:        Z1X12ZBX00009410BJ4F
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Apr 22 10:50:46 2020 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled


This time, I can see that the SMART support is available.

I didn't replaced the controller yet, this is still running on the p420i one I have on the machine.

So, could it be, as you mentioned, a problem with the freeBSD drivers related to the controller?

Thanks in advance

Itamar
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It's always going to be a sucky experience at best with that controller. Some form of SMART support is an improvement, but the recommendation is always to use a known-good HBA (which, unfortunately, means basically exclusively LSI SAS 2008, 2308 and 3008 controllers).
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
It's always going to be a sucky experience at best with that controller. Some form of SMART support is an improvement, but the recommendation is always to use a known-good HBA (which, unfortunately, means basically exclusively LSI SAS 2008, 2308 and 3008 controllers).

Will an HPE H220 Host Bus Adapter be sufficient?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
Final Update

Got my H220 controller today.

I updated it's firmware to v20 in IT mode.

Now all the disks are reporting the S.M.A.R.T status.

Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0002
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500573e7883
Serial number:        Z1P66CMH0000940661R9
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Apr 27 20:39:30 2020 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        68 C

Manufactured in week 35 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  85
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  85
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 477826913
  Blocks received from initiator = 1080263202
  Blocks read from cache and sent to initiator = 5904455
  Number of read and write commands whose size <= segment size = 2612670
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 50658.58
  number of minutes until next internal SMART test = 34

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   912415471        0         0  912415471          0        244.647           0
write:         0        0         0         0          0        554.433           0

Non-medium error count:        8


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   50622                 - [-   -    -]
# 2  Background short  Completed                   -   50589                 - [-   -    -]
# 3  Background short  Completed                   -   50577                 - [-   -    -]
# 4  Background short  Aborted (by user command)   -   50577                 - [-   -    -]

Long (extended) Self-test duration: 18500 seconds [308.3 minutes]


And you know what, call me crazy but, the server is working much better since the SAS disk are not connected to the i420P controller.

Thank you all for all the help.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Final Update

Got my H220 controller today.

I updated it's firmware to v20 in IT mode.

Now all the disks are reporting the S.M.A.R.T status.

And you know what, call me crazy but, the server is working much better since the SAS disk are not connected to the i420P controller.

Thank you all for all the help.

Glad to hear it. Question - did you have a cache module on your P420i before? I'm reading mixed reports that if there isn't one (or it's not working properly) the queue depth is absolutely brutal, which would explain some of the poor performance.
 

hysel

Explorer
Joined
Apr 11, 2020
Messages
69
Glad to hear it. Question - did you have a cache module on your P420i before? I'm reading mixed reports that if there isn't one (or it's not working properly) the queue depth is absolutely brutal, which would explain some of the poor performance.

I do, a 512MB one. Honestly, I am not sure what the situation with it. the fact of the manner is once i disconnected the HD from it, the server started to work properly.
 
Top