Endless Resilvering and some CAM Status: SCSI Status Error

hysel · Apr 21, 2020

Hi everyone

I just bough a HP DL380 G8 with 8x2TB SAS Drivers.

I was able to install and configure FreeNAS-11.3-U2 and create on RaidZ2 pool using all 8 Drivers.

In addition, I have added one M.2 NVME driver for caching and one SSD for logging.

I wanted to test the hot-swap capability and potential drive replacement procedure of the server so I have removed one of the drives, booted from another OS and formatted that driver. I then booted back to FreeNas and re-added that disk. The system started a replacement and re-silvering process started.

Code:

pool: FR4G
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 09:33:29 2020
        7.07T scanned at 3.22G/s, 256G issued at 261M/s, 7.07T total
        8.59G resilvered, 3.53% done, 0 days 07:36:58 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        FR4G                                              DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/1f782fce-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/20024e2c-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/20c6721c-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            replacing-3                                   DEGRADED     0     0     0
              5134065479397326922                         UNAVAIL      0     0     0  was /dev/gptid/20e82911-8198-11ea-b536-2c44fd830388
              gptid/bebdfb0e-8266-11ea-899c-2c44fd830388  ONLINE       0     0     0
            gptid/210337fe-8198-11ea-b536-2c44fd830388    ONLINE       0     0    12
            gptid/212dd557-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/21487bc4-8198-11ea-b536-2c44fd830388    ONLINE       0     0     0
            gptid/2115e086-8198-11ea-b536-2c44fd830388    ONLINE       0     0    13
        logs
          gptid/685d7224-8345-11ea-9ec5-2c44fd830388      ONLINE       0     0     0
        cache
          gptid/68c9e10b-8345-11ea-9ec5-2c44fd830388      ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da8p2     ONLINE       0     0     0

errors: No known data errors

I know that this should have taken a few hours but I have noticed that it take too long and upon looking at the server log, I noticed that the re-silvering process was restarting after the process got to ~4%.

Looking at the server output, I see the following errors:

Code:

Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11181200
Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 4
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): READ(10). CDB: 28 00 02 5c a0 60 00 01 00 00
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): CAM status: SCSI Status Error
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI status: Check Condition
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI sense: RECOVERED ERROR asc:18,5 (Recovered data - recommend reassignment)
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Info: 0x25ca090
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Field Replaceable Unit: 1
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11040400
Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 7

The interesting part here is that da5 is not even the drive i removed (it was da7)

Here is the output of the smart test for da5:

Code:

smartctl -a /dev/da5
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0001
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5004129ea67
Serial number:        Z1P1KMH500009232N4PY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr 21 09:51:09 2020 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===

Current Drive Temperature:     32 C
Drive Trip Temperature:        68 C

Manufactured in week 09 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  49
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  49
Elements in grown defect list: 3539

Vendor (Seagate Cache) information
  Blocks sent to initiator = 873378287
  Blocks received from initiator = 638587946
  Blocks read from cache and sent to initiator = 37263257
  Number of read and write commands whose size <= segment size = 633436
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 64723.22
  number of minutes until next internal SMART test = 41

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1542301948      605         0  1542302553        711        447.178         106
write:         0        0         0         0          0        328.511           0
verify:   554223      145         0    554368        190          0.000          45

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64713                 - [-   -    -]
# 2  Background short  Completed                   -   64701                 - [-   -    -]

Here is the same output for da7:

Code:

smartctl -a /dev/da7
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0002
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500573e7883
Serial number:        Z1P66CMH0000940661R9
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr 21 09:53:32 2020 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===

Current Drive Temperature:     30 C
Drive Trip Temperature:        68 C

Manufactured in week 35 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  54
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  54
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 462698561
  Blocks received from initiator = 1051950447
  Blocks read from cache and sent to initiator = 3971534
  Number of read and write commands whose size <= segment size = 1939050
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 50599.40
  number of minutes until next internal SMART test = 33

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   860180076        0         0  860180076          0        236.902           0
write:         0        0         0         0          0        538.893           0

Non-medium error count:        2

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   50589                 - [-   -    -]
# 2  Background short  Completed                   -   50577                 - [-   -    -]
# 3  Background short  Aborted (by user command)   -   50577                 - [-   -    -]

I tried replacing the bays to make sure it is not a cable problem but I wonder is there is a disk problem here that I need to consider.

Here are more details on the system:

HP Proliant 665553-B21 DL380p Gen8
2 x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
128GB RAM
8 x Segate ST2000NM0001 2TB SAS 3.5inc drivers
1 x Samsung SSD 850 EVO 500GB for log drive
1 x PC401 NVMe SK hynix 512GB for cache

Thanks in advance

Itamar

sretalla · Apr 21, 2020

hysel said:
SMART support is: Unavailable - device lacks SMART capability.

How are you connecting the disks to the system? is there a RAID card involved?

hysel · Apr 21, 2020

sretalla said:
How are you connecting the disks to the system? is there a RAID card involved?

I am using the onboard HP Smart Array P420i in passthrough mode

sretalla · Apr 21, 2020

hysel said:
HP Smart Array P420i in passthrough mode

There's a good chance that this is preventing SMART data from being passed correctly... You may be a little stuck there as the controller may be passing each disk through inside a container, so just replacing the controller with an HBA may not help.

Unless you can get to proper SMART data, I wouldn't trust your setup to keep your data.

hysel · Apr 21, 2020

sretalla said:
There's a good chance that this is preventing SMART data from being passed correctly... You may be a little stuck there as the controller may be passing each disk through inside a container, so just replacing the controller with an HBA may not help.

Unless you can get to proper SMART data, I wouldn't trust your setup to keep your data.

That sound serious enough, have you ever encountered this before with this type of servers?

Should I even try the HBA option? if so, do you have a recommendation?

hysel · Apr 21, 2020

Update, I am going to replace that drive just to be sure. I have OFFLINE it and removed it from the enclosure. I hope this will allow the disks to resilver properly.

HoneyBadger · Apr 21, 2020

hysel said:
That sound serious enough, have you ever encountered this before with this type of servers?

Should I even try the HBA option? if so, do you have a recommendation?

Other users have reported issues using this specific HBA, because the FreeBSD driver ciss that is used is rather flaky. This might be the root cause of the failure to resilver. Are you sure it's in the full HBA mode (you used the hpssacli or similar program to set it to HBAmode=on?) and not just passing through unconfigured disks by default?

HP DL360p Gen8 - P420i controller

Humor me for a few here.... I'm well aware of the specifics of why a controller like the HP smartarray series is not necessarily a good fit for ZFS. I have a different production 9.10 box running an M1015 in IT mode and 6 1tb WD reds. However, I have this Dl360p Gen8 that I rescued from the...

www.ixsystems.com

HPE DL380p Gen8 25SFF Server - CISS driver

Hi all, I am looking at installing a newly acquired DL380p 25SFF server as my new freenas SSD box. The server has an intergrated P420i RAID controller. I have updated all firmware and put the RAID controller into HBA mode. I have some questions/issues, there were some previous topics but none...

www.ixsystems.com

FreeNAS only detecting 2 drives of Smart Array P420i

looked to see if I could disable the P420i in BIOS, but I was not able to find an option to disable it. I It's there...under PCI devices on the main menu in the BIOS you can disable the P420i.

www.ixsystems.com

You can use an HP H220 HBA for a "genuine HP option" or any of the other ones based on the LSI SAS2008 or SAS2308 chipset.

hysel said:
1 x Samsung SSD 850 EVO 500GB for log drive

1 x PC401 NVMe SK hynix 512GB for cache

Neither of these are good SLOG devices. They will both make fine cache options though. I would suggest using the 850 EVO for cache, and getting a better NVMe-based card for SLOG - check the thread here for your options but I would recommend an Optane P4801X if your present NVMe adaptor fits the 110mm cards.

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.ixsystems.com

hysel · Apr 21, 2020

Thank you very much, I can confirm that I use the hpssacli to set HBAmode option to on.

I was looking at the HP H220 HBA but it has only one port and I don't want to get two.

That said, I saw this on Amazon:

https://www.amazon.com/MFU-9211-8i-...:2470955011&rnid=2470954011&rps=1&sr=8-4&th=1

This is based on SAS2308

I will give it a try if nothing else will work

Thanks again!

Itamar

HoneyBadger · Apr 21, 2020

hysel said:
I was looking at the HP H220 HBA but it has only one port and I don't want to get two.

There are definitely models that have two SAS ports - you shouldn't need to get two independent cards.

Regarding buying from random sellers on Amazon - I wouldn't do this. I would be much, much more confident in buying a used-pull from a server refurbisher than I would picking up a piece of third-shift hardware with questionable history.

hysel · Apr 21, 2020

HoneyBadger said:
There are definitely models that have two SAS ports - you shouldn't need to get two independent cards.

Regarding buying from random sellers on Amazon - I wouldn't do this. I would be much, much more confident in buying a used-pull from a server refurbisher than I would picking up a piece of third-shift hardware with questionable history.

Thanks I was about to pull the trigger :)

hysel · Apr 22, 2020

Update

I installed CentOS 8.1 on a USB disk and imported the zpool to it, now when I run the smartcrl command, I see this:

Code:

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0023
Revision:             0006
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005760a8c3
Serial number:        Z1X12ZBX00009410BJ4F
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Apr 22 10:50:46 2020 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

This time, I can see that the SMART support is available.

I didn't replaced the controller yet, this is still running on the p420i one I have on the machine.

So, could it be, as you mentioned, a problem with the freeBSD drivers related to the controller?

Thanks in advance

Itamar

Ericloewe · Apr 22, 2020

It's always going to be a sucky experience at best with that controller. Some form of SMART support is an improvement, but the recommendation is always to use a known-good HBA (which, unfortunately, means basically exclusively LSI SAS 2008, 2308 and 3008 controllers).

hysel · Apr 22, 2020

Ericloewe said:
It's always going to be a sucky experience at best with that controller. Some form of SMART support is an improvement, but the recommendation is always to use a known-good HBA (which, unfortunately, means basically exclusively LSI SAS 2008, 2308 and 3008 controllers).

Will an HPE H220 Host Bus Adapter be sufficient?

Ericloewe · Apr 22, 2020

Looks like that's an option: https://www.ixsystems.com/community/threads/does-the-hpe-h220-hba-need-to-be-in-it-mode.58978/

hysel · Apr 22, 2020

Ericloewe said:
Looks like that's an option: https://www.ixsystems.com/community/threads/does-the-hpe-h220-hba-need-to-be-in-it-mode.58978/

Thank you, I should get it in the next few days and I will try it

hysel · Apr 27, 2020

Final Update

Got my H220 controller today.

I updated it's firmware to v20 in IT mode.

Now all the disks are reporting the S.M.A.R.T status.

Code:

smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0001
Revision:             0002
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500573e7883
Serial number:        Z1P66CMH0000940661R9
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Apr 27 20:39:30 2020 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        68 C

Manufactured in week 35 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  85
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  85
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 477826913
  Blocks received from initiator = 1080263202
  Blocks read from cache and sent to initiator = 5904455
  Number of read and write commands whose size <= segment size = 2612670
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 50658.58
  number of minutes until next internal SMART test = 34

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   912415471        0         0  912415471          0        244.647           0
write:         0        0         0         0          0        554.433           0

Non-medium error count:        8


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   50622                 - [-   -    -]
# 2  Background short  Completed                   -   50589                 - [-   -    -]
# 3  Background short  Completed                   -   50577                 - [-   -    -]
# 4  Background short  Aborted (by user command)   -   50577                 - [-   -    -]

Long (extended) Self-test duration: 18500 seconds [308.3 minutes]

And you know what, call me crazy but, the server is working much better since the SAS disk are not connected to the i420P controller.

Thank you all for all the help.

HoneyBadger · Apr 28, 2020

hysel said:
Final Update

Got my H220 controller today.

I updated it's firmware to v20 in IT mode.

Now all the disks are reporting the S.M.A.R.T status.

And you know what, call me crazy but, the server is working much better since the SAS disk are not connected to the i420P controller.

Thank you all for all the help.

Glad to hear it. Question - did you have a cache module on your P420i before? I'm reading mixed reports that if there isn't one (or it's not working properly) the queue depth is absolutely brutal, which would explain some of the poor performance.

hysel · Apr 28, 2020

HoneyBadger said:
Glad to hear it. Question - did you have a cache module on your P420i before? I'm reading mixed reports that if there isn't one (or it's not working properly) the queue depth is absolutely brutal, which would explain some of the poor performance.

I do, a 512MB one. Honestly, I am not sure what the situation with it. the fact of the manner is once i disconnected the HD from it, the server started to work properly.

Important Announcement for the TrueNAS Community.

Endless Resilvering and some CAM Status: SCSI Status Error

Explorer

Powered by Neutrality

Explorer

Powered by Neutrality

Explorer

Explorer

actually does care

Explorer

actually does care

Explorer

Explorer

Server Wrangler

Explorer

Server Wrangler

Explorer

Explorer

actually does care

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Endless Resilvering and some CAM Status: SCSI Status Error"

Similar threads