SOLVED Degraded pool after non-graceful reboot, rrdcache probem

beralt · Mar 19, 2021

Dear all,

I am experiencing a possibly rather broken system (even though most things still seem to work), and I would really appreciate input on what is going on and what steps I should take.
I am not even sure if all the errors below are related and am grateful for any help.
Also, please let me know if I should post that in another sub-forum.

A little background on my situation:
After not actively managing my NAS for a while, I have recently upgraded from FreeNas 11.3.U2 to 11.3U5 to TrueNas 12.
Even though I am not sure if this was caused by the update, I experienced a hung system, where I couldn't access the system anymore, neither via the webUI nor via ssh.
I then tried to gracefully ("orderly") shutdown the system via the IPMI, but this failed - only an "immediate" reboot did the job.

For more context on my system:

Supermicro X10SDV-4C-7TP4F
Intel® Xeon® processor D-1518, Single socket FCBGA 1667; 4-Core, 8 Threads, 35W
48 GB RAM
4xWD Red 2TB for storage, 2x32GB SanDisk USB SSD drives for system

In any case, I am experiencing the following errors:

The first one regards my freenas-boot pool with the system:

Code:

zpool status -v
  pool: freenas-boot
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:08:45 with 0 errors on Sat Mar 13 03:53:45 2021
config:

    NAME          STATE     READ WRITE CKSUM
    freenas-boot  DEGRADED     0     0     0
      mirror-0    DEGRADED     0     0     0
        da4p2     DEGRADED     0     0     0  too many errors
        da5p2     ONLINE       0     0     0

errors: No known data errors

Tracing it back, this problem was already there before the update to TrueNAS.

Secondly, my main storage pool "vault" also has a problem:

Code:

pool: vault
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 639M in 00:01:08 with 36622 errors on Fri Mar 19 13:27:23 2021
config:

    NAME                                                STATE     READ WRITE CKSUM
    vault                                               DEGRADED     0     0     0
      mirror-0                                          DEGRADED 35.8K     0     0
        gptid/e3a11d9e-a2e1-11e7-ad5e-0025905e1638.eli  REMOVED      0     0     0
        gptid/e4bf2724-a2e1-11e7-ad5e-0025905e1638.eli  ONLINE       0     0 71.5K
      mirror-1                                          ONLINE       0     0     0
        gptid/e60143a6-a2e1-11e7-ad5e-0025905e1638.eli  ONLINE       0     0     0
        gptid/e721c833-a2e1-11e7-ad5e-0025905e1638.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /var/db/system/syslog-76c11d7f8a944b3d8e42fe35420dbaa3/log/utx.lastlogin
        /var/db/system/syslog-76c11d7f8a944b3d8e42fe35420dbaa3/log/maillog.0.bz2
        /var/db/system/syslog-76c11d7f8a944b3d8e42fe35420dbaa3/log/console.log.0.bz2
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/zfs_arc_v2/gauge_arcstats_raw_mru-mfu_ghost_hits.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/aggregation-cpu-sum/cpu-idle.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/zfs_arc/memory_throttle_count.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/df-mnt-vault-apps-transmission/df_complex-reserved.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/zfs_arc_v2/gauge_arcstats_raw_counts-allocated.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/df-mnt-vault-backups/df_complex-free.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/aggregation-cpu-sum/cpu-interrupt.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/df-mnt-vault-archive/df_complex-reserved.rrd
        [...]
        [200+ more *.rrd files in var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/]
        [...]
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/zfs_arc/mutex_operations-miss.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/zfs_arc/hash_collisions.rrd
        /var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/localhost/df-mnt-vault-apps-tautulli/df_complex-reserved.rrd
        /var/db/system/configs-76c11d7f8a944b3d8e42fe35420dbaa3/TrueNAS-12.0-U2.1/20210319.db
        /mnt/vault/lingames/Steam/steamapps/shadercache/35720/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
        /mnt/vault/lingames/Steam/steamapps/shadercache/512900/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
        /mnt/vault/lingames/Steam/steamapps/shadercache/945360/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
        vault/vm_images/valheim_lgsm-v4li08:<0x1>

Also, the observation that something is wrong with "rrdcached" (sorry, I don't know what this is/does), since my terminal is endlessly flooded with this error:

Code:

Mar 19 15:22:14 heimii 1 2021-03-19T15:22:14.454128+01:00 heimii.lan collectd 5514 - - rrdcached plugin: Failed to connect to RRDCacheD at unix:/var/run/rrdcached.sock: Unable to connect to rrdcached: Connection refused (status=61)

Further, mounting my NFS shares on my Linux Desktop throws "duplicate file system cookie errors" like this:

Code:

kernel: FS-Cache: Duplicate cookie detected
   kernel: FS-Cache: O-cookie c=000000001e72b895 [p=0000000089da8da7 fl=222 nc=0 na=1]
   kernel: FS-Cache: O-cookie d=00000000c3a2cbed n=00000000f757123a
   kernel: FS-Cache: O-key=[10] '040002000801c0a805c3'
   kernel: FS-Cache: N-cookie c=00000000ea48db1d [p=0000000089da8da7 fl=2 nc=0 na=1]
   kernel: FS-Cache: N-cookie d=00000000c3a2cbed n=000000000f72327e
   kernel: FS-Cache: N-key=[10] '040002000801c0a805c3'

I am not sure if this is related.
Even though I had my FreeNAS system running for over two years, I am a Newbie. So far I used it mostly for playing around vim VMs and for storage in my LAN.
Honestly, I am quite overwhelmed by this and am not sure what to do and would love to get any input on that.

I have also ordered 2 new 4TB HDDs to use for backups, in case I need to completely recreate/replace the old pool, which spans all of the physical drives.
If the system is completely broken, I am considering backing up all data and simply starting anew. However, if this could be omitted I would be glad.

Let me know if I can provide any more information.

sretalla · Mar 19, 2021

rrdcache is related to collectd, which is collecting the performance data from CPU, memory, Disk, etc.

You can probably safely delete that structure (maybe stopping colledtd first) and it should be recreated, but easier than that is moving your system dataset to another pool (even if only for a minute) under System | System Dataset in the GUI.

You can then run a zpool clear on the pool and see what comes back in the next scrub... perhaps these will:
/mnt/vault/lingames/Steam/steamapps/shadercache/35720/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
/mnt/vault/lingames/Steam/steamapps/shadercache/512900/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
/mnt/vault/lingames/Steam/steamapps/shadercache/945360/mesa_shader_cache_sf/c1516fe0adc2164672ac79ffa0d26cd3/AMD RADV VEGA10 (ACO)/foz_cache_idx.foz
vault/vm_images/valheim_lgsm-v4li08:<0x1>

You may need to manually delete the steam stuff and check/replace the VM image file.

beralt · Mar 19, 2021

Thanks @sretalla for the replies.

sretalla said:
rrdcache is related to collectd, which is collecting the performance data from CPU, memory, Disk, etc.

Ok, good to know.

sretalla said:
You can probably safely delete that structure (maybe stopping colledtd first) and it should be recreated, but easier than that is moving your system dataset to another pool (even if only for a minute) under System | System Dataset in the GUI.

So I did

Code:

zpool clear freenas-boot

and no further errors where listed.
I then moved the system dataset to that pool and cleared/scrubbed the pool "vault".
I then wanted to restart the system, and now it hangs showing this:

Something seems really wrong here, and I guess I should check the cables on the HDDs?
Or does that mean that the SATA controller on the board is broken?
Is it safe to force a shutdown now?

beralt · Mar 19, 2021

Update: The machine ultimately restarted.
The error messages are repeating also in the system after restart.
Everything seemed OK though, so I tried restarting the VM that was using this zvol

Code:

 vault/vm_images/valheim_lgsm-v4li08

, in order to get the data off the zvol.
It seemed to start OK, but then this happened:

Also, now the status of the pool looks much worse:

Code:

pool: vault
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub in progress since Fri Mar 19 17:21:43 2021
    1.68T scanned at 450M/s, 90.8G issued at 23.8M/s, 2.16T total
    0B repaired, 4.11% done, 1 days 01:20:08 to go
config:

    NAME                                                STATE     READ WRITE CKSUM
    vault                                               UNAVAIL      0     0     0  insufficient replicas
      mirror-0                                          UNAVAIL     36     2     0  insufficient replicas
        gptid/e3a11d9e-a2e1-11e7-ad5e-0025905e1638.eli  REMOVED      0     0 8.65K
        gptid/e4bf2724-a2e1-11e7-ad5e-0025905e1638.eli  REMOVED      0     0 3.88K
      mirror-1                                          DEGRADED     0     0     0
        gptid/e60143a6-a2e1-11e7-ad5e-0025905e1638.eli  REMOVED      0     0 1.30K
        gptid/e721c833-a2e1-11e7-ad5e-0025905e1638.eli  ONLINE       0     0     0

errors: List of errors unavailable: pool I/O is currently suspended

Now I am really afraid that I am going to lose all my data on that pool.
Any advice on what to do?

sretalla · Mar 19, 2021

The pool may be fine.

I suspect a problem with the HBA. It seems to be failing to complete the automatic firmware update. You might need to boot with another os to do the firmware or the HBA might be suspect and needing replacement.

It won't hurt to check the cables, but I don't expect any magic there.

beralt · Mar 19, 2021

sretalla said:
The pool may be fine.

Good to hear.

sretalla said:
I suspect a problem with the HBA. It seems to be failing to complete the automatic firmware update.

Is this automatic update a new thing?
I did not change anything but upgrading from FreeNAS 11.3 to TrueNAS 12.
I didn't change any hardware or so.

sretalla said:
You might need to boot with another os to do the firmware or the HBA might be suspect and needing replacement.

I guess the HBA is part of the architecture on my SuperMicro Board (sorry, I am an IT amateur). Can I maybe do something via the "Firmware" Tab of the SuperMicro IPMI?

Or do you mean I should try to boot into a previous version of (Free/TrueNAS) and then somehow initiate an automatic firmware update?

Also, should I just wait and let the current scrub finish before I do anything else? At the moment it says

Code:

 1 days 12:52:33 to go

Thanks for your help and sorry for my most certainly dumb questions...

beralt · Mar 19, 2021

Update: When I disconnected most network access and rebooted, after some time the system resilvered.
As soon as I accessed NAS Storage via Network, data got corrupted again.

The behaviour seems similar to that:

freebsd-stable - mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting

mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting. One of our servers that was upgraded from 11.2 to 12.0 (to RC2 initially, then to RC3 and lastly to a 12.0-RELEASE)...

freebsd.1045724.x6.nabble.com

sretalla · Mar 19, 2021

You need to get that firmware update done.

LSI 9300-xx Firmware Update

Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs. After working with Broadcom, we’ve come up with a...

www.truenas.com

beralt said:
Can I maybe do something via the "Firmware" Tab of the SuperMicro IPMI?

I think the screen you shared there refers to the firmware of the IPMI itself, not the HBA.

beralt said:
2x32GB SanDisk USB SSD drives for system

Actually these may be part of or at the heart of the problem.

Do you have another option? I think these are subject to the TRIM bug and that may be what's causing the corruption to show up with a scrub.

beralt · Mar 20, 2021

I think it really was a hardware problem. I physically replugged the SATA/Power cables and now it seems to be running fine again.
Thanks for the tips on the USB SSD drives. I'll try to get the firmware update of my HBA, but it is a different controller than the one suggested in that link

Code:

# mpsutil show adapter
mps0 Adapter:
       Board Name: LSI2116-IT
   Board Assembly: 
        Chip Name: LSISAS2116
    Chip Revision: ALL
    BIOS Revision: 7.37.01.00
Firmware Revision: 19.00.02.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 1.5    6.0    SAS Initiator 
1                              N                 1.5    6.0    SAS Initiator 
2                              N                 1.5    6.0    SAS Initiator 
3                              N                 1.5    6.0    SAS Initiator 
4                              N                 1.5    6.0    SAS Initiator 
5                              N                 1.5    6.0    SAS Initiator 
6                              N                 1.5    6.0    SAS Initiator 
7                              N                 1.5    6.0    SAS Initiator 
8       0001        0011       N         6.0     1.5    6.0    SAS Initiator 
9       0002        0012       N         6.0     1.5    6.0    SAS Initiator 
10      0003        0013       N         6.0     1.5    6.0    SAS Initiator 
11      0004        0014       N         6.0     1.5    6.0    SAS Initiator 
12                             N                 1.5    6.0    SAS Initiator 
13                             N                 1.5    6.0    SAS Initiator 
14                             N                 1.5    6.0    SAS Initiator 
15                             N                 1.5    6.0    SAS Initiator

I will mark this thread as solved.

Important Announcement for the TrueNAS Community.

SOLVED Degraded pool after non-graceful reboot, rrdcache probem

beralt

Dabbler

sretalla

Powered by Neutrality

beralt

Dabbler

beralt

Dabbler

sretalla

Powered by Neutrality

beralt

Dabbler

beralt

Dabbler

freebsd-stable - mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting

sretalla

Powered by Neutrality

LSI 9300-xx Firmware Update

beralt

Dabbler

Similar threads