Migrating Data Between Pools (Carefully)

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
I'm in the middle of upgrading my server setup and moving my data from an existing pool (mix of drive sizes and no redundancy) to a new one (identical drives in raidz1 configuration).

I've read other posts here that led me to a plan of setting up the new drives as a temporary pool, copying all of the data across, then disconnecting both pools, and reconnecting the new pool under the old pool's name, to hopefully pick up all of the config that the old pool had. I ran into a bit of trouble doing that, however; I initially tried using cp -a to move the files. This worked fine for my smaller dataset (photos, about 32Gib), but during a copy of the larger dataset (media, 14.6 TiB) it wasn't long before the system rebooted itself and reported one of my 4TB drives as degraded, though a quick SMART test didn't show any issues with it. I then tried to set up a one-off replication task in the UI to move the media dataset, and ran into the same reboot issue.

I searched around the forums a bit, and figured out I needed to run a scrub task on the pool. It came up with about 74 errors, which it seems to have fixed. I didn't manage to grab the details of the errors as I ended up doing a reboot to tweak some fan speed settings from the bios afterwards, and it doesn't seem that the system keeps track of the errors cleaned previously (if it has, I'd be grateful for a pointer to find them). I'm about 10 hours into a second scrub task just to be safe; 80% done and no errors found yet, so I should be good to run zpool clear on the pool after it's done.

Now, the advice I'm looking for is how to proceed with copying the data over to my new pool. Given the issue I ran into, I'm wanting to be cautious, given this existing pool has no redundancy. What's the safest way for me to transfer the files from one pool to the other? It's pretty much all large mkv and mp4 files at this point for my plex server. Should I set that one-off dataset replication running? cp smaller chunks of data, or copy as much as I can through an intermediary PC over the network? The total capacity of the original pool is 18.03 TiB, and I've used 14.7TiB of that. I assume that creating a snapshot needs at least a decent proportion of space on the same pool as it is snapshoting, so I doubt that I could fit that in the remaining 3.34 TiB available, unless I'm misunderstanding how snapshots work?

Any advice would be greatly appreciated.

System info is in my signature, but putting the details of the two relevant pools here too for reference:

Running TrueNAS-SCALE-23.10.2
x2 4TB Seagate Ironwolf ST4000VN006-3CW104 + x1 12TB Seagate Ironwolf ST12000VN0008-2YS101 SATA HDDs
STRIPE formation (no redundancy o_O)
Currently storing Plex media and Photos
x3 12TB Seagate Ironwolf Pro ST12000NE0008-1ZF101 SATA HDDs
RAIDZ1 formation
Intended replacement for nas_pool_1
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
The total capacity of the original pool is 18.03 TiB, and I've used 14.7TiB of that. I assume that creating a snapshot needs at least a decent proportion of space on the same pool as it is snapshoting, so I doubt that I could fit that in the remaining 3.34 TiB available, unless I'm misunderstanding how snapshots work?
You misunderstand, if you take a snapshot of your pool now it will need no space at all, it just records the currently written blocks. If you alter data and then create a snapshot again, then the difference in data will dictate how much space it uses.

Since you already setup a replication task, you already made a snapshot (replication tasks send snapshots).

Replication would be the preferred method.

How are the drives connected?
What PSU do you have?

Please attach your smart tests results and zpool status. If I had to guess I'd say your sata controller crashes under prolonged load or your drives indeed started to fail.

Someone please educate me, but how would a scrub repair a damaged file without redundancy?

Without additional information I'd start backing up the most important data manually. Manually copying 30 GB chunks for 14 TB of data seems tedious...

Backup everything irreplaceable and give the requested information. If it is indeed the sata controller you may replace that before attempting the migration. If your drives already failed for real, you're already looking at data loss anyway I fear.
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
I think you might be right about the SATA controller. Overnight I had an alert email through saying that in addition to that one drive in the nas pool, my boot pool is degraded - both of which are connected via my SATA controller. Current connection setup is:

All 12TB drives (the 3 in the new pool, plus the one in the old pool) connected directly to motherboard SATA ports
Both 4TB drives plus the boot drive connected to the SATA controller. SATA controller is ACTIMED PCIE SATA Card 6 port (ASM1064 & JB575 chip)

PSU is 750W Thermaltake Toughpower SFX 7VM1F

This is my current zpool status output:

Code:
  pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:17 with 0 errors on Thu Mar 21 03:45:19 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          sdg3      DEGRADED     0    19     0  too many errors

errors: No known data errors

  pool: nas_pool_1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 15:43:40 with 0 errors on Fri Mar 22 01:10:55 2024
config:

        NAME                                    STATE     READ WRITE CKSUM
        nas_pool_1                              DEGRADED     0     0     0
          6d537eeb-e895-41f0-8f66-6f34f231d606  DEGRADED     0     0     0  too many errors
          9a04e590-6235-435a-8308-105899171263  ONLINE       0     0     0
          272c8d8b-9112-42e4-819b-2baeb00a295f  ONLINE       0     0     0

errors: No known data errors

  pool: ssd_pool_1
 state: ONLINE
config:

        NAME                                    STATE     READ WRITE CKSUM
        ssd_pool_1                              ONLINE       0     0     0
          e7628bcc-9804-4c71-8fec-50619e5971eb  ONLINE       0     0     0

errors: No known data errors

  pool: tmp_pool_1
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        tmp_pool_1                                ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            a1fd373e-f81a-4b61-8a16-0a9c9d44bb23  ONLINE       0     0     0
            cbf01c3f-a0f5-4579-bec0-d75996ac5fd6  ONLINE       0     0     0
            68a42e22-f418-4da3-b9d7-b6fd4621f8ed  ONLINE       0     0     0

errors: No known data errors


Running a manual short SMART test on both of the drives comes back as a success, with Errors: N/A. If there's more detail I can get from those smart test results than what the UI is showing me, happy to grab that if I can be pointed in the right direction.


I'm going to go ahead and see about finding a new SATA controller, since both of the issues I've had have been drives connected to that - and I'll probably also move the boot drive over to one of the motherboard SATA ports, since it makes sense to keep that one connected somewhere safer than it is.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I think you might be right about the SATA controller. Overnight I had an alert email through saying that in addition to that one drive in the nas pool, my boot pool is degraded - both of which are connected via my SATA controller.
It's just an idea, we don't know for sure right now. LSI HBAs are getting recommended around here, flash in IT mode and you can also buy used. We could try to investigate further before spending money.
I don't know the specific controller though, may not be unwise to replace it nonetheless. On the other hand I also use an unrecommended sata card with no problems. But I know the risks and I only connected mirrored pairs of my boot pool and ssd pool to it, both are not mission critical and mirrored, with the other drive beeing connected directly to the mainboard.

If there's more detail I can get from those smart test results than what the UI is showing me, happy to grab that if I can be pointed in the right direction.
Yes please post the output of smartctl -a /dev/sdg and for the drive in your nas_pool_1 (probably should be able to identify with zpool status -LP.
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
Smart test status for boot pool drive:
Code:
smartctl -a /dev/sdg3
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37240G
Serial Number:    50026B72830AD4F5
LU WWN Device Id: 5 0026b7 2830ad4f5
Firmware Version: SAJ20104
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 25 13:56:41 2024 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       3441
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       1
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       0
170 Bad_Blk_Ct_Lat/Erl      0x0000   100   100   010    Old_age   Offline      -       0/0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   027   033   000    Old_age   Always       -       27 (Min/Max 21/33)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       1
231 SSD_Life_Left           0x0000   094   094   000    Old_age   Offline      -       94
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       4537
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       2983
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       132
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       60
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       121
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       99624

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 28 20 f8 b4 07 00 00      00:00:00.000  WRITE FPDMA QUEUED
  61 18 80 68 60 40 00 00      00:00:00.000  WRITE FPDMA QUEUED
  61 10 28 a0 0d 65 00 00      00:00:00.000  WRITE FPDMA QUEUED
  61 10 b0 b0 0d 65 00 00      00:00:00.000  WRITE FPDMA QUEUED
  61 18 80 68 60 40 00 00      00:00:00.000  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3429         -
# 2  Short offline       Completed without error       00%      3405         -
# 3  Short offline       Completed without error       00%      3381         -
# 4  Short offline       Completed without error       00%      3365         -
# 5  Short offline       Interrupted (host reset)      90%      3357         -
# 6  Short offline       Completed without error       00%      3333         -
# 7  Short offline       Completed without error       00%      3317         -
# 8  Short offline       Completed without error       00%      3294         -
# 9  Short offline       Completed without error       00%      3270         -
#10  Short offline       Completed without error       00%      3246         -
#11  Short offline       Completed without error       00%      3222         -
#12  Short offline       Completed without error       00%      3198         -

Selective Self-tests/Logging not supported

The above only provides legacy SMART information - try 'smartctl -x' for more


And for the drive that had issues in nas_pool_1:
Code:
smartctl -a /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN006-3CW104
Serial Number:    WW60T5ZD
LU WWN Device Id: 5 000c50 0f1793709
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar 25 13:59:15 2024 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 461) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       158756176
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   076   060   045    Pre-fail  Always       -       44272865
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       3451
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       12
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   062   040    Old_age   Always       -       32 (Min/Max 28/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       122
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       229
194 Temperature_Celsius     0x0022   032   040   000    Old_age   Always       -       32 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       158756176
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       3398 (68 129 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10445665544
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       57541191290

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3439         -
# 2  Short offline       Completed without error       00%      3415         -
# 3  Short offline       Completed without error       00%      3391         -
# 4  Short offline       Completed without error       00%      3375         -
# 5  Short offline       Interrupted (host reset)      40%      3367         -
# 6  Short offline       Completed without error       00%      3343         -
# 7  Short offline       Completed without error       00%      3327         -
# 8  Short offline       Completed without error       00%      3316         -
# 9  Short offline       Completed without error       00%      3295         -
#10  Short offline       Completed without error       00%      3271         -
#11  Short offline       Completed without error       00%      3247         -
#12  Short offline       Completed without error       00%      3223         -
#13  Short offline       Completed without error       00%      3199         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I'm not the expert in smart results interpretation unfortunately. The output for nas_pool_1 doesn't look too worrying from my perspective, i.e. nothing really stands out to me.

Maybe the sata card is the best bet at this moment. How far are you with your research on that end?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
On the boot drive, you have a crc error. That's a hardware error, perhaps a SATA cable or perhaps controller but it's just 1. You've also had some UDMA CRC errors on the Seagate in your pool, this is not likely good. You have not reached the threshold of bad, but, it's definitely not normal. The latter is a drive error but thus far it's been able to recover the data. It definitely needs monitoring. Pay attention for 195 on value and worst columns. If it keeps dropping, getting worse.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
. You've also had some UDMA CRC errors on the Seagate in your pool, this is not likely good.
Where did you see this?

Also these are all short tests, throw in a long test for good measure.

I assume your most critical data is backed up by now?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
195 Hardware_ECC_Recovered 0x001a 082 064 000 Old_age Always - 158756176

For Seagate, the values to pay attention to are the 082 and the 064, NOT the raw value on the right. Lower is worse.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Thanks for clarifying, I was looking at

199 UDMA_CRC_Error_Count

I converted the raw value for 195 and it returned 0, that's why I assumed that one was okay. Good to know that you need to compare the other columns. Pretty confusing, but that's why I said I'm not the guy to judge the results @cbamatt ;)
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
The raw value of 195 is 158756176, where did you get 0?

And, it's very confusing. Different drives can use the same attribute differently.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
The raw value of 195 is 158756176, where did you get 0?
It's a 48bit value, you need to convert it, or at least I thought so.

Online converter or add X,raw48:54 to your smartctl output. Where X is either 1, 7 or 195.
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
Maybe the sata card is the best bet at this moment. How far are you with your research on that end?
I've managed to find a Dell HBA330 with an LSI 9300 chip for a decent enough price, that's also already been flashed to IT mode, so I'll give that a shot when it arrives.

I assume your most critical data is backed up by now?
For the most part, yeah. The majority of the data on these drives is all media files from ripping DVDs that I own, so all of it can be recovered, if a little bit time consuming. Any critical data I've already move onto my new raidz pool.

I've also ordered duplicates for both my nvme thats forming my ssd pool and my boot drive; I figure it can't hurt to turn those pools into mirrors while I'm in the process of making sure I'm doing things properly.

I'll make sure to set a long test going on each of these (and probably set up one on my other drives at some point too, to be safe). Thank you both for your help in this so far :)
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
It's a 48bit value, you need to convert it, or at least I thought so.

Online converter or add X,raw48:54 to your smartctl output. Where X is either 1, 7 or 195.
You're right. Still, it's the 2 values I mentioned that would matter. They are not what should be expected. I am presuming, though not 100% sure, that since they are somewhat low, that they corrected the error and so it does not increment the raw value. Not 100% certain on that part.
 
Last edited:

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I've managed to find a Dell HBA330 with an LSI 9300 chip for a decent enough price, that's also already been flashed to IT mode, so I'll give that a shot when it arrives.
That's great, but it's only one part of the process. The second part is to use the utility to determine the firmware versions, older firmwares (and there's a surprisingly large # of those even though it's real old!) have problems that cause errors as well. So, make sure it's current firmware for that card in IT mode then proceed. You'll get through it and all will be fine in the end.

Are you sure the drive that got your 74 errors is sdf though, each reboot can change which drive letter it is.
 
Last edited:

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
Are you sure the drive that got your 74 errors is sdf though, each reboot can change which drive letter it is.
I'm pretty sure it is, yeah - cross-referenced the drive serial number. Just in case though, this is the smartctl data for the other 4TB drive:
Code:
smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN006-3CW104
Serial Number:    WW60T7QD
LU WWN Device Id: 5 000c50 0f1792a5f
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 25 22:09:47 2024 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 463) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   074   064   006    Pre-fail  Always       -       22500960
  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   045    Pre-fail  Always       -       53871822
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       3459
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       12
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   057   040    Old_age   Always       -       34 (Min/Max 30/43)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       122
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       246
194 Temperature_Celsius     0x0022   034   043   000    Old_age   Always       -       34 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   074   064   000    Old_age   Always       -       22500960
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       3404 (144 54 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10878143264
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       57513074938

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3439         -
# 2  Short offline       Completed without error       00%      3415         -
# 3  Short offline       Completed without error       00%      3391         -
# 4  Short offline       Completed without error       00%      3367         -
# 5  Short offline       Completed without error       00%      3343         -
# 6  Short offline       Completed without error       00%      3327         -
# 7  Short offline       Completed without error       00%      3295         -
# 8  Short offline       Completed without error       00%      3271         -
# 9  Short offline       Completed without error       00%      3247         -
#10  Short offline       Completed without error       00%      3223         -
#11  Short offline       Completed without error       00%      3199         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more


That's great, but it's only one part of the process. The second part is to use the utility to determine the firmware versions, older firmwares (and there's a surprisingly large # of those even though it's real old!) have problems that cause errors as well. So, make sure it's current firmware for that card in IT mode then proceed. You'll get through it and all will be fine in the end.

The seller of the card has put up a screenshot with these details - I think the firmware version is the right one from searching around the forums?
Code:
Controller Number:                 0
Controller:                        SAS3008(C0)
PCI Address:                       00:01:00:00
SAS Address:                       500605C-9-7E65-9458
NVDATA Version (Default):          0E.01.00.39
NVDATA Version (Persistent):       0E.01.00.39
Firmware Product ID:               0x2221 (IT)
Firmware Version:                  16.00.11.00
NVDATA Vendor:                     LSI
NVDATA Product ID:                 Dell HBA 330 Adp
BIOS Version:                      N/A
UEFI BSD Version:                  N/A
FCODE Version:                     N/A
Board Name:                        Dell HBA 330 Adp
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I don't have one of those so not sure but searching should give you the right result. The other drive looks about the same as the first one. You bought them used or new?

I don't see anything that should cause 70 some errors in the smart data. If you buy the new controller, keep in mind any cabling issues, which it could be for the boot drive at least, will not resolve with just the controller.
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
The other drive looks about the same as the first one. You bought them used or new?
I believe they were listed as new (ordered them end of October), and if memory serves they did come in sealed packaging (not to say that there's a 0% chance they were used/refurbed).

I should have some additional SATA cables around somewhere that I should be able to swap out for the boot drive to try and rule that out as an issue. The boot drive will also likely end up being connected directly to the motherboard when I reconfigure, since I'll be using 2 mini-SAS to SATA breakout cables to connect from the controller to the backplane, and it's simpler to send all those cables down to that and keep the mobo connectors for any drives not connected through the backplane.
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
Last part I needed arrived today, so reconfigured the server with the HBA330 in it - all looks to be working perfectly. I've got the replication task running now, and it's going fine (on the old card the server bombed out after a minute or less).

I also got duplicates of my nvme drive for my system + applications pool and the ssd for my boot pool, and I've got both of those pools in a mirror config now.

If all goes well, the replication task should be finished by the end of today or tomorrow, and I can then look at swapping the pools around for config/naming (will be following this post, looks to be quite an old one based on the forum its in, but none of the instructions look like they're wildly out of sync with the current state of things)
 

cbamatt

Cadet
Joined
Mar 21, 2024
Messages
8
Replication task finished and all data seems available (time will only tell if any of the video files picked up any corruption, but they can be replaced easily enough).

Following instructions in that post didn't work exactly - I couldn't detach the old pool because it kept complaining that it was in use, so I just renamed tmp_pool_1 to tank and manually switched config over to it. Then I ended up having to reboot the server before it'd let me detach the old pool, make sure everything was working as intended, then reimport it to destroy/wipe the old one.

Many thanks to you both for helping me work through this and narrow down what the issue likely was. I've set up a scheduled long SMART test task weekly in addiiton to the daily short tests, and I'll keep my eye on both the drives that did report issues. Both of them are now reconfigured to have some redundancy now, so I should be good.
 
Top