raidz - degraded - investiagion

phier · Nov 29, 2023

@Davvo thanks! will do this weekend.

also regardin to that calculation and UCE

rvassar said:
A couple thoughts:
1. The uncorrectable error spec for the HC550 is 1 in 10^15, which gives you some head space for UCE, but I suspect most of us here would advise against RAIDZ1. You're looking at one UCE for roughly every 120Tb, but for 32Tb of data, that's pretty thin.

are we talking here about https://jro.io/r2c2/,
i did simulation for r5 3drives; r6 with 4 or 6 drives and here are results from the r2c2,
or am i mixing apples with pears?:)

My use-case would be:
NAS - not sure if r5 - 3 drives, or r6 - 4 drives
and daily replication to NAS2: also not sure what setup should be good for that r5?

In the article ZFS_Storage_Pool_Layout_White_Paper_February_2022.pdf they are pointing out that
r6 wide 6 (6 drives) is better then r6 wide 4 (4 drives), but based on the calculation above seems its r6-w4 is better...

Thanks

rvassar · Nov 29, 2023

phier said:
@Davvo thanks! will do this weekend.

also regardin to that calculation and UCE

are we talking here about https://jro.io/r2c2/,
i did simulation for r5 3drives; r6 with 4 or 6 drives and here are results from the r2c2,
or am i mixing apples with pears?:)

Thanks

A quick casual read.... I suspect that calculator is addressing complete drive failure leading to pool failure. I was pointing out the odds of a single un-correctable error per drive, which at 1 x 10^15 is about every 120Tb by my dinner napkin math. Given you have 32Tb of data, you have a margin of about 3.75 footprints. The problem only occurs when two drives experience an error in the same lba, and ZFS RAIDZ1 being only single fault tolerant then can't recover your data to map out the failure. You then lose that file or zvol, but probably not the entire pool, though I don't think that can be excluded. So... no, we're mixing apples with pears.

Davvo · Nov 30, 2023

Please read the following resource:

Assessing the Potential for Data Loss

This guide was written to be read from top to bottom without jumps, with the intent of spreading awareness to both new and experienced users; the author of this document assumes the understanding of the concepts explained in the following...

www.truenas.com

phier · Jan 4, 2024

Davvo said:
Check power and data connections for evident damage or not properly latched cables;

Run zpool clear universalsoldier and then run a scrub;

After the scrub finishes, check if any errors were found;

Hello @Davvo, finally get phy access so did 1)2)3);

even there was no phy access 1) that notification:
* Pool universalsoldier state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk ST18000NM000J-2TV103 ZR52K9D5 is REMOVED

Dissappeared long time ago. I have no clue why, and i am not clear if its good or bad sign.

New results are here:
pool: universalsoldier
state: ONLINE
scan: scrub repaired 0B in 1 days 02:25:01 with 0 errors on Wed Dec 20 03:25:06 2023
config:

NAME STATE READ WRITE CKSUM
universalsoldier ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/a691e97e-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0
gptid/a68cf3cc-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0
gptid/a696a973-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0

errors: No known data errors

Previously it showed -> scan: resilvered 133G in 00:57:28 with 0 errors on Sun Nov 19 07:06:19 2023,
so i hope these 133G was "corrected" and there is no data corruption.

Smartctl long test results attached also, but again test for a drive that was previously removed multiple times:

Disk ST18000NM000J-2TV103 ZR52K9D5 is REMOVED

was Interupted again :( I am going to re-run long test.
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 11629 -

Based on the provided articles; i think it might be better to go with Z2 configuration with 4x18TB
instead of my current setup ie 3x18tb (Z1). Hope my assumption is correct and its not an overkill.

Thanks a lot

Davvo · Jan 4, 2024

I wouldn't consider a 4-wide RAIDZ2 VDEV overkill in a remote system; some folks use it in standard "at hand" systems too.

phier · Jan 5, 2024

Davvo said:
I wouldn't consider a 4-wide RAIDZ2 VDEV overkill in a remote system; some folks use it in standard "at hand" systems too.

@Davvo got it, regardin the issue with the drive,
interuptions during the smartctl long test, removal of the drive from the pool ...
its not possible to say anything?

thanks

Davvo · Jan 5, 2024

Well I'm not clear about your hardware and how you are running the tests, so I won't comment much about it beyond suggesting to check the cables.

phier · Jan 5, 2024

Davvo said:
Well I'm not clear about your hardware and how you are running the tests, so I won't comment much about it beyond suggesting to check the cables.

@Davvo supermicro board and sata ports attached directly to drives.

By how do i run tests - what do you mean? i execute smartct -t long /dev/adaX directly from truenas via ssh

Running Esxi and truenas as a VM, but supermicro native sata controller is passthrought to Truenas VM.

I can try to replace data and power cable but its sounds strange, that all that stopped without any manual intervention. I assume in case the cable is faulty it should repeat etc.
thanks

Davvo · Jan 5, 2024

If you do not provide a complete hardware list as per forum's rules, which was required a few times, you are not helping US help you.
Same goes for not making immediately clear this is a virtualized istance.

Troubleshooting a virtualized system requires specialized knowledge usually not found in "common" users.

To future readers:
DO SPECIFY RIGHT AWAY IF YOU ARE VIRTUALIZING: NOT DOING SO MIGHT RESULT IN DATA LOSS!

phier · Jan 5, 2024

Davvo said:
If you do not provide a complete hardware list as per forum's rules, which was required a few times, you are not helping US help you.
Same goes for not making immediately clear this is a virtualized istance.

Troubleshooting a virtualized system requires specialized knowledge usually not found in "common" users.

To future readers:
DO SPECIFY RIGHT AWAY IF YOU ARE VIRTUALIZING: NOT DOING SO MIGHT RESULT IN DATA LOSS!

why data loss? i am confused.

SETUP:
https://www.supermicro.com/en/products/motherboard/x11ssl-f
pentium processor
64gb ecc ram

ESXi, Truenas in VM (16gb ram);
x11ssl-f sata ports passthrought to VM;
3x 18TB as mentioned 2xwd, 1x seagate

and few VMs running on ESXi
ESXi is installed on ssd drive plugged to PCI (using reduction)

thanks

phier · Jan 8, 2024

smartctl long report from the problematic drive - seagate finished.

@Davvo do i need to provide more additional info?
thanks

Davvo · Jan 8, 2024

Drive looks clean. Read the following resource.

Resource - "Absolutely must virtualize TrueNAS!" ... a guide to not completely losing your data.

[---- 2024/01/16: Still relevant. Virtualization really doesn't change much. Updates made as appropriate. ----] [---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some...

www.truenas.com

Are you passing through the SATA controller?

phier · Jan 8, 2024

Davvo said:
Drive looks clean.

@Davvo thank you for the check.

Davvo said:
Are you passing through the SATA controller?

precisely.
Whole SATA controller is passed through to the VM where Truenas is running.

ESXi is booted via nvme drive plugged to PCI slot.

phier · Jan 9, 2024

@Davvo seems i am stuck here... also i saw:

that jgreco mentioned in his post: "After moving the disks, I *still* ran into a platform issue that resulted in a disk dropping and needing to resilver."
https://www.truenas.com/community/t...ompletely-losing-your-data.109256/post-458737

Davvo · Jan 9, 2024

I have very little knowledge about virtualization and honestly little clue of what to do in order to troubleshoot the issue.

phier · Jan 12, 2024

hello @jgreco,
do u mind to chime in ... i mean is it possible somehow find out what supermicro with Vt-d are safe to be run with ESXi?

thanks!

phier · Jan 22, 2024

@Davvo seems that device ada2 is broken?

phier · Jan 22, 2024

ah maybe its some bug ;/ https://github.com/AnalogJ/scrutiny/issues/522

scrutiny/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md at master · AnalogJ/scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds - AnalogJ/scrutiny

github.com

Davvo · Jan 22, 2024

All the other values are perfectly fine, I wouldn't be so sure to mark it as broken for command timeout errors.

rvassar · Jan 22, 2024

Just a thought... And I admit I haven't gone back to the first page to check, but... Have you checked your TLER settings for this drive?

Checking for TLER, ERC, etc. support on a drive

One of the problems with consumer-grade hard drives is that most of them will hang in the event that they run into an error, and will internally retry the operation, possibly for a minute or more. For a desktop PC, where redundancy does not...

www.truenas.com

Important Announcement for the TrueNAS Community.

raidz - degraded - investiagion

Patron

Guru

MVP

Patron

Attachments

MVP

Patron

MVP

Patron

MVP

Patron

Patron

Attachments

MVP

Patron

Patron

MVP

Patron

Patron

Patron

MVP

Guru

Similar threads