Recommended topology for a "fusion" pool?

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Interesting. Unexpected, and interesting.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Hmmm ... for me that was precisely what I expected.

1st it is inherited like almost any other ZFS attribute.
2nd the docs say:

This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class.

So if I set special_small_blocks=4k, I fully expect all blocks <= 4k to end up on the special vdev.

I thought that was the point ;)
Patrick
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Well, maybe I misjudged the evidence and it was just metadata piling up on the SSD.
Unfortunately I cannot test more at the moment - at least not with actual hardware. All systems productive now.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
@Yorick You are probably right and I interpreted my tests wrongly. I could not reproduce the assumed distribution of zvols with a testpool, stock FreeBSD 12.1 and the OpenZFS port.
 

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
Hello,

This tread shed some light in the new options, although some things are hazy for the time being(possibility of data going in that pool etc.)

Have a couple of storage devices in my home lab and was experimenting with things.
So for the time being have 3 pools:
-one with is an 8tb Seagate Enterprise self encrypting drive that I keep for back-ups(dedup will be nice there but never got it to work guess it´s because it´s SMR drive or something).
This does not warry me as it´s rather warm back-ups as I offload to another box for safe keeping (but dedupping data here will be a plus).


-one pool of 3 10TB (WD red) raidZ for mild workloads(although seems more than capable of saturating 10G) thing i´ll buy 2 more drives and expand it to 5 spindles in the future.

-third and last comprised of 10 6TB SAS (+1 hot spare have another on hand just in case) HGST hybrid(they have onboard nand flash for caching) drives 4kn, the volume created is Raidz2 as z3 seemd a little overkill as they rebuild very quickly and have extremely low error rate.

For the time being have ordered a couple of 16G optane nvme´s that I found dirt cheap just to see how they would perform as ZIL guess I can partition them to 2x4G on each and mirror for ZIL on both pools. would be left with roughly 8G on each that have no idea how to use them.
If this work well, guess will be looking in the near future on the 1TB H10 (that have 32G of optane for zil)so I can partition them as 300-500G (so they work as MLC drives) for cache but have to see in detail the hw corks on them- But this is just a thought.

Then I have 4 more 480 SSD´s that I´m in a bit of a dilemma how to use them... so I´m opened for opinions... they are MLC drives(little maniatic on not keeping up with the trend on as MLC I thing are the only ones that provide an overall good life and performance).

The OS in on a pair of 240 SSD´s, network is a pair of 10g convergent cards (100G infiniband seemed a litle overkill for this), and a HP220 HBA+lsi 16 port expander card cpu is 6 core ryzen 2nd gen and 64G of RAM.

My initial thought was to partition them as 100 + 200 on each and assign them to cache(L2ARC) but seeing that now I can create hybrid pools do not know if they would be put to better use in a different architecture.... do not plan to partition off more then 300G on each of them for endurance and performance sake.

My use case is ISCSI for VM´s and NFS/CIFS shares(general pourpouse goes theroug 1G card, ISCSI goes through 10G ports and agregated a Vmware side as agregation refuses to work on this card in freenas).
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
But there's two factors to consider...

A) A physical issue with the disk. In this case, SMART shows an error and increments the value showing a portion of the disk is unusable. This happens only when data is read and can't; or data is attempting to be written and can't...

Excuse my necromancy, but, even data which looks bad (unable to read/write & SMART errors) tend to have many options. While there are some issues which are big problems (Service Area, Drop Damage, R/W heads, etc) ... here's a little flowchart of diagnostics & solutions.

Serious // Global errors are often in the Service Area (or 'SA' ... which is BIOS info, logic or initialization code) Firmware written to the platters for the manufacture's convenience; the outer-most-tracks on most HDs (usually 0, 1 with backups on others). This is where data like the G-List (Grown list: errors accumulate after mfr.) + P-List (Permanent List: Defects in the substrate at the time of mfr) will be written. All drives must successfully read the SA to allow the armature to begin 'flying' ... usually, SA issues don't preclude operation (but can) as most modules aren't critical and those which are will often be available in duplicate. But, there are special manufactures or series that truly suck ... such as WD passports (garbage -- no SATA interface & are encrypted no matter what!) ... & Seagates ... who's consumer drives are crap.

HD issues you WANT ... appear due to the ATA commands which report blocks that reads slowly in binary "pass fail" terms. Which may in fact, only be a block that reads slowly ... it could mean that the PARITY for that block wouldn't be read, (can also be firmware issue). It could have been erroneously added to the G-List (the error table in the HD Service Area). While I'm unsure what FreeBSD's time out rule is, the Windows definition seems to be the average for OSs (perhaps the POSIX driver) which is predicated on blocks which exceed 600ms.

HFS+ & APFS may be the same as NTFS time out... Despite how annoying it is macOS doesn't natively speak NTFS, & no matter how counterintuitive, Steve Jobs wrote the HPFS (high performance FS) which is NTFS...so, it wouldn't be surprising if the rules were recycled. But, they are just that; rules.

Unlike the laws of physics for which there are no penalties for transgressing, as they're merely arranged such that violations are impossible. Fortunately, our rules can be altered -- and are. I've heard free tools may allow you to modify the time out duration (maybe DD Rescue) definitely HDDSuperTool does, but I'm more familiar with professional tools such as DDI 4 (DeepSpar Disk Imager 4) which allows you to manually control the time out speed of every block read. You'd read the first-pass with fast time outs to avoid taxing the drive in one zone prior to reading the... 97% ..? which may be accessible. With a DDI4, you just grab tougher data in successive passes. Set the time outs to 6s if you want -- & combine that with other I.T.T.T. rules (aka, If This Then That). Eg, if the previous 10 blocks failed to read with 1000ms time outs, skip 100 or 1,000 blocks and try again. Maximizing the data read without risking global failure. Importantly, we can also instruct it to IGNORE PARITY in a later pass. In those blocks parity can't be read for whatever reason, we can read those blocks _x_ times ... to average the result. Weirdly, my more expensive tool (PC-3000 Express which costs ~5x a DDI or ~$17,000) ... doesn't have that feature, but none the less is more capable. Perhaps, performing some of these tasks more autonomously.

Obviously, some issues involve a weak or failed transducer // R/W heads ... or the preamp heads are connected to -- also on the armature. HDD data isn't always digital; were you to hear what the R/W head sends, you'd hear a waveform where high pitch = 1 & the low pitch = 0. These [are] discrete values vs. the resolution of the frequency, able to be every value between that range (aka analogue)... it's discrete during transmission, once the digital state's read via the transducer. If a single head is bad, we can disable it, too, & recover the remaining heads.

Hope this was useful for someone ...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
Windows definition seems to be the average for OSs (perhaps the POSIX driver) which is predicated on blocks which exceed 600ms.
600 ms under what circumstances? Disks are known to retry sectors for much longer than that, a few seconds with TLER, who-knows-how-long without it. Practical experience with failing disks also doesn't seem coherent with such a short timeout.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
600 ms under what circumstances? Disks are known to retry sectors for much longer than that, a few seconds with TLER, who-knows-how-long without it. Practical experience with failing disks also doesn't seem coherent with such a short timeout.

Good observation. Several of us discussed this back and forth after I read your message to try to figure out why we all agreed and are familiar with the 600ms timeout duration. I didn't find the specs I was looking for. The article others seemed to think 'relevant' (which claim it could take 1 min - 8 min as multiple sub-systems waited for a series of timeouts before the system reported to the user) seem germane as this is about the Drive Timeout, not the timeout for a block.

It therefore uses RAID systems, with their own timeouts definitions, which reinforces the idea that it's a driver, not the HD following a rule. (I wish this were a subject that every time I wanted an answer I could just ask it -- but usually I'll get marketing from Data Recovery companies, how to do things rather than what their definitions are, etc).
Microsoft 'blog' of some sort on timeouts

The consensus is the AHCI // ATA driver for Win is about 600ms to time out, per block ... but will retry blocks, which extends the over all duration. Were your priority getting any data you possibly can off a drive, you'd obviously want it to be tenacious. From the perspective however, of preserving the drive for recoverability..? You'd like quick timeouts, as the more time this continues, the more difficult recovery is likely to be.

The Firmware logs sectors the OS was unreadable to read. The subsystem will later attempt re-reading... If successful (but slow) it'll be relocate and that block will the bad block is added in SMART stats. Though, there are many other reasons, still, for which data within the sector that's still readable may still not be available to the OS, relocated, slow, etc. For these reasons, blocks are more than 512 bytes. Such as the ECC data, the address marker, and null space ... of which, if any aren't read and the data will be ignored (no doubt a simplification as I already have questions about what's controlling that decision flow). That info must accompany data for it to be "valid" and "eligible" to be sent or accepted. Again, I'm not sure what's managing those requirements, but they are contingent for the data to be read.

Because the total data between one LBA to the next has to include the ECC, the address marker (AM) as well as some null space for the head to reset (as explained to me). The total size if I recall is between 580-600 Bytes. Apparently, the transducer needs a gap to reset (not that that's a satisfying word) to become prepared to read the next data. I'd assume it's for the head to easily see the boundaries of where a block begins as a reference. And yes, the ECC is a checksum written for every block -- so the system knows whether the integrity of a blocks content has been degraded over time. Thus, ZFS has it per FILE, and the VDevs have it per BLOCK. The file being good because the firmware of the drive could screw up providing the correct abstraction layers info to properly relocate drive, even if it believes everything is perfect.

I'm attaching a few pictures from the manual from DeepSpar Disk Imager, 4 ... DDI 4

These images are all very intuitive -- and although it's obvious there's considerable vernacular to get used to, I'm showing this instead of PC-3000 as PC-3000 is about 100 - 1,000x more complicated... and there's literally NOTHING about it that's intuitive. :)

I do hope some info is useful or interesting, if nothing else. :)


Screen Shot 2020-09-12 at 10.43.57 PM.png


Screen Shot 2020-09-12 at 11.37.21 PM.png


Screen Shot 2020-09-12 at 11.36.04 PM.png


Screen Shot 2020-09-12 at 11.35.28 PM.png


Screen Shot 2020-09-12 at 11.35.03 PM.png


Screen Shot 2020-09-12 at 11.34.11 PM.png


Screen Shot 2020-09-12 at 11.33.56 PM.png



Screen Shot 2020-09-12 at 11.32.54 PM.png


Screen Shot 2020-09-12 at 11.31.18 PM.png
 
Top