Mixed non-critical data & partial data recovery from stripe pool

kiler129

Dabbler
Joined
Apr 16, 2016
Messages
22
The problem
I'm exploring a reasonable solution for low-cost bulk storage of small video files with metadata. I analyzed data characteristics of the current dataset:
  • Metadata files: ~400,000/TB
    • each file 1-20kB, with 95% at 5-6kB
    • heavily random access patterns with a large number of queries all the time
    • critical for application performance
  • Image thumbnails: ~5,000/TB
    • each file 50-1300kB, with 95% at <=500kB
    • moderate UX degradation with high access latencies
  • Video files: ~70,000/TB
    • each file 0.8-30MB in size
    • accessed only sometimes and semi-randomly
    • minimal UX degradation even with large access latencies

How the data is used?
Due to financial reasons, it's not practical to keep the whole pool on SSDs. Keeping the pool purely on HDDs results in the application being almost unusable. Data access patterns prevent any sane caching. Small metadata files are requested in chunks of semi-random 10,000 files at once but the same metadata file is rarely hit more than once a week. Once the correct meta-file is found a stream of sequential video files is initiated, making HDDs more than sutiable.


Possible solution?
Since the data loss of even 100% of the dataset isn't critical, but at worst annoying and inconvenient, I'm thinking about using:
  • Bulk data
    • Stripe vdev with 2-3 HDDs with "recordsize=1M"
    • This will keep video files and store them efficiently
  • Special device/fusion drive
    • Mirror vdev with 2 small SSDs with "special_small_blocks=512K"
    • Allows for fast and random access of ZFS metadata
    • Will keep all application metadata files on SSD, making the app snappy
    • Will keep majority of image thumbnails as well, decreasing perceive latency
Note: any changes to the application aren't possible. The application intermixes all data types in an obfuscated & deep filesystem tree.

I tested this solution with a 1x HDD vs. 1x HDD + 1x SSD special device vs. 1x SSD. With just HDD the application is unusable, while the performance difference of a hybrid setup vs. full-SSD one is barely perceivable.


Questions :)
  1. I think my setup seems sane, unless I overlooked something major with using a special device. Maybe someone has a better idea, and I'm overcomplicating it?
  2. The only issue I see is using HDDs in stripe vs. even RAIDZ1, since the main goal is to maximize storage space with limited footprint. With a failure of a single HDD, the whole pool will ostensibly be lost. But, will damage of some ZFS blocks (e.g. due to bad blocks) will be recoverable? Given the nature of the application, loss of a subset of files is a non-issue and may not even be noticeable by end users. The loss of a whole pool would be disrupting and annoying but not critical enough to even consider backup, due to fast data rotation.
    I only found SEO spam of data recovery firms in Google.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
If you have a Striped HDD data vDev, and just some bad ZFS blocks, the file(s) affected will be in essence garbage. However, ZFS CAN continue operation for all good files.

If your Striped HDD data vDev disk with the bad block(s) still has spare sectors, you can remove the affected file, (zpool status -v will tell you which they are). Then you restore just THAT file(s). After that, all good again to await the next bad block.


I have a miniature fanless desktop as my media server. It has a 1TB mSATA SSD & 2TB SATA laptop HDD. I take 40GBs from each for a Mirrored ZFS OS pool, then Stripe the rest for my ZFS media pool. Works perfectly fine for my use, (1 user streaming 1080p at most).

When, (and his has happened perhaps a dozen times), that I have lost a block, I simply restore the file from backups. (Which I have multiple backups.) Generally the file is a larger video file. Since video files take up more blocks, statistically they have more chance of a failure.
 
Last edited:

kiler129

Dabbler
Joined
Apr 16, 2016
Messages
22
@Arwen thank you for the insights. I still need to test how it will react with data corruption (by doing my favorite thing - deliberately introducing corruption ;p) and write down recovery procedures for that. This seems however like a very acceptable tradeoff.

To update this thread, in case someone finds it later, I actually went ahead and tested the system with 12TB of data. It works wonderfully and the application is snappier than it has ever been. It seems to be a perfect solution, with the special device being filled in at the rate of ~1.2-1.3% of the full data. With 2x500GB SSD I can safely scale it to around 25+ TB, which is more than enough.
 
Top