kiler129
Dabbler
- Joined
- Apr 16, 2016
- Messages
- 22
The problem
I'm exploring a reasonable solution for low-cost bulk storage of small video files with metadata. I analyzed data characteristics of the current dataset:
How the data is used?
Due to financial reasons, it's not practical to keep the whole pool on SSDs. Keeping the pool purely on HDDs results in the application being almost unusable. Data access patterns prevent any sane caching. Small metadata files are requested in chunks of semi-random 10,000 files at once but the same metadata file is rarely hit more than once a week. Once the correct meta-file is found a stream of sequential video files is initiated, making HDDs more than sutiable.
Possible solution?
Since the data loss of even 100% of the dataset isn't critical, but at worst annoying and inconvenient, I'm thinking about using:
I tested this solution with a 1x HDD vs. 1x HDD + 1x SSD special device vs. 1x SSD. With just HDD the application is unusable, while the performance difference of a hybrid setup vs. full-SSD one is barely perceivable.
Questions :)
I'm exploring a reasonable solution for low-cost bulk storage of small video files with metadata. I analyzed data characteristics of the current dataset:
- Metadata files: ~400,000/TB
- each file 1-20kB, with 95% at 5-6kB
- heavily random access patterns with a large number of queries all the time
- critical for application performance
- Image thumbnails: ~5,000/TB
- each file 50-1300kB, with 95% at <=500kB
- moderate UX degradation with high access latencies
- Video files: ~70,000/TB
- each file 0.8-30MB in size
- accessed only sometimes and semi-randomly
- minimal UX degradation even with large access latencies
How the data is used?
Due to financial reasons, it's not practical to keep the whole pool on SSDs. Keeping the pool purely on HDDs results in the application being almost unusable. Data access patterns prevent any sane caching. Small metadata files are requested in chunks of semi-random 10,000 files at once but the same metadata file is rarely hit more than once a week. Once the correct meta-file is found a stream of sequential video files is initiated, making HDDs more than sutiable.
Possible solution?
Since the data loss of even 100% of the dataset isn't critical, but at worst annoying and inconvenient, I'm thinking about using:
- Bulk data
- Stripe vdev with 2-3 HDDs with "recordsize=1M"
- This will keep video files and store them efficiently
- Special device/fusion drive
- Mirror vdev with 2 small SSDs with "special_small_blocks=512K"
- Allows for fast and random access of ZFS metadata
- Will keep all application metadata files on SSD, making the app snappy
- Will keep majority of image thumbnails as well, decreasing perceive latency
I tested this solution with a 1x HDD vs. 1x HDD + 1x SSD special device vs. 1x SSD. With just HDD the application is unusable, while the performance difference of a hybrid setup vs. full-SSD one is barely perceivable.
Questions :)
- I think my setup seems sane, unless I overlooked something major with using a special device. Maybe someone has a better idea, and I'm overcomplicating it?
- The only issue I see is using HDDs in stripe vs. even RAIDZ1, since the main goal is to maximize storage space with limited footprint. With a failure of a single HDD, the whole pool will ostensibly be lost. But, will damage of some ZFS blocks (e.g. due to bad blocks) will be recoverable? Given the nature of the application, loss of a subset of files is a non-issue and may not even be noticeable by end users. The loss of a whole pool would be disrupting and annoying but not critical enough to even consider backup, due to fast data rotation.
I only found SEO spam of data recovery firms in Google.
Last edited: