OpenZFS Developer Summit 2017 Recap
The 5th annual OpenZFS Developer Summit was held in San Francisco on October 24-25. Hosted by Delphix at the Children’s Creativity Museum in San Francisco, over a hundred OpenZFS contributors from a wide variety of companies attended and collaborated during the conference and developer summit.
iXsystems was a Gold sponsor and several iXsystems employees attended the conference, including the entire Technical Documentation Team, the Director of Engineering, the Senior Analyst, a Tier 3 Support Engineer, and a Tier 2 QA Engineer.
Day 1 of the conference had 9 highly detailed, informative, and interactive technical presentations from companies which use or contribute to OpenZFS. The presentations highlighted improvements to OpenZFS developed “in-house” at each of these companies, with most improvements looking to be made available to the entire OpenZFS community in the near to long term. There’s a lot of exciting stuff happening in the OpenZFS community and this post provides an overview of the presented features and proof-of-concepts.
The keynote was delivered by Mark Maybee who spoke about the past, present, and future of ZFS at Oracle. An original ZFS developer, he outlined the history of closed-source ZFS development after Oracle’s acquisition of Sun. ZFS has a fascinating history, as the project has evolved over the last decade in both open and closed source forms, independent of one another. While Oracle’s proprietary internal version of ZFS has diverged from OpenZFS, it has implemented many of the same features. Mark was very proud of the work his team had accomplished over the years, claiming Oracle’s ZFS products have accounted for over a billion dollars in sales and are used in the vast majority of Fortune 100 companies. However, with Oracle aggressively moving into cloud storage, the future of closed source ZFS is uncertain. Mark presented a few ideas to transform ZFS into a mainstream and standard file system, including adding more robust support for Linux.
Allan Jude from ScaleEngine talked about ZStandard, a new compression method he is developing in collaboration with Facebook. It offers compression comparable to gzip, but at speeds fast enough to keep up with hard drive bandwidth. According to early testing, it improves both the speed and compression efficiency over the current LZ4 compression algorithm. It also offers a new “dictionary” feature for improving image compression, which is of particular interest to Facebook. In addition, when using ZFS send and receive, it will adapt the compression ratio to make the most efficient use of the network bandwidth.
Currently, deleting a clone on ZFS is a time-consuming process, especially when dealing with large datasets that have diverged over time. Sara Hartse from Delphix described how “clone fast delete” speeds up clone deletion. Rather than traversing the entire dataset during clone deletion, changes to the clone are tracked in a “live list” which the delete process uses to determine which blocks to free. In addition, rather than having to wait for the clone to finish, the delete process backgrounds the task so you can keep working without any interruptions. Sara shared the findings of a test they ran on a clone with 500MB of data, which took 45 minutes to delete with the old method, and under a minute using the live list. This behavior is an optional property as it may not be appropriate for long-lived clones where deletion times are not a concern. At this time, it does not support promoted clones.
Olaf Faaland from Lawrence Livermore National Labs demonstrated the progress his team has made to improve ZFS pool imports with MMP (Multi-Modifier Protection), a watchdog system to make sure that ZFS pools in clustered High Availability environments are not imported by more than one host at a time. MMP uses uberblocks and other low-level ZFS features to monitor pool import status and otherwise safeguard the import process. MMP adds fields to on-disk metadata so it does not depend on hardware, such as SAS. It supports multi-node HA configs and does not affect non-HA systems. However, it does have issues with long I/O delays so existing HA software is recommended as an additional fallback.
Jörgen Lundman of GMO Internet gave an entertaining talk on the trials and tribulations of porting ZFS to OS X. As a bonus, he talked about porting ZFS to Windows, and showed a working demo. While not yet in a usable state, it demonstrated a proof-of-concept of ZFS support for other platforms.
Serapheim Dimitropoulos from Delphix discussed Faster Allocation with the Log Spacemap as a means of optimizing ZFS allocation performance. He began with an in-depth overview of metaslabs and how log spacemaps are used to track allocated and freed blocks. Since blocks are only allocated from loaded metaslabs but freed blocks may apply to any metaslab, over time logging the freed blocks to each appropriate metaslab with every txg becomes less efficient. Their solution is to create a pool-wide metaslab for unflushed entries.
Shailendra Tripathi from Tegile presented iFlash: Dynamic Adaptive L2ARC Caching. This was an interesting talk on what is required to allow very different classes of resources to share the same flash device–in their case, ZIL, L2ARC, and metadata. To achieve this, they needed to address the following differences for each class: queue priority, metaslab load policy, allocation, and data protection (as cache has no redundancy).
Isaac Huang of Intel introduced DRAID, or parity declustered RAID. Once available, this will provide the same levels of redundancy as traditional RAIDZ, providing the administrator doubles the amount of options for providing redundancy for their use case. The goals of DRAID are to address slow resilvering times and the write throughput of a single replacement drive being a bottleneck. This solution skips block pointer tree traversal when rebuilding the pool after drive failure, which is the cause of long resilver times. This means that redundancy is restored quickly, mitigating the risk of losing additional drives before the resilver completes, but it does require a scrub afterwards to confirm data integrity. This solution supports logical spares, which must be defined at vdev creation time, which are used to quickly restore the array.
Prakash Surya of Delphix described how ZIL commits currently occur in batches, where waiting threads have to wait for the batch to complete. His proposed solution was to replace batch commits and to instead notify the waiting thread after its ZIL commit in order to greatly increase throughput. A new tunable for the log write block timeout can also be used to log write blocks more efficiently.
Overall, the quality of the presentations at the 2017 OpenZFS conference was high. While quite technical, they clearly explained the scope of the problems being addressed and how the proposed solutions worked. We look forward to seeing the described features integrated into OpenZFS. The videos and slides for the presentations should be made available over the next month or so at the OpenZFS website.