dxun
Explorer
- Joined
- Jan 24, 2016
- Messages
- 52
I am in the process of designing a storage array on a 5-yr mission for my home(lab) and due to hardware limitations of my current platform (and the current/perceived lack of need for any changes), I am about to make a few compromises and would appreciate a discussion/characterisation/opinions/inputs about the alternatives ahead of me. I have an existing ~12 TB pool which I plan on replicating to the backup instance of TrueNAS 12, then reconfigure the new pool and replicate the pool contents back to the new instance of TrueNAS 12.
Ideally, I'd like to prevent any costly issues I might be overseeing right now, rather than having to suffer through it later.
As you can see from my configuration (in signature), this is a venerable platform which I am reaching the limits of - but given the mission time frame, I believe it should serve me well.
The storage array will be backed by either existing 4x10 TB Exos drives in RAIDZ1 or (if I happen to stumble upon a good Black Friday deal) 5x10 TB array (which is the hw maximum as it is a 5-unit drive bay).
The expected traffic will be an eclectic mix of:
- several (but < 10) Kubernetes cluster nodes (each node represented by a Proxmox VM)
- syslog/graylog traffic
- security camera surveillance feeds (BlueIris)
- some occasional light Plex duty (but no transcoding)
I _believe_ this storage array ought to be handle the traffic but using the existing Optane 900p drive efficiently is what I am trying optimise here.
Aside from myself (and the VMs) there won't be additional concurrent users.
My intent is to partition the drive into two (or perhaps even three) partitions and have it wear multiple caching hats:
- 32 GB SLOG partition
- (optional) 64 GB L2ARC partition
- (rest - between 200 and 260 GB) metadata vdev partition
From what I have read so far, this would be a tall order for most SSDs but Optane should be able to handle this (?).
As I mentioned before, the hardware is already pretty much maxed out and has no room for much growth - e.g. I am using 12 out of maximum 16 PCI lanes, the RAM is maxed out at 32 GB.
My concerns/questions mostly revolve around that Optane - I know that ideally this should be a mirrored drive but I am not yet prepared to make that jump.
Here is what I see as my options:
1) partition Optane as above but without any mirroring - introduce the single point of pool failure (if the Optane dies, the pool dies)
2) partition Optane with SLOG and L2ARC partitions only and use a pair of Samsung 870 EVO SATA SSDs for metadata vdev
3) partition Optane as above but with mirroring - the ideal solution
As of right now, I am mostly leaning towards 1) - I am prepared to take that risk as I'll have a backup pool for the whole pool (or at least, the critical data). I am not fully sure how much would the I/O be inconsistent (given the Optane sharing the same NVMe namespace over three partitions with mixed read/writes) - but am willing to test this out.
I am not in strong favour of 2) - I (intuitively) speculate that the likelihood of a single Optane failing vs a total mirror failure of both Samsung 870 EVO is not significant in my case (am I wrong here?).
As for the 3), the financial cost of purchasing a Squid PCIe Carrier board and a new identical Optane 900p would be exceeding 600 USD and I am not even sure if that card would play nicely with the Supermicro board from 2014 (especially that I'd be using the full max of all available lanes and there might be....unforseen problems with this).
So that is the last option.
As for the sizing of the metadata vdev, from what I have read and understood, that is a bit of an art still - nevertheless, I think 200-260 GB partition should be able to host the metadata for ~30 TB of data (especially after I reconfigure the new pool with dataset record sizes appropriate for file sizes).
I'd appreciate any thoughts and critiques of my thinking here.
To be totally clear - I am not so much concerned about the speed but the reliability reasoning. My understanding is: Optane should be able to handle it. Am I wildly over-estimating its reliability?
As a last item, I am posting the `zdb` output of my existing pool - I have tried to use it to estimate the future metadata vdev size but even after reading Wendell's articles/posts, I am still unclear what my existing metadata size is - I am not sure what I am looking at or for. Help/guidance here would be also appreciated.
Ideally, I'd like to prevent any costly issues I might be overseeing right now, rather than having to suffer through it later.
As you can see from my configuration (in signature), this is a venerable platform which I am reaching the limits of - but given the mission time frame, I believe it should serve me well.
The storage array will be backed by either existing 4x10 TB Exos drives in RAIDZ1 or (if I happen to stumble upon a good Black Friday deal) 5x10 TB array (which is the hw maximum as it is a 5-unit drive bay).
The expected traffic will be an eclectic mix of:
- several (but < 10) Kubernetes cluster nodes (each node represented by a Proxmox VM)
- syslog/graylog traffic
- security camera surveillance feeds (BlueIris)
- some occasional light Plex duty (but no transcoding)
I _believe_ this storage array ought to be handle the traffic but using the existing Optane 900p drive efficiently is what I am trying optimise here.
Aside from myself (and the VMs) there won't be additional concurrent users.
My intent is to partition the drive into two (or perhaps even three) partitions and have it wear multiple caching hats:
- 32 GB SLOG partition
- (optional) 64 GB L2ARC partition
- (rest - between 200 and 260 GB) metadata vdev partition
From what I have read so far, this would be a tall order for most SSDs but Optane should be able to handle this (?).
As I mentioned before, the hardware is already pretty much maxed out and has no room for much growth - e.g. I am using 12 out of maximum 16 PCI lanes, the RAM is maxed out at 32 GB.
My concerns/questions mostly revolve around that Optane - I know that ideally this should be a mirrored drive but I am not yet prepared to make that jump.
Here is what I see as my options:
1) partition Optane as above but without any mirroring - introduce the single point of pool failure (if the Optane dies, the pool dies)
2) partition Optane with SLOG and L2ARC partitions only and use a pair of Samsung 870 EVO SATA SSDs for metadata vdev
3) partition Optane as above but with mirroring - the ideal solution
As of right now, I am mostly leaning towards 1) - I am prepared to take that risk as I'll have a backup pool for the whole pool (or at least, the critical data). I am not fully sure how much would the I/O be inconsistent (given the Optane sharing the same NVMe namespace over three partitions with mixed read/writes) - but am willing to test this out.
I am not in strong favour of 2) - I (intuitively) speculate that the likelihood of a single Optane failing vs a total mirror failure of both Samsung 870 EVO is not significant in my case (am I wrong here?).
As for the 3), the financial cost of purchasing a Squid PCIe Carrier board and a new identical Optane 900p would be exceeding 600 USD and I am not even sure if that card would play nicely with the Supermicro board from 2014 (especially that I'd be using the full max of all available lanes and there might be....unforseen problems with this).
So that is the last option.
As for the sizing of the metadata vdev, from what I have read and understood, that is a bit of an art still - nevertheless, I think 200-260 GB partition should be able to host the metadata for ~30 TB of data (especially after I reconfigure the new pool with dataset record sizes appropriate for file sizes).
I'd appreciate any thoughts and critiques of my thinking here.
To be totally clear - I am not so much concerned about the speed but the reliability reasoning. My understanding is: Optane should be able to handle it. Am I wildly over-estimating its reliability?
As a last item, I am posting the `zdb` output of my existing pool - I have tried to use it to estimate the future metadata vdev size but even after reading Wendell's articles/posts, I am still unclear what my existing metadata size is - I am not sure what I am looking at or for. Help/guidance here would be also appreciated.
Code:
root@truenas[~]# zdb -U /data/zfs/zpool.cache -Lbbbs Primary Traversing all blocks ... 11.7T completed (19535MB/s) estimated time remaining: 0hr 00min 00sec bp count: 68609607 ganged count: 0 bp logical: 8716583483904 avg: 127046 bp physical: 8554463491584 avg: 124683 compression: 1.02 bp allocated: 12850438782976 avg: 187297 compression: 0.68 bp deduped: 0 ref>1: 0 deduplication: 1.00 Normal class: 12850438782976 used: 42.89% additional, non-pointer bps of type 0: 1260670 number of (compressed) bytes: number of bps 14: 270 * 15: 174 * 16: 28 * 17: 451 * 18: 182 * 19: 98 * 20: 39 * 21: 179 * 22: 129 * 23: 207 * 24: 67 * 25: 52 * 26: 137 * 27: 83 * 28: 2919 * 29: 4984 * 30: 90 * 31: 186 * 32: 278 * 33: 94 * 34: 191 * 35: 150 * 36: 9095 ** 37: 1758 * 38: 326526 **************************************** 39: 59143 ******** 40: 188 * 41: 143 * 42: 98 * 43: 208 * 44: 52 * 45: 1229 * 46: 101 * 47: 316 * 48: 504 * 49: 1543 * 50: 148100 ******************* 51: 3077 * 52: 3742 * 53: 20066 *** 54: 119891 *************** 55: 69274 ********* 56: 128335 **************** 57: 307760 ************************************** 58: 645 * 59: 852 * 60: 816 * 61: 891 * 62: 1180 * 63: 963 * 64: 876 * 65: 956 * 66: 1042 * 67: 1133 * 68: 895 * 69: 568 * 70: 767 * 71: 609 * 72: 645 * 73: 632 * 74: 1055 * 75: 581 * 76: 1052 * 77: 1680 * 78: 643 * 79: 513 * 80: 645 * 81: 948 * 82: 476 * 83: 604 * 84: 1328 * 85: 524 * 86: 1075 * 87: 2109 * 88: 629 * 89: 540 * 90: 499 * 91: 565 * 92: 492 * 93: 468 * 94: 1113 * 95: 537 * 96: 550 * 97: 900 * 98: 520 * 99: 3047 * 100: 6092 * 101: 499 * 102: 592 * 103: 660 * 104: 541 * 105: 350 * 106: 505 * 107: 428 * 108: 343 * 109: 410 * 110: 452 * 111: 706 * 112: 392 * Dittoed blocks on same vdev: 466811 Blocks LSIZE PSIZE ASIZE avg comp %Total Type - - - - - - - unallocated 2 32K 8K 48K 24K 4.00 0.00 object directory 2 1K 1K 48K 24K 1.00 0.00 object array 1 16K 4K 24K 24K 4.00 0.00 packed nvlist - - - - - - - packed nvlist size 2 64K 24K 120K 60K 2.67 0.00 L1 bpobj 422 52.6M 3.28M 19.7M 47.7K 16.06 0.00 L0 bpobj 424 52.7M 3.30M 19.8M 47.8K 15.97 0.00 bpobj - - - - - - - bpobj header - - - - - - - SPA space map header 108 2.80M 452K 2.65M 25.1K 6.34 0.00 L1 SPA space map 2.22K 9.47M 9.21M 55.1M 24.8K 1.03 0.00 L0 SPA space map 2.33K 12.3M 9.65M 57.7M 24.8K 1.27 0.00 SPA space map 1 12K 12K 24K 24K 1.00 0.00 ZIL intent log 28 3.50M 112K 448K 16K 32.00 0.00 L5 DMU dnode 28 3.50M 112K 448K 16K 32.00 0.00 L4 DMU dnode 28 3.50M 112K 448K 16K 32.00 0.00 L3 DMU dnode 28 3.50M 112K 448K 16K 32.00 0.00 L2 DMU dnode 97 12.1M 3.19M 9.82M 104K 3.80 0.00 L1 DMU dnode 59.8K 957M 239M 957M 16.0K 4.00 0.01 L0 DMU dnode 60.0K 983M 243M 969M 16.1K 4.05 0.01 DMU dnode 29 58K 58K 472K 16.3K 1.00 0.00 DMU objset - - - - - - - DSL directory 25 13.5K 2K 48K 1.92K 6.75 0.00 DSL directory child map - - - - - - - DSL dataset snap map 26 44K 8K 48K 1.85K 5.50 0.00 DSL props - - - - - - - DSL dataset - - - - - - - ZFS znode - - - - - - - ZFS V0 ACL 141 4.41M 564K 2.20M 16K 8.00 0.00 L3 ZFS plain file 8.20K 263M 37.8M 151M 18.4K 6.94 0.00 L2 ZFS plain file 348K 10.9G 3.27G 13.1G 38.4K 3.33 0.11 L1 ZFS plain file 64.3M 7.92T 7.78T 11.7T 186K 1.02 99.88 L0 ZFS plain file 64.7M 7.93T 7.78T 11.7T 185K 1.02 99.99 ZFS plain file 4.05K 130M 16.2M 64.8M 16.0K 8.00 0.00 L1 ZFS directory 709K 511M 69.8M 558M 804 7.32 0.00 L0 ZFS directory 713K 641M 86.1M 622M 893 7.45 0.01 ZFS directory 22 22K 22K 352K 16K 1.00 0.00 ZFS master node - - - - - - - ZFS delete queue - - - - - - - zvol object - - - - - - - zvol prop - - - - - - - other uint8[] - - - - - - - other uint64[] - - - - - - - other ZAP - - - - - - - persistent error log 2 256K 20K 120K 60K 12.80 0.00 SPA history - - - - - - - SPA history offsets - - - - - - - Pool properties - - - - - - - DSL permissions - - - - - - - ZFS ACL - - - - - - - ZFS SYSACL - - - - - - - FUID table - - - - - - - FUID table size 1 1.50K 1.50K 24K 24K 1.00 0.00 DSL dataset next clones - - - - - - - scan work queue - - - - - - - ZFS user/group/project used - - - - - - - ZFS user/group/project quota - - - - - - - snapshot refcount tags - - - - - - - DDT ZAP algorithm - - - - - - - DDT statistics - - - - - - - System attributes - - - - - - - SA master node 23 34.5K 34.5K 368K 16K 1.00 0.00 SA attr registration 44 704K 176K 704K 16K 4.00 0.00 SA attr layouts - - - - - - - scan translations - - - - - - - deduplicated block - - - - - - - DSL deadlist map - - - - - - - DSL deadlist map hdr 1 1.50K 1.50K 24K 24K 1.00 0.00 DSL dir clones - - - - - - - bpobj subobj 12 304K 48K 288K 24K 6.33 0.00 L1 deferred free 19 402K 86K 528K 27.8K 4.67 0.00 L0 deferred free 31 706K 134K 816K 26.3K 5.27 0.00 deferred free - - - - - - - dedup ditto 10 37.5K 15K 144K 14.4K 2.50 0.00 other 28 3.50M 112K 448K 16K 32.00 0.00 L5 Total 28 3.50M 112K 448K 16K 32.00 0.00 L4 Total 169 7.91M 676K 2.64M 16K 11.98 0.00 L3 Total 8.23K 266M 37.9M 151M 18.4K 7.01 0.00 L2 Total 353K 11.0G 3.29G 13.1G 38.2K 3.35 0.11 L1 Total 65.1M 7.92T 7.78T 11.7T 184K 1.02 99.89 L0 Total 65.4M 7.93T 7.78T 11.7T 183K 1.02 100.00 Total Block Size Histogram block psize lsize asize size Count Size Cum. Count Size Cum. Count Size Cum. 512: 159K 79.4M 79.4M 159K 79.4M 79.4M 0 0 0 1K: 84.6K 103M 182M 84.6K 103M 182M 0 0 0 2K: 72.3K 189M 372M 72.3K 189M 372M 0 0 0 4K: 359K 1.45G 1.81G 65.6K 362M 734M 0 0 0 8K: 508K 5.34G 7.15G 68.8K 740M 1.44G 453K 3.54G 3.54G 16K: 587K 12.6G 19.8G 112K 2.02G 3.46G 577K 10.8G 14.3G 32K: 379K 16.7G 36.5G 392K 12.6G 16.1G 841K 35.1G 49.4G 64K: 647K 58.6G 95.1G 24.3K 2.17G 18.3G 457K 40.5G 89.8G 128K: 61.5M 7.69T 7.78T 63.3M 7.91T 7.93T 62.0M 11.6T 11.7T 256K: 0 0 7.78T 0 0 7.93T 0 0 11.7T 512K: 0 0 7.78T 0 0 7.93T 0 0 11.7T 1M: 0 0 7.78T 0 0 7.93T 0 0 11.7T 2M: 0 0 7.78T 0 0 7.93T 0 0 11.7T 4M: 0 0 7.78T 0 0 7.93T 0 0 11.7T 8M: 0 0 7.78T 0 0 7.93T 0 0 11.7T 16M: 0 0 7.78T 0 0 7.93T 0 0 11.7T capacity operations bandwidth ---- errors ---- description used avail read write read write read write cksum Primary 11.7T 15.6T 924 0 6.14M 0 0 0 0 raidz1 11.7T 15.6T 924 0 6.14M 0 0 0 0 /dev/gptid/73288854-d4ab-11e9-b86b-0cc47a0b7772.eli 307 0 2.04M 0 0 0 0 /dev/gptid/74117816-d4ab-11e9-b86b-0cc47a0b7772.eli 309 0 2.05M 0 0 0 0 /dev/gptid/74f89b34-d4ab-11e9-b86b-0cc47a0b7772.eli 307 0 2.05M 0 0 0 0