For example, the ZIL appears to cache only writes smaller than 64kbytes. So if you copy a 10GB file to the server the ZIL will not be used at all. But if you copy a 32kbyte PDF that might go into the ZIL.
Okay, sorry, I've read past this three or four times trying to convince myself that this isn't confusing the issue. You go on to a better explanation, but we should probably not use the word "cache" in conjunction with "ZIL", but rather "log". And I'm not sure where the 64KB idea comes from. And if you copy a 10GB file, the ZIL will probably still be used, but probably only for metadata updates, which are small and relatively trite, but if you ask for that as a sync write, then the ZIL is definitely involved.
Key point: The ZIL is NOT IN THE DATA PATH. In general, nothing ever gets read out of the ZIL during normal operation. Only written.
POSIX provides ways for an application (or anything else) to guarantee that data is committed to stable storage. This is called a synchronous write. What it means is that if I call the system's write call with the sync flag set, if the power fails at any point after the write call returns, even just a microsecond, that data is supposed to be guaranteed to be retrievable when the system comes back up.
The problem is, disk is super slow, and if you're asking for a lot of stuff to be written sync, performance tanks. Some things - particularly ESXi - have no clue what they're reading or writing because the I/O is caused by VM's, so they generally ask for EVERYTHING to write sync.
So ZFS honors sync requests. But it does it cleverly. The POSIX mandate is basically that sync written data not be lost. Without a dedicated ZIL, ZFS has a small part of your pool set aside as ZIL. It puts the data there as fast as it can (and basically ignores it), at which point it can return success to the calling process. But the data is also put out to be written to the actual pool. Now, since it's already fulfilled its requirement to commit it to stable storage, ZFS is free to cache that pool write to happen "a bit later". But the in-pool ZIL write still incurs a penalty. Moving that to a dedicated ZIL fixes that.
Now in the event of a crash or reboot or whatever, the import is when the ZIL is actually read. ZFS has to make sure that the data that was promised to be written to the pool will actually be written to the pool. So the ZIL is rewound, read back, and then ZFS makes sure that those changes are reflected in the pool.
With that having been said...
Because it must also anticipate that it may suddenly need to read data from the L2ARC.
I think you meant "
you must also anticipate" because I don't see any logic to do that in ZFS.
So you are limited to certain write speeds to the L2ARC(we're talking a few MB/sec if I remember correctly) ZFS doesn't do much 'read ahead' so even if you start watching a streaming movie don't expect the movie to be dumped into the L2ARC and then the drives to spin down from being idle.
This is controlled by several variables.
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
norw: if this is set to 1, it suppresses reads from the L2ARC device if it is being written to.
noprefetch: if this is set to 1, it suppresses L2ARC caching of prefetch buffers
headroom: the number of buffers worth of headroom the L2ARC tries to maintain. If the ARC is under pressure and there's insufficient headroom, the L2ARC may not get some stuff that it would have been good to get.
The rest of this is complicated and works together.
write_max is the maximum size of an L2ARC write. Typically this happens every feed_secs seconds. Do NOT set write_max to a very large number without understanding all of the rest of this.
When the L2ARC is cold and no reads have yet happened, write_max is augmented by write_boost. The theory is that if nothing's being read, it's not disruptive to write at a higher rate.
If feed_again is set to 1, ZFS may actually write to L2ARC as frequently as feed_min_ms; for the default value of 200, that means 5x per second.
So now, as an administrator, you have to use your head and figure this all out. So here's the thing. The 8MB write_max is very conservative. But you can't just say "oh yeah my SSD can write at 475MB/sec! I'll set it to THAT!" An L2ARC is only useful if it's offloading a lot of read activities from your main pool. So an easy call is that it would make no sense to be using more than half its bandwidth for writing. But further, ZFS already allows for automatic bumping up of write speed when the L2ARC is cold through the write_boost mechanism. Also, the feed_again mechanism works to allow multiple feeds per second if there is sufficient demand, so with 200ms, you only need one fifth. So you can safely set this to 1/2 of 1/5 of what your SSD can write at and still have it all work very well; so for a 475MB/sec SSD, you can go for 47.5MB/sec. Probably best to pick a power of two, though, so pick 32MB or 64MB. More does NOT make sense.