ZFS and lots of files

jpaetzel

FreeNAS Core Team
iXsystems
Joined
May 27, 2011
Messages
226
Thanks
89
#1
We were recently asked now many files ZFS can support. The theoretical limits of ZFS are impressive, but as they say, in theory, theory is better than practice, but in practice it rarely is.

We decided to build a test rig to see how ZFS would fare with lots and lots of files.

After two days of running a shell script we had a pool with one million directories, with one thousand files in each directory, for a total of one billion files.

Our test system had 192 GB of RAM, and even with a 150 GB ARC the metadata cache was blown out of the water. An ls -f in the directory with one million directories took about 30 minutes. The same operation over gigE NFS took around 7 hours. A find . -type f took several hours locally, and ran overnight via NFS.

I guess we expected that sort of behavior. The good news is, doing operations that did not hit that big directory were not painful at all. Snapshot creation and deletion operated normally, doing metadata operations to the 1000 file subdirectories had decent performance. All in all I'd say ZFS handles the pathological situation fairly well.
 

cyberjock

Moderator
Moderator
Joined
Mar 25, 2012
Messages
19,156
Thanks
1,835
#2
Impressive! My server has just short of 1 million files. I can't imagine trying to organize 1 billion files :p
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,428
Thanks
2,767
#3
That's lots of files in one directory (okay, lots of directories in one directory, if you want to be technical). While that's a vaguely interesting performance test, from my point of view, it'd be more interesting to have two levels of a thousand directories each and then a thousand files in each, and see how that compares speedwise. This would be much more representative of the way many storage systems actually work, since pooling a million entries in a single directory is uncommon. And of course it'd be interesting to keep going beyond that.

Without any further insight into your test, interesting questions you could have answered go unanswered, such as

Did you precreate the million directories, or build them as you went? Did your script require traversal of the million-entry directory for each entry, or did you chdir into each directory?

i.e. the following scripts are VERY different

#! /bin/sh -

pd=foo
i=0

while [ $i -lt 1000000000 ]; do
f=`printf %09d $i | sed 's:^\(......\):\1/:'`
d=`dirname ${f}`
if [ "${pd}" != "${d}" ]; then
mkdir "${d}"
pd="${d}"
fi
touch ${f}
i=`expr ${i} + 1`
done


#! /bin/sh -

i=0

while [ $i -lt 1000000 ]; do
d=`printf %06d $i`
echo "${d}"
j=0
while [ $j -lt 1000 ]; do
f=`printf %03d $j`
echo "${f}"
j=`expr ${j} + 1`
done
cd ..
i=`expr ${i} + 1`
done


The primary directory traversal in the first one will be a bit of a killer :) but is also interesting information about how efficient ZFS actually is; people like me are the reason that things like UFS_DIRHASH were invented, hahaha.
 
Joined
May 21, 2012
Messages
6
Thanks
0
#4
Is it possible to create a table with system requirements, which says the amount of files is X (in a defined Step) and it requires Y of RAM? This could be useful for many in the construction of a system.

Greetings
Gregor
 

bmcclure937

FreeNAS Experienced
Joined
Jul 13, 2012
Messages
110
Thanks
3
#5
Crazy experiment! That is so awesome. :D
 
Joined
Aug 23, 2012
Messages
21
Thanks
0
#6
AYE, a performance sub forum. Exactly what I was after. So i'm having an issue with copying many smaller files.

So, after doing some research and many many tests I can push and receive from freenas (8.2 and now 8.3b2 with 6 x 3tb 2d 5400rpm) maxing gigabit on both accounts. This is when transferring large files.

If I swap to transferring many small files (e.g. backing up c:\users\... in Win7 where there is a LOT of small files in various directories) there is an epic bottleneck. Transfer speed remains excellent when it hits a large file however the creation or copy of small files seems to compare to copying to an oldschool 8gb usb thumbdrive. That's the best comparison I can give.

I'm from a Microsoft background with some dabbling in Linux and have done quite a bit of reading on the tunables, all the various settings I can mess with, etc. Nothing seems to be my silver bullet however.

For some context: I can copy and paste, over the network, the directory i'm trying to copy to the nas to ANY computer regardless of the connection (wireless was included in my testing) faster than than the FreeNAS setup.

The issue seems to be purely with the number of files/directories being sent to the nas at once. As mentioned, single file transfers are amazing. If this is by design i'll quit trying to figure it out, else someone please suggest something.

Cheers!
 

cyberjock

Moderator
Moderator
Joined
Mar 25, 2012
Messages
19,156
Thanks
1,835
#7
There is no easy silver bullet for small files. My understanding of CIFS is that it opens a connection, sends the file, then closes the connection for each file. The opening and closing takes time to be acknowledged. AFAIK there isn't much you can do about it since it's a protocol issue.

If you don't have >6TB of RAM that can cause performance issues. There is a penalty for lots of small files, but since you didn't provide system specs, FreeNAS version, or hard numbers on speed I really can't tell you much more than that.
 
Joined
Aug 23, 2012
Messages
21
Thanks
0
#8
Heya NS, ah sorry for not including all the specs, i'll include setup below. The reasoning for the slowness using CIFS sounds acceptable being that it is linux based. I then assume windows does it differently because I don't even get the chance to read the file names when copying to another windows box heh.

Core i3, 8gb memory, 6 x 3tb in RaidZ1, 32gb usb3 for system, gigabit network.
 

cyberjock

Moderator
Moderator
Joined
Mar 25, 2012
Messages
19,156
Thanks
1,835
#9
Heya NS, ah sorry for not including all the specs, i'll include setup below. The reasoning for the slowness using CIFS sounds acceptable being that it is linux based. I then assume windows does it differently because I don't even get the chance to read the file names when copying to another windows box heh.

Core i3, 8gb memory, 6 x 3tb in RaidZ1, 32gb usb3 for system, gigabit network.
You don't have >6TB of RAM. I just noticed the TB instead of GB, but I think anyone reading this thread will figure it out ;).

Yeah, sounds to me like it's just Windows giving you problems. I know there are some registry tweaks you can do that may help(how much depends....), but I don't typically recommend people do registry tweaks unless they understand in great detail what the change does. Some of those tweaks are a tradeoff between various possible situations and you shouldn't go changing stuff just because you want "moar speed!".

I just suck it up when I do a large number of small files on my Windows box.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,428
Thanks
2,767
#10
My understanding of CIFS is that it opens a connection, sends the file, then closes the connection for each file.
Um? It shouldn't. Generally speaking, when you attach to the CIFS server, you create a TCP connection, and accesses happen over that connection.

That connection is indeed a limiting factor; any time you use TCP, you get into questions about rtt, anything else that impacts TCP, and in the case of Samba, also questions about single client performance, since Samba uses a one process per client model, so a single client cannot take advantage of multiple cores on the server.

In theory ZFS could be very fast for smaller files. I don't know how it works out in practice.
 
Joined
Jun 22, 2011
Messages
12
Thanks
0
#11
What Phonoflux sees is the difference between how FreeNAS and Windows handle filewrites.

Windows delays writes to disk, which is fast even with many small files.
FreeNas (and *NIX operating systems in general) handle filewrites differently: if it isn't safely on disk it didn't happen at all.
This means whatever you are writing has to be safely written to disk according to the filesystem and the OS before your windowsclient will receive an ack (and rtt on a local gigabit-LAN is quite fast). Writing to mechanical disks this obviously takes several milliseconds per file but otherwise your data wouldn't be safe in case of a powerfailure (in Windows the data in the writecache is quite volatile.).

Windows by default has it's writecache disabled for USB thumbdrives, that's why the oldschool 8GB thumbdrives are that slow.

So in fact your FreeNAS machine appears to be working perfectly normal.

A small factor might be the somewhat odd configuration of 6 disks in RaidZ1 as now every piece of data is striped over 5 locations + parity in a 6th. Computers like numbers like 2+1, 4+1, 8+1 better.

(edit: actually writes are cached in the ZIL in RAM but that cache is small, when writing a mix of some large and many small files it will be filled fast.)
 
Joined
Sep 5, 2012
Messages
26
Thanks
1
#12
What Phonoflux sees is the difference between how FreeNAS and Windows handle filewrites.

Windows delays writes to disk, which is fast even with many small files.
FreeNas (and *NIX operating systems in general) handle filewrites differently: if it isn't safely on disk it didn't happen at all.
This is filesystem and storage subsystem dependent. OSX and Windows can do the same thing depending on how you set things up.

This means whatever you are writing has to be safely written to disk according to the filesystem and the OS before your windowsclient will receive an ack (and rtt on a local gigabit-LAN is quite fast). Writing to mechanical disks this obviously takes several milliseconds per file but otherwise your data wouldn't be safe in case of a powerfailure (in Windows the data in the writecache is quite volatile.).
Don't assume that drive vendors always tell the truth.

Windows by default has it's writecache disabled for USB thumbdrives, that's why the oldschool 8GB thumbdrives are that slow.
Not true. `Performance mode` is (or can be) now enabled by default on Vista and 7 depending on the device, and once again, this is filesystem dependent (FAT* in general requires fsck'ing more than NTFS because it's not a journaled FS). USB flash is just slow because it's slow tech.

So in fact your FreeNAS machine appears to be working perfectly normal.

A small factor might be the somewhat odd configuration of 6 disks in RaidZ1 as now every piece of data is striped over 5 locations + parity in a 6th. Computers like numbers like 2+1, 4+1, 8+1 better.
There's some misinformation/oversimplification being made. See the ZFS Best Practices Guide for the definitive answer on what you should use with ZFS.

(edit: actually writes are cached in the ZIL in RAM but that cache is small, when writing a mix of some large and many small files it will be filled fast.)
Ultimately, what you're describing is the fact that samba and the underlying filesystem is forcing stable writes / commits instead of allowing for unstable writes, whereas Windows (and FreeBSD when used with UFS) allows unstable writes to be done on local media. That and the network gear in the middle might not be super fantastic, amongst other things.

jgreco is correct: samba will keep one connection open per client (single threaded) and send data over that connection (see tcpdump/wireshark for references). There are several other connections opened for NMB/SMB management traffic (and SMB is super chatty), but there's one devoted data socket per connection with samba.

Assuming parameters are set correctly with samba per your network connection and depending on your machine, you could get ok-ish (saturating multiple GB) speeds.
 
Joined
May 24, 2013
Messages
2
Thanks
0
#13
Reviving this. I just stumbled across this post and it gave me a chuckle.

I'm the one that asked about a billion directories/files. Rest assured we're a little more sane with our directory structure then what was tested here. :D
We've created 62 filesystems and put a directory structure like /[a-zA-Z0-9]^7
With a final subdir that is a 10 character name where the files go. So no more then 3844 subdirs at the end of the branch.

It's not.. uhh.. feasible to count all the directories/files. But our current inode count is 413880422

This is all currently on a mirrored pair of generic fbsd9.0 machines running zfs and we're in the process of copying it all over to a mirrored pair of TrueNAS hosts. Due to the lack of nfsv4 in TrueNAS, we're having to consolidate all 62 filesystems down to one which means we're rsyncing everything over rather then zfs send/receive. This is unfortunate for several reasons.

Problem we're having right now is the very slow read io performance on the source. We're averaging about 8MB/sec read and ~300 read ops/s. I've tried setting primarycache=metadata in hopes it would speed up rsync. It's only got 12GB of memory so I also tried secondarycache=metadata but neither setting has improved things.

So I'm fishing for suggestions. :)

I'm starting to wonder if I would have been better off doing a zfs send over to the secondary TrueNAS and then rsyncing it back to the primary TrueNAS from the secondary *and* the old primary in parallel. I haven't tested an rsync between the two yet, so no clue if that would yield sufficient performance.
 

MtK

FreeNAS Experienced
Joined
Jun 22, 2013
Messages
471
Thanks
29
#14
this scenario is not so theoretical. take (i.e) a backup script like backuppc.
inside its TopDir, it creates 5 directories:
  1. pc - the actual backups, keeping a certain amount of history.
  2. log - to log the successful and failed backups
  3. trash - which is constantly filled with old data, and also removed from it in the process.
  4. cpool/pool - 2 directories consisting of hardlinks to maintain a duplication list of files.
Those hardlinks are hashed and stored in a 3 level directory structure, where each inner directory is the i character of the hardlink hash. so for example hash abc1234567890, would be stored as a/b/c/abc1234567890.
this gives us a structure of 2 * 16^3 directories + those of the actual backups (the pc dir).
and therefore, yes, doing a ls/du/find/cp/tar/mv/etc is very time consuming.
I just (tried to) move a 2TB backup pool from a single drive etx3 into a zfs-over-nfs, which was roughly estimated in the 500+ hours area (aka < 10M/s transfer rate) - probably because of the bad IO of the single disk.
but having roughly 10 machines to backup (=350GB of a single full backup with compression=3), even if I start a clean pool on the zfs-over-nfs, I get the same transfer rate (see attached image), the same IO bottleneck on ls/du/find/cp/tar/mv/etc, even if directly issuing the zfs machine.
just as an addition, looking at the largest backup of 114GB, it contains 2061995 files, and took 580 minutes to complete, so... 3.28MB/s, but just to continue the discussion started above, keeping 4 full backups and 6 incremental as I do for each machine, don't forget to add to that the pool+cpool hardlinks, and you'll get... a lot of files!

I don't have the actual file count (it might take a few days), but size wise:
  • 71 full backups of total size 2323.22GB (prior to pooling and compression).
  • 254 incr backups of total size 875.11GB (prior to pooling and compression).
so an estimation would be:
if 115GB represent 2,000,000 files, we are talking about 50,000,000 files (prior to pooling and compression).
 

Attachments

Top