Diaries of a n00b.

Huib · Nov 6, 2016

Diaries of a n00b

I have no idea where to post this or if it is even apreciated but I guess it won't hurt anyone.

So this is my little story about how I thought I could quickly replace my windows file server with a freenas box.

Or rather (because this was my first idea) complement my windows file server with a freenas box with the idea that the windows box would simply replicate to the freenas box so I could use snapshots to restore files that got deleted by users and have a backup by means of a second of site box and server replication…..

I had no idea that this would get so complicated!!!
I mean.. how hard can it be? Get a box, put in some drives and turn it on right?

So after watching some youtube movies (I know) and reading several best practice topics on the forums I was happily testing on a virtual box. I had 6 virtual drives in raidz2 as my freenas test machine. I was making snapshots and had active directory working! Life was good!

Until…..

I realized that in order to make a snapshot instantaneously you need to have the data on the freenas box… That sounds logical and simple.. just copy your data over there. But I hit a small bottleneck. In order to copy over the data in an efficient manner, you first need to scan for changes. I’ve tried rsync and syncthing (to get this working as a freenas n00b is a topic on its own). The problem with this was that the scanning part of the process took HOURS and HOURS for a subset of the data on my windows machine. (about 100 gigs of 1 TB of total data). I would be better of making a daily backup!!

That’s not what I was going for so I then thought I’ll just copy everything over to the freenas and let everyone work on that instead of the windows box. Problem solved!

So being quite proud of my selves I posted my intended hardware with the note that I intended to actually use it as the filesystem that everyone would work on.
I got slaughtered….

I posted a x11 system with a i3 6700 and 32 gigs of ram. This because I read that clock is more important than cores and that even if 1 gig per TB is recommended, you can never have enough ram. I intended to use raidz2 with 6 drives (powers of 2 with redundancy on top of that. I did my homework even if now I see that if you use compression this rule goes out of the window. But I digress).

Note that this system would have about 10 TB of usable data where we accumulated about 1 TB in the last few years. The amount of space is overkill!

Little did I know that the performance of raidz2 apparently is sub optimal for IOPS. The general advice was: use mirrors!!!

Use mirrors.. What the h3ll does that mean. Back to the forums and a lot of other resources on the internet and after a lot of reading I concluded (I am a n00b so I will not say I found out since I’m not 100% sure that what I jug down here is correct) that raidz2 generally has the performance of a single drive and if you use several mirrors in one pool, you start to get the benefits of striping in raid.

That made me think.
What’s the actual situation on my network now? (I’m an manager. Not an sys admin) so I found out that we have two kinds of use cases.
1) engineers working on autocad files that area between 2 and 15 MB per file (ballpark)
2) engineers that work in an engineering platform that had the great idea to write thousands and thousands of files with extremely small file sizes. I knew this but I never really checked how small these files are.

Use case 1 I’m not woried about. Even with 20 MB/s transfer speeds over a 1 gig lan this is no problem whatsoever. Less than a second to open a file is acceptable for me as they will work for many minutes before opening the next file. It’s not a workflow bottleneck.

Use case 2 needed some investigation.
How small are the files? How many IOPS does this actually comprise of?

So I took a look at one typical directory on witch an engineer would work on. I found a 100 MB folder. Nice and small… no big deal. Only 76000 files… wait what!!!!
then I found that a spindle can handle about 100 IOPS. Time to open a beer…..

after some checking I found out that not all files are read by the software for every action but it was more than clear that IOPS and not throughput is a problem here. Also IOPS is probably the time consuming part for backups. Keep in mind that this 100 MB of 76000 files is only a subset of several tenth of gigabytes of data.

So I decided to make a ram drive on my workstation and copy the folder there. It took about 5 minutes from my windows box. Copying it to a freenas box with a raidz of 3 drives took about 2 minutes. And copying it back took took about 3 minutes. So I’m getting about 3 to 600 KB/s throughput. However much more IOPS than expected.

Now I’m in the here and now of my “simple replacement of my windows box”

I’m now considering the following implementation:

2 zpools. One with just a single mirror of ssd drives (500 gig each and hoping for about 10000 IOPS) for the IOPS workload.

The second zpool will be a raidz2 pool of 6 drives. The idea is to make a mount point for the rest of the workloads with the bigger files and a backup mount point for the ssd vdevs zpool. That way the immediate reads and writes for the engineers that need high IOPS would use the ssd drives and in the background this would replicate to the slower zpool.
The other workload would be on this second zpool directly, and then the entire slow zpool would be replicated to a second (off site) box.

I thought about adding multiple ssd mirrors to increase IOPS even further, but I don’t believe it will help since it’s going over a lan. Ethernet also adds latency. Speaking of that, I’ve learned on this forum that realtek uses the cpu a lot and Intel nics don’t. So I’m replacing all nics in our work stations. Any call to the cpu adds latency so if this has to be done for the transfer of each file realtek nics are just not acceptable….

I hope you guys had a good laugh at my investigation and all the mistakes I made and all the miss conceptions I still have. I’m trying to learn and the forums here are giving me great pointers. I thought the least I could do is to write down my adventure here for entertainment :)

Signing off for now.

Huib.

p.s. Depending on the reactions to this post I might post test results on my actual implementation later. After probably much more reading…..

Stux · Nov 6, 2016

No laughs :)

Good research though.

Huib · Nov 6, 2016

Stux said:
No laughs :)

Good research though.

Thanks. I'm doing my best :)

Huib · Nov 22, 2016

Small update.

I tried something else. On a old borrowed HP server with 8 gigs of ram and 3 drives in raidz (still just testing. I don't have harware laying around doing nothing)

I first applied this http://marc.info/?l=samba&m=139336252926228&w=2

Then I copied 153 MB of files (about 10.000 files) to the FreeNAS box using just windows explorer.
This takes about a minute.

Then I ssh'd into the box and using the command line, made a copy of that directory in the same root directory.
Interestingly enough this takes 1 second. All that data has to be read, and written for the copy so that more than a 60x speedup.
I could see and access the data on my windows box immediately.
So it looks like ZFS is handling the caching "pretty" well and I suppose this is fully done in RAM since with my spindles this is never possible
But copying the same data to a RAM drive on my windows box takes a minute again (over the network).

This leads me to the following conclusion. The bottleneck is not the disks (since as long as I apply enough RAM, most data that is actually worked on should find its way to the ARC at some point).
It's just the crappy way Windows copies files over a network connection (overhead. I assume checking rights etc etc) so using mirrored SSD's shouldn't help much in most cases.

It's the same effect as deleting a directory over a windows network. It takes ages if you have many files (about 100 files per second can be deleted).
If you go to the shell on FreeNAS and type rm -R directory (containing 310000 files) it takes 180 second and that's 1800 files per second so 18x faster.
It's just network overhead.....

In other words, it looks like "tuning" the NAS won't help too much :(.....

Signing out again for now.

Donny Davis · Jan 16, 2017

With raidz(2,3) you effectively get the less IOPS.
https://www.ixsystems.com/blog/why-zil-size-matters-or-doesnt/

Consider a solid slog device. A good SSD based slog may help with those small files.

@jgreco has said more than once a good raid controller with two disks in a mirror behind the raid card will make a great slog. Or you could also use a supported SSD. The choice in raid card or SSD is important, and this has been covered very well on here.

Let us know what you find out

Corrected

Stux · Jan 16, 2017

Donny Davis said:
With raidz(2,3) you effectively get the write speed of a single disk.

That's not really true. You get the IOPS of a single disk, but sequential read/write speed scales up as you add more disks.

What is true is that all vdevs give you the IOPS of a single disk, but with mirrors you have more vdevs.

In a mirror you also get the write speed of a single disk, so two mirrors is double speed but a RAIDZ2 of 4 disks would actually be triple speed.

Donny Davis · Jan 17, 2017

Stux said:
That's not really true. You get the IOPS of a single disk, but sequential read/write speed scales up as you add more disks.

What is true is that all vdevs give you the IOPS of a single disk, but with mirrors you have more vdevs.

In a mirror you also get the write speed of a single disk, so two mirrors is double speed but a RAIDZ2 of 4 disks would actually be triple speed.

Correct, and I should have made that more clear. Yes you effectively pay more for IOPS with RAIDZ. The OP seemed to be concerned about IOPS

Holt Andrei Tiberiu · Jan 18, 2017

If you have that much number of files, soon you will want IOPS more than anything, so consider raid 10.
Also, what specs does you FreeNAS box have?

I have several accounting companies for which i had to make some FreeNAS Storages, but had to ditch raid z2 and z3 for raid 10, because the accounting programs used by many make a lot of individual files, each transaction is 1 file.
2 of my clients had 6.000.000+ files in their accounting software. ( for the last 12 years of operation )
Can you imagine how a upgrade went? days until all files were re-indexed and the variables in them updated to the new version.
What took days on raid z2 took hours on raid 10.
I even tried 2 x raidz2, faster, but not fast enough.
So take time and think, what will your scenario will be in 2-3 years, and build your box accordingly.

Huib · Jan 20, 2017

Hi guys,

I've not followed this thread for some time due to some distractions, but I'm pleasantly surprised regarding the reactions.

So herewith my update.
I've bought a
supermicro x11ssm-f
32 gigs of ram
an I3-6700 (cpu load is fine.... except for one weird situation)
4 x 3 TB WD reds in raid 10
and some other non performance inpacting stuff like case, coolers etc....

I had two ssds laying around so I used those for testing also.

I'm not going to digress regarding our larger file workloads. This is a non interesting workload that works great with freenas. No need to beat on this subject.

So the many files with small sizes is the interesting part. (at least for me)
So I actually now have the setup where the data directories for the smaller files are in a seperate striped mirror array of ssd drives (hold on before you scold on me). This striped set is replicated every 5 minutes to a zpool with 4 spinning drives that are in raid 10. In turn this is replicated to an offsite freenas.

Keep in mind I'm still not in production. Its just for testing.

I've notices some crazy stuff.... and as mentioned before I still think that active directory and windows networking protocols are killing the performance. I think this because if I copy a 80 gig dataset from my windows box (with milions of small files) the transfer rate is about 60 to 100 files per second. REGARDLESS IF I COPY THEM TO SSD'S OR SPINDEL zpool. So my conclusion remains that I will not increase performance with this. Only the ability to recover to a previous filesystem state in a short period of time.

A second thing I've found out is that if you copy milions of small files over the network over smb in some cases your cpu hits 100% after a random amount of time(I found several hits on the forums regading this but also that it should have been solved a long time ago). Then the copy speed drops from 60 to 100 files a second with cpu usage of about 25%(1 core) to 1 to 4 files a second with cpu usage of 100%(1 core). That's a problem when you are trying to copy 1.3 milion files. This is however not a normal workload for us and can be solved by zipping the files, moving the zip to the freenas, unzipping it and viola. Note that the unzip in the striped ssd's took less time than getting rid of my breakfast. so let's say less than half an hour. The replication to the raid 10 spindel pool is snappy too!
This issue I plan to reproduce and (re)submit as a bug.

Now for a small funny story.
I noticed that the freebsd logo has some horns.. so why not roll with it

SInce one of my main objectives was to be able to recover a previous state quickly I started to make snapshots... just before I deleted major parts of a test project without informing my engineers (they knew I was testing something but not what).....
then I waited for about 5 minutes and restored the filesystem to the last snapshot.
rince and repeat.....
This is mostly fun if you actually have the engineering department behind the next door in the corridor so you can see their reactions when you go there to discuss the schedule while they are looking at their monitors in pure confusion. "I lost my work... but now it's back again.... Wait no.... It's gone... etc...)

Maybe that's a horrible way to test recovery of previous states but I promise you it's fun.

In conclusion, I AM going to put this in production. the speed during normal works is sufficient and it has flexibility as well as backup advantages through replication. I'm a bit disapointed that I did not manage to increase performance for the small file workload, but it seams to be out of freenas his control.

Have a good weekend!

Stux · Jan 21, 2017

Did you disable atime support on the SMB share?

Also, there's a performance tweak involving case sensitivity that you can do.

Huib · Jan 22, 2017

Thanks Stux.
I forgot to mention that.

I did both of those tweaks. Also I tried it with and without auto tuning but I didn't see a difference.

Huib · Feb 9, 2017

The next chapter arrives.

Generally everything is working fine but the small files kept buging me.
Since I have a box a home now it's easier to play around a bit in the evenings.

So I wrote some python scripts to test SMB file shares
One was just writing files to a directory untill the speed dropped to 1 or 2 files per second.

Code:

import os
import time

def readwritefast(n,path):
    start = time.time()
    fullpath = path
    for x in range (1,n):
            f = open(fullpath + str(x) + '.txt', 'w')
            f.write("1")
            f.close
            if (x/n*100)%1 == 0:
                    timed = time.time()-start
                    print(str(round(x/n*100, 1)) + '% done ' + str(x) + " files processed in " + str(round(timed,2)) +" second; " + str(round(x/timed,2)) + ' files per second')

readwritefast(200000,path) #you will need to set your path yourselves depending on your share mine was 'f:/temp'

The second did the same, but it writes them in sub directories to keep the files per folder resonable.

Code:

import os
import time

def readwritefast(n,path):
    start = time.time()
    fullpath = path + '0/'
    os.mkdir(fullpath)
    for x in range (1,n):
            f = open(fullpath + str(x) + '.txt', 'w')
            f.write("1")
            f.close
            if (x/n*100)%1 == 0:
                    timed = time.time()-start
                    print(str(round(x/n*100, 1)) + '% done ' + str(x) + " files processed in " + str(round(timed,2)) +" second; " + str(round(x/timed,2)) + ' files per second')
                    fullpath = path + str(int(x/n*100)) + '/'
                    os.mkdir(fullpath)

readwritefast(200000,path) #you will need to set your path yourselves depending on your share mine was 'f:/temp'

The second script performed at 200 files per second over the network consistantly (200.000 files) and the first script started at 200 but droped linearly and after 5000 to 10000 files it went straight to "zero" and a 100% cpu for smb.

For refference, on my laptop using an SSD (local not over the network) the output of these scripts look something like this:

Code:

1.0% done 2000 files processed in 0.74 second; 2706.36 files per second
2.0% done 4000 files processed in 1.55 second; 2580.65 files per second
3.0% done 6000 files processed in 2.35 second; 2552.11 files per second
4.0% done 8000 files processed in 3.2 second; 2496.88 files per second
5.0% done 10000 files processed in 4.03 second; 2483.85 files per second
6.0% done 12000 files processed in 4.89 second; 2453.99 files per second

conclusion:
the checking if a file exists is the killer. The more files you have in a directory the longer the addition of a new file takes. Keep hammering the folder and the cpu just cant keep up. As soon as you hit 100% it's game over.

A second test I did with the second script (with sub folders) was how much this http://marc.info/?l=samba&m=139336252926228&w=2 (case sensitifity) actually makes a difference. So I made a share accordingly and the speed increased to 500 files per second. so about a 2.5X speedup.

So in my oppinion the small files is not the big problem. It's too many small files in the same folder.

I hope you found this interesting.

see you another day!

edit: even with the second script, don't go toooo high with the number of files. 1% of the files gets written to each folder so with 200.000 that's 2000 per folder and this works great. going much higher might slow it down significantly. I did that on purpose and I just exit python if speeds go downhill.
secondly I'm not checking if folders exists. so if you run the test you need to manually remove all files before running the script again or it will crash. Sorry I'm lazy

anodos · Feb 9, 2017

the checking if a file exists is the killer. The more files you have in a directory the longer the addition of a new file takes. Keep hammering the folder and the cpu just cant keep up. As soon as you hit 100% it's game over.

That's more or less my observation as well. CPU quickly becomes the bottleneck in Samba when you have lots of files. Get an E5-1650. :D

It would be interesting to perform the same battery of tests with ZFS on Linux to see if there is any performance difference.

A second test I did with the second script (with sub folders) was how much this http://marc.info/?l=samba&m=139336252926228&w=2 (case sensitifity) actually makes a difference. So I made a share accordingly and the speed increased to 500 files per second. so about a 2.5X speedup.

So in my opinion the small files is not the big problem. It's too many small files in the same folder.

Noticed that as well. It seems to be where you hit the wall with smbd processes being single-threaded. See discussion here: http://marc.info/?l=samba&m=146175647222729&w=2

There is probably some hope for this improving in the future, but it's not a simple fix.

Huib · Feb 9, 2017

Thanks for the confirmation of my finding ;).

anodos said:
That's more or less my observation as well. CPU quickly becomes the bottleneck in Samba when you have lots of files. Get an E5-1650. :D

with the risk of sounding like a n00b again.... why would I use that cpu? The base clock is lower than my i3 and samba is single threaded. Am i missing something?

anodos said:
Noticed that as well. It seems to be where you hit the wall with smbd processes being single-threaded. See discussion here: http://marc.info/?l=samba&m=146175647222729&w=2

I've found similar msgs but this one is regarding reading. I don't experiance performance problems while reading big directories. Only when I'm hammering a directory with writes with many small files already in it. I could retest reads if you are interested, but for small files like this I think i was getting around 500 files per second. Reasonable for 1 Byte files on a single mirror of spindels I think.

I understood that is't so complicated that they are not putting multithreading on their priority list and for me that's not a problem. Now I understand that this "problem" only happens in (for me) rare cases I'm fine with it. But I will surely upgrade if they implement it :D

Huib · Jun 27, 2017

Hi guys,

small update... might be a bit boring
1) I don't consider myself a full n00b anymore. I feel I have a grasp of the basics of copy on write filesystems, snapshots (and their pitfalls) and replication.
2) My systems are in my sig. They are working fine and stable and under 1% utilization :p

I can not begin to explain how much this somewhat painful transition improved my possibilities in my company.
The snapshot feature and the possibility to make a "backup" in about 1 second before you try something stupid is just awesome.
Knowing your snapshots will replicate offsite adds so much peace of mind, especially if you find out that it actually works!!! You can recover from them!
And it takes just a moment...
You don't have to search for tapes. You just make a share of a clone if you have to recover a few files. Or you just restore to an old snapshot if you did something really stupid (didn't happen yet, but I tested it).

Anyway, the point is that I love it.
And I want to thank everyone for the help they gave me when I couldn't figure it out on my own. So consider yourself thanked :)

See you on the forums!

H.

Important Announcement for the TrueNAS Community.

Diaries of a n00b.

Huib

Explorer

Stux

MVP

Huib

Explorer

Huib

Explorer

Donny Davis

Contributor

Stux

MVP

Donny Davis

Contributor

Holt Andrei Tiberiu

Contributor

Huib

Explorer

Stux

MVP

Huib

Explorer

Huib

Explorer

anodos

Sambassador

Huib

Explorer

Huib

Explorer

Similar threads