Datasets with max (gzip-9) compression fail on large writes

Status
Not open for further replies.

sirjorj

Dabbler
Joined
Jun 13, 2015
Messages
42
Original thread here.
Hardware detailed here.

I did some more troubleshooting. Here is the summary.

I have had no issues with datasets using the default compression. It is when I make a dataset with max compression is when the problems start.

My main test data for this is a folder containing ~20 GB across 63 files. Transfer speeds are estimated averages.

I started with afp, as that is what I plan on using the most. When writing the test folder, it goes at ~100MBps, but it eventually stalls and fails.

I redid the test using sftp instead. It went at ~35 MBps and did not fail.

Doing the test with SMB went at ~65 MBps and failed, but it did take longer than to fail than AFP.

The message dump for the AFP failure is this:

Jan 20 18:23:40 freenas afpd[37036]: transmit: Request to dbd daemon (volume emc) timed out.
Jan 20 18:24:37 freenas cnid_dbd[37037]: read: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: error reading message header: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: read: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: error reading message header: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: read: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: error reading message header: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: read: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: error reading message header: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: read: Connection reset by peer
Jan 20 18:24:37 freenas cnid_dbd[37037]: error reading message header: Connection reset by peer

I looked at the log for the SMB failure but didn't notice anything other than SMB going down and coming back up again. The message in the OS X error window is this:
The Finder can’t complete the operation because some data in “2014-05-04.wav” can’t be read or written.
(Error code -36)

Another interesting observation is that when trying to delete the dataset after a failed write to it, it will fail because 'device is busy'. If I reboot or wait a while (like the time it took to write out this report), I can then delete it.

I have no idea how FreeNAS works under the hood, but I would guess that if you give it too much data too fast at high compression, it gets overloaded and crashes.

To reproduce this test, make a max compression dataset and throw a large amount of data at it as fast as you can. If you can push ~100 Mbps at it, you should see a failure in less that 20 GB!

I looked at the bug reporting page and saw that I have to register an account to 'officially' report this. I really don't want to make yet another account. I just want to be a FreeNAS user...
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Hmmm.. i think this was brought up a couple months ago and it was basically yeah gzip 9 is slow and the client times out waiting for acks.
 

sirjorj

Dabbler
Joined
Jun 13, 2015
Messages
42
Hmm... So the client doesn't get the ack until the data is fully compressed and on disk? Interesting...

When I was playing around with FreeNAS to get a feel for it, I noticed that several minutes after writing a large file (to a dataset with the default compression setting), one of the CPU cores suddenly hit 100% for a little while on gzip or some other compression program (i dont remember exactly which one it was). I assumed that the data was written uncompressed to get it on disk ASAP and then a later process would make a pass to compress it. I guess that isn't the case.

In writing to a max-compression dataset, I have never seen the CPU (8-core Atom) go above about %50. Could the compression part be made to take better advantage of multicore systems? Or should the max setting be removed because it isn't stable?
 

rsquared

Explorer
Joined
Nov 17, 2015
Messages
81
This sounds more like a case of the classic "Unix gives you just enough rope to hang yourself -- and then a couple of more feet, just to be sure."

In other words, sure there are tools there that can cause you problems if used incorrectly. There are also valid uses for those tools. It's on the user to know the implications when they use a non default setting.

In this case, gzip 9 is known to be slow, so it's up to you to test and decide whether the extra time is worth the space saved. If so, there are also workarounds you could probably implement. If remote clients are timing out and you really want the extra compression, perhaps a "staging" share/dataset with standard lz4 to drop the files into, followed by a move to the gzip dataset at the command line.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I assumed that the data was written uncompressed to get it on disk ASAP and then a later process would make a pass to compress it. I guess that isn't the case.
Right, all the compression options are on-the-fly. The great thing about the default, LZ4, is that it's fast enough to saturate spinning hard drives and gigabit ethernet without hogging the CPU.
 

sirjorj

Dabbler
Joined
Jun 13, 2015
Messages
42
perhaps a "staging" share/dataset with standard lz4 to drop the files into, followed by a move to the gzip dataset at the command line.

That is an interesting idea. Since the compression delay will be in the mv command (which doesn't time out) instead of the AFP transfer (which can and does timeout), that shouldn't cause problems.

Makes sense to me! Thanks!
 
Status
Not open for further replies.
Top