Is there any possibility this is a zfs issue?

pedz · Oct 28, 2022

This bug has me stumped so I'm grasping at straws.

From Ruby code, I do (more or less the equivalent of) a printf, followed by a call to link(from, to) (the system call), followed by another printf.

In the output, I get the first printf but not the second and the link fails with EEXIST -- the "to" link already exists. And, sure enough, the "to" link does exist and it is linked to "from" just like it is suppose to. The "to" link has, as part of it, its process id. So I know that this process created the link. The application is not multi-threaded.

This is executing in a jail and the file system for the link is mounted from the main platform. I'm running the latest TrueNAS release.

Could this be some kind of weird timing bug with the jail and the underlying ZFS mount point?

jgreco · Oct 28, 2022

pedz said:
Could this be some kind of weird timing bug with the jail and the underlying ZFS mount point?

Sounds more like a Ruby thing. But this is very vague and it's hard to say. Jails have been around two decades plus, ZFS has been around nearly that long, UNIX itself has been around much longer. It's hard to imagine that you'd be bumping up against something new at the system level.

I am often doing work with C code that does not lend itself to easy debugging with a debugger, and I find that it is easiest to start throwing in liberal fprintf(stderr) and system() calls to test various theories. Then run the thing with truss, and see if the syscalls make sense. There should be some approximate equivalent for Ruby.

pedz · Oct 28, 2022

Yea. I don't disagree. Ruby at this point is 25+ years old as well and I looked at the Ruby to C code implementation and its drop dead simple.

I had a similar issue between Docker containers, macOS, its file system APFS, and "Docker Desktop" which is a VM stuffed in between. In that case, PostgreSQL (also very old and reliable) kept saying (infrequently) it didn't have access to a directory when in fact it did. That setup was considerably more complicated though.

And... of course, for this error, it takes a week or so to hit it. So I rigged up more debugging and see what happens.

The other idea, since TrueNAS is on top of BSD, is BSD has restartable system calls. I can't remember all the ins and outs but maybe Ruby hasn't fully implemented all the nuances of that. e.g. the link starts creates a link but before it returns back to the application a timer pops. The link system call is then restarted and at that point discovers the link is already there. I've had to deal with that type of stuff countless years ago. I don't recall all the details but I know it had a few tricks depending upon various settings like signal masks, etc. Plus... I was dealing with BSD 4.2 and I don't know what it looks like today.

Important Announcement for the TrueNAS Community.

Is there any possibility this is a zfs issue?

pedz

Dabbler

jgreco

Resident Grinch

pedz

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Is there any possibility this is a zfs issue?

pedz

Dabbler

jgreco

Resident Grinch

pedz

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is there any possibility this is a zfs issue?"

Similar threads