Is there any possibility this is a zfs issue?

pedz

Dabbler
Joined
Jan 29, 2022
Messages
35
This bug has me stumped so I'm grasping at straws.

From Ruby code, I do (more or less the equivalent of) a printf, followed by a call to link(from, to) (the system call), followed by another printf.

In the output, I get the first printf but not the second and the link fails with EEXIST -- the "to" link already exists. And, sure enough, the "to" link does exist and it is linked to "from" just like it is suppose to. The "to" link has, as part of it, its process id. So I know that this process created the link. The application is not multi-threaded.

This is executing in a jail and the file system for the link is mounted from the main platform. I'm running the latest TrueNAS release.

Could this be some kind of weird timing bug with the jail and the underlying ZFS mount point?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Could this be some kind of weird timing bug with the jail and the underlying ZFS mount point?

Sounds more like a Ruby thing. But this is very vague and it's hard to say. Jails have been around two decades plus, ZFS has been around nearly that long, UNIX itself has been around much longer. It's hard to imagine that you'd be bumping up against something new at the system level.

I am often doing work with C code that does not lend itself to easy debugging with a debugger, and I find that it is easiest to start throwing in liberal fprintf(stderr) and system() calls to test various theories. Then run the thing with truss, and see if the syscalls make sense. There should be some approximate equivalent for Ruby.
 

pedz

Dabbler
Joined
Jan 29, 2022
Messages
35
Yea. I don't disagree. Ruby at this point is 25+ years old as well and I looked at the Ruby to C code implementation and its drop dead simple.

I had a similar issue between Docker containers, macOS, its file system APFS, and "Docker Desktop" which is a VM stuffed in between. In that case, PostgreSQL (also very old and reliable) kept saying (infrequently) it didn't have access to a directory when in fact it did. That setup was considerably more complicated though.

And... of course, for this error, it takes a week or so to hit it. So I rigged up more debugging and see what happens.

The other idea, since TrueNAS is on top of BSD, is BSD has restartable system calls. I can't remember all the ins and outs but maybe Ruby hasn't fully implemented all the nuances of that. e.g. the link starts creates a link but before it returns back to the application a timer pops. The link system call is then restarted and at that point discovers the link is already there. I've had to deal with that type of stuff countless years ago. I don't recall all the details but I know it had a few tricks depending upon various settings like signal masks, etc. Plus... I was dealing with BSD 4.2 and I don't know what it looks like today.
 
Top