cp: bad address on Freenas 11.2-U6

skek · Feb 6, 2020

Hi All,

I wanted to discuss this before considering a bug report. I think its possibly a freebsd/zfs interaction bug but the end result is corrupted data. There is evidence others have seen this, but it's rare.

Yesterday I copied a directory of about 400 jpg files about 1-10MB each locally on my freenas server (ssh in, and "cp -r <source> <dest>" where source and dest were on the same dataset).

cp failed on 5 files reporting cp: bad address. Those 5 files were truncated. From a practical perspective they were corrupted (not by by zfs, at least solely).

Repeating the copy to a new dest the same corrupted result on the same 5 files. Copying the files other ways, ie, using "cat <source> <dest>" resulted in a bit perfect file. Reboot and the original "cp" completes without error. Both tests prove the original file was ok. System had been up for 93 days before reboot. pool scrubs clean, no issues found/corrected. all disks smartctl and regular tests report ok. nothing in the logs. memory is tested good, its ECC and no panics etc. so the corrupted file was consistent on disk, it was just that cp failed to do its job. System is Xeon on supermicro board, LSI controller and wd red 8th disks in raid2. I don't think this is related to hardware.

I took a truss of a cp of a single file before rebooting which I have attached.

I have seen this once before some time ago, on an earlier iteration of Freenas 11 (can't remember version). It was just one file and I put it down to something funky with that file at the time and just deleted it.

This appears to be the same bug as #47668: https://redmine.ixsystems.com/issues/47668

Now what is scary here is that if I had not been watching the terminal session and seen the "cp: bad address" lines, those 5 files would lie there corrupt and eventually fall off my snapshot/replication/offsite backup scheme if left undetected.

I can't be sure, but the issue may not affect NFS/SMB/rsync/scp etc so this might only show up if manipulating files directly on the host. But. My fear is that if it occurs over one of those protocols would it error in a way that the user is aware?.

The issue appears to be mmap related, so it could affect other protocols (or it could be just in how cp uses mmap). I did find another information thread somewhere on a freebsd forum discussing the possibility of an mmap bug in cp but I can't find it right now. I'll add it if I re-find it. It implies, as does the source for cp, that mmap is only used for files <8MB.

I have no idea how to proceed from here. I can't think of anything I can do personally to troubleshoot except wait for it to happen again and me notice it and then do some more informed tests/diagnoses. Looking for input on what that might be, or if anyone knowledgable can advise on some diagnostics to do now.

I think this warrants a bug report, either here or upstream, but looking for advice on how to make that effective.

Wondering is there a way to detect this automatically. I've read some hints that there may be a way to force a panic for such a failure, maybe.. how and would this be wise?

Any other thoughts? Obviously this is a worrying situation for me given I rely on this platform for data integrity and backups can only go so far (backing up data since corrupted is not good), but I'm also hopeful to get to the bottom of it for the greater good. I think there is a sneaky edge case bug between cp/mmap/zfs to do with memory allocation perhaps (a reboot fixing it is worrying however). If I *knew* it was only cp affected I'd just not use it, my fear is silent corruption using nfs/smb etc.

Here's the truss:

Code:

root@vault:/mnt/mass1/piwigo/x # truss cp ../upload/2020/02/02/20200202174652-706ab1cc.jpg .
mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366177280 (0x800624000)
issetugid()                     = 0 (0x0)
lstat("/etc",{ mode=drwxr-xr-x ,inode=2,size=8128,blksize=4096 }) = 0 (0x0)
lstat("/etc/libmap.conf",{ mode=-rw-r--r-- ,inode=118,size=145,blksize=4096 }) = 0 (0x0)
openat(AT_FDCWD,"/etc/libmap.conf",O_RDONLY|O_CLOEXEC,00) = 3 (0x3)
fstat(3,{ mode=-rw-r--r-- ,inode=118,size=145,blksize=4096 }) = 0 (0x0)
mmap(0x0,145,PROT_READ,MAP_PRIVATE,3,0x0)     = 34366210048 (0x80062c000)
close(3)                     = 0 (0x0)
munmap(0x80062c000,145)                 = 0 (0x0)
openat(AT_FDCWD,"/var/run/ld-elf.so.hints",O_RDONLY|O_CLOEXEC,00) = 3 (0x3)
read(3,"Ehnt\^A\0\0\0\M^@\0\0\0\M^S\0\0"...,128) = 128 (0x80)
fstat(3,{ mode=-r--r--r-- ,inode=204,size=275,blksize=4096 }) = 0 (0x0)
lseek(3,0x80,SEEK_SET)                 = 128 (0x80)
read(3,"/lib:/usr/lib:/usr/lib/compat:/u"...,147) = 147 (0x93)
close(3)                     = 0 (0x0)
access("/lib/libc.so.7",F_OK)             = 0 (0x0)
openat(AT_FDCWD,"/lib/libc.so.7",O_RDONLY|O_CLOEXEC|O_VERIFY,00) = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=3071,size=1680112,blksize=131072 }) = 0 (0x0)
mmap(0x0,4096,PROT_READ,MAP_PRIVATE|MAP_PREFAULT_READ,3,0x0) = 34366210048 (0x80062c000)
mmap(0x0,3878912,PROT_NONE,MAP_GUARD,-1,0x0)     = 34368278528 (0x800825000)
mmap(0x800825000,1634304,PROT_READ|PROT_EXEC,MAP_PRIVATE|MAP_FIXED|MAP_NOCORE|MAP_PREFAULT_READ,3,0x0) = 34368278528 (0x800825000)
mmap(0x800bb3000,49152,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_FIXED|MAP_PREFAULT_READ,3,0x18e000) = 34372005888 (0x800bb3000)
mmap(0x800bbf000,102400,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_FIXED|MAP_ANON,-1,0x0) = 34372055040 (0x800bbf000)
munmap(0x80062c000,4096)             = 0 (0x0)
close(3)                     = 0 (0x0)
munmap(0x80062b000,4096)             = 0 (0x0)
mmap(0x0,102400,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366205952 (0x80062b000)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
readlink("/etc/malloc.conf",0x7fffffffd6c0,1024) ERR#2 'No such file or directory'
issetugid()                     = 0 (0x0)
__sysctl(0x7fffffffd560,0x2,0x7fffffffd5b0,0x7fffffffd5a8,0x800984053,0xd) = 0 (0x0)
__sysctl(0x7fffffffd5b0,0x2,0x7fffffffd674,0x7fffffffd668,0x0,0x0) = 0 (0x0)
mmap(0x0,2097152,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34372157440 (0x800bd8000)
munmap(0x800bd8000,2097152)             = 0 (0x0)
mmap(0x0,4190208,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34372157440 (0x800bd8000)
munmap(0x800bd8000,163840)             = 0 (0x0)
munmap(0x800e00000,1929216)             = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
sigaction(SIGINFO,{ 0x4020e0 SA_RESTART ss_t },{ SIG_DFL 0x0 ss_t }) = 0 (0x0)
stat(".",{ mode=drwxr-xr-x ,inode=2933,size=2,blksize=131072 }) = 0 (0x0)
umask(0x1ff)                     = 18 (0x12)
umask(0x12)                     = 511 (0x1ff)
mmap(0x0,2097152,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34374418432 (0x800e00000)
fstatat(AT_FDCWD,"../upload/2020/02/02/20200202174652-706ab1cc.jpg",{ mode=-rw-r--r-- ,inode=421,size=3474211,blksize=131072 },0x0) = 0 (0x0)
stat("./20200202174652-706ab1cc.jpg",0x7fffffffe888) ERR#2 'No such file or directory'
openat(AT_FDCWD,"../upload/2020/02/02/20200202174652-706ab1cc.jpg",O_RDONLY,00) = 3 (0x3)
openat(AT_FDCWD,"./20200202174652-706ab1cc.jpg",O_WRONLY|O_CREAT|O_TRUNC,0100644) = 4 (0x4)
mmap(0x0,3474211,PROT_READ,MAP_SHARED,3,0x0)     = 34376515584 (0x801000000)
write(4,"\M^?\M-X\M^?\M-aQ\M^SExif\0\0II*"...,3474211) ERR#14 'Bad address'
cp: write(2,"cp: ",4)                 = 4 (0x4)
./20200202174652-706ab1cc.jpgwrite(2,"./20200202174652-706ab1cc.jpg",29)     = 29 (0x1d)
: write(2,": ",2)                     = 2 (0x2)
Bad address
write(2,"Bad address\n",12)             = 12 (0xc)
munmap(0x801000000,3474211)             = 0 (0x0)
close(4)                     = 0 (0x0)
close(3)                     = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)         = 0 (0x0)
exit(0x1)
process exit, rval = 1

skek · Feb 10, 2020

Lots of views but not much interest. After 5 days uptime I can reproduce this bug.

A bit more testing suggests a file that gives the bad error message using "cp" on the freenas box can be read and copied without error via NFS and SMB shares (at least with the settings I have). It's not conclusive as I've just tested with the one file, perhaps other files will trigger the bug using those protocols. Still, my confidence in the system is somewhat shaken.

~~I've also noticed SMB shares being extremely slow at the same time this bug has reappeared. Not sure if that's related.~~ (edit) a reboot of OSX fixed that.

Finally, it appears to only affect files less than 8MB, and it appears cp uses memmap for such files.

I still find this all extremely concerning. Should I look to open a bug with openbsd? or reopen the above freenas bug id?

2bob · Mar 3, 2020

I think i've just seeen the same issue on 11.3-U1

Attempting an mv on a directory from one dataset to another, one file gave 'Bad address'

A cp on the same file again gave 'Bad address' and the target file was truncated (smaller size).

A cp on the same file from a snapshot worked and the size appeared normal - although i have no way to verify that the contents are correct.

Copying and moving the same file using SMB appears ok based on file size.

KernelPanic · May 20, 2020

We have managed to trigger this as well on a heavily used 11.3 Release system.. Similar conditions.

cp triggered the "bad address" and resulted in corrupted files. rsync worked fine. We only walked into this one yesterday. System uptime of 94 days.

This is a rather nasty one! Happy to provide debug information if needed.

KernelPanic · May 20, 2020

Machine will be undergoing reboot shortly, but grabbed this while I could:
stat("<output file>",0x7fffffffe938) ERR#2 'No such file or directory'
openat(AT_FDCWD,"<input file>",O_RDONLY,00) = 3 (0x3)
openat(AT_FDCWD,"<output file>",O_WRONLY|O_CREAT|O_TRUNC,0100644) = 4 (0x4)
mmap(0x0,134681,PROT_READ,MAP_SHARED,3,0x0) = 34366312448 (0x800645000)
write(4,"/*=============================="...,134681) ERR#14 'Bad address'
write(2,"cp: ",4) = 4 (0x4)
write(2,"<output file trimmed>"...,83) = 83 (0x53)
write(2,": ",2) = 2 (0x2)
write(2,"Bad address\n",12) = 12 (0xc)
munmap(0x800645000,134681) = 0 (0x0)
futimens(4,{ 1589987039.485719267, 1589986667.732862967 }) = 0 (0x0)
fstat(4,{ mode=-rwxrwxrwx ,inode=2849020,size=131072,blksize=131072 }) = 0 (0x0)
fchown(4,138,11) = 0 (0x0)
fchmod(4,0644) = 0 (0x0)
fpathconf(0x3,0x40) = 1 (0x1)
__acl_get_fd(3,ACL_TYPE_NFS4,0x800ebc000) = 0 (0x0)
close(4) = 0 (0x0)
close(3) = 0 (0x0)

Had to redact the file names used.

KernelPanic · May 20, 2020

Ive gone ahead and created a ticket for this one. https://jira.ixsystems.com/browse/NAS-106185

However, Im guessing that after a reboot, our problem will go away, and I'll be unable to assist with debugging until the conditions trigger again.

Brezlord · Jul 26, 2020

I have the same problem on TrueNAS-12.0-BETA.

KernelPanic · Jul 26, 2020

Brezlord said:
I have the same problem on TrueNAS-12.0-BETA.

Ok, that sucks. I was hoping Truenas 12 would fix it!

donalm · Aug 6, 2020

Also seeing this on FreeNAS-11.3-U1. A small datapoint to add is that if I try to copy my failing 529KB file out of the zfs pool (to /tmp or /var), a zero-length file is created there. If I try to copy it to seemingly any other destination in the zfs pool, a 512KB file is created. If I try to copy it to /home (which isn't on zfs) a 512KB file is created there.

This behaviour seems to be perfectly reproducible.

Default 128KB record size on the zfs pool.

Slennox · Aug 15, 2020

I've also experienced this. It was copying data between two datasets using system's basic cp.

It's very worrying since this bug could introduce silent data corruption in the pools.

svoltaire2 · Jan 12, 2021

Unfortunately, it still happen on the last release (trueNAS 12), any bug track?

Important Announcement for the TrueNAS Community.

cp: bad address on Freenas 11.2-U6

skek

Dabbler

skek

Dabbler

2bob

Cadet

KernelPanic

Dabbler

KernelPanic

Dabbler

KernelPanic

Dabbler

Brezlord

Contributor

KernelPanic

Dabbler

donalm

Cadet

Slennox

Cadet

svoltaire2

Cadet

Similar threads