ZFS replication causing kernel panic

Lebesgue · Aug 15, 2019

Hi everyone,
I have setup two pools and replication in between, however shortly after this is initiated I encounter a kernel panic causing the server to reboot with approximately 10 minutes intervals. Replication has been enabled for months since v9.10 however kernel panic only happened recently on v11.2U5.

I have now disabled ZFS replication instead resorting to less efficient rsync. Since this server has been stable having neither SMART nor scrub errors reported.

The error is "panic: dva_get_dsize_sync() bad DVA []".
I have not been able to find a bug describing this - have anyone else experienced this?

Rgds. Thomas

toadman · Aug 15, 2019

Hi. I have not experienced it. A quick google search seems to indicate a sparse few others have seen it. Not sure on the root cause, but seems to imply disk corruption.

Lebesgue · Aug 28, 2019

Hi,
update on my side still pointing to this being a FreeNAS/BSD kernel or ZFS software issue rather than hardware related.
Initiated full disk write overwriting with zero on the 4 disks on the backup pool from within FreeNAS GUI. Ran for app. 4 days without any errors reported nor the server rebooting as previously.
Recreated the pool and created ZFS replication job. This was running for some days as well.

The pools are now in sync and server have uptime of 14 days.

Rgds. Thomas

flashero · Aug 30, 2019

Hi, how are you. I have experiencing, the same error, on TWO separate systems (that are replicating from a main server). Is this a bug?.

Best regards

flashero · Aug 30, 2019

All systems are running FreeNAS-11.2-U5. One server is remote, and another local.

flashero · Aug 30, 2019

All systems are running FreeNAS-11.2-U5. One server is remote, and another local.

rvassar · Aug 30, 2019

flashero said:
Hi, how are you. I have experiencing, the same error, on TWO separate systems (that are replicating from a main server). Is this a bug?.

Best regards

I would call this a bug. You need to submit the full stack trace. It may be in the end, a device driver that's not well supported, or flakey hardware. But if it crashes the kernel, and provides a stack trace, it's likely something the iX crew want to see.

flashero · Aug 30, 2019

rvassar said:
I would call this a bug. You need to submit the full stack trace. It may be in the end, a device driver that's not well supported, or flakey hardware. But if it crashes the kernel, and provides a stack trace, it's likely something the iX crew want to see.

Thank you very much forma the reply. How should I submit the full stack trace?

Best regards!

rvassar · Aug 30, 2019

flashero said:
Thank you very much forma the reply. How should I submit the full stack trace?

Best regards!

Ultimately, it needs to go into the iX Redmine system. But you might start by posting it over in the Bug discussion forum, here:

https://www.ixsystems.com/community/forums/bug-reporting-discussion.7/

Lebesgue · Sep 1, 2019

Hi flashero,
what I did at first was simply to disable the ZFS replication job I had enabled in the FreeNAS GUI. This prevented the regular 10 min. server server crash intervals, I had experienced prior to this.
I subsequently setup rsync to stress the disks, which worked as well and had it running for some days.
Subsequently as described above I eventually wiped all disks at the receiving side overwriting all block with zero (took days), before recreating the pool and enabling ZFS replication again.
My FreeNAS server has been stable since, although I fail to explain why.

In /data/crash you should find the tar'ed crash logs. Attach these if you submit a bug report.

Rgds. Thomas

flashero · Sep 3, 2019

Lebesgue said:
Hi flashero,
what I did at first was simply to disable the ZFS replication job I had enabled in the FreeNAS GUI. This prevented the regular 10 min. server server crash intervals, I had experienced prior to this.
I subsequently setup rsync to stress the disks, which worked as well and had it running for some days.
Subsequently as described above I eventually wiped all disks at the receiving side overwriting all block with zero (took days), before recreating the pool and enabling ZFS replication again.
My FreeNAS server has been stable since, although I fail to explain why.

In /data/crash you should find the tar'ed crash logs. Attach these if you submit a bug report.

Rgds. Thomas

Hi, I have installed a fresh server, with new disks, and the panics remain. I have allready submited the bug report, and waiting.

Best regards!

ykhodo · Sep 11, 2019

I'm seeing this during replication as well...

flashero · Sep 12, 2019

ykhodo said:
I'm seeing this during replication as well...

Hi, please file a bug report in https://jira.ixsystems.com/secure/Dashboard.jspa

Best regards

appliance · Dec 13, 2019

tickets to watch. same crash on 11.3beta1 and RC. #PanicFest

dwoodard3950 · Feb 18, 2020

Similar crash here which results in a reboot. The message is;
panic: dva_get_dsize_sync(): bad DVA ...
cpuid = 6
KDB: stack backtrace:
db_trace_self_wrapper() at ...

I'm able to reliably duplicate the crash with a specific snapshot on a specific dataset.

Unfortunately, not sure where to do with it, other than wait for this snapshot to rotate out of the source data.

Machine in question:
Destination server is Supermicro X10SDV-6C-TLN4F with 64GB ECC.

Henry L · Mar 2, 2020

Same crash issue here. Same panic: dva_get_dsize_sync(): bad DVA

My work around, while not ideal delete the last 2 snapshots, then replication runs fine.

styno · Apr 10, 2020

It looks like that nasty bug is finally squashed! ( commit here ) HAPPY DAYS.
I hope there will be some sort of hotfix and we don't have to wait for the next update cycle.

Important Announcement for the TrueNAS Community.

ZFS replication causing kernel panic

Lebesgue

Dabbler

toadman

Guru

Lebesgue

Dabbler

flashero

Cadet

flashero

Cadet

flashero

Cadet

rvassar

Guru

flashero

Cadet

rvassar

Guru

Lebesgue

Dabbler

flashero

Cadet

ykhodo

Explorer

flashero

Cadet

appliance

Explorer

dwoodard3950

Dabbler

Henry L

Dabbler

styno

Patron

Similar threads