Unexpected reboots with zfs send/recv and a full L2ARC

Silverstar24

Dabbler
Joined
Jan 22, 2020
Messages
11
Hi all

I have a problem with zfs send/recv on FreeNAS 11.3 U3.2 and l2arc cache. I try to replicate some snapshot over the wan connection with another FreeNAS 11.3 U3.2.

Since a few months the remote FreeNAS has some unexpected reboots. First at all i checked for updates and updated from 11.3 U1 to 11.3 U3.2. In the logs i can only find the message "The system is back online after an unexpected (Date/Time)", but i can't find anything about an error. So i replaced the Mainboard/CPU/Memory with the same parts. But this dosen't solve the issue. Sometimes the system has two or more reboot on one day.

Now i started with replacing, new installing and configuration of the whole setup on my side. Exactly the same system configuration with the new hardware that i use on remote side. But I run in the same issue with the new hardware and default installation/configuration. But the reboots on the new system comes on within a few minutes.

Ok, the different is that the remote side is using a 50/50Mbit/s wan connection and internal i use a 1Gbits LAN. I played with the push replication settings around(Dedup/Compression) but nothing solved the issue. I noticed that the data transfer dropped from about 900Mbit/s to about 600Mbit/s after 120GB are transmitted. Afterwards it tooks a few minutes and the FreeNAS had another unexpected reboot.

120GB is the size of the SSD which I use as L2ARC on the remote side. Well, I removed it from the pool and the problem does not occur anymore and the network usage is always around 900Mbit/s. So i can transfer 5TB of data without any issue.
This is probably the reason why only the remote side shows this behavior, Iocal I use an all-Flash system.

Can anyone explain this behaviour? I haven't been able to find any information on a bug.


My Systems:

Local side:

CISCO C240 M4X2
2x E5-2690v4
512GB RAM
24 Port LSI 12G SAS HBA (CISCO brand no RAID/Cache)
10x 3.8TB SAMSUNG SATA Enterprise SSD
2x 800GB SASMSUNG SAS 12G Enterprise SSD
2x 120GB SATA INTEL SSD for VM's
2x 10Gb/s CISCO NIC

FreeNAS Guest
VM Version 15
8vCores
64GB RAM
HBA Passthrough with all 12 Disks
2x 10Gbit/s VMNET3 NIC

ZFS Pool Config:
5 Disk as RAIDz1 / LZ4 Compression enabled.
5 Disk as RAIDz1 / LZ4 Compression enabled.
2 Disk as RAIDz1 / LZ4 Compression enabled.

Tunables:
kern.ipc.maxsockbuf 4194304 sysctl
kern.ipc.nmbclusters 4085768 sysctl
net.inet.tcp.delayed_ack 0 sysctl
net.inet.tcp.mssdflt 1448 sysctl
net.inet.tcp.recvbuf_inc 524288 sysctl
net.inet.tcp.recvbuf_max 16777216 sysctl
net.inet.tcp.recvspace 262144 sysctl
net.inet.tcp.sendbuf_inc 16384 sysctl
net.inet.tcp.sendbuf_max 16777216 sysctl
net.inet.tcp.sendspace 262144 sysctl
vfs.zfs.arc_max 61803000000 sysctl
vfs.zfs.l2arc_headroom 2 sysctl
vfs.zfs.l2arc_noprefetch 0 sysctl
vfs.zfs.l2arc_norw 0 sysctl
vfs.zfs.l2arc_write_boost 40000000 sysctl
vfs.zfs.l2arc_write_max 10000000 sysctl
vfs.zfs.metaslab.lba_weighting_enabled 1 sysctl
vfs.zfs.zfetch.max_distance 33554432 sysctl

scrub on vmware os disk disabled!

Remote side:
Intel I7-3770
16GB RAM
16GB SANDISK USB SDD for FreeNAS
1x 120GB Toshiba A100 SSD as L2ARC cache
3x 5TB Thoshiba X300 SATA disk
1x Intel 1 Gbit/s NIC

Pool Config:
3 Disk as RAIDz1 with 1 SSD L2ARC / LZ4 Compression enabled

Tunables:
kern.ipc.maxsockbuf 2097152 sysctl
kern.ipc.nmbclusters 2097152 sysctl
net.inet.tcp.delayed_ack 0 sysctl
net.inet.tcp.mssdflt 1448 sysctl
net.inet.tcp.recvbuf_inc 524288 sysctl
net.inet.tcp.recvbuf_max 16777216 sysctl
net.inet.tcp.recvspace 131072 sysctl
net.inet.tcp.sendbuf_inc 16384 sysctl
net.inet.tcp.sendbuf_max 16777216 sysctl
net.inet.tcp.sendspace 131072 sysctl
vfs.zfs.arc_max 13418774528 sysctl
vfs.zfs.l2arc_headroom 2 sysctl
vfs.zfs.l2arc_noprefetch 0 sysctl
vfs.zfs.l2arc_norw 0 sysctl
vfs.zfs.l2arc_write_boost 40000000 sysctl
vfs.zfs.l2arc_write_max 10000000 sysctl
vfs.zfs.metaslab.lba_weighting_enabled 1 sysctl
vfs.zfs.zfetch.max_distance 33554432
 
Last edited:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
From your description, it's likely the SSD is going bad.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
When you replace the SSD, you could try setting the L2ARC to metadata-only on the remote system by running zfs set secondarycache=metadata <name of your pool>. This will prevent the L2ARC from quickly filling up with replication data.
 

Silverstar24

Dabbler
Joined
Jan 22, 2020
Messages
11
I'm sorry, it's a bit badly phrased.

I built a identical system with the same hard and software twice. So I have completely the same remote NAS near by me.
And I have exactly the same problem on both system.

By the way SMART is enabled and no issue or warnings are coming up.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
According to the datasheet for the Toshiba A100 SSD, this is set up with a SLC write cache, and a 3-bit/cell MLC data area. In effect, this SSD is configured like a SMR drive, which doesn't work well with FreeNAS workloads. I'd replace this SSD with another model.
 

Silverstar24

Dabbler
Joined
Jan 22, 2020
Messages
11
ok, that is strange. I think the issue comes up the first time after upgrading to FreeNAS 11.3. The system was running well since 2016 and i had never a problem with the systems.

Thanks, I try to replace the ssd with an HPE 64GB SLC SSD.
 
Top