FreeNAS as FC target: Large write causes kernel panic

Status
Not open for further replies.

S.co.tt

Cadet
Joined
Feb 27, 2016
Messages
2
Hi all,

First I want to give a great thanks to aran kaspar on here for the awesome writeup on setting up an FC target. It worked perfectly and immediately.

The first thing I did was expose a 1TB file extent as a LUN to one of my VM servers (CentOS 6 / Xen), created an ext4 filesystem, & etc. and was able to read/write just fine. I ran a backup script copying a raw snapshot of all the VMs on the system to the LUN as a test. About 850GB in total.

All went well for the first ~65GB, then I started seeing kernel: Buffer I/O error on device sdc1 repeatedly on the initator (sdc was the devname of the LUN).

On the target (FreeNAS) ctladm dumpooa showed during the transfer, at any given time, one command taking a long time to complete (the remainder were on the order of 0-100ms) -- so long that it was causing the HBA drivers on the initiator to time out. Increasing the timeout didn't help, because at some point a command would eventually take more than the timeout to complete (I went up to 2 mins at any rate).

I thought I'd narrowed it down to a bad disk (which didn't show as bad from smartd, but it was absurdly slow and the whole array was waiting on it). After the disk was replaced I ran the same backup script again. It was running great, and nothing in the command queue was waiting longer than 2-3 seconds in absolute worst cases. Most wait times were in the 0-15ms range.

At this point, in an act of hubris, I prematurely wrote up a blog post on the "fix" as the backup was progressing into the hundreds of GBs. That link has a lot more background.

About 750GB into the 850GB backup (which honestly was just a dd for a snapshot of each raw logical volume), the following errors showed up on the initiator:

Feb 28 20:50:42 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 20:52:43 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 20:54:44 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 20:56:45 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 20:58:46 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 21:00:47 hp-vm02 kernel: lpfc 0000:07:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2002
Feb 28 21:00:47 hp-vm02 kernel: sd 1:0:0:1: [sdc] Unhandled error code
Feb 28 21:00:47 hp-vm02 kernel: sd 1:0:0:1: [sdc] Result: hostbyte=DID_REQUEUE driverbyte=DRIVER_OK
Feb 28 21:00:47 hp-vm02 kernel: sd 1:0:0:1: [sdc] CDB: Write(10): 2a 00 55 57 20 3f 00 02 00 00
Feb 28 21:00:47 hp-vm02 kernel: end_request: I/O error, dev sdc, sector 1431773247
Feb 28 21:00:47 hp-vm02 kernel: Buffer I/O error on device sdc1, logical block 178971648
Feb 28 21:00:47 hp-vm02 kernel: Buffer I/O error on device sdc1, logical block 178971649

The kernel: Buffer I/O error on device sdc1, logical block error was repeated hundreds of times, with the rest of the above errors popping up intermittently. I believe this is because the FreeNAS box slowed to a crawl (and eventually crashed).

On the FreeNAS server all that was held in /var/log/messages surrounding the time period was:

Feb 28 13:30:32 champ smartd[52957]: http://knowledge.seagate.com/articles/en_US/FAQ/223651en
Feb 28 19:22:34 champ ctlfedone: got XPT_IMMEDIATE_NOTIFY status 0x36 tag 0xffffffff seq 0xffffffff
Feb 28 19:22:34 champ (ctl0:isp0:0:-1:-1): ctlfe_done: returning task I/O tag 0xffffffff seq 0xffffffff
Feb 28 21:12:33 champ syslog-ng[3083]: syslog-ng starting up; version='3.5.6'

Not much help, because as you can see it logged a normal smartd error at 13:30 (about firmware being out of date on a drive), a non-fatal error at 19:22, and a boot up at 21:12. However, I believe that the entire storage subsystem crapped out because at the time I happened to have the console open on an IP KVM session on another monitor, and at about 20:52 I saw this pop up:

FreeNAS_Crash_20160228-2052.png


These were the same errors I'd been getting in the first place that I chalked up to the array hanging on a bad drive. I immediately went to do a ctladm dumpooa, but my SSH session was frozen as was the console. I thought the system was completely hung -- no more messages at the console and no visible disk activity -- but at about 21:04 (12 mins later):

FreeNAS_Crash_20160228-2104.png


Then nothing for about 10 mins, and then I heard the fans throttle up because the box was POSTing. Looks like a kernel panic reboot or .. something. IPMI shows no problems at the hardware level.

It booted back up just fine like nothing happened. I'm able to read and write to the LUN again (with a issue_lip and scan to the HBA), and as I write this I've read about 200GB from the LUN so far (I'm kind of afraid to re-do the write test again).

But I could really use some guidance here. I had a similar setup on Solaris w/ZFS for a few years, but I'm a FreeNAS / FreeBSD noob.

I'd love to dump more logs, and would be happy to throw you guys more info. I just don't know where to start digging around that might be useful, so please let me know what you'd like to see if you have any ideas.

FWIW I really don't think it's hardware or driver related (aside from perhaps the HBA's driver). I've had the FreeNAS box up and running for about 2 months now and transferred about 15TB onto it with no problems (other than slow 10Gb CIFS performance, but that's a sad story for another day). It's been rock solid solid, as had all the same FC hardware been in its previous incarnation.

But it's weird that it could take a transfer of ~750GB without a hiccup and then suddenly fail, right?

Here's the FreeNAS hardware, BTW:
  • That giant Supermicro 36 x 3.5" case
  • X8DTH-iF mobo
  • 1 x X5687 3.6Ghz QC
  • 96GB PC3-12800R
  • 5x PERC H310 (flashed with LSI fw to IT mode)
  • Pool01: 10 x 3TB RAIDZ2 + 10 x 3TB RAIDZ2 (this is the pool in question)
  • Pool02: 3 x 1TB mirror
  • Intel 10Gb XF SR
  • QLE2460 4Gb HBA (flashed to IT mode) to Brocade 200E switch
Initiator machine in question is a Dell PE610 / CentOS 6 / Emulex LPe1150-E.

Oh, and apologies for the wall of text and thanks in advance!
 
D

dlavigne

Guest
Please create a bug report at bugs.freenas.org that contains this info and post the issue number here. It is also useful to include a debug in the bug report (after a recent crash). You can create that from System -> Advanced -> Save Debug.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
This thread was emailed to me. Just a little advice:

1. Still put in a bug ticket regardless of what you read below... attaching a debug file (System -> Advanced -> Save Debug) would be useful in this case I think.
2. Using RAIDZ2 with workloads like you are trying is just going to go very, very badly. When doing RAIDZ2 it doesn't take much workload to find yourself so overloaded with I/O that the zpool cannot keep up. I wouldn't be too surprised if you are simply trying to work the zpool beyond its design limits, but let the devs figure this out with the bug ticket. There may be a bug here despite my guesstimate.

Quite seriously, the only recommended zpool type for VM workloads is mirrors. That's it. No RAIDZ-anything because the iops just aren't there to support VM workloads. This is probably a big buzzkill since RAIDZ2 gives very good data integrity in relation to the amount of disk space set aside for parity/redundancy. So even if there is a bug and the issue isn't because of RAIDZ(x), as soon as you start loading the zpool you're going to be very disappointed at how little workload the system will be able to handle. I tried to run 2 VMs on a RAIDZ2 myself and it was deplorable. Ultimately I couldn't do it and sometimes the VMs would fail to complete booting because the I/O was so bad that things were timing out.

If you are okay with providing me with a debug file, please PM me your debug file. You can post it in the forums if you wish, but there may be sensitive data (IP addresses, etc) that you may not want public. ;)
 

S.co.tt

Cadet
Joined
Feb 27, 2016
Messages
2
Thanks for the replies!

I will submit a bug report shortly and update the thread.

cyberjock,

We're on the same page as far as RAIDZ and high iops workloads -- The VM servers have their own sets of spindles running h/w RAID10 across either 6 or 8 disks. The only reason for the FreeNAS LUNs on those servers is for temp space, log archiving, dev machines, etc. Stuff that's usually very low iops, unless I decide to manually push nearly a TB to them, of course. :)

That being said, I've never met an array* that I couldn't push as hard as I wanted and had it crash on me as opposed to just.. being slow. So I'm inclined to think it's a bug. Then again, I know I'm outside the realm of standard/intended use cases for FreeNAS, so I understand if it's-not-a-bug-it's-a-feature.

*Faulty controllers and beta versions of software RAID aside ;)

You're really going to hate me when I tell you that the mix of disks in the RAIDZ2 vdevs are a completely heterogeneous mix of brands and models.

However, local to the FreeNAS server (champ):

[root@champ] /mnt/pool01/temp# dd if=/dev/zero of=testfile bs=1M count=300000 &
[1] 49517
[root@champ] /mnt/pool01/temp# zpool iostat pool01 30
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
pool01 18.7T 35.8T 112 241 566K 1.69M
pool01 18.7T 35.8T 0 5.00K 0 628M
pool01 18.8T 35.7T 3 4.95K 489K 618M
pool01 18.8T 35.7T 5 6.04K 665K 755M
pool01 18.8T 35.7T 2 9.50K 109K 1.16G
pool01 18.9T 35.6T 0 8.98K 0 1.10G
pool01 18.9T 35.6T 0 8.55K 273 1.05G
pool01 19.0T 35.5T 0 8.98K 3.73K 1.10G
^C
[root@champ] /mnt/pool01/temp# iostat -w 30
tty da0 da1 da2 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 7 23.21 12 0.27 24.02 11 0.27 23.23 12 0.27 0 0 1 0 99
0 8 113.47 665 73.73 103.16 744 74.90 93.20 841 76.52 0 0 53 2 45
0 3 79.91 949 74.07 80.00 948 74.09 79.92 949 74.06 0 0 52 2 46
300000+0 records in
300000+0 records out
314572800000 bytes transferred in 313.159204 secs (1004513985 bytes/sec)
^C
[1] + Done dd if=/dev/zero of=testfile bs=1M count=300000

And the same sort of thing (just a lower count bc this particular extent is only 20G) on the vm server (hp-vm02) against the LUN:

root@hp-vm02 [/mnt/freebsd-temp-lun]# dd if=/dev/zero of=testfile bs=1M count=20000

[root@champ] /mnt/pool01/temp# zpool iostat pool01 30
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
pool01 19.1T 35.4T 112 250 567K 2.80M
pool01 19.1T 35.4T 651 953 81.4M 87.3M
pool01 19.1T 35.4T 885 1.32K 111M 127M
^C
[root@champ] /mnt/pool01/temp# iostat -w 30
tty da0 da1 da2 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 7 24.72 12 0.29 25.54 12 0.30 24.73 12 0.29 0 0 1 0 99
0 8 25.92 560 14.18 26.02 557 14.14 26.01 559 14.20 1 0 10 1 88
0 3 25.72 605 15.21 25.61 608 15.21 25.73 605 15.21 0 0 11 1 88
0 3 26.07 609 15.50 26.12 610 15.56 26.15 606 15.49 0 0 10 1 89

dd: writing `testfile': No space left on device
19929+0 records in
19928+0 records out
20896350208 bytes (21 GB) copied, 175.609 s, 119 MB/s

When the crash occurred I was doing the same thing, except instead of if=/dev/zero the input was an LVM snapshot. The performance was approximately the same, and it was the only workload on the array at the time. Anyways, I'm just trying to show that the FC stack is running far fewer iops than the pool can handle.

Apologies for another long post.. I'm not trying to argue for the sake of arguing or anything; I had to test this out for myself just now to be sure I wasn't going crazy. Though I can't figure out where the bottleneck is on the FC side (yet), which is another puzzle. Maybe I'll have to take the switch out and just crossover the LC connectors. (At work atm though.)
 
Status
Not open for further replies.
Top