zio->io_cv messages on boot

Fraoch · Oct 24, 2014

I've been testing my server. It's been working great so far but it looks like I broke something in the OS.

I can no longer boot.

I was stressing my drives using @jgreco 's test here:

https://forums.freenas.org/index.ph...esting-your-freenas-system.17750/#post-148773

It was working well but overnight my main machine kicked off an rsync backup to FreeNAS.

I get up this morning and my main machine is locked into IOwait. The FreeNAS web page isn't responding anymore and I can't ssh in. I try to power FreeNAS down via IPMI and IPMI reports back that this failed because the operating system is taking too long. I then send a reset signal, which worked.

I get the console up while it's rebooting. It reports the file system is clean but gets stuck at the following (see attachment).

It looks like it's hanging at a tunable added using Autotune? Should I unset all tunables as described in the manual? Autotune has never caused a problem before, I'm wondering why it is now.

Thanks for any help or advice!

jgreco · Oct 24, 2014

So you managed to have both your main machine and FreeNAS box lock up? Something seems wrong/unlikely/bad there. Does the APC Smart-UPS indicate any power events overnite?

As for FreeNAS rebooting, what you're seeing there is just console spillover. It doesn't really indicate what it is doing, which is something that annoys me a bit. You can try hitting ^T (control-T) to see what exactly it thinks it is doing, and probably you shouldn't do anything to it until we are really convinced that it is Stuck(tm) or Broken(tm).

If Broken(tm), a new flash disk and copying over the configuration will be the likely fix.

If Stuck(tm), need more info.

Fraoch · Oct 24, 2014

Thanks for the help!

Weird, but it seems to have recovered now. I rebooted it. Thanks for the ^T tip, after the part above, there were multiple entries for "zfs-zio" (looks like it was trying to complete something?) which scrolled by for quite a while, and finally an ominous:

Code:

Oct 24 13:13:14 Minas-Tirith kernel: Solaris: WARNING: ZFS replay transaction error 45, dataset volume1/Aileens_Backup, seq 0x13ea4, txtype 13

But then it started up normally. It indicates the pool is healthy.

I would have copied the entire boot sequence here (in particular those zfs-zio entries), but I can't seem to find it. Oddly, I noticed that the drive stress testing was running an identical process after some initial dd processes. Anyway I was looking for the boot log in /var/log - got that part about the replay transaction error 45 from /var/log/messages, but nothing about zfs-zio. I also poked around /etc, couldn't find anything there.

My UPS was showing one event, but it must have been brief because nothing else in the house had reset. The main machine is also on a UPS. What I suspect occurred is that everything was going fine until the rsync at midnight. My SNMP data shows things kept running for about an hour then my main machine stopped collecting data on everything. I think rsync was having a hard time writing to the disks on FreeNAS while they were so busy and just kept trying until it locked up both machines (to be honest, my main machine had not totally locked up but all network functions had stalled and IOwait was sitting at 50% of CPU). It's unwise to continue normal operations while stressing the drives, but this worked fine through some rather intensive iozone tests, just slower than it should.

SMART indicates the disks are fine but it may also indicate what caused this:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Interrupted (host reset)  10%  381  -
# 2  Short offline  Completed without error  00%  356  -

Running a short SMART test while the drives were under stress and rsync was trying to write to them may have been too much.

Things seem to be fine now and if you don't feel any more forensics would uncover anything, I'm fine with it. I guess the stress testing finally showed a weakness but it seems to have recovered fine.

My next step would have been to re-install the OS, import the pool and restore from a backup config. I was confident this would work, the pool seemed to be fine and I had an up-to-date config backup file.

Again, thank you.

jgreco · Oct 25, 2014

I imagine that the SMART status is due to the IPMI reset.

SMART tests should not significantly impact ongoing user operations. Just like a ZFS scrub it should at best take a fraction of the capacity for itself and still be responsive for other needs.

You probably want to keep an eye on it and do further testing. Sometimes "stuff happens" in this business, but in my experience that is often a precursor to "bad stuff happens," so if it were me, I'd be playing it a little on the paranoid side.

Fraoch · Oct 25, 2014

jgreco said:
I imagine that the SMART status is due to the IPMI reset.

The SMART short test was initiated about 6-1/2 hours prior and usually takes a few minutes, so it was unresponsive for a long time.

Definitely more testing then. Thanks.

Fraoch · Oct 27, 2014

I've changed the title because it happened again and I can boot after a long wait.

I was stressing the drives this time using iozone. This is the most intensive test I could find, it passed all the others perfectly. However this time the web interface became unresponsive, then I could no longer login by SSH. I reset again, but this time I captured the following.

It's about 15 minutes worth of zio->io_cv messages, followed by a few minutes of tx->tx_sync_done_cv messages. It then boots and runs fine.

I looked these up online and didn't find much. It sort of looks like the system is writing failed writes, that the writes were not completed on reset and it's writing the remainder and syncing up the drives. If so, this is a good thing.

But what is a bad thing is that my system goes unresponsive by the web interface and over SSH on moderate load. I did get a console over IPMI, it was indicating the interface was unset. I tried a few commands and eventually it locked up as well.

The CPU is loaded during these tests but there should still be capacity to run the web interface and other services. It's 40-50% idle as you see, it should still have lots of resources. (I restarted the test, it's still responding so far.)

This is strange, I don't know what component is failing or if it's a failure at all. I've tested all components individually.

Important Announcement for the TrueNAS Community.

zio->io_cv messages on boot

Fraoch

Patron

Attachments

jgreco

Resident Grinch

Fraoch

Patron

jgreco

Resident Grinch

Fraoch

Patron

Fraoch

Patron

Attachments

Similar threads