HELP! vm_fault and swap_pager errors!

danjng · Jul 31, 2017

After seeing these errors, the GUI dies and all my VMs seem to die as well (which pull the Docker containers down too). During these errors, it seems that one of my Backup pool drives (not the one that contains the VMs) goes into a degraded state and says it was removed by an administrator (I did not do any sort of removal). Just the other day, the server became unresponsive (maybe due to the same issue) and after a hard reset, it went through the resilvering process which completed without any issues.

Anyone have any thoughts or ideas? I wouldn't think that a bad drive would cause a catastrophic failure like this, so there's got to be something else. zpool status reports all other drives and pools to be OK (with the exception of the Backup pool, of course). It doesn't seem to be shutting down either. I can still SSH to it and issue the shutdown command, but it's not doing anything. It might be in the middle of a scrub - would that make it unable to shut down?

Thanks for any help!!

Edit: If the Backup pool has some sort of swap or something critical to operations on it, but it is a mirrored pool, should this happen if only one drive goes bad?

Logs:

Code:

Jul 31 10:24:02 sandcrawler swap_pager: I/O error - pagein failed; blkno 525333,size 4096, error 6
Jul 31 10:24:02 sandcrawler vm_fault: pager read error, pid 1934 (daemon)
Jul 31 10:24:02 sandcrawler swap_pager: I/O error - pagein failed; blkno 525332,size 4096, error 6
Jul 31 10:24:02 sandcrawler vm_fault: pager read error, pid 1934 (daemon)
Jul 31 10:24:02 sandcrawler kernel: Failed to fully fault in a core file segment at VA 0x800622000 with size 0x23000 to be written at offset 0x4000 for process daemon
Jul 31 10:24:02 sandcrawler kernel: Failed to fully fault in a core file segment at VA 0x800622000 with size 0x23000 to be written at offset 0x4000 for process daemon
Jul 31 10:24:02 sandcrawler swap_pager: I/O error - pagein failed; blkno 525333,size 4096, error 6
Jul 31 10:24:02 sandcrawler vm_fault: pager read error, pid 1934 (daemon)
Jul 31 10:24:02 sandcrawler kernel: Failed to fully fault in a core file segment at VA 0x800dcb000 with size 0x19000 to be written at offset 0x36000 for process daemon
Jul 31 10:24:02 sandcrawler kernel: Failed to fully fault in a core file segment at VA 0x800dcb000 with size 0x19000 to be written at offset 0x36000 for process daemon
Jul 31 10:24:02 sandcrawler kernel: pid 1934 (daemon), uid 0: exited on signal 11 (core dumped)

Stux · Jul 31, 2017

Swap is spread across all drives. When a drive fails and you have swap on it, you crash. Dumb. I know.

See this for a Work around
https://forums.freenas.org/index.ph...ny-used-swap-to-prevent-kernel-crashes.46206/

danb35 · Jul 31, 2017

danjng said:
During these errors, it seems that one of my Backup pool drives (not the one that contains the VMs) goes into a degraded state and says it was removed by an administrator (I did not do any sort of removal).

This, I think, is likely to be the core of your problem. By default, every disk that's part of a pool has a 2 GB swap partition created on it, and all those swap partitions are then used by the system--and currently, swap is not mirrored. If swap is being used (which shouldn't be the case in normal operation with adequate resources), and the disk on which the swap is located drops offline, you'll see those errors.

danjng · Jul 31, 2017

I really appreciate the quick responses! This certainly sheds some light on things. So even in a raid1 mirror, swap is not mirrored?

I'm not sure why swap is in use for me in the first place - I've got 16GB RAM.

I will try that script when my system comes back up. First boot after hard shutdown was throwing these crazy errors talking about something being already queued but I wasn't smart enough to jot it down, so I'm letting it 'rest'...

danb35 · Jul 31, 2017

danjng said:
even in a raid0 mirror, swap is not mirrored?

Correct, swap is not mirrored. And a mirror is RAID 1, not RAID 0.

danjng · Jul 31, 2017

danb35 said:
Correct, swap is not mirrored. And a mirror is RAID 1, not RAID 0.

Oops. My mistake! I stand corrected.

I'm curious why it's designed that way. I don't wanna open a new can on this thread but if there's a short answer, I'd really appreciate it!

So far so good. It's back up and running now. Resilvering again. Script is not yielding anything yet since the swap is not being used (yet). I'm planning on doing some smart tests when the resilvering is done. Any other ways I can diagnose the drive? I guess pull it out and see if it still does the same thing?

I'm trying to think of the root cause. The only thing I've changed recently is getting a UPS with USB connection so that FreeNAS knows when to shut down gracefully. Any experience or any chance that it's due to that?

Sent from my iPhone using Tapatalk

Stux · Jul 31, 2017

Dodgy Sata/power cables/cabling

danjng · Jul 31, 2017

Stux said:
Dodgy Sata/power cables/cabling

Thanks! I haven't touched the internals in months, but I'll shut 'er down for a bit and do a quick dusting and cable check when I get a chance.

danjng · Jul 31, 2017

So just before the crash this time, I got a snapshot error for my Farm pool and then a complete failure of the Farm pool due to IO failures.

The screen continuously scrolls:

Code:

Can't allocate CCB to send AC_BUS_RESET

and

Code:

(noperiph:ahcich15:0:-1:ffffffff):rescan already queued

Does this change anything? Am I experiencing two drive failures or is the failure of one of the Backup pool drives causing the IO failures in the Farm pool?

SMART did not report anything indicative of failure or impending doom.

I've been running that script every hour and so far every hour I've been getting a little here and there:

Code:

2017-07-31 16:00:00: Paging in 672 KiBs on /dev/ada7p1.eli
2017-07-31 16:00:00: Paging in 1320 KiBs on /dev/ada3p1.eli
2017-07-31 16:00:00: Paging in 1096 KiBs on /dev/ada1p1.eli
2017-07-31 16:00:00: Paging in 1176 KiBs on /dev/ada0p1.eli
2017-07-31 16:00:00: Paging in 1104 KiBs on /dev/ada5p1.eli
2017-07-31 16:00:00: Paging in 1272 KiBs on /dev/ada4p1.eli
2017-07-31 16:00:00: Paging in 1280 KiBs on /dev/ada2p1.eli
2017-07-31 16:00:01: Paging in 832 KiBs on /dev/ada6p1.eli

Should I run it every 10 minutes instead? This catastrophic error happened at 4:40 before anything could be caught by the cron job.

Any additional thoughts or suggestion are welcome. I do still have to check the cabling and stuff.

Thanks!!!!

danjng · Jul 31, 2017

I have yet to investigate, but could a faulty PCIx RAID card cause this?

Stux · Jul 31, 2017

You need to fix your HD issues. I do run the script every 10 minutes... but it really is a band aid, based on the idea that swap doesn't happen *that* often, and if you ensure that there's only swap for x minutes... then the chances of an hd failure occurring at the same time as swap being present are reduced.

BUT you need to fix your HD failure issue.

SATA cables go bad. Especially cheaper ones. My advise, is to forget about your swap related crashes, and focus on resolving your HD issues.

danjng · Aug 1, 2017

Stux said:
You need to fix your HD issues. I do run the script every 10 minutes... but it really is a band aid, based on the idea that swap doesn't happen *that* often, and if you ensure that there's only swap for x minutes... then the chances of an hd failure occurring at the same time as swap being present are reduced.

BUT you need to fix your HD failure issue.

SATA cables go bad. Especially cheaper ones. My advise, is to forget about your swap related crashes, and focus on resolving your HD issues.

Good call on the cable checking. Looks like one of them was just a tiny bit crooked and also folded almost 175 degrees on one end. I replaced the cable and so far so good. I'll give it another day to see if that actually did the trick. I've also increased the recurrence of the job so that it's done every 15 minutes, so hopefully that helps.

I'll update with the long term health, but so far it looks healthy. Thanks, @Stux and @danb35, for the help and suggestions!!

Ericloewe · Aug 1, 2017

danjng said:
I'm curious why it's designed that way. I don't wanna open a new can on this thread but if there's a short answer, I'd really appreciate it!

Because nobody really cared back in the day. It is inelegant and nasty, so it's scheduled to be changed.

It was changed for Corral, but we all know how that ended...

danjng · Aug 1, 2017

So we're about 13 hours in and I just saw my first set of errors. No catastrophic failure just yet, but are these errors indicative of drive failures? And what sort of test would be definitive for this? I had run a short SMART test and it passed just fine. Running a long one now and so I'll know the results in the morning (hopefully, if it's still up). I've never had a drive die on me before and I'm a little bit surprised since this is a WD Red 5400RPM, but I suppose that this is around the time it dies? It's got 15,700 hours on its clock (although I don't remember buying this drive 2 years ago...). As usual, any thoughts or suggestions are helpful. I'll be looking out for the next 4TB deal...

Code:

Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 b8 d3 49 40 7e 01 00 00 00 00
Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): Retrying command
Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 b8 d3 49 40 7e 01 00 00 00 00
Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:22 sandcrawler (ada3:ahcich13:0:0:0): Retrying command
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 b8 d3 49 40 7e 01 00 00 00 00
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): Retrying command
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 68 60 e8 82 40 28 00 00 00 00 00
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): Retrying command
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 68 88 10 0c 40 21 01 00 00 00 00
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:23 sandcrawler (ada3:ahcich13:0:0:0): Retrying command
Aug  1 20:09:23 sandcrawler (aprobe0:ahcich13:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Aug  1 20:09:23 sandcrawler (aprobe0:ahcich13:0:0:0): CAM status: Uncorrectable parity/CRC error
Aug  1 20:09:23 sandcrawler (aprobe0:ahcich13:0:0:0): Retrying command
Aug  1 20:09:55 sandcrawler ahcich13: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Aug  1 20:10:25 sandcrawler ahcich13: Timeout on slot 6 port 0
Aug  1 20:10:25 sandcrawler ahcich13: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd 80 serr 00000000 cmd 0000e617
Aug  1 20:10:25 sandcrawler (aprobe0:ahcich13:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Aug  1 20:10:25 sandcrawler (aprobe0:ahcich13:0:0:0): CAM status: Command timeout
Aug  1 20:10:25 sandcrawler (aprobe0:ahcich13:0:0:0): Error 5, Retries exhausted

Stux · Aug 1, 2017

They're indicative of a faulty sata cable.

They indicate that between the SATA port and the HD the data got corrupted.

It could be the SATA port, the cable or the HD.

Guess which it normally is ;)

danjng · Aug 1, 2017

Stux said:
They're indicative of a faulty sata cable.

They indicate that between the SATA port and the HD the data got corrupted.

It could be the SATA port, the cable or the HD.

Guess which it normally is ;)

Hey, I really appreciate your attention on this. People like you make the community a great place to grow and learn. You da real MVP.

That said, I'll try some more cables and see what happens. Any recommendations on quality brands? I only have unlabeled SATA cables left. I did some research and they were saying that a SATA cable is a SATA cable whether it's labeled SATA2 or 3. The only difference is the quality. Is that true in your experience?

I guess I can always try swapping SATA ports as well and see if I can narrow down the issue.

Thanks!

Sent from my iPhone using Tapatalk

Stux · Aug 1, 2017

Yes. SATA2 or 3 have the same electrical specifications, so as long as the cable is actually electrically compliant to SATA1/2 it should work with 3. BUT if the cable is not,it might just 'work' with SATA2 but not 3.

What is more important is the quality/thickness/construction of the cable and connectors. Unfortunately I have no brand advise on this, except to say that I recently had similar issues, and tossed a bunch of sata cables with thinner plastic on the connectors... and used the Supermicro cables that I've been collecting from their motherboards.

I wish it was easier to buy quality.

Stux · Aug 1, 2017

danjng said:
I'm trying to think of the root cause. The only thing I've changed recently is getting a UPS with USB connection so that FreeNAS knows when to shut down gracefully. Any experience or any chance that it's due to that?

zero chance. More likely it happened when you moved the case.

danjng · Aug 6, 2017

UPDATE

So it's actually the SATA port on the motherboard itself. Sucks royally, but at least I've found the issue. I am using a port on my SATA card instead and so far so good. Seemed to be getting worse, so I am considering upgrading motherboards or maybe a case that isn't so cramped.

Thanks again for all the support!

danjng · Aug 10, 2017

So everything is definitely running better now, but about once every two days, my iohyve VMs will crash with an error stating that it ran out of swap space. Is this because I keep paging in all the swap into physical memory?

Also, if I do turn this cron job off, my swap usage climbs and climbs. After about a day or so of turning it off, the VMs haven't crashed, but I'm at 337MB of swap usage right now. On a 16GB machine, is that anything to worry about?

Important Announcement for the TrueNAS Community.

HELP! vm_fault and swap_pager errors!

Explorer

MVP

Hall of Famer

Explorer

Hall of Famer

Explorer

MVP

Explorer

Explorer

Explorer

MVP

Explorer

Server Wrangler

Explorer

MVP

Explorer

MVP

MVP

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP! vm_fault and swap_pager errors!"

Similar threads