SOLVED FreeNAS reporting errors in alert system, pool status is clear

Louc918 · Jan 5, 2018

FreeNAS is reporting errors on 2 disks in the alert system, but upon a zpool status check, there are no errors displayed. These 2 disks are a mirrored pair that only contain my jails on them. (Plex, CouchPotato, SickRage, Sabnzbd and transmission) The Sabnzbd jail does have a lot of traffic running through a VPN connection pretty regularly at about 100-120 mbs downloads, and I host a fair amount of plex traffic outside of my network with between 3-5 concurrent connections at about 25-30 mbs upload. There is some additional history that may be relevant to the issue.
1) I was getting lots of disk errors on all disks connected to the LSI 9211-8i. (2 mirrored drives in question, 6 disk Raidz2 pool)
2) Ran a full memory test, ran clear.
3) I changed the cables between the LSI card and drives, still getting errors.
4) Changed the LSI card to another PCIe slot on the motherboard, still getting errors.
5) Replaced the LSI card, no errors reported in raidz2 volume, but errors still being thrown in the alert system for mirrored volume, but not displayed in the volume status, or pool status in terminal. I should note that before the card was changed, errors would appear in the alert system and in the volume status and zpool status.
6) Ran scrubs and short and long SMART tests on all volumes and disks, all clear with no errors to report.

Everything has been humming along great after I changed the lsi card out, it's been flawless other than these annoying errors in the UI. Does anyone have any ideas why I would see errors in the alert system, but not in the pool status? It's been driving me crazy.

System specs.
FreeNAS 9.10.2 U6
Dell T5500 motherboard repurposed for FreeNAS build
Inte Xeon E5507 (2.27 ghz)
48GB ECC RAM
LSI 9211-8i Raid Card (IT Mode)
2 2TB WD RE4 Mirrored serving Jails
6 4TB WD Red RaidZ2 serving storage
4 2TB WD RE4 Mirrored and Striped serving ESXI box via ISCI
1 Intel GB NIC on the Board
1 4 port Intel GB NIC (2 ports LAGG1 for services, 2 ports LAGG2 for ISCI to ESXI on a separate VLAN)

Chris Moore · Jan 5, 2018

Louc918 said:
Everything has been humming along great after I changed the lsi card out, it's been flawless other than these annoying errors in the UI. Does anyone have any ideas why I would see errors in the alert system, but not in the pool status? It's been driving me crazy.

What error are you getting?

Bidule0hm · Jan 5, 2018

The pool errors are cleared on reboot but not the FreeNAS GUI errors so I guess you just see "ghost" errors and you just need to clear them manually.

Louc918 · Jan 5, 2018

CRITICAL: Jan. 5, 2018, 12:31 a.m. - Device: /dev/da0 [SAT], ATA error count increased from 2947 to 2948
CRITICAL: Jan. 5, 2018, 2:31 a.m. - Device: /dev/da1 [SAT], ATA error count increased from 115 to 116

I zpool cleared them, shut down the jails, unmounted them, rebooted, mounted them, rebooted with the jails auto starting on reboot.

It also appears that I am getting loads of CAM status errors on all of the drives connected to the LSI Card:
Jan 5 13:24:18 nas01 (da5:mps0:0:5:0): READ(10). CDB: 28 00 77 be 18 58 00 01 00 00
Jan 5 13:24:18 nas01 (da5:mps0:0:5:0): CAM status: SCSI Status Error
Jan 5 13:24:18 nas01 (da5:mps0:0:5:0): SCSI status: Check Condition
Jan 5 13:24:18 nas01 (da5:mps0:0:5:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jan 5 13:24:18 nas01 (da5:mps0:0:5:0): Retrying command (per sense data)
Jan 5 13:24:48 nas01 (da5:mps0:0:5:0): READ(10). CDB: 28 00 77 c0 7f 28 00 00 c0 00
Jan 5 13:24:48 nas01 (da5:mps0:0:5:0): CAM status: SCSI Status Error
Jan 5 13:24:48 nas01 (da5:mps0:0:5:0): SCSI status: Check Condition
Jan 5 13:24:48 nas01 (da5:mps0:0:5:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jan 5 13:24:48 nas01 (da5:mps0:0:5:0): Retrying command (per sense data)
Jan 5 13:24:54 nas01 (da2:mps0:0:2:0): READ(10). CDB: 28 00 77 c1 00 b0 00 00 80 00
Jan 5 13:24:54 nas01 (da2:mps0:0:2:0): CAM status: SCSI Status Error
Jan 5 13:24:54 nas01 (da2:mps0:0:2:0): SCSI status: Check Condition
Jan 5 13:24:54 nas01 (da2:mps0:0:2:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Jan 5 13:24:54 nas01 (da2:mps0:0:2:0): Retrying command (per sense data)

Chris Moore · Jan 5, 2018

Well, those errors indicate communication between the drive and the system is faulty.
I would start troubleshooting over with the cables. I have seen a kink in the SAS cable between the drive and card cause a CRC error on a single drive while all the rest were working fine. If you get too many errors on a drive, ZFS may drop it from the pool for being failed.

Louc918 · Jan 5, 2018

Could a single cable issue cause CRC errors on all of the drives connected to that card? I only pasted some of the errors, but I'm getting errors on all drives connected to that card. I'm only asking because I already replaced the cables once and just wanted to make sure I don't purchase a 3rd set if there's something else that could be suspect.

Chris Moore · Jan 5, 2018

What power supply are you using on this build? Did you move the system board to a different chassis?
Lets have some more detail about this setup if you will.

Louc918 · Jan 5, 2018

I'm using a Corsair HX1000i 1000w power supply. (80+ Platinum) The power supply has 12 SATA power cables and those go directly to each of the drives, I'm not using any splitter cables. The power supply is connected to a APC Smart UPS 1500VA LCD for redundant power.

The board was moved from the original chassis. It started off in a Dell T5500 workstation I had lying around. I moved it into a Rosewill RSV-L4412 for this build.

The 4 other disks we haven't been discussing (Those serving ESXI via ISCSI) are connected directly to the motherboard and are not reporting any CRC errors. I'm using a SanDisk Cruzer Glide for the OS on the internal motherboard USB slot.

Chris Moore · Jan 5, 2018

That should be plenty of power, so not likely an answer...
Are the disks actual SAS drives or SATA drives running over the SAS connection?
Is this system able to be taken offline for some testing or is this a critical system for you?

Louc918 · Jan 5, 2018

They’re all SATA drives. I can bring it down for some testing. This is a home system used to serve home services only and not anything used in a professional prod environment.

BigDave · Jan 5, 2018

Louc918 said:
It started off in a Dell T5500 workstation I had lying around. I moved it into a Rosewill RSV-L4412 for this build.

No one has asked, so I will.
I looked up the chassis model and it shows hot swap bays as part of the system (is this correct?)
Also you mention ESXI, does this mean you're running FreeNAS virtualized?

Louc918 · Jan 5, 2018

Yes, there are hot swap bays in the chassis. It’s not virtualized however. FreeNAS only runs on this box. FreeNAS serves storage for a separate ESXI server over ISCSI

BigDave · Jan 5, 2018

Louc918 said:
Yes, there are hot swap bays in the chassis.

So you mentioned earlier that you have tried new cables for the HBA card, but have you tried switching bays?
Better yet hook up the drives without the hotswap in the mix and test that out...

Louc918 · Jan 5, 2018

You got me thinking a little bit more about what's going on at the connection between the power supply and the hot swap bays. i was so focused on the SATA cables, I wasn't thinking much about the power connections there. So, I removed, and reseated all of the power connections at the hot swap bays. I rebooted the server and no CAM errors right now. It's been running for 15 mins. CAM errors would have been thrown at startup, and at the rate I was getting them before, they would have been coming pretty frequently. I assume when swapping out the new card and sata cables, maybe I giggled something loose. I'll keep an eye on it in the meantime, but I think we can call this one solved.

Thanks for your help Chris & BigDave.

BigDave · Jan 5, 2018

If the cam status errors return, my last trouble shooter would be to check the HBA heatsink/chipset temperature as it may be getting too hot in the new enclosure.
I had one (HBA) that would start wigging out when running the server with the cover off for too long, it just seemed to need lots of moving air.

Louc918 · Jan 5, 2018

I'll check for that. I've been using this build for some time, only recently I had the issue with disk errors. The case stays surprisingly cool considering the the stress I put on this system. The WD Reds rarely get above 32c during a scrub and the RE4s get to 34c during a scrub. The HBA card is pretty close to an exit fan, so there should be some pretty good airflow moving though. I'll probably tackle a new build this year considering the age of this system. Thanks again for your support.

Important Announcement for the TrueNAS Community.

SOLVED FreeNAS reporting errors in alert system, pool status is clear

Louc918

Dabbler

Chris Moore

Hall of Famer

Bidule0hm

Server Electronics Sorcerer

Louc918

Dabbler

Chris Moore

Hall of Famer

Louc918

Dabbler

Chris Moore

Hall of Famer

Louc918

Dabbler

Chris Moore

Hall of Famer

Louc918

Dabbler

BigDave

FreeNAS Enthusiast

Louc918

Dabbler

BigDave

FreeNAS Enthusiast

Louc918

Dabbler

BigDave

FreeNAS Enthusiast

Louc918

Dabbler

Similar threads