CAM Status Errors with only spinning disks not SSD?

Bezerker · Jan 31, 2017

Hello all,

I'm still troubleshooting my build. As mentioned previously, I have the following hardware:

Code:

Chassis: Supermicro Superchassis SC216E16-R800LPB
Backplane/Expander: Supermicro SAS/SATA Backplane BPN-SAS2-216EL1 with SAS2 expander 
Motherboard: TYAN S7012 (S7012GM4NR) Motherboard
CPU: Dual Intel Xeon E5645 2.4Ghz 6 core CPUs for 12 cores 
Memory: 64GB DDR3 ECC Ram
HBA: LSI 9211-8i JBOD HBA CARD (Appears to be what I would guess is a Dell H200 reflashed to show up as an Avago card.)
PSU: Dual Supermicro 800watt Power Supply (I have 4 total and tried all 4.)

sas2flash reports the following for my HBA and sees all drives:

Code:

Adapter Selected is a LSI SAS: SAS2008(B2)

		Controller Number			  : 0
		Controller					 : SAS2008(B2)
		PCI Address					: 00:04:00:00
		SAS Address					: 5848f69-0-ecb3-4d00
		NVDATA Version (Default)	   : 14.01.00.08
		NVDATA Version (Persistent)	: 14.01.00.08
		Firmware Product ID			: 0x2213 (IT)
		Firmware Version			   : 20.00.07.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : SAS9211-8i
		BIOS Version				   : 07.39.02.00
		UEFI BSD Version			   : N/A
		FCODE Version				  : N/A
		Board Name					 : SAS9211-8i
		Board Assembly				 : N/A
		Board Tracer Number			: N/A

I have the following disks in this machine currently:

Code:

6x Micron M500 960GB SSD 
8x WDC WD7500BPKT-2 1A01
8x ST4000LM016-1N21 0003

I have 3 pools setup for testing:

Code:

ssdpool 6x micron m500s in 3 mirrored vdevs9
8x WDC WD7500's in 4 mirrored vdevs.
8x ST4000 in a raidz2

I am running FreeNAS-9.10.2-U1 (86c7ef5).

The problem I am having is that both the spinning disk pools (no matter what configuration. I've tried raw stripes) throw all sorts of disk errors when being used by ZFS. The 7500's will throw errors similar to this (http://pastebin.com/cq3gi61m) on any disk in the pool when trying to write/read from the pool. In early testing, I attempted only using 4 disks at a time, and did not see any errors. I only saw the issue upon adding more than 4 disks to the pool consistently They are completely unusable in my system. The ST4000's (4tbs) are newer, and while they do not throw the above errors often, will drop randomly out of the raidz with errors and resilver randomly. Smart will detect increasing numbers of Current_Pending_Sector and Offline_Uncorrectable errors that do not go away until the disk is zeroed out. Essentially however, both pools of disks are completely unusable in this state due to IO issues.

What is really strange, is I am able to use my ssd pool without issue at all. I have nfs sharing against that, and am able to run iozone against it without issue. I never experience any errors with this pool.

I am able to DD to and from each disk as well as run badblocks against them without issue or error. In fact, I am currently rezeroing all the disks out at the same time currently (16x spinners 100% busy with writes). I only get this error trying to use them in pools.

One of the ST4000 DID report the following errors in its internal disk log related to the errors at the time according to smart: http://pastebin.com/wyV7RrfE

I have tested the following on the hardware end:

Code:

I have attempted to use another LSI controller I have laying around. It's a 9260-4i, and in JBOD mode I get similar styles of errors from the WDC7500s though obviously different.  During this test phase, I also attempted using another OS (Linux ISO) and experienced issues as well again.

I have swapped the backplane/expander for an identical model on Ebay. Issues remain.

I have attempted using 2 other PSU, individually 1 at a time, and in pairs.

I have verified the voltage of the rails in general are in line with standards with a multimeter, though I need to disconnect each molex connector (backplane takes 4), I only yested the two unconnected ones unused. No voltage drops/deviation however even under load.

Swapped PCI-e slots of the card.  I've verified the bus itself is capable and fine by using my 10g card completely, and the SSD pool obviously works.)

Replaced the SFF-8087 cable between controller and backplane. Attempted using different ports on both controller and backplane. Concern I have is this cable is exactly 1m long, and could the backplane add just a bit too much length? but wouldn't the SATA ssd also fail then if this was the error?

As of now, I feel I've replaced nearly everything hardware wise and have to wonder what else is left. I'm going to test the molex connectors for power related reasons more indepth, but I am expecting this to work well.

It should be noted, that I've tested the disks in various enclosure slots, and no matter what, same errors occur.

I would appreciate any and all help people can recommend. This is making me insane. :)

Thanks everyone in advance,

Bezerker

BetYourBottom · Jan 31, 2017

While I am sorry that you are having issues, it makes me feel a little better knowing that someone else is going through a similar problem. Hopefully both of our issues will be solved soon (I'm a month and a half into my build and no joy yet).

dlavigne · Feb 1, 2017

Have you created a bug report that includes a debug file? If so, what is the ticket number?

Bezerker · Feb 1, 2017

I have not yet created a bug report that includes a debug file, but I will do so. I do not inherently suspect this is a FreeNAS bug itself, but something hardware related however.

Bezerker · Feb 1, 2017

dlavigne said:
Have you created a bug report that includes a debug file? If so, what is the ticket number?

I've opened a bug per your suggestion here:

https://bugs.freenas.org/issues/20784

Bezerker · Feb 2, 2017

Further updates:

Following errors in dmesg log from raidz2 pool (Seagate 4tbs) upon syncing data from my old linux NAS to a pool via rsync.
http://pastebin.com/WYUEwtNJ

Again, these smart errors appear and I start seeing disks drop out/fail (my raidz2 is failed now).

Again all disks can be badblocks'd and dd'd to fully and this clears out the smart errors / reports no bad sectors.

Bezerker · Feb 2, 2017

Code:

Info on my pool that shows the above spinning disk errors.
 
	NAME											STATE	 READ WRITE CKSUM
		largepool									   DEGRADED	 0   506	 0
		  raidz2-0									  DEGRADED	 0   201	 0
			gptid/036d0236-e8a4-11e6-9098-00e081c5ab78  ONLINE	   3   233	 0
			gptid/0454228b-e8a4-11e6-9098-00e081c5ab78  ONLINE	   0	 0	 0
			gptid/053a924b-e8a4-11e6-9098-00e081c5ab78  ONLINE	   0	 0	 0
			gptid/06317ca7-e8a4-11e6-9098-00e081c5ab78  FAULTED	  9 32.4K	 0  too many errors
			gptid/0717dab9-e8a4-11e6-9098-00e081c5ab78  ONLINE	   0	 0	 0
			6734745724954972106						 REMOVED	  0	 0	 0  was /dev/gptid/0801c564-e8a4-11e6-9098-00e081c5ab78
			gptid/08f1b8e5-e8a4-11e6-9098-00e081c5ab78  ONLINE	   0	 0	 0
			gptid/09eb1af9-e8a4-11e6-9098-00e081c5ab78  ONLINE	   0	 0	 0

Correct me if I am wrong, but cksum indicates data on disk != what the system thinks it wrote right?

Bezerker · Feb 14, 2017

Ok, I have an update regarding this:

It appears that this only occurs with NCQ enabled. I have disabled ncq on the 750 gigs by typing "camcontrol tags da[X] -N 1" and the problem was solved. I can now use these.

This is not normal however, and clearly an underlying issue.

Bezerker · Feb 14, 2017

Confirmed. Reading finds that there is a known issue with sata NCQ on SAS2x36's and various drive firmwares. Not all of them, just most of them.

I'm disabling NCQ across the board just to be safe with camcontrol on all the pass devices, but does anyone know if theres a way to do it via loader.conf or anything? In theory, hotswapping causes this to be set to 255 every time it returns.

Bezerker · Feb 14, 2017

BAM. figured it out.

Why was I seeing this only with spinners? ada driver turns off queueing for SSD by default.

kern.cam.sort_io_queue

BetYourBottom · Feb 14, 2017

Congrats on figuring your issue out as well!

Bezerker · Feb 14, 2017

Thanks! Turns out the kernel tuning options didn't actually solve it, but the camcontrol tags -N 1 did. Unfortunately, that setting is lost on hotswap if i change disks or anything. So I just wrote a script that runs on startup that runs and sets it in a while loop every second and a cron that ensures the script is always running.

(Concerns I had were that i'd start seeing ata resets during a disk replacement. It's probably overkill bujt doesnt seem to impact performance at all.)

Important Announcement for the TrueNAS Community.

CAM Status Errors with only spinning disks not SSD?

Bezerker

Dabbler

BetYourBottom

Contributor

dlavigne

Guest

Bezerker

Dabbler

Bezerker

Dabbler

Bezerker

Dabbler

Bezerker

Dabbler

Bezerker

Dabbler

Bezerker

Dabbler

Bezerker

Dabbler

BetYourBottom

Contributor

Bezerker

Dabbler

Similar threads