SOLVED Possible incompatibility with Supermicro?

Status
Not open for further replies.

macxs

Dabbler
Joined
Nov 7, 2013
Messages
21
Hi Folks,

I have 2 FreeNAS systems for 6 months now, they are identical except HDDs:
Supermicro X10DRH LN4
2x Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
64G Ram
Avago MPT SAS3 SAS Controller (v8.25.00.00) with expander backplane (the controller seems to be LSI3008-IR, FW revision 10.00.03.00-IR)
HDDs are all SATA
[edit: latest FreeNAS 11.1]

System 1 has 6x 6TB WD black (WD6001FZWX),
in system 2 (backup) there are 12 older disks:
6x 3TB WD3000F9YZ and
6x 2TB (2x WD2003FYYS, 2x Hitachi HUS723020ALS640, 2x Hitachi HUA723020ALA640)

System 1 is used for NAS and SAN storage. System 2 for backup (zfs send).

Both systems are running fine until there is some load on it. Then the zpools become degraded because some disks are faulting:
Code:
		NAME											STATE	 READ WRITE CKSUM
		tank6x6										 DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/36dc3ffc-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/3794a11f-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/384b3fed-9fbe-11e7-964a-ac1f6b2067f0  FAULTED	  7	 1	 0  too many errors
			gptid/38fe5371-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/39b445ef-9fbe-11e7-964a-ac1f6b2067f0  FAULTED	  6	81	 0  too many errors
			gptid/3a6df18f-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0	 0
			gptid/3b5af8ea-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/3c221515-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/3d147697-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/3f147bb8-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/41e7d938-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0
			gptid/43896bdb-9fbe-11e7-964a-ac1f6b2067f0  ONLINE	   0	 0	 0


...even with data corruption!

The disks that are faulting are SMART-OK:
Code:
~ # glabel status
...
gptid/384b3fed-9fbe-11e7-964a-ac1f6b2067f0	 N/A  da2p2
...
gptid/39b445ef-9fbe-11e7-964a-ac1f6b2067f0	 N/A  da4p2
...

Code:
~ # smartctl -a /dev/da2
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
# 1  Short offline	   Completed without error	   00%	 24823		 -
...


Now the strange thing:
First I thought it was a problem with the WD black, as they were in system 1 and at least 2 of them were constantly faulting on higher load. I ordered 2 new ones (I think WD Red Pro, don't know exactly) , replaced the 2 faulting disks (one after another), resilvered. During resilvering it became worse, with data errors...
I moved the config to the boot-HDDs and switched the boot-HDDs from system1 <-> system2. On system 2 (now logically system1) I switched the (formerly backup-) datasets to r/w and used this system as storage system. As you can see in the first example, it's now also faulting on higher load. There were no zpool errors when it was the backup system. Since it became the primary system and got some load there were faulting HDDs randomly in the pool, sometimes some 2TB drives, after I zpool clear there are some other drives faulting. Most of the time 2 disks are faulting. Also during resilver there are errors, probably because of high(er) load.

Please let me know if you need additional data.

Thank you!

Bye!
Marco
 
Last edited by a moderator:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
While IR mode should work, ideally the controllers should be in IT mode. They should also be on the latest firmware. See if that helps.
 

macxs

Dabbler
Joined
Nov 7, 2013
Messages
21
OK, I should flash the LSI3008 to IT-FW. Would that be the solution?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
It won't hurt, in any event. Older versions of the 2008/2308 firmware have been known to cause problems, so it's possible that's what's going on with you as well.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Avago MPT SAS3 SAS Controller (v8.25.00.00) with expander backplane (the controller seems to be LSI3008-IR, FW revision 10.00.03.00-IR)
The first thing I thought when reading this was that the IR firmware would cause problems. I agree with @danb35 , that should be the first thing to change but it may not be the only problem.
What is the means by which the drives are connected to the SAS controller?
 
Status
Not open for further replies.
Top