BUILD SuperMicro X10SRL-F + 3 846 Chassis + 72 Disks

Status
Not open for further replies.

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
I wasn't able to flash the firmware using the gui interface. I used the xflash command line utility instead.

You just run these 2 commands:

stusas2flash.JPG


Here's a zip file with the xflash command and the latest ROM image for the SAS2 expander.

http://www.cstone.net/~dk/SMCTools.zip

Place the folder directly off the C: drive (yes Windows) and add a path to it.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Finally completed copying all my media files over to the FreeNAS server. I then moved the 10 4TB drives over and added them to the existing pool. I trust these drives as I have been running SMART tests against them on a regular basis as well as moved a lot of data around on them without any issues ever.

So here's what the LSI controller looks like now:

chas1-0-14.PNG

chas1-15-23.PNG

chas2-0-14.PNG

chas2-15-23.PNG

chas3-0-1.PNG


Ahhh, lots of space now:

50diskvolumeoverview.PNG


Currently in the process of copying 41TiB of date over from the 'media1' datastore over to the 'media' one. I have 8 media directories within the datastore, and I'm just doing a cp -v -R source target from a ssh session. I figured that will be faster than dragging and dropping the folders from my windows machine over CIFS.

Not sure of the exact speed, but I kicked it about 2 hours ago and I got 3.6TiB copied over so far, so a little under 2 TiB per hour it would seem.

It' s a good workout for the drives, which is also my new avatar. :D

50disk-01.JPG
 
Joined
Oct 2, 2014
Messages
925
mhmmmmmm, if you have a money tree in your back yard, can you send some seeds my way? or cash, cash works too. Its soooo purty!
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Lol, I sure wish I had a money tree in my back yard! While the X10 mobo, cpu and ram are new, everything else is recycled from my old servers, or purchased / traded on eBay (like my IBM 1015s sold in exchange for LSI 9200s and 9211.

So I do see a couple of things in my log that concerns me. Here's the entries since first booting it up with all 50 drives attached:

Code:
Jun 11 17:22:49 freenas smartd[20837]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 17:53:52 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 18:23:53 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 18:53:53 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 19:23:51 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 19:53:51 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 20:23:51 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 20:53:50 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 21:23:54 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 21:53:52 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 22:23:44 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): READ(10). CDB: 28 00 56 96 a4 d8 00 00 40 00
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): CAM status: SCSI Status Error
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): SCSI status: Check Condition
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): Info: 0x5696a517
Jun 11 22:41:09 freenas (da26:mps2:0:8:0): Error 5, Unretryable error
Jun 11 22:53:52 freenas smartd[20922]: Device: /dev/da42 [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.


Corresponding SMART stats:

Code:
[root@freenas] ~# smartctl -A /dev/da42

ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  117  099  006  Pre-fail  Always  -  147085816
  3 Spin_Up_Time  0x0003  091  091  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  099  099  020  Old_age  Always  -  2021
  5 Reallocated_Sector_Ct  0x0033  100  100  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000f  062  060  030  Pre-fail  Always  -  1705807
  9 Power_On_Hours  0x0032  081  081  000  Old_age  Always  -  17260
10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  95
183 Runtime_Bad_Block  0x0032  099  099  000  Old_age  Always  -  1
184 End-to-End_Error  0x0032  098  098  099  Old_age  Always  FAILING_NOW 2
187 Reported_Uncorrect  0x0032  100  100  000  Old_age  Always  -  0
188 Command_Timeout  0x0032  100  009  000  Old_age  Always  -  0 0 113
189 High_Fly_Writes  0x003a  095  095  000  Old_age  Always  -  5
190 Airflow_Temperature_Cel 0x0022  073  044  045  Old_age  Always  In_the_past 27 (0 47 27 22 0)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  41
193 Load_Cycle_Count  0x0032  095  095  000  Old_age  Always  -  11561
194 Temperature_Celsius  0x0022  027  056  000  Old_age  Always  -  27 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0010  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x003e  200  175  000  Old_age  Always  -  307
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  6109h+19m+30.683s
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  9236204700
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  5187621965

[root@freenas] ~# smartctl -A /dev/da26

ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  8
  3 Spin_Up_Time  0x0027  161  154  021  Pre-fail  Always  -  8950
  4 Start_Stop_Count  0x0032  099  099  000  Old_age  Always  -  1052
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  037  037  000  Old_age  Always  -  46599
10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  108
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  106
193 Load_Cycle_Count  0x0032  196  196  000  Old_age  Always  -  14089
194 Temperature_Celsius  0x0022  119  102  000  Old_age  Always  -  33
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  200  200  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  1
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0


I'll be keeping an eye on these drives for sure and already have replacements ready to go, should they start showing additional errors.
 
Last edited:

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Got another one:

Code:
Jun 12 00:00:48 freenas  (da38:mps2:0:20:0): WRITE(10). CDB: 2a 00 46 bc 16 d8 00 00 08 00 length 4096 SMID 845 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 12 00:00:48 freenas  (da38:mps2:0:20:0): READ(10). CDB: 28 00 03 fc 28 a8 00 00 20 00 length 16384 SMID 436 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 12 00:00:48 freenas  (da38:mps2:0:20:0): READ(10). CDB: 28 00 03 fc 28 f0 00 00 20 00 length 16384 SMID 367 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): READ(10). CDB: 28 00 03 fc 28 40 00 00 20 00
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): CAM status: SCSI Status Error
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): SCSI status: Check Condition
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): Info: 0x3fc2841
Jun 12 00:00:48 freenas (da38:mps2:0:20:0): Error 5, Unretryable error


SMART:

Code:
[root@freenas] ~/scripts# smartctl -A /dev/da38

ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000b  099  099  016  Pre-fail  Always  -  2
  2 Throughput_Performance  0x0005  133  133  054  Pre-fail  Offline  -  91
  3 Spin_Up_Time  0x0007  131  131  024  Pre-fail  Always  -  437 (Average 437)
  4 Start_Stop_Count  0x0012  100  100  000  Old_age  Always  -  3960
  5 Reallocated_Sector_Ct  0x0033  100  100  005  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000b  100  100  067  Pre-fail  Always  -  0
  8 Seek_Time_Performance  0x0005  135  135  020  Pre-fail  Offline  -  26
  9 Power_On_Hours  0x0012  096  096  000  Old_age  Always  -  30291
10 Spin_Retry_Count  0x0013  100  100  060  Pre-fail  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  185
192 Power-Off_Retract_Count 0x0032  097  097  000  Old_age  Always  -  4288
193 Load_Cycle_Count  0x0012  097  097  000  Old_age  Always  -  4288
194 Temperature_Celsius  0x0002  193  193  000  Old_age  Always  -  31 (Min/Max 14/39)
196 Reallocated_Event_Count 0x0032  100  100  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0022  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0008  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x000a  200  200  000  Old_age  Always  -  0


Interestingly enough, these last 2 errors are both from me copying over my TV episodes. I wonder if there was just some junk files in the original source that copied into the temp datastore? TV episodes hadn't been copied yet when I ran the initial scrub.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Why would you trade an M1015 for an SAS9211? With the M1015 crossflashed, they're the same thing.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
The SAS9211 has the SFF-8087 connectors closer to the backplane and are facing it. The IBM1015 has the SFF-8087 connectors furthest away from the backplane and are facing up. So the SAS9211 makes for slightly neater cable routing. :D
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The SAS9211 has the SFF-8087 connectors closer to the backplane and are facing it. The IBM1015 has the SFF-8087 connectors furthest away from the backplane and are facing up. So the SAS9211 makes for slightly neater cable routing. :D
That makes sense. I assumed something similar was the reason. :p
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
The SAS9211 has the SFF-8087 connectors closer to the backplane and are facing it. The IBM1015 has the SFF-8087 connectors furthest away from the backplane and are facing up. So the SAS9211 makes for slightly neater cable routing. :D
I did the same thing and got a 9211 because it fits better in the 846.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Got 4 spare 2TB Hitachi drives in today. They are the newer design with 6Gbps interfaces. $49.95 each with free shipping and a 1 year warranty.

hitachi2tbs.JPG


Given what I have seen in my logs over the last 24 hours, these are going through a full suite of testing as I type this, before they go into the production server, replacing the ones that are throwing SMART errors.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
While the 4 new drives are running badblocks on another rig I set up, the copy of all my media over to the new datastore completed.

So all my data should now be evenly spread across all 50 disks. I kicked off another scrub, and it seems to be supporting that notion.

scrub50disks.PNG


So the scrub is running at 1.84G/s. Not too shabby I think.

Prior to the next scrub, I'll add those 2nd SFF-8087/8 cables to each backplane SAS2 expander to see if it makes a difference one way or another.
 
Last edited:

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
So my 4 spare 2TB drives all completed the burn in testing with no issues. So I popped in the first drive to replace the one with the most SMART errors, went to storage -> volume -> select volume -> volume status. I then clicked the drive I wanted to replace, verified the serial number, and then choose the replacement disk from the drop down.

So now the resilver is in trucking along at 2.2G/s. My question is, once it is done, I'd like to pull out the old drive and insert the new one in its place. I can just do that live, right? The volume will go into degraded status for a just a short while until it picks up on the disk change, correct?

The alternative would be to shut the whole system down, but I'd rather keep it online and just do a hot swap.

Is there a risk doing it the hot swap way?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Hum... I think the proper way to do that is to offline the drive first in the GUI but wait for a confirmation ;)

BTW, I hate hot-swapping, so many problems...
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
I'm following the steps outlined here in the online user guide:

8.1.11. Replacing Drives to Grow a ZFS Pool

It suggests that the old drive will automatically be offlined once the resilver completes.

I suppose the safe next step would be to shut down the system pull out the old drive, and move the new one into its place, and power back up. Just curious if anyone does this last step without shutting down the system.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep, but I was talking about moving the new drive without offlining it first, not sure it's a great thing to do.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Ah gotcha. I thought you were referring to the old one.

So the safest approach would be:

1. Let resilver complete which will offline old disk
2. Offline new disk
3. Shut down server
4. Pull old drive and insert new drive in its place
5. Power back up
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
No, if you shutdown the server you don't need to offline it, you can shuffle all the drives without a problem if you want :)
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
(Side note - When you ordered the Supermicro 846's from Mr Rackable, where did you get the drive screws?)
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
@depasseg:

http://www.ebay.com/itm/321127593861?_trksid=p2060353.m2749.l2649&ssPageName=STRK:MEBIDX:IT

Ok, I'll probably just do it with the system running then.

So what's the difference between offlining a good disk vs. just pulling it out and inserting it in another bay?

The resilvering is flying along:

Code:
pool: v1
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 15 17:38:32 2015
26.3T scanned out of 55.0T at 2.64G/s, 3h5m to go
652G resilvered, 47.77% done
 
Last edited:
Status
Not open for further replies.
Top