BUILD Advice and comment welcome on my proposed FreeNAS build

Status
Not open for further replies.

trionic

Explorer
Joined
May 1, 2014
Messages
98
be sure to read the manual.
If the enclosure came with one...

I'll read the LSI SAS controller manual instead.

Can I go wrong though if I just connect all eight SATA tails to all eight LSI SATA ports?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If the enclosure came with one...

I'll read the LSI SAS controller manual instead.

Can I go wrong though if I just connect all eight SATA tails to all eight LSI SATA ports?

No, don't think there's a problem. I imagine they're using an LSI 36-channel expander (that popular Intel expander uses a 24-channel LSI expander chip), so you should have a straightforward "8 channels in, 24 out" setup.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Just taken delivery of six 4TB Western Digital Red WD40EFRX RED drives and two 3TB Western Digital Red WD30EFRX RED drives. With the drives already in stock, and the drives recycled once data has been transferred, that should give me enough for three VDEVs: one 6x4TB RAID-Z2 and two 6x3TB RAID-Z2.

I have tested the in-stock drives and will test the eight new drives during the next few days.

Here how to do (b): link
Just successfully flashed the LSI controller to IT mode. Thanks for the link pbutcher!

Now... shall I flash the Supermicro motherboard BIOS or leave sleeping dogs alone?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just taken delivery of six 4TB Western Digital Red WD40EFRX RED drives and two 3TB Western Digital Red WD30EFRX RED drives. With the drives already in stock, and the drives recycled once data has been transferred, that should give me enough for three VDEVs: one 6x4TB RAID-Z2 and two 6x3TB RAID-Z2.

I have tested the in-stock drives and will test the eight new drives during the next few days.


Just successfully flashed the LSI controller to IT mode. Thanks for the link pbutcher!

Now... shall I flash the Supermicro motherboard BIOS or leave sleeping dogs alone?

My standard procedure is to flash new motherboards to the latest available BIOS (this way, if something goes wrong, exchanges are easier) before they're put into production use. After they're in production, they're only updated on an as-needed basis, if problems arise.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Thanks. Sounds sensible and that is what I will do.

On a fan/heatsink related note: the Supermicro SNK-P0050AP4 cooler does comfortably fit heightwise within the XCase RM424 Pro case.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Supermicro motherboard BIOS successfully upgraded to v7.05. The process to add custom applications to Ultimate Boot CD was too much of a faff and so I grabbed a simple FreeDOS image from here:
http://goebelmeier.de/bootstick/

Simply copy the extracted contents of X9SRH3_705 BIOS.zip into the flash/ directory, read the readme, boot-up and follow the instructions.

I have also catalogued all disks which will make up the first two VDEVs. I scanned the drive lids (yes really) so that I know exactly what's in the server. I added the models and serial numbers for all drives to a simple spreadsheet. Onto that I will record drive tests executed and their results.

I executed WDIDLE3.exe against all twelve drives. I used the Ultimate Boot CD image, run from a USB stick. However, nothing I tried would get the utility to work on drives attached to the NAS server, even when the drives were connected to the Supermicro board's SATA ports. A different consumer-grade PC worked fine.

As expected, all Reds were set to 300 seconds and all Greens to 8 seconds. However, the "recertified" drive sent as a 3.0TB Red RMA replacement had its idle value at 8 seconds. Did I get a refurbed Green?

Next up is to dd and badblocks test all eight new hard disks. Need to make a decision on the power supply because this Zippy is really noisy. Then I might be ready for a FreeNAS dry run on physical hardware (for weeks I have been messing around with FreeNAS in VirtualBox).

Must return that M1015 too...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Supermicro motherboard BIOS successfully upgraded to v7.05. The process to add custom applications to Ultimate Boot CD was too much of a faff and so I grabbed a simple FreeDOS image from here:
http://goebelmeier.de/bootstick/

I have also catalogued all disks which will make up the first two VDEVs. I scanned the drive lids (yes really) so that I know exactly what's in the server. I added the models and serial numbers for all drives to a simple spreadsheet. Onto that I will record drive tests executed and their results.

I WDIDLE3.exe against all twelve drives. I used the Ultimate Boot CD image, run from a USB stick. However, nothing I tried would get the utility to work on drives attached to the NAS server, even when the drives were connected to the Supermicro board's SATA ports. A different consumer-grade PC worked fine.

As expected, all Reds were set to 300 seconds and all Greens to 8 seconds. However, the "recertified" drive sent as a 3.0TB Red RMA replacement had its idle value at 8 seconds. Did I get a refurbed Green?

Next up is to dd and badblocks test all eight new hard disks. Need to make a decision on the power supply because this Zippy is really noisy. Then I might be ready for a FreeNAS dry run on physical hardware (for weeks I have been messing around with FreeNAS in VirtualBox).

The refurb (disgusting practice replacing infant mortality victims with refurbs...) could be one of the Reds from that period when they came with the timer set at 8s (Around December/January)
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Could be.

I am puzzled by the fan behaviour on this case/mobo combo. Fan behaviour is controlled via IPMI instead of the BIOS or a 3rd party application yet the chassis fan speed does not increase in-line with hard disk temperatures. This is not good when I see hard disk temperatures htting 40 deg C.

In the IPMIView's IPM Device tab I see five settings for "Fan speed mode": Standard, Optimal, Full, Heavy IO and PUE Opt. Only Standard, Optimal and Full are enabled and selectable. Full sounds like a jet engine taking off and there is precisely zero fan speed difference between Standard and Optimal.

There's no shortage of fan power here. On full fan speed the "sheet of paper" test passes. You could get a plank of timber to stick to the front of the case. There's just nothing inbetween "inadequate" and "antisocial".

SMART long tests are running overnight (9 hours for a 4TB disk!). I'll leave the fan on full and see how the temperatures are in the morning.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Getting the fans to be regulated by the HDD temperature is definitely not trivial, unfortunately...

Isn't there an option to manually set the PWM duty cycle for the fans?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yes, Supermicro boards regulate fan speed by CPU temp. Does it suck? Yes. Basically I set them to medium speed and leave them there year-round.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Of course... in IPMIView there is no sensor data for hard disk temperature, thus no regulation possible.

Can't see any way to manually set the PWM duty cycle.

I set them to medium speed and leave them there year-round.
cyberjock did you use IPMIView for that? To which fan speed mode does "medium speed" correspond? Any idea why two of the fan speed mode options are disabled?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I set the fan speed in the BIOS. I know my board (X9SCM-F) has like 10 speed options that have varying fan speeds for a given temperature. Then there's something like 2 or 3 settings that let you just force a fan speed regardless of temp. I did the middle of the road since my fans at 100% draw an extra 100W!
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
Couldn't find any whiff of fan speed control in the X9SRH-7F's BIOS.

Most fan stuff is controlled via the IPMI for the Supermicro boards I've got(all in the X9 series). Check out this thread about fooling around with fans & IPMI
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Spent a fun evening recovering a Western Digital 3.0TB Green hard disk with head stuck to the platters. Removed the lid (no I don't have a clean room), unseized the heads by gently applying torque to the motor spindle, and while turning the platter stack returned the heads to the landing area. Removed the inevitable few specks of dust that had landed on the platter (using the end of a hair and a light breeze from a camera lens dust cleaner thing), replaced the lid and off we go.

The drive was from a MyBook enclosure and thus is transparently encrypted (whether you ask for it or not). I used ddrescue (from a SystemRescue CD boot USB) to create a first copy from the the raw encrypted drive, then created a second copy from the first copy (I now have two encrypted copies, one as a backup).

Hook the USB bridge adapter to the second encrypted copy, use cp to copy the unencrypted files to a first unencrypted copy and then finally create a second unencrypted copy from the first unencrypted copy (I now have two unencrypted copies, one as a backup). Now when I mount the drive under Windoze if it mangles filestructures I have an unencrypted backup to rely on.

ddrescue reported just 93Kb of errors after cloning the original drive. Not bad considering the potential for total drive loss when attempting this particular repair. ddrescue rocks.

A quick mount of the unencrypted disk under Linux reveals boot image, partition table and root directory structures all intact. I'll know more when I have a chance to properly inspect the second unencrypted copy which will be finished in a few hours.

Well worth doing. I was ready to chuck the drive in the trash but thought I may as well try repairing it. I was prepared even to attempt a head swap but turned out to be necessary. So I read everything I could find about unseizing heads from platters, watched loads of YouTube videos and finally gave it a go.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Just bought another six Western Digital 4TB drives and returned an infant mortality for replacement. Now the first vdev will begin at 12x4TB RAIDZ2. The drives are undergoing testing now.

Also, since a recent lightning storm destroyed my router, I'll be shopping for a decent ethernet surge suppressor. The AC SmartUPS has one built-in but it's only 100Mbit. APC sell 1Gbit modular surge suppressors for use in a 19-inch chassis. I suspect that I may be able to replace the SmartUPS's built-in suppressor with the innards from a 1Gbit module.

For anyone who buys an old RS232-based APC SmartUPS, be aware that a standard RS232 cable will not work. In order to make customers buy an expensive accessory, APC monkeyed with the pin-outs on the RS232 connector, thus making it proprietary. They retail for £35 but pattern-parts are available. I got lucky and bought the factory accessory from EBay for £3.

One of these days I'll finish this project.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
That £3 EBay serial cable did not work because it was a 940-0020 instead of a 940-0024c, which is what the APCSMART driver expects to see. With a 940-0024c cable the UPS is working fine, although due to loading and output quality I will replace it with a SmartUPS 3000VA.

The server chassis is now fully populated with 24 western Digital 4TB Red hard disks.

All is well apart from the SMART error email which I just received. I now have a failing Western Digital 4TB Red (WD40EFRX-68WT0N0) which I will replace with a tested cold spare. I have just ordered two new cold spares are on their way (I decided that having just one cold spare on the shelf was inadequate now that the ZFS pool expanded to fill the chassis).
Code:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  1004
  3 Spin_Up_Time  0x0027  177  176  021  Pre-fail  Always  -  8116
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  12
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3888
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  12
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  11
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  40
194 Temperature_Celsius  0x0022  117  111  000  Old_age  Always  -  35
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  200  200  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  198  198  000  Old_age  Offline  -  1187


Code:
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed: read failure  10%  3887  3135531688


From /var/log/messages:
Code:
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 c2 ac 98 e0 00 00 00 40 00 00 length 32768 SMID 760 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 b3 f2 05 20 00 00 01 00 00 00 length 131072 SMID 362 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 b3 f2 06 20 00 00 01 00 00 00 length 131072 SMID 645 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): WRITE(16). CDB: 8a 00 00 00 00 01 71 b1 84 88 00 00 00 08 00 00 length 4096 SMID 1023 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 b3 f2 04 20 00 00 01 00 00 00
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): CAM status: SCSI Status Error
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): SCSI status: Check Condition
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): Info: 0x1b3f20477
Mar  8 13:46:44 zfs kernel: (da13:mps0:0:21:0): Error 5, Unretryable error
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 b3 f2 06 20 00 00 01 00 00 00 length 131072 SMID 354 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 c2 ac 98 e0 00 00 00 40 00 00 length 32768 SMID 850 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 c2 ac 99 20 00 00 00 80 00 00 length 65536 SMID 193 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 604 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00 length 8192 SMID 444 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 length 8192 SMID 147 terminated ioc 804b scsi 0 state 0 xfer 0
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): READ(16). CDB: 88 00 00 00 00 01 b3 f2 05 20 00 00 01 00 00 00
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): CAM status: SCSI Status Error
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): SCSI status: Check Condition
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): Info: 0x1b3f20597
Mar  8 13:46:50 zfs kernel: (da13:mps0:0:21:0): Error 5, Unretryable error


Full out put of smartctl -a /dev/da13
Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  WDC WD40EFRX-68WT0N0
LU WWN Device Id: 5 0014ee 25fea43c0
Firmware Version: 80.00A80
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Sun Mar  8 17:07:50 2015 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
  was aborted by an interrupting command from host.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  ( 113) The previous self-test completed having
  the read element of the test failed.
Total time to complete Offline
data collection:  (54000) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 540) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  1004
  3 Spin_Up_Time  0x0027  177  176  021  Pre-fail  Always  -  8116
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  12
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3888
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  12
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  11
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  40
194 Temperature_Celsius  0x0022  117  111  000  Old_age  Always  -  35
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  200  200  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  198  198  000  Old_age  Offline  -  1187

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed: read failure  10%  3887  3135531688
# 2  Short offline  Completed without error  00%  3874  -
# 3  Short offline  Completed without error  00%  3851  -
# 4  Short offline  Completed without error  00%  3826  -
# 5  Short offline  Completed without error  00%  3802  -
# 6  Short offline  Completed without error  00%  3778  -
# 7  Short offline  Completed without error  00%  3754  -
# 8  Short offline  Completed without error  00%  3730  -
# 9  Short offline  Completed without error  00%  3706  -
#10  Short offline  Completed without error  00%  3682  -
#11  Short offline  Completed without error  00%  3658  -
#12  Short offline  Completed without error  00%  3636  -
#13  Short offline  Completed without error  00%  3612  -
#14  Short offline  Completed without error  00%  3589  -
#15  Short offline  Completed without error  00%  3565  -
#16  Extended offline  Completed without error  00%  3552  -
#17  Short offline  Completed without error  00%  3540  -
#18  Short offline  Completed without error  00%  3516  -
#19  Short offline  Completed without error  00%  3492  -
#20  Short offline  Completed without error  00%  3468  -
#21  Short offline  Completed without error  00%  3444  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


All drives were fine at the time of the last scrub, on Sat 7th March. I just hope more drives don't fail during the resilver.

How many cold spares do other folk keep on the shelf?
 
Status
Not open for further replies.
Top