Citadel - Build Plan and Log

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Woke up this morning and almost had a heart attack. smartd had logged Temperature_Celsius at numbers of 110+, it took a few minutes to figure out that I was seeing an attribute value, not the actual temperature...
 
Last edited:

ramar

Explorer
Joined
Jan 8, 2018
Messages
53
It is highly recommended that you not use a hardware RAID controller
ok, i'll read uncle festers basic ... i think it was the brother of the actor that played uncle fester on tv in the 60s that was the manager for the birds or the turtles. i think he had a 60' 2 masted schooner
 

Evertb1

Guru
Joined
May 31, 2016
Messages
700
Woke up this morning and almost had a heart attack. smartd had logged Temperature_Celsius at numbers of 110+, it took a few minutes to figure out that I was seeing an attribute value, not the actual temperature...
Breathe in, breathe out :)
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
The disks passed the short and extended SMART tests.

I cloned the disk-burnin-and-testing repo, and started the shell script for each in tmux. Should be done sometime this weekend.

And the server just arrived several days ahead of schedule!
f5zsz7kviZGJHttB3p9ZqSPtY-i0ckBLdCMcK61Ij88mUpL4yFxp5plii1Qv21Y1W-LrZVLxg7bxvFH20CfZeJOJzGhdhU782tL4csPqnkIF1Ahim9Coo0XFKGx_2UQRu5L4BDJ4IUCFuX4l_2mi-7ShDDrkx2e61I71kWAnen3dCZafgjYuVQWLYYORgB0TF3xvx9iyypglkJaPQJ5t3ekChz_XZ58_IkPCVemHB4qLzLHcRRDFmfqjS2vZJsl2JKP5XBzul-5xWAAz6W7B4fJXpdChpZr9oJ30Nou_UIdk_FON8S-AMZI69y5MFhBrrtKkgE8LA4cvQH8Z_LtgPZyFakU6mW5ueY4umKZADnah4OXrwC1_WzSbtIqOwT6kaTN_ePMFGx5byIJz01QMdW3bq7gh90qMDIVpmH0EEp53hoj2Z17jaitvQbqkiitflVDKW1nLbJV2-YxcZMOUho3OumHiYA_9fV_qu5foar52yAvkL47OFFne2pZu3GqwdTVYWiHH22wfGr2h2uNI_Nfup_gKtgeVdt5aVn6UuD0ARkmHUi3YIWjmhmtI5F6A2BtvUu55cujJbpgioM19M2nRvAU8CRQfrhJqSmCUaQ5oCfINqrdisMrXi2JCEinzmfQdQCP11e_-4h7-CJoqC6gcu6t-IVno=w1303-h977-no


I'll check that it boots and get it started on memtest sometime tonight.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The disks passed the short and extended SMART tests.

I cloned the disk-burnin-and-testing repo, and started the shell script for each in tmux. Should be done sometime this weekend.

And the server just arrived several days ahead of schedule!

I'll check that it boots and get it started on memtest sometime tonight.
I don't know what method you used to insert a photo, but it was not the right one, because that didn't work. Use the 'upload a file' button below the message window.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Ugh, Google Photos strikes again! Should be fixed now, thanks!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The disks passed the short and extended SMART tests.

I cloned the disk-burnin-and-testing repo, and started the shell script for each in tmux. Should be done sometime this weekend.

And the server just arrived several days ahead of schedule!

I'll check that it boots and get it started on memtest sometime tonight.
If you hit f12 during the BIOS screen, it will bring up a boot menu and there is a built in diagnostics utility you can run.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
If you hit f12 during the BIOS screen, it will bring up a boot menu and there is a built in diagnostics utility you can run.
Looks like I need to scrounge up an old graphics card and hook up a monitor first. Will give that a shot when I do though!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Looks like I need to scrounge up an old graphics card and hook up a monitor first. Will give that a shot when I do though!
Also, f2 gets you into the BIOS config screen.
Even an old PCI video card will do.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Approaching the halfway mark for burn-in testing...

I don't think I mentioned it yet, but I picked up two more 8T drives. I've been burning in three of them, and once that's done I'll swap them out and run the next three.
MLaKwbs00LTV9lbZZLXq_bckRusD8ifwd20cUV_03Ij_jXImpgp6nSaMVLOL5fGi21S6t9BR0RyfevvcrQeqA3nq0_P23OllWgUCCAuOaPRRyuLfzJo9VreAQvBRaxDWUCpNKmWuD5Gk0GsarWp01xspO30ig_paulXO5W1M1rMYUpJB1qTaZDsgr5jP6hNRDhD-RshLKcPxvY6lWyxDUDa68Leney15vE88lSRLqfXDoVChdvrQnz6QNlWK-IrPM_gwcLgyUn6nq08kcTgFhYVHXTMuQfR4l1lBB-vRFYmvX7Kd8XcRPGg0va9Bb814FYNieajLcREo0QuQqTzHs6JTQIw-p9Rm9lnnVopCDUCOGUjcDa63Et6j_UCzI4ji9NevYZruae-sGuff34aBbtdrPHaGjXW8Dlol_rS2CDTJFNRPLAmokQqHq-fW7nK7f0jKDfhJiXCb4c1CJBCUYQjWyLQ5LYamRsVlEDlgeIcsvISzqM3Ga59qSFoAU73b1RP3wk--vGuCanB9RIbGx1ZeGpppp85l1a7LHkjlRM4Z7sIo2J2MhZTME3jDHfcw-B_QeHiajhKkopLcOTfkQ3EkbEcu73prJdjLu0LNAQlr1q4XQIj90cwNHR1W5AYhhO43j-3lx1Q8BdGwVypqdrvdZCzlLzEA=w1303-h977-no


A little over 120 hours on the first round of that burn-in script, and badblocks has finally finished. Sometime tomorrow the last extended SMART test will conclude and the script will be done. So far so good.

Oh, yeah, I found a graphics card! So at the same time I've been running memory tests on the server. It passed its own BIOS memtest easily:
joHMl5pkSZdHwM2TYZiMtn5Wr21pnfBW9W7nNe5cPT8uGia6bPyPTee290B2suT8t_B2g2Rm8uQYkI28MlhxvD-9bqituiGJd3QGtqx15cpe-zmQby5P5niLTKFcG2UM4RQv0EUpIyQ5nByZ7iU1F6V9F_1aDZNhdDnoIzzGwGpBacH4lt_NpC1uWm8i4BTT-wjxeNPMuxwXriCv9aZJAE0w_S6N-mcSkP6OwvVx1ZsXKlN3vRNfjppU7cYwAN5_RVqWliofLu_2qZ2wds20OH3iMd4yfYBe9iyrFS1-BwLYKotKShbJ3c4VhLO-BCac84h7FBRNSjTa1PsPu-LemGiNw-acCBP0rCwtduXz6So-Q0QhqTqL2x_P08Sgoz82uwI0A9T8KxpiD8gp9SWPEc-cpfvtAzFLpCvQSqjjB9Nrxab9g1Xx6GorS8SK4pRpfmQycTL-3kDR07JUhT6ju6lvFZPSh7HF-sUswpN5hzb7alMK8ByCLqVz4AddpNOY_zFs2cqIHTB6z7hew8QplbzBwJ0OYCcQO6bp3GEv-KDlV2aZH5OSCXEyGz9BVo7JSAZE3lO17582dxi0pssWSN3iV55VNJuJ_oHlbNxmTImtqElg6JFEuzHS_Ja6OpzoV1A-MRuCzKrSekpZuqs9MAt0ec8PX3od=w1303-h977-no


And for the past ~120 hours it's been running memtest86+ on loop. I finally stopped it after 20 rounds.
qcUM88PDm_czODyISc-8buqA1i5Fks5qeCnW6xODBi47jzgX0knz-hqD-VG5l1t7mJ_KFxlJT2L8sHv0iByTAIOkYlWHY6VGKAKDRvKm9I1ObUIbVj4Mp-u5FQqNyuPZlsvhF0nbH0YW6DdZc7RRWWad8IXdZbUPMOTstIP7mtl7JJd2v6YnQzDwXpd9acugA_NOZmPHXfgVesxSIbM7xIXaLPuUR3L3uuV90LTo0F8aBRzgeXRA4K9K9zmNUTqP7KcI1aZjKKDajQQqu9_QvVePObv6L_e1Gzhh9BN0PqrefIBq2SqUdkZH7BKC5cgStBtTSedbOA1O2X5mYfjgMrq6J2_NUz5-54BvG2nsKz7uhbme5i6S0BuJL2qvlSCH-2lRF7sWsl-B3OQ7d0LbQ0wKG7LEU97M9ZOZpBbrYxuGDVTnjVQQkp2zbQwhm9SarAUopT9wQXk12ewlg_mAiZ30NOnYOlqa2B6j6KeME7jFHo9HoON0aJ3Eh3a6R9tIV1QSR_W2Sw0QJSW2hGqoDpd8ZuT5CN8GCle7Ci3rBwj7y4-5kMHGdf1DqRU6zsXe45hxe3ctuly-olHf-q-xMnxVjbdkaN-eqxSTPULYrb_ZlucO522OyR-9LIGDDem0fBCL71TR5pEfuxSsM1DfVs4y6x6yS3_8=w1303-h977-no


Now I want to switch to CPU/system stress tests for it to churn on while the next three drives are being burned in. To start off, I'm running the Archlinux stress utility.
cIr-ysqgOv-vfLlQzZqp53Q10UJGRsCnlj2Be6h4Xyr6TZ6-WFJEUPI5fCc2t-lwkHXAYycz5lwAxOcLrVF7-qDqdo_c5fa3ZK1raH6jvTiuMPAGeET7UJux6gRwIyGdiUPFEGE3fyhs83HD-LqJytxEfsK_3iOy-zD9OMQcZ3qm3wMtAWK2EaqAZrrxTXCM1Q-4dN-WPbF7hniz0yTUHm6SGTEycah7lS23rQPBz1o91Efj_NrAGODuxkHgUpY6TzqXD3PAJDhKxpjqm5MGqc8aaz7k3OcfXVEcsOAQb4bL8xJ6U3AtS4-Z1wKHHhFDpbBtWSX37-wk0GDAgCxTp26wVZj_GXTfgF-viGitxTU36DjP5X69anFuy0QZym30bRrc1sw-0_Zx9FkdJKjQTEDjCbWh3Y44kSXcsExQsYraQkinbNlqEnjTt-1Y5pVdpRwoTdcfirLxFnZaa2PJLBlo1b6flPomK4cnD0mWW0xCByoLBr6UU2VlsjDNZTqgFjAfl_D5BzTN4MJ4PrnuC7hqU2pW1LJeuK_KYu64C1riCHHr0IJGC4Rn61KebTamTyzXmKO0g7OainyAlT8mq78XOEhRa7WUkbd2LIyGQ6rVwBW36Si1gnqmPRzm-24wnIUzJex4F7lyRVrS7E0JpZEbZJXptRL7=w1303-h977-no

The command used is stress --cpu 24 --vm 80 --timeout 28800s. The timeout is so it should automatically finish before I leave for work tomorrow, so I can check on it.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The timeout is so it should automatically finish before I leave for work tomorrow, so I can check on it.
If that Dell workstation is like the ones that I've used before it could churn away for weeks and not have a problem.

What did your parts list end up including?
 
Last edited:

ctag

Patron
Joined
Jun 16, 2017
Messages
225
If that Dell workstation is like the ones that I've used before it could churn away for weeks and not have a problem.

What did your parts list end up including?
It's been pretty solid so far, but this morning I started up mprime and the CPU temperature jumped to 74C, so I canned it... Should I try replacing the thermal paste or something?

I pretty much got the parts you suggested. The new parts list.

Total cost: $1300
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
CPU temperature jumped to 74C, so I canned it... Should I try replacing the thermal paste or something?
If nobody else ever replaced the thermal paste, it is probably all dry and crusty. I find that (depending on the brand) it needs replacement every two or three years. If you put some good new paste on, it will probably drop the temp by a good 10 degrees. I did that one one of the systems at work that chews on math problems all day most days. Makes a nice difference in max temp and actually prevented that system from thermal throttling. Definitely worth trying.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Hrm, desktop crashed when I started the second batch of hard drive burn-in tests.

I cleaned the surfaces and replaced the thermal paste with some Arctic Silver. CPU temp is still hanging out around 74C while crunching though.. lm-sensors shows the stock high value at 79C, and critical at 89C.

Edit: Temps got up to 81C and I shut mprime down again.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Hrm, desktop crashed when I started the second batch of hard drive burn-in tests.
The Dell system or another computer?
lm-sensors shows the stock high value at 79C, and critical at 89C.
Do you have the side of the case open while this is running?
Is the air duct installed between the front fan and the heatsync?
Did the 120mm case fan at the front kick up to high speed?
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
The Dell system or another computer?
A different computer, I'm not sure what happened, but I'm giving it another shot.

Do you have the side of the case open while this is running?
Is the air duct installed between the front fan and the heatsync?
Did the 120mm case fan at the front kick up to high speed?
I had been leaving the side off, but I put it back on after applying the new thermal paste. The duct that goes from the front fan to the CPU heatsink is in place, but the fan stayed at 700 RPM the whole time.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
but the fan stayed at 700 RPM the whole time.
Those systems are normally very responsive to processor heat. I am not familiar with the test you are using, but under real world use in Windows, you can run something like open hardware monitor that will the temperature per core and you can watch it thermal throttle before it gets too hot. Also the fan normally cranks up to high rate to try and keep it cool. It makes me wonder if there isn't something wrong with the testing utility.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Maybe, it's been a while since a computer hasn't just had automatic fan control with Linux for me.

There isn't any option for fan settings in the bios. I can see the temperatures and fan RPMs easily enough with lm_sensors, but by default the fan RPM stay low and static. I can manually echo pwm values to hwmon1/pwm1 which is the CPU fan, and it'll spin up, but I haven't been able to get fancontrol to handle it automatically for me yet... Will keep playing with it.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
One of the disks in the second batch was lagging way behind in the badblocks progress, so I cancelled it out and went to take a look at it. Turns out the short SMART tests succeed, but the extended tests all return "cancelled by host" so something's weird. It also reports '426' as the spin-up-time, when the other disks report something like 20.

The bad disk:
Code:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.9.75-1-lts] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD80EFAX-68LHPN0
Serial Number:	7SGM2G3C
LU WWN Device Id: 5 000cca 252c8ac44
Firmware Version: 83.H0A83
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jan 21 15:18:52 2018 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 249)	Self-test routine in progress...
					90% of test remaining.
Total time to complete Offline 
data collection:		 (   93) seconds.
Offline data collection
capabilities:			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 958) minutes.
SCT capabilities:			(0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0004   129   129   054	Old_age   Offline	  -	   113
  3 Spin_Up_Time			0x0007   156   156   024	Pre-fail  Always	   -	   420 (Average 426)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   43
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000a   100   100   067	Old_age   Always	   -	   0
  8 Seek_Time_Performance   0x0004   128   128   020	Old_age   Offline	  -	   18
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   88
 10 Spin_Retry_Count		0x0012   100   100   060	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   43
 22 Helium_Level			0x0023   100   100   025	Pre-fail  Always	   -	   100
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   78
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   78
194 Temperature_Celsius	 0x0002   180   180   000	Old_age   Always	   -	   36 (Min/Max 25/37)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%		87		 -
# 2  Extended offline	Aborted by host			   90%		23		 -
# 3  Short offline	   Completed without error	   00%		22		 -
# 4  Extended offline	Aborted by host			   90%		 0		 -
# 5  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I don't know what the host is, but if it is going to 'sleep' for some power saving mode, that will abort the long test. I always disable all power saving features before doing any testing. If the disk was not working in lock step with the others, it probably has some kind of defect. The quickest solution might be to box it back up nice and need and take it back to the dealer for a refund, then buy another one.
If you can't do that, you should RMA it with WD. Ideally, you want all the disks to have about the same level of performance and the disk being slower when they are supposed to be the same, probably indicates a defect of some kind.
 
Top