A couple of possibly unrelated issues

sirjorj · Nov 2, 2018

I have been running freenas for almost 3 years now and it is been fantastic (my build info is here: https://forums.freenas.org/index.php?threads/my-build.40342/)

However, I have recently noticed some issues that I have some questions about.

1) A few months ago I started noticing the red light on when I logged into the web interface. Clicking on it gave me this message:

Code:

Device: /dev/ada3, 3 Currently unreadable (pending) sectors

When I looked into the volume I saw no evidence that my pool was in a degraded mode so I didn't panic, but I was a little concerned. ( I'm running a z2 pool so I can withstand a single drive failure. Also, I do have a replacement drive here so I could swap out that one at any time.)

2) More recently, I noticed a yellow light as well telling my my boot volume (2 16G USB flash drives in a mirrored pool) was over 80% full. Today I noticed that it had snapshots (or whatever they are called) of every update I did in the last couple years, so I deleted a bunch of the 9.x ones and got it down to 33%, so that one is remedied.

3) The most alarming issue started today - my NAS has rebooted itself about 5 times today. I am guessing 5 because i see info.0 through info.4 in /data/crash, though the most recent is .2 to maybe it overwrites them after 5 and has rebooted more than that. The info.2 contains this:

Code:

  Architecture: amd64
  Architecture Version: 1
  Dump Length: 555008
  Blocksize: 512
  Dumptime: Sat Nov  3 01:29:56 2018
  Hostname: midna.local
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 11.1-STABLE #0 r321665+9902d126c39(freenas/11.1-stable): Tue Aug 21 12:24:37 EDT 2018
	root@nemesis.tn.ixsystems.com:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/
  Panic String: NMI indicates hardware failure
  Dump Parity: 296205855
  Bounds: 2
  Dump Status: good

I'm not sure how to further debug this but it looks like I may be troubleshooting hardware in the near future. Any advice on how to proceed from here?

Thanks.

sirjorj

garm · Nov 3, 2018

Well you didn’t tell us what version of FreeNAS you run.. but if one of your drives are dying and you are on a version not mirroring the SWAP partitions that would cause unscheduled reboots.

It’s worrying that you don’t have email notifications set up, how about S.M.A.R.T? Do you have scheduled runs on all drives?

Please post the output of smartctl -a /dev/ada3 in code tags.

sirjorj · Nov 3, 2018

You're right - I forgot the version! Sorry! I am running FreeNAS-11.1-U6

FreeNAS puts swap partitions on the data drives? I didn't realize that. Seems like that could make the system unstable even though the redundant drives are able to maintain data integrity of the pool.

You are right about me being bad about not having email notifications set up! It's been on my todo list since I got this up and running but I never got around to it.

As for SMART, I think I have it scheduled to run. Under Services -> SMART, the check interval is 30. Does that mean it will run or is that set up someplace else?

Here is the output of that command:

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Serial Number:	WD-WXA1D65E3FLK
LU WWN Device Id: 5 0014ee 2b757760d
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5700 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Sat Nov  3 08:48:38 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 ( 4244) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 696) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:			(0x303d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   202   190   021	Pre-fail  Always	   -	   8900
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   42
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24564
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   42
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   20
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   224
194 Temperature_Celsius	 0x0022   120   105   000	Old_age   Always	   -	   32
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   3
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	   240		 -
# 2  Extended offline	Completed: read failure	   80%	   120		 1676686960
# 3  Conveyance offline  Completed without error	   00%	   118		 -
# 4  Short offline	   Completed without error	   00%	   118		 -
# 5  Conveyance offline  Completed without error	   00%	   116		 -
# 6  Short offline	   Completed without error	   00%	   116		 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

kdragon75 · Nov 3, 2018

sirjorj said:
Panic String: NMI indicates hardware failure

I'm no expert but with this information, I would need to see the full hardware specs (model number are required) to make and guesses.

sirjorj · Nov 3, 2018

That is all here: https://forums.freenas.org/index.php?threads/my-build.40342/

kdragon75 · Nov 3, 2018

Can you log into your motherboards management page and check for logs indicating any memory errors? It could also be a bad PSU.

sirjorj · Nov 3, 2018

Yeah, that would be a good idea. I haven't messed with IPMI for a while so I will have to relearn how to do that. I will give that a try today.

Thanks for the suggestions!

garm · Nov 3, 2018

Okey, in 11.1-U6 the swap should be mirrored. Then the reboot is probably something else. Having said that:

The latest long smart test was run on that drive at hour 240, it’s now at 24k hours. You also had a read failure at 120 hours. You need to run long smart tests.

What about scrubs? Do you do those? What is the output of zpool status -v?

sirjorj · Nov 3, 2018

Heh. While I was looking at the back of the server to find the IPMI port, I noticed that while the server was running, the fan in the power supply wasn't! I am going to replace the power supply. And maybe this time in can find a higher efficiency one in the power/form factor that I need, because when I built this, I could only go up to gold.

Thanks for the feedback and I will let you know if this fixes it.

sirjorj · Nov 3, 2018

In digging into this, I'm pretty convinced that the PS fan is not the issue - it spins a bit when the system powers up and i can spin it easily with a can of compressed air, so I'm pretty sure that the PS will just spin up the fan when it needs it.

the zpool status command resulted in this:

Code:

% zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:03:07 with 0 errors on Fri Nov  2 15:07:06 2018
config:

	NAME		STATE	 READ WRITE CKSUM
	freenas-boot  ONLINE	   0	 0	 0
	  mirror-0  ONLINE	   0	 0	 0
		da0p2   ONLINE	   0	 0	 0
		da1p2   ONLINE	   0	 0	 0

errors: No known data errors

  pool: pool
 state: ONLINE
  scan: scrub repaired 1.53M in 0 days 09:14:37 with 0 errors on Sun Oct 14 09:14:38 2018
config:

	NAME											STATE	 READ WRITE CKSUM
	pool											ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/0d26a770-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0
		gptid/0e499a4c-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0
		gptid/0f727e28-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0
		gptid/1099d0da-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0
		gptid/11b51065-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0
		gptid/12dc6451-bf04-11e5-aa17-002590f1c118  ONLINE	   0	 0	 0

errors: No known data errors

danb35 · Nov 3, 2018

sirjorj said:
scrub repaired 1.53M

That generally isn't a good thing...

sirjorj said:
Does that mean it will run or is that set up someplace else?

No, it doesn't. http://doc.freenas.org/11/tasks.html#s-m-a-r-t-tests

sirjorj · Nov 4, 2018

So I unracked the server, opened it up, blew out the dust, and reseated the RAM SODIMMs. I then cleared the IPMI logs and just left the thing running on my table to see when it would reboot again. It has been running for over a day just fine, so I'm starting to wonder if that was just a fluke.

As for the drive, how do you tell when a drive needs to be replaced? Under Tasks -> SMART Tests, I now have a long one running every month and a short one running about every week. I ssh-ed into the machine and ran the smartctl command on a few of the drives in it and really didn't notice anything different on the one that is giving those error messages (on the red light thing in the web interface as well as spamming the console with them). When a drive has 'currently unreadable' sectors, is that a 'replace the drive ASAP' issue, or can the drive just map those as bad and not try to use them in the future?

danb35 · Nov 4, 2018

sirjorj said:
When a drive has 'currently unreadable' sectors, is that a 'replace the drive ASAP' issue

Opinions vary a bit on that. IMO, a middle-aged disk with a small (i.e., single-digit) number of bad sectors that's otherwise asymptomatic can stay in my system, though I'll be keeping an eye on it. If the number starts increasing, I replace the disk. If it fails SMART self-tests, I replace the disk. If I start seeing pool problems, I probably replace the disk. If the disk is very new or very old and starts to show bad sectors, I probably replace the disk.

sirjorj · Nov 7, 2018

An update. After the system ran for a few days without rebooting, I decided to replace the questionable drive. I have never done this on freenas before and had a guess of how it should be done. The reality matched that expectation almost perfectly, so good job on making that intuitive! I then set out to secure wipe the drive before sending it back for warranty replacement, but just trying to use it caused tons of read/write errors, so that drive was worse than I thought. The new drive is now resilvered and all is well. Today i will rerack the server and see if the reboot issues come back,

Thanks for the assistance on this!

Important Announcement for the TrueNAS Community.

A couple of possibly unrelated issues

sirjorj

Dabbler

garm

Wizard

sirjorj

Dabbler

kdragon75

Wizard

sirjorj

Dabbler

kdragon75

Wizard

sirjorj

Dabbler

garm

Wizard

sirjorj

Dabbler

sirjorj

Dabbler

danb35

Hall of Famer

sirjorj

Dabbler

danb35

Hall of Famer

sirjorj

Dabbler

Similar threads