RaidZ2 endless resilvering

Status
Not open for further replies.

gh123man

Cadet
Joined
Apr 20, 2018
Messages
4
Hi all,

I am running FreeNAS-11.1-U1 and have 16 (1.5 and 2.0TB mixed) drives in a raidz2 pool.

I recently had what I believe to be a faulty sata cable causing a device to pop online and offline over and over again. I have replaced the affected disk and the sata cable and it has left my pool in an odd state. It seems to resilver forever. Though unlike other cases where this has happened that I read about, it has not reported any dataloss. It seems to approach 100% and at some point starts over from 0% again.

Here is my zpool status:

Code:
  pool: Storage
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 20 09:38:15 2018
	2.50T scanned at 2.12G/s, 997G issued at 845M/s, 13.5T total
	60.9G resilvered, 7.22% done, 0 days 04:18:52 to go
config:

	NAME											  STATE	 READ WRITE CKSUM
	Storage										   DEGRADED	 0	 0	 0
	  raidz2-0										DEGRADED	 0	 0	 0
		replacing-0								   DEGRADED	 0	 0	 5
		  9457191272304144993						 UNAVAIL	  0	 0	 0  was /dev/gptid/41aeca34-42b2-11e8-95eb-0025902019d2
		  gptid/c3021635-443b-11e8-aad4-0025902019d2  ONLINE	   0	 0	 0  (resilvering)
		gptid/347e7427-0302-11e8-ba38-0025902019d2	ONLINE	   0	 0	 0
		gptid/cd7f1522-d10e-11e6-8314-d05099ac059a	ONLINE	   0	 0	 0
		gptid/544bb2a5-03a2-11e8-bfe2-0025902019d2	ONLINE	   0	 0	 0
		gptid/b11aaf92-d045-11e6-9e1e-d05099ac059a	ONLINE	   0	 0	 0
		gptid/83187daa-8865-11e7-86cc-90e2ba20b6cd	ONLINE	   0	 0	 0
		gptid/3e513b9e-6a44-11e7-b2c1-d05099ac059a	ONLINE	   0	 0	 0
		gptid/b513319e-d045-11e6-9e1e-d05099ac059a	ONLINE	   0	 0	 0
		gptid/b653f9f5-d045-11e6-9e1e-d05099ac059a	ONLINE	   0	 0	 0  (resilvering)
		gptid/b79e0383-d045-11e6-9e1e-d05099ac059a	ONLINE	   0	 0	 0
		gptid/39750405-b780-11e7-8eb6-0025902019d2	ONLINE	   0	 0	 0
		gptid/b992fb2c-d045-11e6-9e1e-d05099ac059a	ONLINE	   0	 0	 0
		gptid/620055b3-54ed-11e7-a60d-d05099ac059a	ONLINE	   0	 0	 0
		gptid/a86410ae-3b57-11e7-88a5-d05099ac059a	ONLINE	   0	 0	 0
		ada3p2										ONLINE	   0	 0	 0  (resilvering)
		gptid/5d6f5f8b-54eb-11e7-a60d-d05099ac059a	ONLINE	   0	 0	 0
	logs
	  gptid/012aec3f-c7e4-11e7-adb8-0025902019d2	  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:49 with 0 errors on Thu Apr 19 03:47:49 2018
config:

	NAME											STATE	 READ WRITE CKSUM
	freenas-boot									ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/f045845b-c4b5-11e6-a12b-d05099ac059a  ONLINE	   0	 0	 0
		gptid/2068d1a8-c47f-11e6-88d8-d05099ac059a  ONLINE	   0	 0	 0

errors: No known data errors



I wrote a script to monitor the resilver every 10 seconds. Here is what it prints:
Code:
269G resilvered, 31.95% done, 0 days 02:47:08 to go
	270G resilvered, 32.01% done, 0 days 02:47:04 to go
	270G resilvered, 32.06% done, 0 days 02:47:03 to go
	271G resilvered, 32.10% done, 0 days 02:47:02 to go
	271G resilvered, 32.16% done, 0 days 02:46:57 to go
	271G resilvered, 32.22% done, 0 days 02:46:53 to go
	272G resilvered, 32.28% done, 0 days 02:46:46 to go
	272G resilvered, 32.33% done, 0 days 02:46:44 to go
	273G resilvered, 32.39% done, 0 days 02:46:36 to go
	273G resilvered, 32.43% done, 0 days 02:46:38 to go
	273G resilvered, 32.43% done, 0 days 02:46:59 to go
	273G resilvered, 32.46% done, 0 days 02:47:08 to go
	0 resilvered, 0.00% done, 4 days 16:12:55 to go
	0 resilvered, 0.00% done, 4 days 23:14:09 to go
	491M resilvered, 0.06% done, 0 days 12:34:46 to go
	1019M resilvered, 0.12% done, 0 days 08:34:00 to go
	1.55G resilvered, 0.19% done, 0 days 07:01:33 to go

Any suggestions would be greatly appreciated. I recognize I may be in a situation where I will experience data loss since 3 drives are reporting issues and I can only lose a maximum of two. I would like to address any data issues after I get the pool back in a healthy state.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
I don't have any suggestions, other than have good backups.

On the issue of 3 drives having problems. ZFS handles disk block failures differently than other RAID schemes. In theory you can have all 16 drives with lots of bad blocks as long as each data stripe has the data, or parity to rebuild it.

However, 16 drives in a single vDev is not recommended. Partly it's the rebuild speed, if all the data is striped across all disks. Partly it's the reduced reliability of having lots of data disks, (14), verses very few parity disks, (2). A reasonable limit for a RAID-Z2 or -Z3 vDev is 10 or 12 disks. I'd only go to 12 if that was the maximum the chassis supported, and I needed that much storage.
 

gh123man

Cadet
Joined
Apr 20, 2018
Messages
4
Yeah I understand that I put too many drives into this vDev - Newebie mistake. But that is an issue I cant exactly address right now.
Are there any logs detailing the resilvering process and why it may have restarted?

The other two drives reporting that they are resilvering never reported any issuer nor did they go offline. They just started showing that status when the one flaky drive was trying to rebuild. What is also interesting is that those two drives are not reporting any write activity - though I dont know if the resilvering process is sequential or parallel. So I am not convinced I have lost data yet. Just trying to diagnose and resolve this bad state it seems to be in.
 

Agi

Dabbler
Joined
Feb 26, 2016
Messages
14
Hardware and how long is endless/forever?
Days is 'normal' for a raidz rebuild, hardware, pool fill and width dependant.
 

gh123man

Cadet
Joined
Apr 20, 2018
Messages
4
Endless as it in it keeps starting at 0% over and over (probably 4 or 5 times now) and it resets around 32%.

Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
24529MB ECC
Supermicro board (Need to look up the exact model)

All but 4 drives are connected directly to the MOBO, 4 are connected via a PCI sata controller exposing the disks directly.

I guess a better question is, is it normal for a resilver to restart itself?
 

Agi

Dabbler
Joined
Feb 26, 2016
Messages
14
I understand now, perhaps the underlying issue with the sata cable isn't resolved? Double/triple check the connections on them and see if that helps.
 

gh123man

Cadet
Joined
Apr 20, 2018
Messages
4
I went ahead and replaced all 6 sata cables in that part of the NAS. And one of the drives resilvering is not in that block of cables that I replaced. So I am thinking something else must be amiss. I wonder if the PCI sata card is bad or acting up. I have a spare so I may try swapping that out.
 

Agi

Dabbler
Joined
Feb 26, 2016
Messages
14
The problem is, you don't know what is causing the resilver to fail, so it could be one of the replaced cables causing it, just as easily as it could be an existing cable.
Yes, good idea. Unfortunately it's going to be a laborious task of working through it all methodically.
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
Firstly back-up all the data if you can, it is in extreme danger of being lost (three drives with problems and a two drive redundancy never ends well)

What does the smart data results say about your drives?
Have you been periodically running long SMART tests?
Are any errors showing?
How many hours on the drives?
Are your drives NAS drives?
Have you been periodically running scrubs?
Did you burn-in the drives?
Does use a your SATA card use port multipliers?
What your SATA cards make/model and chipset?

If the drives and controllers are good, @Agi is right this could take a long time to fix, replace one component at a time with known good, until your server is fixed

Lastly with 16 drives use two vdevs when rebuilding the volume!

Good Luck and Have Fun
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
When I had a similar problem, it turned out to be a bad sata connection (cable wasn't plugged in securely) and also a bad power connector on another drive. I had used some molex to sata power adapters and one of them was bad. Turns out, bad molex to sata power connector problems are fairly common.

I replaced all the power adapters and my problem went away.
 

PhilipS

Contributor
Joined
May 10, 2016
Messages
179
Just to add another wrench to the mix - I've had a bad SATA port connector on the motherboard - didn't matter which cable I used. I kept having intermittent failures and tracked it down to the port by jiggling the cables while writing to the drive.
 
Status
Not open for further replies.
Top