Is it safe to reboot during a resilver?

Alan W. Smtih · Jun 22, 2017

I just replaced a hard drive. It's in the process of resilvering with an ETA of 40+ hours.

My question: Is ZFS/FreeNAS designed to safely accommodate shutdown/rebooting during a resilver?

---

Footnotes

- Obviously, it would be better to let the process finish. For times when that would be problematic, I'd like to know if the system is designed to gracefully accommodate a restart during a resilver.

- My reason for asking is based on an issue I'm having with drives becoming "REMOVED" over time. I'm troubleshooting that here, but was thinking if I could reboot the machine every few hours it might limit/stop the drives disappearing until I find the permanent fix.

- I'm making this it's own thread since it seems like a good question to know the answer to in general.

- I've rebooted a couple times (unintentionally at first) already during the resilver because of the "REMOVED" problem. It seems like the resilver picked up where it left off and things are working. Part of what I'm after is figuring out if I've just been lucky or if ZFS/FreeNAS is specifically designed to accommodate mid-resilver shutdowns/reboots.

danb35 · Jun 22, 2017

I don't think that FreeNAS as such is designed for this, but I believe that ZFS itself is.

m0nkey_ · Jun 22, 2017

In an ideal world, you should leave it to finish the re-silver. However, if you're satisfied that you have a recent working backup (in case things go wrong), then go ahead and reboot.

rs225 · Jun 23, 2017

Answer is yes. It may be set back slightly by a reboot, but it doesn't start over and it can handle it. Even a crash.

Alan W. Smtih · Jun 23, 2017

rs225 said:
Answer is yes. It may be set back slightly by a reboot, but it doesn't start over and it can handle it. Even a crash.

Good stuff. I expected that would be the case, but it's nice to hear confirmation.

(I'm still troubleshooting the "REMOVED" issue. It seems to take some time to manifest. I'm going to try rebooting every few hours to see if that prevents it as a temporary measure while I move files off that I don't want to have to re-rip.)

Ericloewe · Jun 24, 2017

In the near future, optimizations to resilver/scrub will mean that up to a few minutes of scrubbing will be lost.

CraigD · Jun 24, 2017

Alan W. Smtih said:
It's in the process of resilvering with an ETA of 40+ hours.

How wide is your vdev?

Alan W. Smtih · Jun 26, 2017

I'm happy to report that rebooting during the resilver appears to have worked as expected (i.e. nothing blew up).

CraigD said:
How wide is your vdev?

I've got eleven 4TB drives in RAIDZ3.

rs225 said:
Answer is yes. It may be set back slightly by a reboot, but it doesn't start over and it can handle it. Even a crash.

Thanks for the confirmation. It calmed my nerves during the process. (I could have gotten most of the data back, but it would have been a huge pain.)

For posterity's sake, here's some snapshots of what I saw in case it's useful to someone else:

-- June 21, 2017 --

- This is when the drives first went haywire.

Code:

  pool: z
 state: DEGRADED
status: One or more devices has been removed by the administrator.
  Sufficient replicas exist for the pool to continue functioning in a
  degraded state.
action: Online the device using 'zpool online' or replace the device with
  'zpool replace'.
  scan: scrub repaired 0 in 49h40m with 0 errors on Tue Jun 20 01:40:44 2017
config:

  NAME											STATE	 READ WRITE CKSUM
  z											   DEGRADED	 0	 0	 0
	raidz3-0									  DEGRADED	 0	 0	 0
	  gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  8431330591788471819						 REMOVED	  0	 0	 0  was /dev/gptid/c701c120-4668-11e4-b49f-d05099264f68
	  17501151778533256183						REMOVED	  0	 0	 0  was /dev/gptid/c779c670-4668-11e4-b49f-d05099264f68
	  333950894800270571						  REMOVED	  0	 0	 0  was /dev/gptid/c7efacfe-4668-11e4-b49f-d05099264f68
	  gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

-- June 21, 2017 (an hour or so later) --

- I thought "REMOVED" meant the drives were dead. So, I replaced one.

- I brought the machine back up and did the drive replacement via the WebGUI.

- Checking the status, I discovered that the other two drives reported "ONLINE".

Code:

  pool: z
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun 21 12:02:55 2017
		90.9G scanned out of 34.3T at 171M/s, 58h12m to go
		7.95G resilvered, 0.26% done
config:

  NAME											STATE	 READ WRITE CKSUM
  z											   ONLINE	   0	 0	 0
	raidz3-0									  ONLINE	   0	 0	 0
	  gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c701c120-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 3
	  gptid/c779c670-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 3
	  gptid/0d977d05-569b-11e7-8777-d05099264f68  ONLINE	   0	 0	 0  (resilvering)
	  gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

-- June 22, 2017 --

- I left the machine running overnight.

- When I checked it in the morning, the drives had gone REMOVED again.

Code:

  pool: z
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun 21 12:02:55 2017
		10.9T scanned out of 34.3T at 124M/s, 54h57m to go
		824G resilvered, 31.72% done
config:

  NAME											STATE	 READ WRITE CKSUM
  z											   DEGRADED	 0	 0	 0
	raidz3-0									  DEGRADED	 0	 0	 0
	  gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  8431330591788471819						 REMOVED	  0	 0	 0  was /dev/gptid/c701c120-4668-11e4-b49f-d05099264f68
	  17501151778533256183						REMOVED	  0	 0	 0  was /dev/gptid/c779c670-4668-11e4-b49f-d05099264f68
	  11668331602481738125						REMOVED	  0	 0	 0  was /dev/gptid/0d977d05-569b-11e7-8777-d05099264f68  (resilvering)
	  gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

-- June 22, 2017 --

- After a reboot, the drives showed ONLINE again and the resilver picked back up.

- I'm not sure what to make of the "CKSUM" going to zero for the other two drives. (Not the highest priority at the time so I didn't bother to investigate.)

Code:

freenas% zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Sat May 27 03:46:40 2017
config:

  NAME										  STATE	 READ WRITE CKSUM
  freenas-boot								  ONLINE	   0	 0	 0
	gptid/07bec8cd-0ab3-11e7-9ecb-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

  pool: z
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun 21 12:02:55 2017
		11.3T scanned out of 34.3T at 178M/s, 37h45m to go
		857G resilvered, 32.92% done
config:

  NAME											STATE	 READ WRITE CKSUM
  z											   ONLINE	   0	 0	 0
	raidz3-0									  ONLINE	   0	 0	 0
	  gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c701c120-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c779c670-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/0d977d05-569b-11e7-8777-d05099264f68  ONLINE	   0	 0	 0  (resilvering)
	  gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

-- June 25, 2017 --

- I probably rebooted 7-8 times during the resilver.

- The drives went "REMOVED" 3-4 of those.

- The first few times, it seemed to take ~10 hours before the drive problem showed up, but one time it was only three hours. (I was rsyncing stuff off the box throughout.)

- I'd let it go for several hours then reboot in an attempt to do that before the drives went REMOVED again. (That seemed less precarious than having all three parity drives out of the mix.) I caught it most of the time except when they went REMOVED after just a few hours.

- I also shut it down overnight to lower the risk of something going wrong when I wasn't keeping as close an eye on it.

- The resilver finished up and seems to have done it's job.

Code:

  pool: z
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
  attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 2.45T in 104h26m with 0 errors on Sun Jun 25 20:29:44 2017
config:

  NAME											STATE	 READ WRITE CKSUM
  z											   ONLINE	   0	 0	 0
	raidz3-0									  ONLINE	   0	 0	 0
	  gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c701c120-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 2
	  gptid/c779c670-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 1
	  gptid/0d977d05-569b-11e7-8777-d05099264f68  ONLINE	   0	 0	 1
	  gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
	  gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

-- Footnotes --

- Two drives had thrown S.M.A.R.T. warnings last week. So, when I saw the "REMOVED", I thought they bit the dust and took another one with them. I didn't think to reboot the machine to see what effect that had.

- If I could go back, I'd start with the reboot. Evidence suggest that would have brought the drives back ONLINE. From there, I'd troubleshoot the main issue and wait to do the drive replacements once that was fixed.

- I still need to fix the underlying issue. I didn't want to be messing with that while the machine was resilvering (i.e. the only change one thing at a time when troubleshooting idea). If you want to play along at home, here's the original issue thread.

CraigD · Jun 26, 2017

I can't see your system specs, so I have to ask:

Are you using any USB drives in your pool?

Have Fun

Alan W. Smtih · Jun 27, 2017

CraigD said:
Are you using any USB drives in your pool?

Good question, but nope. They're all SATA drives connected directly to the motherboard.

(It's an Asrock C2550D4I which I picked for the twelve onboard ports. This is also a good reminder that I need to add specs to my signature when I get home and am not brain-fried.)

CraigD · Jun 27, 2017

Alan W. Smtih said:
It's an Asrock C2550D4I which I picked for the twelve onboard ports.

About a year ago the hardware gurus here told me to avoid this board at all costs!

I hope the SATA problem is the only fault you have

Good Luck

danb35 · Jun 27, 2017

CraigD said:
About a year ago the hardware gurus here told me to avoid this board at all costs!

Interesting, as it's the board iX uses in the FreeNAS Mini.

eldo · Jun 27, 2017

danb35 said:
Interesting, as it's the board iX uses in the FreeNAS Mini.

CraigD: There's been a few issues with this board (I use it), including the BMC failure issue as well as the intel atom kerfluffle resulting in many dead systems, and an iX extended warranty.

Alan W. Smtih said:
Good question, but nope. They're all SATA drives connected directly to the motherboard.

(It's an Asrock C2550D4I which I picked for the twelve onboard ports. This is also a good reminder that I need to add specs to my signature when I get home and am not brain-fried.)

Alann,
when I was planning my FN install, I chose the same board as you have.

Remember, only 6 of the drives are using the Intel SATA controllers. The other 6 are on Marvell, and Marvell at least during the 9.3 days weren't recommended for use as the Marvell was much more flaky than acceptable for many people.

Board Specs
Storage
SATA Controller - Intel® C2750 : 2 x SATA3 6.0 Gb/s, 4 x SATA2 3.0 Gb/s
Additional Storage Controller - Marvell SE9172: 2 x SATA3 6.0 Gb/s, support RAID 0, 1
- Marvell SE9230: 4 x SATA3 6.0 Gb/s, support RAID 0, 1, 10

Alan W. Smtih · Jul 4, 2017

CraigD said:
About a year ago the hardware gurus here told me to avoid this board at all costs!

Yeah, there have definitely been issues with the board in recent history.

I built mine in 2014. It seemed like a good choice at the time.

(I wanted 11 drives and liked the idea of having all the ports directly on the motherboard to eliminate the complexity of additional parts.)

We didn't know about the BMC/watchdog issue yet. (Thankfully, I was lucky enough to hear about it before it ate my board.)

The USB drive with FreeNAS on it when out a few months back. I didn't have a backup copy of the config or have it mirrored (which was dumb and I know better, but I did that thing where I "put it off until later" where "later" == "indefinitely").

It certainly seems like my issues (which further digging suggests are related to ahcich2 Timeout issues) started to manifest at that point.

There's no way to tell, but it definitely seems like if I'd had the USB drive mirrored (as per basic best practice) and never changed anything, it would still be humming along nicely.

This is not to excuse the current issue (or my lack of following best practices), I just want to give the board due credit for years of stability.

eldo said:
Remember, only 6 of the drives are using the Intel SATA controllers. The other 6 are on Marvell, and Marvell at least during the 9.3 days weren't recommended for use as the Marvell was much more flaky than acceptable for many people.

I'm seeing a lot of references to that in my troubleshooting.

While I'm still working on a solution for the problem, I've also started thinking about adding hardware to move off the Marvell ports. (Running off the assumption that that's doable...)

eldo · Jul 4, 2017

Alan W. Smtih said:
This is not to excuse the current issue (or my lack of following best practices), I just want to give the board due credit for years of stability.

Agreed. I've had mine for a few years without issue (yet). Except my ipmi has gone completely out.

Alan W. Smtih said:
I'm seeing a lot of references to that in my troubleshooting.

While I'm still working on a solution for the problem, I've also started thinking about adding hardware to move off the Marvell ports. (Running off the assumption that that's doable...)

I think what you'll be looking for is an ?hba? That will allow pass through, IT mode, or a pcie Intel sata controller.

I think I've seen some models of lsi hbas mentioned on the forums with good results but requires flashing the firmware.

Sent from my SM-G930T using Tapatalk

Alan W. Smtih · Jul 4, 2017

eldo said:
Agreed. I've had mine for a few years without issue (yet). Except my ipmi has gone completely out.

Some Java security updates broke IPMI for lots of folks (i.e. it's insecure based on the latest Java updates, but there's a workaround.)

Check out these two threads for potential solutions if that's what's causing you problems:

- Thread: PSA: Java 8 Update 131 breaks ASRock's IPMI Virtual console - and the comment where screamer shows the fix (I put some more notes below showing the before and after)

- How to use ipmitool SOL with Asrock motherboards (alternative to JViewer)

eldo · Jul 4, 2017

Alan W. Smtih said:
Some Java security updates broke IPMI for lots of folks (i.e. it's insecure based on the latest Java updates, but there's a workaround.)

Check out these two threads for potential solutions if that's what's causing you problems:

- Thread: PSA: Java 8 Update 131 breaks ASRock's IPMI Virtual console - and the comment where screamer shows the fix (I put some more notes below showing the before and after)

- How to use ipmitool SOL with Asrock motherboards (alternative to JViewer)

When I go into bios I don't have a Mac address for the ipmi interface, and the ipmi static address doesn't even ping.

In order to even flash a new ipmi firmware I have to boot into the ipmi first... :-/

I should probably get hold of asrock unless there's some magic to revive it.

Sent from my SM-G930T using Tapatalk

Alan W. Smtih · Jul 4, 2017

Might be worth checking to see if the IP address changed for some reason.

I've used Fing for that in the past and it worked great.

Other than that, yeah, sounds like a support request is in order.

eldo · Jul 4, 2017

Alan W. Smtih said:
Might be worth checking to see if the IP address changed for some reason.

I've used Fing for that in the past and it worked great.

Other than that, yeah, sounds like a support request is in order.

I've finged and nmapped.
Only thing I didn't do that might work is throw Wireshark on it.

Good idea though.

Sent from my SM-G930T using Tapatalk

Alan W. Smtih · Jul 30, 2017

Following up on the original question about the safety of rebooting during a resilver:

I just finished resilvering a second drive. I let the resilver go for a couple hours then shut all the way down for a few minutes before starting up and letting it go some more.

Probably did that cycle 40+ times with no apparent problems.

So, to confirm the consensus, it certainly appears safe under normal circumstances.

Of course, it takes way longer :)

Code:

  pool: z
 state: ONLINE
  scan: resilvered 3.11T in 264h14m with 0 errors on Sun Jul 30 20:06:20 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	z											   ONLINE	   0	 0	 0
	  raidz3-0									  ONLINE	   0	 0	 0
		gptid/c6171db6-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/c68a4186-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/25a33983-6cdd-11e7-9c11-d05099264f68  ONLINE	   0	 0	 0
		gptid/c779c670-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/0d977d05-569b-11e7-8777-d05099264f68  ONLINE	   0	 0	 0
		gptid/c8688176-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/c9cc35d5-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/ca449ed0-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/cabb7d11-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0
		gptid/cb39f216-4668-11e4-b49f-d05099264f68  ONLINE	   0	 0	 0

errors: No known data errors

Important Announcement for the TrueNAS Community.

Is it safe to reboot during a resilver?

Explorer

Hall of Famer

MVP

Guru

Explorer

Server Wrangler

Patron

Explorer

Patron

Explorer

Patron

Hall of Famer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Similar threads