I'm happy to report that rebooting during the resilver appears to have worked as expected (i.e. nothing blew up).
I've got eleven 4TB drives in RAIDZ3.
Answer is yes. It may be set back slightly by a reboot, but it doesn't start over and it can handle it. Even a crash.
Thanks for the confirmation. It calmed my nerves during the process. (I could have gotten most of the data back, but it would have been a huge pain.)
For posterity's sake, here's some snapshots of what I saw in case it's useful to someone else:
-- June 21, 2017 --
- This is when the drives first went haywire.
Code:
pool: z
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 49h40m with 0 errors on Tue Jun 20 01:40:44 2017
config:
NAME STATE READ WRITE CKSUM
z DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
gptid/c6171db6-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c68a4186-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
8431330591788471819 REMOVED 0 0 0 was /dev/gptid/c701c120-4668-11e4-b49f-d05099264f68
17501151778533256183 REMOVED 0 0 0 was /dev/gptid/c779c670-4668-11e4-b49f-d05099264f68
333950894800270571 REMOVED 0 0 0 was /dev/gptid/c7efacfe-4668-11e4-b49f-d05099264f68
gptid/c8688176-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c9cc35d5-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/ca449ed0-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cabb7d11-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cb39f216-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
errors: No known data errors
-- June 21, 2017 (an hour or so later) --
- I thought "REMOVED" meant the drives were dead. So, I replaced one.
- I brought the machine back up and did the drive replacement via the WebGUI.
- Checking the status, I discovered that the other two drives reported "ONLINE".
Code:
pool: z
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Jun 21 12:02:55 2017
90.9G scanned out of 34.3T at 171M/s, 58h12m to go
7.95G resilvered, 0.26% done
config:
NAME STATE READ WRITE CKSUM
z ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/c6171db6-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c68a4186-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c701c120-4668-11e4-b49f-d05099264f68 ONLINE 0 0 3
gptid/c779c670-4668-11e4-b49f-d05099264f68 ONLINE 0 0 3
gptid/0d977d05-569b-11e7-8777-d05099264f68 ONLINE 0 0 0 (resilvering)
gptid/c8688176-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c9cc35d5-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/ca449ed0-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cabb7d11-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cb39f216-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
errors: No known data errors
-- June 22, 2017 --
- I left the machine running overnight.
- When I checked it in the morning, the drives had gone REMOVED again.
Code:
pool: z
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Jun 21 12:02:55 2017
10.9T scanned out of 34.3T at 124M/s, 54h57m to go
824G resilvered, 31.72% done
config:
NAME STATE READ WRITE CKSUM
z DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
gptid/c6171db6-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c68a4186-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
8431330591788471819 REMOVED 0 0 0 was /dev/gptid/c701c120-4668-11e4-b49f-d05099264f68
17501151778533256183 REMOVED 0 0 0 was /dev/gptid/c779c670-4668-11e4-b49f-d05099264f68
11668331602481738125 REMOVED 0 0 0 was /dev/gptid/0d977d05-569b-11e7-8777-d05099264f68 (resilvering)
gptid/c8688176-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c9cc35d5-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/ca449ed0-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cabb7d11-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cb39f216-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
errors: No known data errors
-- June 22, 2017 --
- After a reboot, the drives showed ONLINE again and the resilver picked back up.
- I'm not sure what to make of the "CKSUM" going to zero for the other two drives. (Not the highest priority at the time so I didn't bother to investigate.)
Code:
freenas% zpool status -v
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h1m with 0 errors on Sat May 27 03:46:40 2017
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/07bec8cd-0ab3-11e7-9ecb-d05099264f68 ONLINE 0 0 0
errors: No known data errors
pool: z
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Jun 21 12:02:55 2017
11.3T scanned out of 34.3T at 178M/s, 37h45m to go
857G resilvered, 32.92% done
config:
NAME STATE READ WRITE CKSUM
z ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/c6171db6-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c68a4186-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c701c120-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c779c670-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/0d977d05-569b-11e7-8777-d05099264f68 ONLINE 0 0 0 (resilvering)
gptid/c8688176-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c9cc35d5-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/ca449ed0-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cabb7d11-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cb39f216-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
errors: No known data errors
-- June 25, 2017 --
- I probably rebooted 7-8 times during the resilver.
- The drives went "REMOVED" 3-4 of those.
- The first few times, it seemed to take ~10 hours before the drive problem showed up, but one time it was only three hours. (I was rsyncing stuff off the box throughout.)
- I'd let it go for several hours then reboot in an attempt to do that before the drives went REMOVED again. (That seemed less precarious than having all three parity drives out of the mix.) I caught it most of the time except when they went REMOVED after just a few hours.
- I also shut it down overnight to lower the risk of something going wrong when I wasn't keeping as close an eye on it.
- The resilver finished up and seems to have done it's job.
Code:
pool: z
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 2.45T in 104h26m with 0 errors on Sun Jun 25 20:29:44 2017
config:
NAME STATE READ WRITE CKSUM
z ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/c6171db6-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c68a4186-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c701c120-4668-11e4-b49f-d05099264f68 ONLINE 0 0 2
gptid/c779c670-4668-11e4-b49f-d05099264f68 ONLINE 0 0 1
gptid/0d977d05-569b-11e7-8777-d05099264f68 ONLINE 0 0 1
gptid/c8688176-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c8e2f1e9-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/c9cc35d5-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/ca449ed0-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cabb7d11-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
gptid/cb39f216-4668-11e4-b49f-d05099264f68 ONLINE 0 0 0
errors: No known data errors
-- Footnotes --
- Two drives had thrown S.M.A.R.T. warnings last week. So, when I saw the "REMOVED", I thought they bit the dust and took another one with them. I didn't think to reboot the machine to see what effect that had.
- If I could go back, I'd start with the reboot. Evidence suggest that would have brought the drives back ONLINE. From there, I'd troubleshoot the main issue and wait to do the drive replacements once that was fixed.
- I still need to fix the underlying issue. I didn't want to be messing with that while the machine was resilvering (i.e. the only change one thing at a time when troubleshooting idea). If you want to play along at home, here's the
original issue thread.