Hello,
Forgive the long post, but I'll give a complete synopsis of what is going on, with all the data I have to hopefully provide sufficient info to have some insight into things...
I've been running a 9.3.x system (FreeNAS-9.3-STABLE-201605170422) for years with no issues. The pool was getting quite full (~92%), and drives have about 50k hours on them, so I elected to replace them. I don't have a spare port in this system (see specs below), and thus, I have been following the process to replace one drive at a time, to eventually get an expanded pool when I finish the replacement.
I've had an issue that's making me a bit nervous. I have all vital data backed up, so losing the pool would be an inconvenience, not a disaster, but I'd really prefer to not have to restore everything from backup.
The pool is a 6 disk raidz2 made of 2TB WD Reds, replacing with 4TB WD Reds. Each replacement drive has been burned in with SMART short/conveyance/long tests, followed by a 2 pass (sequential then random) write/read surface scan using Hard Disk Sentinel in on a Win7 box, then another SMART long scan.
First disk replacement went fine (offlined in GUI, shutdown, remove/replace, boot, replace device, resliver).
Second disk replacement went fine, with the exception that somewhere near the end of the resliver, both new drives threw a checksum error (a single one). These first two replacements are on a highpoint controller, so I figured hmm, maybe something got screwy with some cabling or the controller had an issue or something.
I scrubbed the pool, which had no issues, then cleared the errors ('zpool clear').
Post-scrub, pre-clear zpool status:
3rd disk (now replacing a disk on the motherboard ports, ada2), something went a bit wonkier. Resliver was fine, but at the end of the resilver, the arrayed showed a small number of checksum errors at the pool level, with one corrupted file. I didn't save the console output for this one, but in the next spoiler tags, you'll see the current status of the pool, which includes the checksum errors and damaged file.
As the file was in a snapshot, I decided I didn't care about that, and would delete it later...I think this was potentially a mistake at this point. I also didn't rescrub the pool, seeing as it had already decided that the damage was done and no other checksum errors at the device level were showing, I figured why bother (maybe another mistake...?).
So, I pressed forward, and offlined another disk (ada3, 4th replacement). The disk went offline, and at the same time, the pool started resilvering again immediately (I had not removed the drive yet, or powered down the server). All of a sudden, my pool had no redundancy (gulp).....
This is where I stand now (you can see the resilver in progress, with one drive offline...it's resilvering to ada2, the 3rd disk I'd replaced, on a motherboard port):
So, not sure what happened here. Kind of feels like either the pool was unhappy due to preexisting error in the snapshotted file/and/or checksum errors at pool level; or for some reason, when I offlined ada3, it bounced ada2 for some reason momentarily. Either of these feel like I'm toying with pool loss at this point, as there is no redundancy until this resilver completes. Any ideas? Did I make some kind of error in not deleting the file or clearing the error status of the pool before offlining a disk, or is there some other gremlins at play here?
My current plan is to wait until the resilver completes, and then I'll attempt to delete that snapshot. Does this make sense? I've seen some evidence in older forum posts of pools acting strangely with corruption in a snapshot, but could be all anecdotal.
Should I at that point scrub the pool (even with a device missing)? I thought that the resilver should basically touch all the data, so I don't see a reason to unless I misunderstand that point, so I'm thinking not...
Assuming I don't scrub, my plan is that I'll delete that snapshot, do a zpool clear, and then shutdown, pull out the offlined disk (ada3), and replace it. Hopefully at that point, I'm on 5 disks resilvering to the replaced one with no issues...
Really not sure what is going on here, but it does make me a bit nervous. Any input you all have would be appreciated. I am planning to upgrade to 11.2 when I get through this pool expansion, 9.3 has been solid for years for me but it's past due even for my uses (light media serving, storage, local backups.
Thanks!
Camcontrol output:
glabel status output:
Specs:
Forgive the long post, but I'll give a complete synopsis of what is going on, with all the data I have to hopefully provide sufficient info to have some insight into things...
I've been running a 9.3.x system (FreeNAS-9.3-STABLE-201605170422) for years with no issues. The pool was getting quite full (~92%), and drives have about 50k hours on them, so I elected to replace them. I don't have a spare port in this system (see specs below), and thus, I have been following the process to replace one drive at a time, to eventually get an expanded pool when I finish the replacement.
I've had an issue that's making me a bit nervous. I have all vital data backed up, so losing the pool would be an inconvenience, not a disaster, but I'd really prefer to not have to restore everything from backup.
The pool is a 6 disk raidz2 made of 2TB WD Reds, replacing with 4TB WD Reds. Each replacement drive has been burned in with SMART short/conveyance/long tests, followed by a 2 pass (sequential then random) write/read surface scan using Hard Disk Sentinel in on a Win7 box, then another SMART long scan.
First disk replacement went fine (offlined in GUI, shutdown, remove/replace, boot, replace device, resliver).
Second disk replacement went fine, with the exception that somewhere near the end of the resliver, both new drives threw a checksum error (a single one). These first two replacements are on a highpoint controller, so I figured hmm, maybe something got screwy with some cabling or the controller had an issue or something.
I scrubbed the pool, which had no issues, then cleared the errors ('zpool clear').
Post-scrub, pre-clear zpool status:
Code:
[root@freenas] /mnt/bluemesa/media# zpool status -v pool: bluemesa state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 0 in 35h49m with 0 errors on Sat Jan 5 05:31:19 2019 config: NAME STATE READ WRITE CKSUM bluemesa ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/5dce97b0-0d3b-11e9-b5ad-7054d21693cc ONLINE 0 0 1 gptid/541ad802-0ec0-11e9-b79b-7054d21693cc ONLINE 0 0 1 gptid/ba00b169-ae44-11e2-a357-7054d21693cc ONLINE 0 0 0 gptid/ba756e6a-ae44-11e2-a357-7054d21693cc ONLINE 0 0 0 gptid/bae4d348-ae44-11e2-a357-7054d21693cc ONLINE 0 0 0 gptid/55892d30-c89c-11e2-9452-7054d21693cc ONLINE 0 0 0 errors: No known data errors
3rd disk (now replacing a disk on the motherboard ports, ada2), something went a bit wonkier. Resliver was fine, but at the end of the resilver, the arrayed showed a small number of checksum errors at the pool level, with one corrupted file. I didn't save the console output for this one, but in the next spoiler tags, you'll see the current status of the pool, which includes the checksum errors and damaged file.
As the file was in a snapshot, I decided I didn't care about that, and would delete it later...I think this was potentially a mistake at this point. I also didn't rescrub the pool, seeing as it had already decided that the damage was done and no other checksum errors at the device level were showing, I figured why bother (maybe another mistake...?).
So, I pressed forward, and offlined another disk (ada3, 4th replacement). The disk went offline, and at the same time, the pool started resilvering again immediately (I had not removed the drive yet, or powered down the server). All of a sudden, my pool had no redundancy (gulp).....
This is where I stand now (you can see the resilver in progress, with one drive offline...it's resilvering to ada2, the 3rd disk I'd replaced, on a motherboard port):
Code:
[root@freenas] ~# zpool status -v pool: bluemesa state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Jan 9 22:53:22 2019 6.37T scanned out of 10.2T at 83.4M/s, 13h26m to go 1.06T resilvered, 62.33% done config: NAME STATE READ WRITE CKSUM bluemesa DEGRADED 0 0 2 raidz2-0 DEGRADED 0 0 4 gptid/5dce97b0-0d3b-11e9-b5ad-7054d21693cc ONLINE 0 0 0 gptid/541ad802-0ec0-11e9-b79b-7054d21693cc ONLINE 0 0 0 gptid/ce054d2f-1118-11e9-9c8a-7054d21693cc ONLINE 0 0 0 (resilvering) 6099859449148961522 OFFLINE 0 0 0 was /dev/gptid/ba756e6a-ae44-11e2-a357-7054d21693cc gptid/bae4d348-ae44-11e2-a357-7054d21693cc ONLINE 0 0 0 gptid/55892d30-c89c-11e2-9452-7054d21693cc ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: bluemesa/media@auto-20190105.0000-2w:/Video/Movies/Finding Nemo (2003)/Finding Nemo (2003).mkv
So, not sure what happened here. Kind of feels like either the pool was unhappy due to preexisting error in the snapshotted file/and/or checksum errors at pool level; or for some reason, when I offlined ada3, it bounced ada2 for some reason momentarily. Either of these feel like I'm toying with pool loss at this point, as there is no redundancy until this resilver completes. Any ideas? Did I make some kind of error in not deleting the file or clearing the error status of the pool before offlining a disk, or is there some other gremlins at play here?
My current plan is to wait until the resilver completes, and then I'll attempt to delete that snapshot. Does this make sense? I've seen some evidence in older forum posts of pools acting strangely with corruption in a snapshot, but could be all anecdotal.
Should I at that point scrub the pool (even with a device missing)? I thought that the resilver should basically touch all the data, so I don't see a reason to unless I misunderstand that point, so I'm thinking not...
Assuming I don't scrub, my plan is that I'll delete that snapshot, do a zpool clear, and then shutdown, pull out the offlined disk (ada3), and replace it. Hopefully at that point, I'm on 5 disks resilvering to the replaced one with no issues...
Really not sure what is going on here, but it does make me a bit nervous. Any input you all have would be appreciated. I am planning to upgrade to 11.2 when I get through this pool expansion, 9.3 has been solid for years for me but it's past due even for my uses (light media serving, storage, local backups.
Thanks!
Camcontrol output:
Code:
[root@freenas] ~# camcontrol devlist <WDC WD40EFRX-68N32N0 82.00A82> at scbus0 target 0 lun 0 (pass0,ada0) <WDC WD40EFRX-68N32N0 82.00A82> at scbus1 target 0 lun 0 (pass1,ada1) <WDC WD40EFRX-68N32N0 82.00A82> at scbus2 target 0 lun 0 (pass2,ada2) <WDC WD20EFRX-68AX9N0 80.00A80> at scbus3 target 0 lun 0 (pass3,ada3) <WDC WD20EFRX-68AX9N0 80.00A80> at scbus4 target 0 lun 0 (pass4,ada4) <WDC WD20EFRX-68AX9N0 80.00A80> at scbus5 target 0 lun 0 (pass5,ada5) <MUSHKIN MKNUFDMH8GB PMAP> at scbus7 target 0 lun 0 (pass6,da0)
glabel status output:
Code:
[root@freenas] ~# glabel status Name Status Components gptid/5dce97b0-0d3b-11e9-b5ad-7054d21693cc N/A ada0p2 gptid/541ad802-0ec0-11e9-b79b-7054d21693cc N/A ada1p2 gptid/ba756e6a-ae44-11e2-a357-7054d21693cc N/A ada3p2 gptid/bae4d348-ae44-11e2-a357-7054d21693cc N/A ada4p2 gptid/55892d30-c89c-11e2-9452-7054d21693cc N/A ada5p2 gptid/abc9b96a-99cc-11e4-8b6f-7054d21693cc N/A da0p1 gptid/ce054d2f-1118-11e9-9c8a-7054d21693cc N/A ada2p2 gptid/ba660cbb-ae44-11e2-a357-7054d21693cc N/A ada3p1 [root@freenas] ~# gpart status Name Status Components ada0p1 OK ada0 ada0p2 OK ada0 ada1p1 OK ada1 ada1p2 OK ada1 ada3p1 OK ada3 ada3p2 OK ada3 ada4p1 OK ada4 ada4p2 OK ada4 ada5p1 OK ada5 ada5p2 OK ada5 da0p1 OK da0 da0p2 OK da0 ada2p1 OK ada2 ada2p2 OK ada2 [root@freenas] ~#
Specs:
Code:
Case: Lian Li PC-Q25B PSU: SeaSonic SS-300ET (OEM) MB: Intel DBS1200KPR RAM: 16GB Kingston ECC ValueRAM (KVR13E9K2/16I - 2x8GB kit) CPU: Intel Celeron G555 Flash: Mushkin Mullholland 8GB (USB FreeNAS boot) HDDs: 6x Western Digital 2TB Red (WD20EFRX) HBA: Highpoint Rocket 620 (2 ports) HSF: Noctua NH-L9i Fan #1: Noctua NF-P14 FLX (Intake 140mm) Fan #2: Noctua NF-S12B FLX (Outflow 120mm)
Last edited by a moderator: