Need to replace drives on mirror, but...

ChrisNAS

Explorer
Joined
Apr 14, 2017
Messages
71
Hello,

So today my pool of 2x4tb hard drives in mirror setup got filled up by an unexpected windows backup and another user. The root volume says 96% but every dataset within says 100%.

To top things off, while trying to delete some files, I've noticed some deletes are very slow and at least one of the drives is not sounding so good. At this point I'm nervous about best approach to safely get out of this mess.

I would very much appreciate some guidance here.

Current Knowns:

- FreeNAS-11.1-U6
- Motherboard has 4 SATA. 2 are used for the mirror, 1 for boot SSD, 1 for optical drive.
- Existing drives: 2 x 4tb Ironwolf NAS
- New drives: 2 x 8tb Exos Enterprise
- Deleted files manually over SMB to regain some space.
- Turned off snapshot tasks
- Rebooted to see if size updated after deletes. Startup took much longer than usual. Console showed importing pool and lots of drive activity. At least one of the existing drives do not sound so good... I don't think.
- Once startup completed, console just repeating: check_create_dir: mkdir ...... no space left on device
- Even though I've deleted some large files, GUI does not change from 100% on datasets.
- SMB service did not start on boot and will not start when clicking Start Now.
- Accessing and copying some files from FN to local system over SFTP seems normal... speed and sound wise.
- Some deletes are very slow and drive noise (not sure if bad or normal as usually case is closed and very quiet)

Current Unknown/Confirmed:

I've searched, and lots of the info for replacing drives seems to be for everything except simple 2 drives mirrored. I'm assuming that the process is similar?

- Offline a drive (the one making noticeably more noise than other) using the GUI
- Shut down and replace that old 4tb drive with a new 8tb drive
- Start system and use GUI to replace the now physically replaced offlined drive with the new drive.
- Wait for (hopefully) successful resilver
- If successful (best way to confirm?) then repeat these steps for the second old drive

Are above steps correct?

And to minimize risk and increase successful outcome, should I:

A: Use SFTP (since SMB not working all the sudden) and copy over files which are not yet backed up to a separate drive, and then attempt drive replacements? Or would this be decreasing my chances of a successful drive replace... maybe stressing an possibly failing drive?

B: Jump right into replacing drives one at a time using existing used sata ports.

C: Add one of the new drives on the cable that the optical drive is on, boot up and then replace a drive, shifting what cable each drive is currently connected to.

D: Unsure yet if this is possible or safe, but would be nice to move FN to a USB, disconnect optical, and add the 2 new drives as a new mirror. Can this be done?

Thank you in advance for your input.
 
Joined
Oct 18, 2018
Messages
969
- Some deletes are very slow and drive noise (not sure if bad or normal as usually case is closed and very quiet)
Do you run regular SMART tests?

A: Use SFTP (since SMB not working all the sudden) and copy over files which are not yet backed up to a separate drive, and then attempt drive replacements? Or would this be decreasing my chances of a successful drive replace... maybe stressing an possibly failing drive?
If one of your drives is indicating a failure; you should resilver it, I think. If you don't and you experience any disk issue with the other disk(s) you could lose your data.

B: Jump right into replacing drives one at a time using existing used sata ports.
If your pool is already 100% full you may want to try to resolve that prior to resilvering. Are there old snapshots you can delete or anything?

C: Add one of the new drives on the cable that the optical drive is on, boot up and then replace a drive, shifting what cable each drive is currently connected to.
This would probably be the safest way to resilver

D: Unsure yet if this is possible or safe, but would be nice to move FN to a USB, disconnect optical, and add the 2 new drives as a new mirror. Can this be done?
Absolutely; many people boot from USB. I do not recommend you make this change while you're having drive and pool issues though. If the only reason you're looking to move to booting from USB consider picking up a cheap used HBA to give you more ports so you can continue to boot from SSD; they are quite a bit more reliable than USB.
 

ChrisNAS

Explorer
Joined
Apr 14, 2017
Messages
71
Thank you for your replies.

I did make some progress in reducing from 100%. I tried nulling a large file and it did not work. I deleted a few larger snapshots and that did work. Brought me down 40 gigs. I also was able to start SMB after deleting the few snapshots.

This was helpful: https://www.thegeekdiary.com/how-to-delete-files-on-a-zfs-filesystem-that-is-100-full/

I then tried deleting more files, but not sure where they go as space does not reduce in GUI by just deleting files. I think it has to do with snapshots. Like a delete isn't going away if file is referred to by a snapshot or something like that. So it seems the only way to get out of this and actually reduce sizes is deleting snapshots.

System has been sitting on since last night seeming normal. The repeating error about mdir is hasn't been an issue since size reduction. To answer your question about SMART test... I have not... at least not since long ago. And I'd rather not stress drives now.

I did however just read the smart status on both drives:

ada0

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 064 044 Pre-fail Always - 218225144
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 669650378
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 22078 (50 6 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 20
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 063 040 Old_age Always - 32 (Min/Max 24/33)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 69
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 98
194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (0 23 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 22716 (212 185 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 40942727703
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 123144990566

ada1

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 064 044 Pre-fail Always - 142725408
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 087 060 045 Pre-fail Always - 461349882
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 22078 (14 129 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 20
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 061 040 Old_age Always - 32 (Min/Max 24/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 69
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 96
194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (0 23 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 22714 (199 240 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 40943666359
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 123141838241
 
Joined
Oct 18, 2018
Messages
969
Have you run any SMART tests on them such as smartctl -t /dev/<device>? If not, the results may be a bit misleading. The results, as they are, look good though. But, SMART could give a false negative.

If you're able to use the pool and have usage lower (what percentage is it at now?) you may consider increasing the size of your pool. Deleting files is temporary because it will just fill up again.
 
Top