Replacing A Failed Single Drive With A Larger Hard Disk

pab49162 · Aug 19, 2019

Hi,

I would appreciate comments/confirmation relative to replacing a failed single drive with larger hard disk.

I am running FreeNAS-11.2-U5 on a Dell T20 with four 3TB WD Red drives in a raidz2 configuration. Last night, I received an alert saying that my main pool’s state was degraded and one drive had been removed by the administrator.

I pulled that disk this morning and the Western Digital diagnostics found the drive has lot of bad sectors. Unfortunately, it is 4 months past the warranty. :(

Instead of replacing this drive with another 3TB drive, I am wondering if I should go to something a bigger.

If I understand some other postings and the user’s manual correctly, the smallest drive in a volume dictates the available size of the volume. So even if I replaced the bad drive with a 6TB drive, the volume’s available size wouldn’t change until the other 3 drives where replaced with larger drives.

I am thinking that as the remaining drives fail, I could replace the failed drive with 6TB drives. Then once the last drive is replaced and the resilvering is complete, the available size of my volume would basically double from 5.1TB to 10.2TB.

I would appreciate if one of the experts on this forum would correct/confirm my understanding on this.

Thanks in advance, Paul

PhiloEpisteme · Aug 19, 2019

pab49162 said:
I pulled that disk this morning and the Western Digital diagnostics found the drive has lot of bad sectors. Unfortunately, it is 4 months past the warranty. :(

Depending on the type of error if you configure your system to run automatic SMART self tests you'll likely be able to catch bad sectors early and replace a drive before it fully fails.

pab49162 said:
Instead of replacing this drive with another 3TB drive, I am wondering if I should go to something a bigger.

I would suggest not unless you plan to actively increase the size of your pool. You are correct that if you have 3 3TB drives and 1 6TB drives the entire vdev will behave as if it has 4 3TB drives. The reason I suggest not going with a 6TB drive unless you plan to actively increase the size of your pool is because as you replace more and more failed drives you'll likely end up with some of your 6TB drives failing before your last 3TB drive fails and thus you'll have wasted money on that 6TB drive. Imagine how bummed you'd be if you had one pesky 3TB drive that decided to live for 8 years just to spite you and several of your 6TB drives died after 2 or 3 years. Anyway, just my suggestion. Save money and go with a 3TB drive and decide on your growth plan for when you need it, ie either replacing all of the drives with a larger drive or adding another vdev to the pool.

pab49162 · Aug 19, 2019

PhiloEpisteme - Thank you so much for the quick reply to my post.

PhiloEpisteme said:
Depending on the type of error if you configure your system to run automatic SMART self tests you'll likely be able to catch bad sectors early and replace a drive before it fully fails.

Great comment -- I have short and long SMART self tests running on a regular schedule but neither identified any issues the last time they ran. The actaul alert didn't mention bad sections but instead said the following:

Code:

HomeNAS.local kernel log messages:
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada0: <WDC WD30EFRX-68EUZN0 82.00A82> s/n WD-WCC4N7ZRL370 detached
> GEOM_MIRROR: Device swap1: provider ada0p1 disconnected.
> (ada0:ahcich0:0:0:0): Periph destroyed
> GEOM_ELI: Device mirror/swap1.eli destroyed.
> GEOM_MIRROR: Device swap1: provider destroyed.
> GEOM_MIRROR: Device swap1 destroyed.
> (aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
> (aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
> (aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
> (aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted

-- End of security output --

I did a little googling and found that the most likely cause was a bad drive. So that is why I ran the Western Digital diagnostics. Any thoughts on if such an error could really be caused by bad sectors?

PhiloEpisteme said:
Save money and go with a 3TB drive and decide on your growth plan for when you need it, ie either replacing all of the drives with a larger drive or adding another vdev to the pool.

Not sure I want my wife to read this as I already got "spouse approval" to buy a bigger drive. ;) Seriously, I was initially going to buy a 3GB drive but Amazon has the 4GB on sale for $1 more than the 3GB. Figured for the difference, it was probably worth it.

Thanks again for the info - Paul

PhiloEpisteme · Aug 19, 2019

pab49162 said:
I did a little googling and found that the most likely cause was a bad drive. So that is why I ran the Western Digital diagnostics. Any thoughts on if such an error could really be caused by bad sectors?

Hard to say given my limited knowledge there. If you have the drive still and can run a long smart test and put the results here that may show something.

pab49162 said:
Amazon has the 4GB on sale for $1 more than the 3GB.

I know the feeling. If you think you'll need more space soonish it wouldn't be ridiculous to take advantage of a sale. :)

pab49162 · Aug 19, 2019

PhiloEpisteme said:
If you have the drive still and can run a long smart test and put the results here that may show something.

Unfortunately, I don't have an easy way to run a SMART test since I removed the hard drive from the box. I tried to use a USB disk adapter but it apparently doesn't support running SMART tests. :(

Thanks again for all of your assistance. I just order a 4TB WD Red and it should arrive on Wednesday.

PhiloEpisteme · Aug 19, 2019

pab49162 said:
Unfortunately, I don't have an easy way to run a SMART test since I removed the hard drive from the box. I tried to use a USB disk adapter but it apparently doesn't support running SMART tests.

No worries, it was mostly a curiosity.

pab49162 · Aug 20, 2019

After I posted my reply last night, I tried a different approach to run a long SMART test and was successful. Instead of running it via Windows, I did it on Linux using the smartctl utility with the same USB disk adapter .

The results were interesting - the hard drive passed both a short and long SMART test with no errors. Here is a dump of a portion of the output from smartctl showing recent SMART test results including the last long (extended) test:

Code:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     23867         -
# 2  Short offline       Completed without error       00%     23860         -
# 3  Short offline       Completed without error       00%     23860         -
# 4  Conveyance offline  Completed without error       00%     23846         -
# 5  Short offline       Completed without error       00%     23816         -
# 6  Short offline       Completed without error       00%     23768         -
# 7  Short offline       Completed without error       00%     23720         -
# 8  Short offline       Completed without error       00%     23672         -
# 9  Short offline       Completed without error       00%     23624         -
#10  Extended offline    Completed without error       00%     23583         -
#11  Short offline       Completed without error       00%     23528         -
#12  Short offline       Completed without error       00%     23480         -
#13  Short offline       Completed without error       00%     23432         -
#14  Short offline       Completed without error       00%     23361         -
#15  Short offline       Completed without error       00%     23312         -
#16  Short offline       Completed without error       00%     23264         -
#17  Short offline       Completed without error       00%     23216         -
#18  Extended offline    Completed without error       00%     23176         -
#19  Short offline       Completed without error       00%     23120         -
#20  Short offline       Completed without error       00%     23082         -
#21  Short offline       Completed without error       00%     23034         -

and here are the SMART attributes showing Power On Hours of 23868 which correlates with the running of the last couple of SMART tests:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   178   021    Pre-fail  Always       -       5116
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       87
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23868
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       87
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       175
194 Temperature_Celsius     0x0022   115   108   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   001   000    Old_age   Always       -       5570
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

So while the disk successfully passed the long SMART test, the Western Digital Diagnostics extended test found "Too many bad sectors detected" as shown in this log

Code:

Test Option: QUICK TEST
Model Number: WDC WD30EFRX-68EUZN0
Unit Serial Number: WD-WCC4N7ZRL370
Firmware Number: 82.00A82
Capacity: 3000.59 GB
SMART Status: PASS
Test Result: PASS
Test Time: 09:06:25, August 19, 2019

Test Option: EXTENDED TEST
Model Number: WDC WD30EFRX-68EUZN0
Unit Serial Number: WD-WCC4N7ZRL370
Firmware Number: 82.00A82
Capacity: 3000.59 GB
SMART Status: PASS
Test Result: FAIL
Test Error Code: 08-Too many bad sectors detected.
Test Time: 17:35:16, August 19, 2019

I guess the bottom line is that you should run both the SMART tests and the manufacturer's test utility if you think a hard drive has issues.

PhiloEpisteme · Aug 20, 2019

So, looking at your smart results the value of UDMA_CRC_Error_Count suggests an issue communicating with the controller such as as bad cable or bad connection. Though, to be honest, usb enclosures etc add a bit of an uncontrolled variable and could possibly themselves be causing those errors as thus be a red herring.

What is really interesting to me is that the WD tool reports Test Error Code: 08-Too many bad sectors detected. and yet SMART is reporting zero for all of Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable.

I do wonder if the WD utility was running tests and had issues communicating with the drive and treated it as a bad sector when in fact perhaps a cable or connection issue precluded communicating with the controller reliably?

Anyway, certainly not an expert on these things but it is interesting. If it was a bad cable or bad connection it is entirely possible the drive is still good. The drive has been powered on for a few months shy of three years though, perhaps something went bad on the drive causing those UDMA_CRC_Error_Count errors.

I wonder what your other drives would report for UDMA_CRC_Error_Count after a long self-test.

pab49162 · Aug 20, 2019

PhiloEpisteme - I think you may be right about the UDMA_CRC_Error_Count being caused by the USB adapter. After I ran the WD tool, I noticed that this count was about 2700. After running the long SMART test on Linux, I noticed this count was its present value of 5570. So potentially, the adapter is causing those problems.

As I am periodically running the long SMART test on all of the drives, I just checked the UDMA_CRC_Error_Count on the 3 drives still in my FreeNAS server. The count is 0 for all three of those drives.

It is interesting that the WD tool may have misinterpreted the communications errors with the USB adapter as bad sectors. I previously used the same adapter and WD tool to test another of the 3TB WD Red drives which failed after 6 months. In that case, the extended test worked just fine and did not report any errors.

So now I am wondering if the drive is actaully good and maybe I have a bad cable. However, the cable is only 3 years old and has never touched after building the system until I removed the hard drive yesterday.

What would you think about me zeroing this drive, then reinstalling it back into the system and then doing a "Replace" to add it back into the pool? It would be interesting to see if the disk works or if FreeNAS flags it with more errors.

If I did this, do you see any potential issue where I could cause damage to the data stored on this system?

Thanks, Paul

PhiloEpisteme · Aug 20, 2019

pab49162 said:
As I am periodically running the long SMART test on all of the drives, I just checked the UDMA_CRC_Error_Count on the 3 drives still in my FreeNAS server. The count is 0 for all three of those drives.

Great! Your controller is probably not busted.

pab49162 said:
What would you think about me zeroing this drive, then reinstalling it back into the system and then doing a "Replace" to add it back into the pool? It would be interesting to see if the disk works or if FreeNAS flags it with more errors.

If I did this, do you see any potential issue where I could cause damage to the data stored on this system?

If I were you I'd give it a shot. I always have backups of my data and I use RAIDZ2 so the disk is very very unlikely to ruin my system. If you can get that drive hooked back up directly to the board or HBA and run the long self-test again and see what it says you may be fine. I don't think the count for UDMA_CRC_Error_Count gets reset so it may hang at the same 5570. You'll want to take note of the value before and after you run the test once you get it plugged in just to be sure it hadn't changed. If you wanted to be even more sure of the health of the drive you could run badblocks on it followed by a final long self-test. I'm not sure if I would personally do the badblocks test on a drive I was already using prior but I'm sure some folks would argue it is worth doing. I suppose that is up to you.

pab49162 said:
So now I am wondering if the drive is actually good and maybe I have a bad cable. However, the cable is only 3 years old and has never touched after building the system until I removed the hard drive yesterday.

The bad cable idea isn't the only possible cause, though it is a common on. It is possible that something went wrong on the plug or pcb on the drive causing a communication error as well. Of course, if this is the case and you hook it up to the machine directly, run the long self-test, and then check the value it should have gone up indicating the drive is bad.

My personal approach would be to use the drive if I could convince myself it was not about to fail. I would keep the replacement drive as a spare for when one of your data drives inevitably fails. If possible you'll want to burn in the replacement drive, it run a short self-test, followed by a long self-test, followed by badblocks, followed by a long self-test. After all of that no errors should be reported and the values for the SMART fields I listed above should all be 0. If the backup drive passes that it is ready to go. This is my personal burn-in process, I'm sure others have some they like better.

pab49162 · Aug 20, 2019

PhiloEpisteme said:
My personal approach would be to use the drive if I could convince myself it was not about to fail ... If possible you'll want to burn in the replacement drive, it run a short self-test, followed by a long self-test, followed by badblocks, followed by a long self-test.

I like your recommendation and that is the direction I am going ...

I have already reinstalled the drive back in my FreeNAS box and ran a short and conveyance self-tests. Both passed with no errors. I now have a long self-test running and it will take about 7 hours to complete. If it has no errors, I will run a badblock check which takes about 50 hours to complete. If no problems are found there, I will run another long self-test. (This is basically your burn-in process and is simlar to what I have used in past.)

So if everything passed with no errors and the UDMA_CRC_Error_Count hasn't increased, I will then do "Replace" to add it back into the pool. Then I will monintor what happens over the next week or two especially the next time the system does a scrub.

One question, do you think I should wipe the drive before I do the replace? This probably isn't necessary as I am guessing the resilvering process is smart enough to ignore the existing old data. But I am wondering if this might be a good stress test for the drive.

PhiloEpisteme said:
I would keep the replacement drive as a spare for when one of your data drives inevitably fails.

As far as the new replacement drive, I am not sure if I will keep it or return it. While it would be convenient to have a spare laying around, I can easily get a replacement drive in about 2 hours. I would hate to have the spare drive sitting unused for a year or more burning through it 3 year warranty.

I will post another update if I find a problem or after the burn-in process is complete.

Thanks again for all of your assistance - Paul

PhiloEpisteme · Aug 20, 2019

pab49162 said:
One question, do you think I should wipe the drive before I do the replace? This probably isn't necessary as I am guessing the resilvering process is smart enough to ignore the existing old data. But I am wondering if this might be a good stress test for the drive.

If you run badblocks it'll wipe the drive anyway.

pab49162 said:
So if everything passed with no errors and the UDMA_CRC_Error_Count hasn't increased, I will then do "Replace" to add it back into the pool. Then I will monintor what happens over the next week or two especially the next time the system does a scrub.

Yeah, that sounds right. Keep in mind that smartctl will not necessarily say a test failed even if there are bad sectors etc. Its best to look at the raw values of the fields we discussed earlier. If I were you I'd run a scrub right after the resilvering is complete as well.

pab49162 said:
As far as the new replacement drive, I am not sure if I will keep it or return it. While it would be convenient to have a spare laying around, I can easily get a replacement drive in about 2 hours. I would hate to have the spare drive sitting unused for a year or more burning through it 3 year warranty.

Fair enough, especially if you can hold onto it unopened until you confirm your other drive is either good or bad.

Anyway, hope all works out. I'm curious to see how your tests work out when you're done. Report back. :)

pab49162 · Aug 20, 2019

PhiloEpisteme said:
If you run badblocks it'll wipe the drive anyway.

Thanks for the reminder - I forgot that badblock is a destructive test

PhiloEpisteme said:
Keep in mind that smartctl will not necessarily say a test failed even if there are bad sectors etc. Its best to look at the raw values of the fields we discussed earlier.

Thanks - I will definitely look at the raw values. I saved a copy of the values before I started the short and conveyance self-tests.

PhiloEpisteme said:
Fair enough, especially if you can hold onto it unopened until you confirm your other drive is either good or bad.

That is my plan - I have 30 days to return it so I should be good.

PhiloEpisteme said:
Anyway, hope all works out. I'm curious to see how your tests work out when you're done. Report back

Thanks - I will definitely report back once the tests are done (probably 4 or 5 days).

pab49162 · Aug 24, 2019

PhiloEpisteme - Here is an update ...

It took almost 3 days for the badblock check to finish and it completed with no errors. I then ran two long SMART tests and the second one completed late last night. Both of them also completed with zero errors.

I just looked at the SMART data -- none of the error counts changed since I reinstalled the drive and started all of the testing including the badblock check. So I am thinking that the drive may be fine and the original problem was some type of fluke.

Based on these results, I just did a"replace" disk within FreeNAS and the resilvering process is at 4%. One interesting thing was that FreeNAS saw that the old drive was ZFS formatted and I had to check a "Force" box to initiate the replace. I wasn't expect that as I thought that badblock would have destroyed both the data and formatting.

I am going to see how things goes over the next week or so. If I don't have any issues, I will probably return the new 4GB drive. Its box is sitting here unopened and hopefully, will stay that way.

pschatz100 · Aug 24, 2019

Personally, I do not mess around with questionable drives. If a drive begins to show errors, and I know it is not due to something else like a cable not plugged correctly, etc., I just replace it and do not look back. According to your SMART tests, the drives are almost three years old. Why take the risk?

At the end of the day, the cost of a drive is not a big deal to me considering how much time and effort I put into organizing and curating my data.

As for replacing the disk with a larger one, no harm there. I did just that a few years ago. As it so happened, about six months after I replaced the questionable drive, the larger disks went on sale. I used the opportunity to replace all of the drives and increase my storage. I set up my old drives as external backup. Nothing was wasted.

pab49162 · Aug 24, 2019

pschatz100 said:
Personally, I do not mess around with questionable drives. If a drive begins to show errors, and I know it is not due to something else like a cable not plugged correctly, etc., I just replace it and do not look back. According to your SMART tests, the drives are almost three years old. Why take the risk?

Thanks for the feedback and I total agree -- I normally don't mess with questionable drives. If I have any doubt about a drive, I replace it.

That said, the original error I got with this drive was weird and didn't make much sense. It seemed to be a communication error rather than the usual bad sector or totally dead drive. So I decided to do more testing in an attempt to isolate the problem to the hard drive, the disk controller itself or even the cable.

My initial guess was that the drive was bad, so I removed the drive and did further testing to hopefully confirm that the drive was bad. However, as I did the further testing, I couldn't find anything wrong with the drive. So I put the drive back in the original system to do even more testing under the original conditions. But even after doing this, I didn't see problems.

At this point, I hope that putting things back together will either reproduce the error or I won't see any more issues. If it is the former, I will try a couple different things to hopefully isolate the problem to a specific hardware component.

One quick question for you - do you have any thoughts on what would cause this error:

Code:

> (aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
> (aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
> (aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
> (aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted

As I mentioned earlier, I think it could be hard drive, the disk controller itself or even the cable. Any suggestions on how to find the root cause would be appreciated.

Thanks, Paul

pschatz100 · Aug 25, 2019

Over many years, I have had exactly two problems related to disks, other than the disk itself going bad. 1) sata cable was not making a good connection, and 2) power cable to disk was flaky.

In the first, I rearranged the sata cables - and the problem followed the bad sata cable. In case you are not aware, you can re-plug the disks in any order and the system will still access them OK.

In the second, I had a molex to sata power adapter fail. This was a difficult one to troubleshoot because one would not expect problems with a simple power adapter. After replacing sata cables, checking my motherboard, and even replacing the power supply itself - I replaced the power adapter and the problem went away.

pab49162 · Aug 25, 2019

pschatz100 - Thanks for the information on rearranging SATA cables. I was wondering about that last night and did a bit of googling on that question. It is good to have someone knowledgeable state that I can indeed swap SATA ports and the system will still be happy.

If the problem reoccurs, my plan will be to first swap two of the SATA cables right at the disk drive. This is easy since two the drives are right next to each other. Depending on the results, I would then probably swapping cables at the motherboard ports. I think I would then have a better idea of what component is faulty.

As far as the power adapter, I have a couple of those in other systems so I will have to remember that they can go bad. Like yourself, I sure wouldn't have expected one of those to go bad.

Thanks again for the tips - Paul

Important Announcement for The TrueNAS Community.

Replacing A Failed Single Drive With A Larger Hard Disk

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

PhiloEpisteme

Guru

pab49162

Dabbler

pab49162

Dabbler

pschatz100

Guru

pab49162

Dabbler

pschatz100

Guru

pab49162

Dabbler

Similar threads