.rrd files all failed writing at once - I/O write failure during scrub causes pool degredation

fossnik · May 22, 2019

I'm experiencing a very disturbing problem with my FreeNAS rig, which has never occurred before. I'm not an expert, but I am seeking help about how to proceed.

Allow me to describe the situation. I set up this system 3 or 4 years ago - AsRock Rack C2550D4I 16GB ECC dimms, FreeNAS 11.1 U7 on 32GB USB Flash, and a couple 4TB WD Red HDDs organized as an 8TB pool without redundancy. I have never in that time had any critical errors or hardware faults. I always kept it updated, had it on a UPS battery backup, and it ran like champ, clocking 3 or 4 months uptime between fairly regular updates I would perform, scrubbing the volume every 35 days, and never producing any checksum mismatches. I use it for my own personal file storage, over SMB, and most of the time system load was about zero - extremely light usage.
So, I was highly distressed when I logged into the HTTP webclient and saw the flashing red light indicating a critical error, and that apart from the critical warning, the actual web interface was unresponsive (Stuck at "Loading...")(see screenshot). After re-booting the system, I found that many of my reporting databases had been degraded. (see screenshot). None of my other files were active at the time this occurred, so they're probably not affected.
I saw that one of my drives had failed during a recently initiated automatic scrub. It was about 20% through the scrub, which resumed after reboot.
Each time I executed zpool status -v I noticed that the checksum errors (97 in this screenshot) continued to increase, although the actual number of files affected remained the same. It made me nervous that the checksum error count kept rising, because I know that FreeNAS seems to update those reporting databases incessantly, so I just powered down the system. Now I'm not really sure what to do.

My basic concerns can be summarized:

is the HDD actually in failure or pre-failure and in need of replacement? (SMART readouts in screenshot)
how do I get the system running normally again?
if the .rrd files, which had been appended to for years, just failed in this one moment in time to be written to, wouldn't that leave them mostly intact (perhaps recoverable even)?

Thanks in advance for any help rendered - It is GREATLY appreciated!

sretalla · May 22, 2019

Part of your problem will be that your pool has no redundancy. It's a stripe of 2 disks.

The SMART data certainly indicates a disk with problems (I guess it's /dev/ada0), a huge number of UDMA CRC errors, so that will be the degraded one. You will not have a good time if you continue to try to use that disk.

With that drive failed, your pool is gone, so get your data off the pool immediately while it is still running. Backup, destroy pool, insert new disk(s) build new pool, restore.

When you get a replacement disk, consider how much you love your data and going with a mirrored pool instead.

There is no option to simply replace the failed disk as removing the failed one will kill the pool.

Don't worry about the reporting databases... they will get recreated when you replace the pool.

The fact that your reaction was to scrub the pool indicates you may think there's something that the system can do to fix filesystem errors... in order for that to be true, you need to have more than one copy of your data (or at least some parity data), which isn't the case in a stripe. ZFS has checksums on your files, so it can see some errors due to that when you run a scrub, but has no parity data/additional copies to recover from when errors are found (i.e. scrubs will only ever find permanent errors for you).

Chris Moore · May 22, 2019

sretalla said:
UDMA CRC errors

Sometimes CRC errors are just bad cabling. You might try new cables first.

fossnik said:
a couple 4TB WD Red HDDs organized as an 8TB pool without redundancy.

You should not have created a pool with no redundancy. You are guaranteed that eventually one of the two drives will fail and when that happens, you will loose all the data on the pool.

Things to try from the command line:
You could try the command zpool clear redzpoole to clear the error state.
https://docs.oracle.com/cd/E19253-01/819-5461/gazge/index.html
After that you could try to use the command zpool online redcpoole but you need to add to the end of that command the full gptid/bunch of numbers letters and dashes for the device that is offline.
https://docs.oracle.com/cd/E19253-01/819-5461/gazgk/index.html

I highly recommend adding a mirror to each of those drives to give you redundancy, if this works, so your data is better protected going forward.

sretalla · May 22, 2019

fossnik said:
I set up this system 3 or 4 years ago

fossnik said:
a couple 4TB WD Red HDDs

The disks have very different numbers of power-on hours... did you add them both at the beginning, or was the 2nd drive (the one that is not failing) added about 2 years ago?

Chris Moore said:
Sometimes CRC errors are just bad cabling. You might try new cables first.

For it to be a cabling error after about 58'000 hours of power-on, it would seem a little odd to me unless there was some kind of physical intervention in the box to bring about some kind of change to the status quo.

Chris Moore · May 22, 2019

fossnik said:
is the HDD actually in failure or pre-failure and in need of replacement? (SMART readouts in screenshot)

The two hard drives you gave us partial information on are of different ages. One is reporting 58005 power on hours, if I am reading that photo correctly while the other is reporting 25434 power on hours.
Just to give you some perspective on that, 58000 hours works out to 6.6 years of power on time for the drive. That is beyond the expected life of most hard drives and that is the drive that appears to be faulted. If you are able to get this pool back online, you need to make a backup and start with some new drives.

Chris Moore · May 22, 2019

sretalla said:
For it to be a cabling error after about 58'000 hours of successful spin, it would seem a little odd to me unless there was some kind of physical intervention in the box to bring about some kind of change to the status quo.

Perhaps it is wishful thinking on my part, but I have seen a reseat of a connection clear up a data error. It is grasping at straws, but the only hope I see.

fossnik · May 22, 2019

sretalla said:
The disks have very different numbers of power-on hours... did you add them both at the beginning, or was the 2nd drive (the one that is not failing) added about 2 years ago?

For it to be a cabling error after about 58'000 hours of power-on, it would seem a little odd to me unless there was some kind of physical intervention in the box to bring about some kind of change to the status quo.

I think I started with one disk and added another later on (if I remember correctly). So it might not actually be a stripe.

Chris Moore · May 22, 2019

fossnik said:
So it might not actually be a stripe.

No, it is, you showed us the zpool status so there is no question about that.

fossnik · May 22, 2019

Oh, that's right - I think I did set up a stripe, but I had started out with just one HDD, and then created the striped pool when I bought a second drive some years later... I kind of regret buying the more expensive WD Red HDDs now, given that any drive will fail at some point anyway.
I guess now I need to buy an 8TB drive, copy the whole pool over to it, and then probably mirror it with another 8TB drive (when I'm able to afford a second one). I have been meaning to do so for a while, but it's just rather expensive for me.
It's sort of ironic that my reasoning for putting the reporting databases on the HDD pool was that I thought the frequent writes would probably burn out the USB flash drive. I'm curious how others have addressed this issue. Would I have been better off leaving it on the NAND Flash?

Chris Moore · May 22, 2019

fossnik said:
I kind of regret buying the more expensive WD Red HDDs now, given that any drive will fail at some point anyway.

There are so many problems with this.
The problem here is not the kind of drive. The problem is that you did not set the system up with redundancy, and you did not monitor the system for faults, and you used a drive for 6.6 years with no thought that a mechanical device might eventually fail. So take that finger you are pointing at the hard drive and turn it around and point it at yourself. If you are going to do a thing, you should learn enough to do it right. You do not drive a car without ever learning how, if you did you would crash. Same for maintaining a NAS.
Hard drive failure rates go up significantly after the five year mark. That is a well-documented statistic in the industry. They are engineered to a certain standard and the expectation is they will be replaced on some kind of schedule, like tires on a car. They do not last forever, no mechanical device does.

fossnik said:
I thought the frequent writes would probably burn out the USB flash drive.

It would have.

fossnik said:
I'm curious how others have addressed this issue.

We don't suggest using USB flash drives any more.

fossnik said:
Would I have been better off leaving it on the NAND Flash?

USB flash drives do not use the same quality of memory that SSDs use. The modern guidance is to use a SSD for the boot drive.

fossnik · May 22, 2019

You're totally right. I can't believe 6 years went by and I never managed to implement redundancy. That's the reason I bought a second hdd, but I changed my mind and decided I would rather have the extra space instead. That's why human error is the cause of 90% of data loss.

fossnik · Jun 16, 2019

I think the disk drive may not have been the issue. Working with an entirely different drive, I encounter the same error. Some of the SATA ports appear to work better than others, and I think the motherboard is probably to blame.

Chris Moore · Jun 16, 2019

fossnik said:
I think the motherboard is probably to blame.

Very possible, which is the reason for this post:
https://www.ixsystems.com/community/threads/forum-guidelines.45124

If we know what hardware you are working with, it makes troubleshooting much simpler.

fossnik · Jun 16, 2019

I have added HW information to my signature.
I have the AsRock Rack C2550D4I with Intel(R) Atom(TM) CPU C2550 @ 2.40GHz
The C2550D4l has a lot of SATA - Some of the ports seem to work fine, but I'm not sure how I can know for sure.

2 x SATA3 6.0Gbps, 4 x SATA2 3.0Gbps by C2550
4 x SATA3 6.0Gbps by Marvell SE9230, 2 x SATA3 6.0Gbps by Marvell SE9172

Chris Moore · Jun 16, 2019

fossnik said:
I have added HW information to my signature.

Please keep in mind that anyone viewing the forum from a mobile device (phone or tablet) will not see the signature. Please be polite if someone asks.

fossnik said:
I have the AsRock Rack C2550D4I with Intel(R) Atom(TM) CPU C2550 @ 2.40GHz

These systems were having a lot of failures a couple years ago. Did your system board get replaced?

fossnik · Jun 16, 2019

Chris Moore said:
Please keep in mind that anyone viewing the forum from a mobile device (phone or tablet) will not see the signature. Please be polite if someone asks.

These systems were having a lot of failures a couple years ago. Did your system board get replaced?

I bought it about 6 years ago and it was never replaced.

Chris Moore · Jun 16, 2019

fossnik said:
I bought it about 6 years ago and it was never replaced.

You might not be having the same (or even similar) fault, but you might still want to read up on it:
https://www.ixsystems.com/community/threads/asrock-rack-c2550d4i-second-rma.60940/post-433056

Try some searches of the forum.

Chris Moore · Jun 16, 2019

As I recall, the problem is simply a CRC error count that continues to increase. You can, if you have not yet, replace the cable from the drive to the system board and you can plug the drives into different connectors on the system board. FreeNAS doesn't care which port the drive is plugged into as long as it can find the dive when it goes searching at boot.

fossnik · Jun 16, 2019

Yeah, it's probably a similar issue.
So I have the (2x) 4TB drives and I want to copy all the extents of the that striped storage pool onto my new 8TB disk. What is the best way to do this?

Chris Moore · Jun 16, 2019

fossnik said:
Yeah, it's probably a similar issue.
So I have the (2x) 4TB drives and I want to copy all the extents of the that striped storage pool onto my new 8TB disk. What is the best way to do this?

Do you want to just copy all the current data over to the 8TB drive and use it as a single drive pool? Is your system physically large enough for you to put all the drives in the system at the same time?

Important Announcement for The TrueNAS Community.

.rrd files all failed writing at once - I/O write failure during scrub causes pool degredation

Dabbler

Attachments

Powered by Neutrality

Hall of Famer

Powered by Neutrality

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Dabbler

Attachments

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Similar threads

Important Announcement for The TrueNAS Community.