How screwed am I? Or where do I go from here?

Status
Not open for further replies.

caddy013

Dabbler
Joined
Sep 6, 2018
Messages
11
Hey everyone,

Before I start out, system info:

Build 11.1-U5
i3-4130T
16GB ECC RAM
ASROCK E3C226D2I Mobo
6x WD Red 3TB drives (well, one is now a 4TB, but I'll get to that) set up as 3 mirrored vDevs all in one large zpool.


Been running more or less problem-free since 2014 (weekly scrubs and regular short smart tests), until now.

Okay, here goes...


So, last week I started getting CRITICAL alert e-mails from my box. Started checking it out and I was getting pending unreadable sectors. Ordered a new drive (4TB Red and was going to replace it when one of my other drives just dropped out of the vDev and my pool started showing as degraded. So I (probably too hastily) rushed and replaced that drive instead. Since then, and after the resilver, I've been getting some weird errors and things just aren't stable. Still getting the pending sectors error on ada2, but also, ada2 and ada3 (a mirrored pair) have both dropped out of the pool twice making my entire zpool unavailable (different reboots), and ada5 is starting to throw a lot of ahchi errors and occasionally will drop out of the pool. I shut it down for a couple days while I tried to wrap my head around what could be going on. Did a lot of reading in the forums, some of which led me checking cables and connections several times. Once, when ada5 dropped out, I found some *fantastic* advice that recommended force mounting. Didn't find out until later that this might have been a bad idea. And, I found out during all of this that I should have been doing regularly scheduled long smart tests along the way.

I'm guessing things are going to quickly get worse, so I'm trying to make sure I save as much data as I can, if possible. (This is where I mention that my backups aren't what you'd call backups...) However, troubleshooting-wise, I'm really not sure where to go from here first in order to nail down what's really going on. My original plan was to get as much data off as I can quickly, then start small with the cables first, then purchasing an HBA card to see if it's the SATA ports on my mobo before just throwing new HDDs at the problem. I'm attaching output from smartctl -x /var/log/messages, and zpool status -v. Anything else y'all need me to post just let me know. It's late, so I'm sure there are things I've left out... Crossing my fingers, but trying to steel myself for what I think the answer probably is.
 

Attachments

  • smartctl x noserial.txt
    85.8 KB · Views: 214
  • zpool status -v.txt
    4.9 KB · Views: 244
  • var log messages -noserial.txt
    40.2 KB · Views: 376
Last edited:
Joined
Jul 3, 2015
Messages
926
Is the pool mountable and are you currently able to access your data?
 

caddy013

Dabbler
Joined
Sep 6, 2018
Messages
11
Yes, the pool is mounted and I can access it, but it's resilvering two drives that fell out of the pool earlier (ada2 and ada5). I wasn't sure if trying to do massive copies would tax it too much. Interesting fact, at the beginning of this last boot, it said it was resilvering (didn't label which drives), and said it was going to take approx. 30 minutes. At about 3 this morning, it was in the low 60%s done and said around an hour remianing. Just checked and now it's over three hours remaining and still low 60s for %age complete.

So here's my new question: would you recommend shutting down, unplugging the two resilvering drives, booting back up to copy data, then trying to reattach them after I have everything I wanted copied?
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I don't think I'd bother unplugging the resilvering drives. Just start the copy. The data copying will slow down the resilver, but I wouldn't expect it to damage anything. Though I don't have any experience with mirrors.

With the amount of errors you've been seeing though, I'd suspect something like loose or bad cables. Pending sectors on their own don't bother me much, unless you're seeing a lot of them.

I will say that this is an example of why I personally prefer RAIDZ over mirrors. Imagine if ada2 and ada5 had been part of the same mirror. All that data would be gone and that failing mirror would take the rest of the pool with it. I understand that it's all trade-offs though and there are speed and expansion advantages to mirrors.

I do note that 'zfs status' shows permanent data errors with some files. It doesn't look like you've lost anything critical, but I'd also probably be considering recreating the affected data (possibly recreating those jails). I'm not familiar with what the '<0x0>' means after some of the lines. Degraded metadata?
 

icsy7867

Contributor
Joined
Dec 31, 2015
Messages
167
If you can copy I would go ahead and snag your important stuff. Remember raid is not a backup! I personally have two datasets...

A "data" and a "backup". Anything critical gets backed up to backblaze. ( About 500gb of personal photos, videos and ovirt virtual machines.

Count your blessings that you can still access your data! However, if the reslivering finishes, and you still have errors, check your cables, backup your config, and reinstall freenas.

Also I think mirrors are fine with at least 2+ mirrors. Obviously the more you stripe together the better your chances. However if you have a couple of drives mirrored you better have some good backups!

Goodluck! I hope it goes well for you.
 

caddy013

Dabbler
Joined
Sep 6, 2018
Messages
11
I’ll call this one solved for now. Last I checked, resilver was at just over 94% and copying data over was working swimmingly with good speeds. /begin backing up all the things...

I’m still thinking about pulling the trigger on the Intel-branded LSI card though to see if that fixes some of the errors and to have something more reliable than the onboard SATA ports. Any gotchas I need to consider when finally making the move? Backing up config and data beforehand, but other than that?
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I personally have two datasets...
datasets or pools? you would need two pools to be safe from a vdev failure.
Obviously the more you stripe together the better your chances.
This is the opposite of the truth. In now way does any form of stripe make your data safer, only parity or mirrors. a 3-way mirror would be safer than a 2-way mirror. 2 2-way mirrors are 3/4 as safe as 1 2-way mirror. Math son.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I’ll call this one solved for now. Last I checked, resilver was at just over 94% and copying data over was working swimmingly with good speeds. /begin backing up all the things...

I’m still thinking about pulling the trigger on the Intel-branded LSI card though to see if that fixes some of the errors and to have something more reliable than the onboard SATA ports. Any gotchas I need to consider when finally making the move? Backing up config and data beforehand, but other than that?
I'm not sure how you are considering this solved. I see a few issues...
1) When you post results about your hard drives, user the serial number. If you don't want to post the entire serial number then post the last 4 digits so we all can track them in a positive way.

2) You are not running routine SMART Extended tests. A Short test is a quick Go/NoGo test, it is not complete enough for proper testing. I run my Short tests daily and a Long test weekly for all my drives and have been for over 5 years. I highly recommend that you include some routing SMART Long testing for all your drives.

3) Your drive ada2 definitely looks to be failing as indicated by ID values 1, 197, and 200.

4) You do not have any SMART data for drive ada5, I suspect it dropped out again. If it's working then you should still provide that data.

5) If your drive is dropping out of the pool then I'd recommend after the resilvering completes that you swap the data cable between drives ada5 and ada1 to see if the problem moves to ada1. Of course shut down the system when you do this and track the drives by serial number, not adax. When you reconnect it is quite possible for the drive that was ada1 to now be ada5. If the problem moves then you know you either have a bad cable or controller, odds are a cable. Yes they can and do go bad without any user intervention, it's happened to me and many others.

6) Once you have the problem solved and corrected then you may want to rethink how your pool is configured, but then again maybe you really wanted mirrored pools. If you do not have a specific requirement to use mirrors then you could reconfigure after backing up all your data and configure a RAIDZ2. This would increase your capacity and also allow you to tolerate any two drive failures.

Just my two cents.
 

caddy013

Dabbler
Joined
Sep 6, 2018
Messages
11
I'm not sure how you are considering this solved. I see a few issues...
....
Thanks for the reply and the recommendations. I'll try to address all of your points.

I guess by considering it "solved for now", I felt I had enough info to go off of to be able to start safely-ish pulling data from the server, but you're right, my issues aren't really solved in any sense of the word.

1) Regarding the serial numbers, I didn't know they could be useful to others. I had masked them because I saw someone else in another post somewhere mask theirs (maybe not even on this forum tbh), so went with "when in Rome" logic . Here they are for reference:

ada0 WD-xxxxL5D1
ada1 WD-xxxxXKJ3
ada2 WD-xxxx6299
ada3 WD-xxxx9088
ada4 WD-xxxxXEU2
ada5 WD-xxxxXRPT

2) Correct. When I was initially setting it up 4 years ago, I didn't realize I should be doing that. I'll admit to having read maybe a setup guide or two from somewhere, but probably went in a lot more ignorant than I should have. That's being remedied tonight.

3) Check, and thanks. I have a spare 3TB I can replace it with.

4/5) I don't have a long test for that one, and it has indeed fallen out again as of 17:17 this evening. Running the test now and will post results after completion. I'll try swapping cables afterwards as well.

6) When I was doing research, I was after max performance (and complete overkill). I don't do anything that requires near that throughput though, and after this recent experience, I've been reading a lot about RAIDZ2. I do remember reading some things about RAID6 when I was doing initial research, and being scared away by a few things (especially restore times, etc.), but looking at things practically, without having the capacity or $$ to go to RAIDZ3...Z2 will probably fill my needs just fine. Maybe someday if I ever move to 10G...but that's no time soon.

Thanks for the recommendations and the 2 cents. I really do need to get a backup solution in place. I've been eyeing Backblaze B2, but for now I have a student Google Drive account that I can use. My other problem is duplicates...duplicates everywhere. That's another story for another post, but I'll be trying to sort through that as I copy stuff to the cloud...

Thanks again!
 
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
No problem. I do what I can to offer sound and good advice just like all the regular forum member here. When looking at your hard drives after running a SMART Long test on them, reference my little Hard Drive Troubleshooting Guide link in my signature, it will help you decipher much of the data that we are looking for with respect to failing indicators.

As for local backups of your data, I find it very handy to have a few external USB hard drives where I could copy off my data and then I could blow away the FreeNAS pool and rebuild it. Then I copy the data back on. The external USB drives would be connected to my Windoze computer and the copy process would be somewhat a manual operation since it would be a one time event. You could have something like that automated if you wanted to do it periodically but that is another discussion.

The reason some foks do not post the drive serial numbers is becasue some folks think that someone will steal your serial number and use it as a warranty trade-in. I don't have faith that it would work, I'm sure someone could get an Advanced RMA but that charges thier credit card before the replacement goes out the door and if the failed drive doesn't return within a certain period of time then they purchased a refurbed hard drive at full cost. If you needed warranty replacement and your serial number had been used, well you may not be able to use Advanced RMA but they would still honor the warranty once they recieved the hard drive. I went ahead and edited your posting to remove the first pieces of the serial number. I wouldn't do it for myself but you might feel differently.

Good luck!
 
Last edited:

caddy013

Dabbler
Joined
Sep 6, 2018
Messages
11
Hey guys, just an update on what's been going on...

I was finally able to muck around tonight with the insides and I found a replacement SATA cable to try out. So far, ada5 (WD-xxxxXRPT) is online again and resilvering like a champ. Received a few CAM errors for ada2 and ada3 during the reboot, so a new set of cables is on the way. Planning to run a long SMART on all the drives once the resilver is done, and I found my spare 3TB that I can switch out with ada2, which has the pending sectors once the resilver and long SMART tests complete.

Going to do some more research on regular maintenance and take a look at some scripts I can run on the regular to assist with keeping me informed.

Thanks for all the help with the troubleshooting. Looks like it's time to put on my adulting pants and actually get all of this stuff (at least the important bits) legitimately backed up....
 
Status
Not open for further replies.
Top