SOLVED Pool status unknown after faulty drive(s)

tc9999 · Oct 15, 2020

Hi,

earlier this week I got a warning that my pool status was degraded. One drive showed as unavailable and gave the gptid but I could not match it to a serial number of my drives. The next day I checked again and now it said one drive was unavailable and one was faulted.

I have 7 hdds but both glabel status and a script I found in this forum only showed 4 drives with their gptid/serial:

Screen Shot 2020-10-09 at 10.49.09 PM.png

Since I couldn't figure out which of the 3 remaining drives are bad, I turned off the server. Today I wanted to try and figure it out again, but now after turning it back on again it says pool status unknown and only lists 2 drives:

Screen Shot 2020-10-15 at 10.54.20 PM.png

zpool import shows this:

Screen Shot 2020-10-15 at 10.54.35 PM.png

Any idea what's wrong? I'm a complete noob when it comes to freenas.

Thanks!

tc9999 · Oct 15, 2020

Also, I forgot to mention that I did remove 1 drive I assumed to be faulty (da0), I don't know if that was a mistake or not but I reconnected it and still same issue.

tc9999 · Oct 17, 2020

I'm not able to edit my post, but here are the system specs, hoping someone can help:

Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
Supermicro X10SL7-F
32GB RAM
7 WD Red 6 TB
FreeNAS-11.2-U5

Thanks

JaimieV · Oct 19, 2020

The system is correct when it's declaring that "one or more devices are missing" - it cannot see enough of the drives of your Pool1. They're not showing up in FreeBSD, so naturally ZFS can't see them.

How are they physically connected? I see the motherboard has a bunch of SATA ports, so if you're using those and all was well you should be able to see all the drives in the BIOS. Through a SATA/SAS card? Seems like it may have come loose/died. Are they all spinning up? Cables not worked their way loose?

If you get to see the drives in the BIOS (and/or addin card BIOS), we'll move forward to further troubleshooting.

tc9999 · Oct 19, 2020

Thanks a lot for the reply!

The drives are connected to the SATA ports on the mobo, I do not have a SATA card.

I ended up removing all the drives from the server, bought enclosures for all the drives, connected them to my mac running ubuntu and was able to import the pool.

I have 7 drives in total (raidz2), smartcl would not run for 2 out of the 5 drives, the others said smart status passed.

This is what zpool status shows:

So if I understand correctly 4 drives are about to fail, and 1 has failed based on the info above.

I am not sure why it started resilvering as I did not replace any of the original drives... I ended up just disconnecting drives as it said it would take 10 days and I could not keep it running as I need my mac for work.

What's the best way to proceed from here?

I will try installing the drives in the server again with different cables, and then should I replace the 1 unavailable drive first? If I need to replace 5 drives in total, would it make sense to just create a new pool with larger drive capacities (currently my drives are 6tb)?

Thanks for your help!

JaimieV · Oct 20, 2020

Wow, that's pretty horrific. Given the disk move, I think we can probably trust that those disks probably are going bad. The "37817 data errors" are blocks that have been lost, with no readable replica copies left on the remaining drives. Resilvering won't help those, as there's no source unless more of the drives come back online. I've been there. Now is the time to start planning for full data loss, I'm afraid.

The READ/WRITE error columns aren't trustworthy - they're progressively logged as errors happen session specific (a reboot, remount, export/import or 'zpool clear' will reset them). CKSUM is nonpersistent too, but it's filled as the scrub/resilver progresses and annotates the issues found.

I suspect the original resilvering was due to one drive dropping out for a moment, then recovering - that still triggers a resilver to get the dropout drive back up to date.

Do you have a backup of the stuff you want off the pool?

If no: Gather a small stack of HDDs to use to fix this. You'll need at least one, maybe five. Put them into the supermicro and run the burnin tests suggested by jgreco in the Hardware forum stickies - https://forums.freenas.org/index.php?resources/solnet-array-test.1/ You don't want the replacements dying while you're trying to salvage the array. Test and set them all to one side.

Put all the Pool1 disks back into the Supermicro so you can at least do the same check there, and do the 'zpool status -v' to get the broken files list to see what is lost. Last time I had three disks go bad out of a RAIDZ1, I was insanely fortunate in that the files lost were all in my 'backups of other systems' dataset so were actually disposable. I recovered by removing the UNAVAIL disk because it really waas dead (do this by elimination - the missing serial number from the Storage/Disks GUI page - to ensure you pull the right one!); adding progressively more hot spares which replaced the failed (UNAVAIL) and failing (DEGRADED) disks, resilvered. After that completed I 'detach'ed the DEGRADED disks and removed those, which promotes the spares to full replacements. Finally I deleted the files still named by zpool status -v, and I fully recovered the array.

If you have a backup OR the broken files list makes recovery pointless, give up. Run all the disks through burnin tests and toss any with bad results from SMART short/conveyance/long tests, or with reallocated sectors, or that trip up in the solnet.sh testing. Make an array from the remaining disks and any addins, and start again. Design your system so you can have a backup of anything that is essential.

tc9999 · Oct 20, 2020

Thanks a lot, very helpful. I do not have a backup... it was mostly a Plex server, but I did have some home videos on there so that's the main thing I'm trying to save. I will run the script and try to recover the pool, but big lesson learned from this. Think in the future will use something like AWS Glacier to store additional backups of files I care about. I'll report back on how it goes.

Thanks again, really appreciate your help!

JaimieV · Oct 20, 2020

No problem, it's horrible when it happens to you. I went through big box of failure-prone 4Tb ex-datacentre drives over the last two years, so got quite practiced. As my sig below, I had already built a second NAS purely to be a backup of the content of the first one... I have it wake up every night, do a backup by ZFS replication, then power itself down again. Old server hardware is surprisingly cheap (and more reliable than old server HDDs, as it turns out!).

Follow up if you have any more questions, and good luck!

JaimieV · Oct 20, 2020

Just realised I didn't actually link to the jgreco sticky I mentioned, but only to the script page. If you haven't seen it, go look in the Hardware forum next door and read the "Building, Burn-In, and Testing your FreeNAS system" post. All of it is good, but the disk testing is the main thing for you.

tc9999 · Oct 23, 2020

Thank you, I found that article and ran the script on new drive. I was able to pull the files I cared about off the pool and backed them up after connecting the drives to my Mac running ubuntu again

The files listed when looking at zpool status -v are all Plex related, but none of the actual media files.

I tried installing the drives in my freenas server again (replaced one defective drive with a new one), but neither the pool or the individual drives show up... do you think it could an issue with the SATA cables or any of the hardware? Do I need to import the drives again or should they show up as long as they are connected properly?

Thanks!

joeschmuck · Oct 23, 2020

JaimieV said:
I think we can probably trust that those disks probably are going bad.

This is a premature statement. All you have proven here is that you have lost data, not that the drives are to blame. It's very possible that your old computer was the cause since you could import your pool, although damaged, into Ubuntu on a different computer.

Here is a plan of attack that I recommend you do, but whatever you to, make a plan and follow it.

Test your hard drives:

1. Put the drives back into your Mac and try to backup your data.
2. Run smartctl -a /dev/ada0 (where ada0 = the drive assignment) for each drive and report the results for each drive. Use my link in my signature to help you read what the results mean.
3. Run smartctl -t long /dev/ada0 for each drive, note that it will take many hours to complete a Long/Extended test so just let it operate until complete, then post the output of step 2 again. Read the results.
4. Do you really have a bad hard drive?

Test your FreeNAS computer:

1. With your drives connected boot up an Ubuntu Live ISO.
2. Does Ubuntu recognize it or do you have issue?
3. If you cannot import your pool or have other issues, odds are you have a computer failure, run the burn-in tests (CPU Stress Test & MEMTEST86) with your hard drives connected (to pull load on the power supply), but feel free to run this also without the drives.
4. If the burn-in tests pass and step 1 and 2 failed, odds are you have a motherboard chipset failure, typically a Power Supply, CPU, RAM, or Northbridge will fail during the burn-in tests.

So these are things you can try for now, just report your results and if something fails, be very clear what failed and all the indications.

Good Luck!

JaimieV · Oct 24, 2020

tc9999 said:
Thank you, I found that article and ran the script on new drive. I was able to pull the files I cared about off the pool and backed them up after connecting the drives to my Mac running ubuntu again The files listed when looking at zpool status -v are all Plex related, but none of the actual media files.

Excellent news!

I tried installing the drives in my freenas server again (replaced one defective drive with a new one), but neither the pool or the individual drives show up... do you think it could an issue with the SATA cables or any of the hardware? Do I need to import the drives again or should they show up as long as they are connected properly?

Given you've got the data off them in Ubuntu, the drives showing as "bad" in the Supermicro aren't as bad as all that, as Joe points out (thanks for the sanity check).

What I'd do next is connect everything back up to the Supermicro, but boot off the Ubuntu LiveCD and see if the pool is accessible. If it is, there's something weird in your FreeNAS install - save the config, write a fresh boot drive, check on importing the pool, if it works then reload the config and see if you can still mount the pool.

If the pool is not mountable from Ubuntu in the Supermicro, you can work out what's busted by doing some swapping of things around - take a known working drive+cable and connect it to each SATA port in turn, if you can see it after then the port is okay. Then test all the cables similarly between a known-good port and known-good drive. If the Supermicro SATA isn't hotswap it's a long process with reboots in between everything.

tc9999 · Oct 26, 2020

Thanks for all the help!

I ran the FreeNAS server with Ubuntu and could only get 2 drives to show up briefly (a minute or two) before they disconnected. It did not matter what SATA port I used for each drive, only those 2 drives would show up. I tried again with FreeNAS and same thing, 1-2 drives would briefly appear, then disappear and log said drives detached, periphs destroyed. I think I can rule out FreeNAS being the issue.

I also tried installing a brand new PSU, but that did not help, so I can rule out the SATA power cables/PSU. Tomorrow I should get some new SATA cables and will try and see if that makes a difference, will also try to test each SATA port individually. But, at this point I think it's the motherboard.

The motherboard I use, X10SL7-F, seems to have been discontinued and is quite expensive used, so will have to find an equivalent motherboard to use with my CPU (E3-1246 v3).

I did run smartctl when I had all the drives attached to my Mac and 5/7 passed without any errors, 2 drives would not run at all. I did not save the results, but will try to run it again later this week when I have time.

joeschmuck · Oct 26, 2020

tc9999 said:
I did run smartctl when I had all the drives attached to my Mac and 5/7 passed without any errors, 2 drives would not run at all. I did not save the results, but will try to run it again later this week when I have time.

If you are going to troubleshoot these problems then you should keep good records of your observations and test results. For example you should be tracking drive failures by the drive serial number and the SATA port. The drive serial number would identify to the physical drive, the SATA port would identify to the motherboard or HBA or data cable. 5 or 7 passed, I'm curious what didn't pass on the two drives.

In my signature line is a link to Hard Drive Troubleshooting Guide, you should read it, I thin kit will help you a lot here.

Something you could try is this...

1. Using FreeNAS, do the following...
2. Connect a single pool hard drive to SATA port 1 (pick any port, keep track of what you are using).
3. Boot up FreeNAS, your pool of course will not mount but with luck your drive is recognized.
4. Verify you now have a drive connected, go into the shell and grab some SMART data smartctl -t short /dev/ada0 and after 3 minutes smartctl -a /dev/ada0 and post it to this forum thread. This should run the SMART Short test and then you gather the results. And yes, post all of the output from the commands, do not truncate the data thinking we don't want it.
5. Shutdown FreeNAS.
6. Remove the hard drive and connect up the next hard drive to check out.
7. go to step 3, repeat until all drives have been checked.
8. If you find a drive that fails to work, set it to the side. I expect you to have two drives to fall into this category.

So let's say the testing above works fine and you have no drive failures, well then I would think you have a SATA port that is not working so here is a step by step procedure on how to identify what is not working...

1. Ensure FreeNAS is shutdown.
2. Choose a single drive and connect it to SATA port 2 (one you have not previously used).
3. Boot up FreeNAS, your pool of course will not mount but with luck your drive is recognized.
4. Verify you now have a drive connected, go into the shell and grab some SMART data smartctl -a /dev/ada0 and post it to this forum thread. This should just verify comms are working with the drive.
5. Shutdown FreeNAS.
6. Disconnect the hard drive from the SATA port and move it to the next SATA port, return to step 3, repeat until all SATA ports have been checked.
7. If you find some ports which fail to work, great, now you know which ones work and which ones do not. In this situation you could easily buy a cheap HBA to augment the bad ports and save yourself a lot of money trying to find another motherboard.

If you do not find any drives or ports which fail, we should be doing this test all over again but maybe a little differently, we may need to add some non-destructive hard drive testing into the mix as SMART Extended/Long test is really a hard drive internal test, not a complex data transfer test. Let's cross that bridge when we get there.

Burn In Testing -- Let's say you identify a port failure, I would also recommend that you do some stress testing of the CPU and RAM, making sure nothing else is a problem.

Good luck

tc9999 · Nov 3, 2020

I was able to test the drives one at a time as one of the SAS ports would work for a few minutes... enough time to boot into ubuntu and run the short test before the drive detached. Based on the wiki link, I am not seeing anything critical, so seems it was mainly a motherboard issue.

I have replaced the motherboard and the drives do show up when connecting to the SATA ports, but having some issues with the SAS controller... made a new thread about that issue, but I think once I get the controller working everything should be fine again.

Thanks for the help!

joeschmuck · Nov 3, 2020

So in looking at your SMART data, it's obvious that many of the drives experienced a communications failure (ID 199). Unfortunately this is a cumulative value meaning you will never see it at a zero value once it's started incrementing. So yes, you did have a comms issue, I'd keep track of the SMART data and if you see that value increment, then you need to troubleshoot the problem. It could be a bad HBA/Controller or even the data cables.

EDIT: I posted your answer in your other thread, I'm sure you are using the wrong IT Flash image.

tc9999 · Nov 3, 2020

joeschmuck said:
So in looking at your SMART data, it's obvious that many of the drives experienced a communications failure (ID 199). Unfortunately this is a cumulative value meaning you will never see it at a zero value once it's started incrementing. So yes, you did have a comms issue, I'd keep track of the SMART data and if you see that value increment, then you need to troubleshoot the problem. It could be a bad HBA/Controller or even the data cables.

EDIT: I posted your answer in your other thread, I'm sure you are using the wrong IT Flash image.

Thanks for looking at the data, I will monitor the SMART data in the future. I have replaced the cables so I can rule out those. As for the flash image, I am using the latest from the supermicro website, version 20.00.07.00.

tc9999 · Nov 10, 2020

Received a replacement motherboard today and was able to get back up running again.

Screen Shot 2020-11-10 at 2.42.05 AM.png

My plan is to do an extended test for the 2 degraded drives next.

Edit: Two drives are fine

Important Announcement for The TrueNAS Community.

SOLVED Pool status unknown after faulty drive(s)

Dabbler

Dabbler

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Guru

Dabbler

Old Man

Guru

Dabbler

Old Man

Dabbler

Attachments

Old Man

Dabbler

Dabbler

Similar threads

Important Announcement for The TrueNAS Community.