Critical storage issues

mikehextall · Jan 27, 2019

Hi,

I am getting the below warnings in FREENAS and I am unable to access some of my storage via the network. I have read online that SMART long self-test will help me fix it along with a couple of other steps but the test has been running for about 14 hours on a 3TB drive, what am I doing wrong, is this normal? The test itself said it would take 393 minutes.

CRITICAL: Jan. 26, 2019, 5:02 p.m. - Device: /dev/ada1, 32 Currently unreadable (pending) sectors
CRITICAL: Jan. 26, 2019, 5:02 p.m. - Device: /dev/ada1, 32 Offline uncorrectable sectors
CRITICAL: Jan. 26, 2019, 5:32 p.m. - Device: /dev/ada1, Self-Test Log error count increased from 1 to 3
CRITICAL: Jan. 26, 2019, 5:02 p.m. - The volume STORAGE state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

Heracles · Jan 27, 2019

Hi Mike,

What exact model of hard drive do you have in that server ? From the symptom, I guess that you used standard desktop drives in your NAS. If I am right, you are now experiencing one consequence of not using the proper hardware...

Desktop hard drives are designed considering they are the only one holding the data. As such, whenever they face a problem, they will keep fighting their internal mechanic as long as possible, trying to recover the data. They do so because if they don't get the data, they consider nobody will.

You should use hard drives designed for NAS instead, like the Seagate Ironwolf I am using. These drives are designed for 24 / 7 operations, to sustain the vibration of many adjacent drives and also not to waste time this way. Because they are meant to be used in a NAS where other drives host the same data and can recover it, these drives will return an error quickly should they have a problem like yours. When FreeNAS receives the error, it goes to the other drives and recover the data from them right away.

So that disk ada1 is either dead or dying. Replace it ASAP and I recommend you start using hard drives designed for NAS. Also, should you replace all your drives with bigger ones, ZFS will auto-expand your pool to the new maximum size possible once all drives are replaced. So you may have an opportunity to easily expand your pool should you wish more storage.

A SMART test, long or short, will not --fix-- a problem. It only tests the drive and tells you if any sign of coming trouble are detected. Here, you already know that they are.

An option would be :
--Power off FreeNAS
--Unplug that failed drive
--Reboot without that failed drive

Doing that will put you back online and you will not suffer the delays from that non-reliable hard drive.

Of course, order a new hard drive ASAP and as soon as you have it, install it and re-silver your pool.

Also, know that when one hard drive fails, often other drives will fail soon. The reason is that most time, they are about the same age and they endured about the same load, so they suffer the same fate at about the same time. It may be a good idea to order more than 1 drive to be ready should you loose another drive. Also, be sure you backup your data properly so you are ready to face the worst.

Good luck,

mikehextall · Jan 27, 2019

The HD is a Seagate NAS HDD

https://www.amazon.co.uk/Seagate-ST3000VN000-Drive-5900rpm-Internal/dp/B00D1GYNU8

Heracles · Jan 27, 2019

Hi again,

Good for you that you already use the proper kind of hard drive then. So it was the long smart test that slowed down everything.

In all cases, the only thing to do is to replace that drive ASAP... Not only there are errors in the drive, but the number of errors increases. Usually, such an increase is exponential : it increases faster and faster until the disk fails definitely.

mikehextall · Jan 27, 2019

What do you recommend for a small server? I don't store a massive amount, pictures, documents, videos, a few films. I only currently use about 1TB of the 3TB I've got.

Heracles · Jan 28, 2019

Hi Mike,

As for the model, the Seagate Ironwolf I have answers my need perfectly. How many drives can you fit your server ? How much space do you need ? I would recommend against RaidZ1, so I suggest a minimum of 5 drives and RaidZ2 for a reasonable usable space and good protection. Just size the disk according to your need. That gives you the usable space of 3 drives, so 5x 1TB will give you 3 TB usable when you say you need about 1. That may be a good starting point. Also, should you need more in the future, bigger hard drives will be easy to find and buy. Once all the drives replaced with bigger ones, ZFS will auto-expand the pool to whatever the new size will be, like 30 TB if you go with 5x 10 TB.

pro lamer · Jan 29, 2019

Heracles said:
as soon as you have it, install it

Maybe a burn-in?

Sent from my phone

Heracles · Jan 29, 2019

Hi Pro lamer,

Burn-in is surely a good practice, but it is only when you have the time to do it... Here, the disk is already failing. As such, I do not think it is appropriate to do an offline burn-in.

The actual drive is known to be failing. The new drive may potentially be failing. As such, what protects the pool the better ? To keep the known defective drive a little longer or to give a shot at the supposed-to-be good new drive despite the evidence was not provided ?

Here, I would skip the offline burn-in and replace the confirmed defective drive ASAP, even if it is an un-confirmed new good drive.

pro lamer · Jan 29, 2019

One can consider a downtime until has a burnt-in drive for replacement ...

Edit: OTOH some more drives may fail to spin up after downtime :/

Sent from my phone

Chris Moore · Jan 29, 2019

Heracles said:
From the symptom, I guess that you used standard desktop drives in your NAS

Nothing about what the OP described would indicate the type of drives being used.

Chris Moore · Jan 29, 2019

Heracles said:
Also, know that when one hard drive fails, often other drives will fail soon. The reason is that most time, they are about the same age and they endured about the same load, so they suffer the same fate at about the same time

You didn't ask for anything about the OP storage pool that would indicate any age related failure. You are purely guessing about what is the problem here because you didn't ask any questions about this failure event. The drives in this system could be new and this might be an 'infant mortality' that has nothing to do with drive age or wear.
Ask questions, please don't guess like this.

Heracles · Jan 29, 2019

Hi Chris,

Please, go back and read again the very first sentence I wrote in my answer. It is the very question that you said I did not ask and everything after the question is clearly stated as valid only if my guess was right. So up to me to ask you to read and mind your intervention please...

Chris Moore · Jan 29, 2019

Heracles said:
What exact model of hard drive do you have in that server ? From the symptom, I guess that you used standard desktop drives in your NAS. If I am right, you are now experiencing one consequence of not using the proper hardware...

Just asking the model of the drive ...

Heracles said:
Please, go back and read again the very first sentence I wrote in my answer.

That is not telling you the type of pool or the number of drives or the age of the drives.

Heracles · Jan 29, 2019

The type of pool has no effect on a failing drive. The drive is to be replaced ASAP no matter the kind of pool.

You have your style of working and I have mine : when the one asking a question did not provide all the information, I do what I can to help him with the information I have and clearly express what is assumed / possible / guess so the one receiving the answer can evaluate if indeed this is his situation or not.

Just like here : the guy provided no informations about his drive, so I had to guess. To see a single drive fighting forever to get some data was compatible with desktop hard drives and the use of desktop hardware is a common decision.
I got it wrong and the guy provided the required information after that. So no harm to anyone and we keep working the case.

You would rather leave the guy without info unless he provided all the info possible ? So do it that way because it is your style. But do not blame someone else for working differently.

I did asked the question about the drive, did mentioned that I assumed one answer, did explained the logic between the symptom, ... The guy understood and we kept the ball rolling. If this thread is not of your style, stay on the line and pick another. There are enough for all of us here.

Chris Moore · Jan 29, 2019

Heracles said:
The type of pool has no effect on a failing drive.

It does if the drive is still limping along and only having data errors and the person that created the pool set it up as a stripe, meaning that there is no redundancy at all and you tell them to take the drive out. Then they have no data because YOU didn't bother to find out what kind of pool it is before you say snatch that drive out.

Heracles said:
You have your style of working and I have mine

You just need to understand that not everyone has their system setup the way you think they should. I have seen all kinds of setups here and many of them are so fragile that bad advice can destroy them instantly.

Heracles said:
Just like here : the guy provided no informations about his drive

You should keep asking until the information needed is provided.

Heracles said:
So do it that way because it is your style. But do not blame someone else for working differently.

I would be happy to have another contributor to the forum, but you need to be more cautious and ask more questions before making assumptions.

Heracles said:
If this thread is not of your style, stay on the line and pick another. There are enough for all of us here.

It isn't the thread, it is you. Bad advice will make the forum look bad. We don't need that.

Chris Moore · Jan 29, 2019

mikehextall said:
I am getting the below warnings in FREENAS and I am unable to access some of my storage via the network. I have read online that SMART long self-test will help me fix it along with a couple of other steps but the test has been running for about 14 hours on a 3TB drive, what am I doing wrong, is this normal? The test itself said it would take 393 minutes.

Please show us the output of zpool status from the command line. It should look something like this:

Code:

  pool: Emily
 state: ONLINE
  scan: scrub repaired 0 in 0 days 03:42:46 with 0 errors on Sun Jan 20 21:54:20 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        Emily                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/af7c42c6-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b07bc723-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b1893397-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b2bfc678-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b3c1849e-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b4d16ad2-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/bc1e50e5-c1fa-11e8-87f0-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/a03dd690-c1fb-11e8-87f0-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/a6ed2ed5-c240-11e8-87f0-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/b9de3232-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/baf4aba8-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
            gptid/bbf26621-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE       0     0     0
        logs
          gptid/ae487c50-bec3-11e8-b1c8-0cc47a9cd5a4    ONLINE       0     0     0
        cache
          gptid/ae52d59d-bec3-11e8-b1c8-0cc47a9cd5a4    ONLINE       0     0     0

errors: No known data errors

Heracles · Jan 30, 2019

The type of pool has no effect on a failing drive.

It does if the drive is still limping along and only having data errors and the person that created the pool set it up as a stripe, meaning that there is no redundancy at all and you tell them to take the drive out. Then they have no data because YOU didn't bother to find out what kind of pool it is before you say snatch that drive out.

No it does not. Again, go back and read before blaming.

The procedure I gave him was :

An option would be :
--Power off FreeNAS
--Unplug that failed drive
--Reboot without that failed drive

So should the drive be part of a pool without redundancy, the pool will just not mount and the failed drive will be preserved up to the moment where a raw copy of it can be made to the new drive.

So stop blaming others for not doing the same as you.

pro lamer · Jan 30, 2019

Heracles said:
the pool will just not mount and the failed drive will be preserved up to the moment where a raw copy of it can be made to the new drive.

Unless the failing drive refuses to spin-up anymore... For ~~the~~ some other imaginary scenario (for example in-place hot drive replace without power down depending on the details we don't know yet) - we would need to know if @mikehextall has hardwsre capable of hot-swap/hot plug...

Sent from my phone

Heracles · Jan 30, 2019

Hi pro lamer,

The situation indeed is a two edged sword. To stop the disk may be the way to preserve the last fragment of life in it, as much as killing it once and for all. Unfortunately, there is no way to know in which of the two cases this is until it is too late.

In all cases, should that pool had no redundancy from day 1, then to loose it in such a situation is just normal and by design. To preserve it would be more of a miracle than proper operations.

Considering there is no way to know which of the two scenarios will happen in such a case, I consider both ways are good. This is also the very purpose of such a forum : when there is more than one option, a forum like this helps one to realize about all of them thanks to the opinion expressed by different people.

Should the pool be protected by a minimum level of redundancy, the objective of not suffering the long delay anymore is achieved and the urge to replace the drive remains. Without redundancy, the extremly shaky situation created by design remains extremely shaky and to handle it by powering down the drive or keep it hot are two options that only luck will separate.

At the end, I stand by my answer despite Chris's direct personal blame and welcome other's opinion like yours as to what other options can be considered in the context addressed in the ticket.

Have a nice day,

pro lamer · Jan 30, 2019

Still regarding a hypothetical case of a failing drive scaring-of-not-wanting-to-spin-up-after-power-off-anymore:

pro lamer said:
hot-swap/hot plug

If not, but still wanting to avoid power down and in a very unhappy case, one can even try temporarily attaching the new disk by eSATA port (if the OP has one) or an USB attached enclosure (but USB attached drives also are not recommended in our forums and some/most eSATA controllers neither :( )

Again:

Heracles said:
The new drive may potentially be failing. As such, what protects the pool the better ?

I guess the OP has to decide - I can imagine a not burnt-in drive may be DOA or dodgy and I can imagine attaching such one may make lots of mess :( (just guessing, not from my experience)

BTW:
@mikehextall

mikehextall said:
I only currently use about 1TB of the 3TB I've got.

Is it a single drive pool? Anyway answering this question

Chris Moore said:
Please show us the output of zpool status from the command line

will help a lot...

Sent from my phone

Important Announcement for the TrueNAS Community.

Critical storage issues

Cadet

Wizard

Cadet

Wizard

Cadet

Wizard

Guru

Wizard

Guru

Hall of Famer

Hall of Famer

Wizard

Hall of Famer

Wizard

Hall of Famer

Hall of Famer

Wizard

Guru

Wizard

Guru

Similar threads