Degraded pool - bad or missing disk?

James S

Explorer
Joined
Apr 14, 2014
Messages
91
The system reports a critical alert,
CRITICAL: July 22, 2019, 11:05 a.m. - The volume VM-Datastore state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
I am trying to work out if the disk is as fault and, if so, which disk it is.
Background:
I've configured the Freenas 111-U7 as an iSCI datastore for a separate VMware machine. There are three datasets, two of the three containing VMs and the third as a backup (I'm working on this - hence the weird setup of a backup located here). This backup dataset (VM-Backup) had filled up completely, although now I've deleted some files now.
datastore.PNG

Running zpool status shows,
gptid/f8b9d75d-aecb-11e8-8fb0-ac1f6b2542fe
as the faulted drive
datapool error.jpg


Using glabel status I cannot then find the drive listed (i.e., I'm missing "1" above):
drive list.jpg

I don't follow, why both ada4 and da1 appear twice - or am I reading something wrong?
For clarity this is the GUI:
gui disks.PNG

Since I can identify da1, da2 and da3 it seems da0 is potentially at fault. However is passes SMART.
The recommendations seem to be to keep lots of free space on iSCI drives (maxing at say 60%). Since I've filled one datapool is this the root cause rather than a physical drive problem?
Any help on how to progress would be really appreciated!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
You are seeing write errors (and a few read errors). It's not a case of a full pool.

you should review the details of the SMART data from ada0 (if you can as it seems the disk is no longer present?).

The forum will help to interpret it if you post it here. (in code tags please)
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
The drive is available for testing -- but I cannot find it as part of the pool.
SMART reports show an overall pass
smart - 1.PNG


General SMART values
smart - 2.PNG


Errors on drive

smart - 3.PNG


End of report
smart - 4.PNG
 

Attachments

  • smart - 3.PNG
    smart - 3.PNG
    10.8 KB · Views: 303

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Your drive is dying, get it replaced.

You have read errors, seek errors and multi-zone errors. Unless you don't care about your data, waiting for those numbers to increase until the whole show is over is not productive.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Thanks!
I've replaced the disk following burn-in and then resilvered. Just before I started the process another disk in the array was marked degraded. I'm surprised since these disks are 6 month old (WD red 3Tb).

I'm also wondering if the hardware configuration is a problem here.
Motherboard - Asus KCMA-D8 with Asus raid card - Pike 2008 - configured as JBOD. I'm running a stripped mirror accross 4 3TB reds. This is the data store for a VMware machine.
The chipset in the Pike card is the LSI SAS2008. Another FN user had a similar build and had not reported problems. I have run the system without disk problems for 6 months. I saw a suggestion, though, to flash the card to IT mode (https://www.ixsystems.com/community/threads/confused-about-that-lsi-card-join-the-crowd.11901/).

I'm curious whether this might be hardware problem or whether this is really two dead disks.

Any ideas would be appreciated!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
You're certainly right to start thinking about other factors after a second case appears... the percentages say that most disks either fail shortly after arrival/during burn-in or right at the end of their MTBF period, rarely in between, so twice in quick succession might be some kind of indicator for another common component in bad shape.

Since SMART reporting is based on tests run from the PCB attached to the bottom of the disk, there's nothing in the path to question in terms of the tests themselves (maybe a dud batch of the PCBs?).

Given that, we might want to consider other environmental factors like power supply stability (the SMART data hasn't shown anything there, but maybe something difficult for it to pick up would be voltage spiking... how good is your power from the outlet to the NAS? How good is the power supply in the NAS?)

Otherwise, things like movement or vibrations may be worth a thought (again no real indicators in the SMART data other than the read/write errors, but perhaps there has been a head strike due to a physical event while in operation?... could happen to two or more disks in the same enclosure from the same strike/event if both disks were in the middle of writes at that moment and others were parked and hence avoided damage... just a theory).

At 6 months, those drives shoudl still be under warranty, perhaps consider sending them back.

RAID cards in IT mode behave more predictably, but I doubt that's a root cause here. By all means get to it though if you can.

I see there was a huge gap between SMART tests... make sure they are running on all your disks so you don't miss another bad one until it's too late.

Perhaps if you shared your full hardware spec, the forum would put a bit of attention on spotting anything I've missed by no knowing what you have there.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Thanks for the quick response. Since the disks were all bought together the "dud batch" theory could well play out here. (Do you, for example, buy disks from multiple vendors to try and overcome this potential problem?) On install I was naive about the need for drive burn-in which might have picked these problems up sooner.

I'm also not seeing the raid card as a root problem since it hasn't given problems (well that I've seen - which is not saying much). I assume getting it to run in IT mode would mean flashing it?

I'd seen "smart enabled" in the Freenas gui and thought all was well. So thanks for the reminder to set up regular smart tests (I've done same for scrubs already).

In terms of a full spec:
Motherboard: Asus kcma-d8 dual socket
CPU: AMD Opteron 4122 (4 core no threading) x2
RAM: 32Gb Kingston ECC
Raid card: Asus Pike 2008 - running the LSI 2008 chipset in jbod mode
Power: FSP Twins (500w) connected to UPS
HDD: 4 x 2TB WD red (Raidz 2) 4 x 2TB WD red (Raidz 1) 4 x 3TB WD red (stripped mirror)
System disks: 2 x 120Gb Intel SSDs
System: FreeNAS 11.1-U7
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
You're probably OK to leave the RAID card as it is.

Looks like power is not the issue.

I don't buy drives separately on purpose, but I do watch new drives carefully.
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
Thanks!
I've replaced the disk following burn-in and then resilvered. Just before I started the process another disk in the array was marked degraded. I'm surprised since these disks are 6 month old (WD red 3Tb). ~snip~ (

TBH, I’m yet to find a logical answer to how hard disks fail. All those burn ins are again not an absolute test, it just ensures no prematured death, but doesn’t mean it’ll last longer. I always believe in disks being like tyres bought from a local dealer. Can’t be sure how long they will last.
Personally I have my system running on an Asus board for 4+ years.
Here is a report of my NAS

+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|Device|Serial |Temp|Power|Start|Spin |ReAlloc|Current|Offline |UDMA |Seek |High |Command|Last|
| | | |On |Stop |Retry|Sectors|Pending|Uncorrec|CRC |Errors|Fly |Timeout|Test|
| | | |Hours|Count|Count| |Sectors|Sectors |Errors| |Writes|Count |Age |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|ada1 ?|WD-WCC4E4JL9NDZ| 38 |35203| 268| 0| 0| 0| 0| 0| N/A| N/A| N/A| 2|
|ada2 ?|WD-WCC7K2VDVPKT| 34 |13006| 53| 0| 0| 0| 0| 0| N/A| N/A| N/A| 2|
|ada3 ?|WD-WCC4E4KC79HK| 37 |35203| 269| 0| 0| 5| 0| 0| N/A| N/A| N/A| 2|
|ada4 ?|WD-WCC4E0ESU744| 35 |29051| 125| 0| 0| 0| 0| 0| N/A| N/A| N/A| 2|
|ada5 ?|WD-WCC4E2VSE6NP| 37 |35203| 262| 0| 0| 0| 0| 0| N/A| N/A| N/A| 2|
|ada6 ?|WD-WCC4E1FSUL4N| 37 |26541| 107| 0| 0| 0| 0| 0| N/A| N/A| N/A| 2|
|ada7 ?|WD-WCC4E2FL7TSK| 36 | 5639| 18| 0| 0| 0| 0| 0| N/A| N/A| N/A| 235|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+

Not sure where the ada0 went, but as you can see the array of disk hours in my system. I think I’ve had a total of 6 disks failing over 4 years. All were RMA’d and I always had 1 cold spare.
 
Last edited:

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Interesting... bit confused why you are missing a disk (7 instead of the 8TiBs). Also I'm struggling to align the column heads to the data (14 columns of data v. a pile of column headings). I'm kind of assuming those zeros line up with critical SMART variables...?

I'm hoping my other tyres have still got plenty of rubber on them yet ;)
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
Interesting... bit confused why you are missing a disk (7 instead of the 8TiBs). Also I'm struggling to align the column heads to the data (14 columns of data v. a pile of column headings). I'm kind of assuming those zeros line up with critical SMART variables...?

I'm hoping my other tyres have still got plenty of rubber on them yet ;)
I use a lot of scripts written by @Bidule0hm. This is one of the scripts output. I'm also on the verge of changing my hard disks. Don't wanna end up with all tyres going flat at the same time.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
After replacing the disk the system reports it has resilvered succesfully.
Code:
  pool: VM-Datastore
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 3.77M in 0 days 00:00:00 with 0 errors on Wed Aug 14 18:34:09 2019
config:

However, a second disk is also reported degraded. I'm just about to get a replacement, burn-in and install.

Today, however, I had two errors while the machine was running, one fatal so the machine froze. The first seems to be an attempt to automatically resize the disk I'd just replaced (da0). (Is this due to the second faulty drive?) The second seems to be a memory fault?
The attached shows the details -- sorry for the bad picture. I had a colleague take a picture of the terminal and mail it to me.
When I remote in I'm not clear why it shows 17 disks when I've got 14. I assume it has automatically partitioned disks?
Code:
root@freenas:~ # glabel status
                                      Name  Status  Components
gptid/f9b84327-aecb-11e8-8fb0-ac1f6b2542fe     N/A  da0p2
gptid/fbc1273f-aecb-11e8-8fb0-ac1f6b2542fe     N/A  da1p2
gptid/025b0bdd-c146-11e8-9d80-ac1f6b2542fe     N/A  da2p2
gptid/03993d48-c146-11e8-9d80-ac1f6b2542fe     N/A  da3p2
gptid/04da7480-c146-11e8-9d80-ac1f6b2542fe     N/A  da4p2
gptid/063a9ab5-c146-11e8-9d80-ac1f6b2542fe     N/A  da5p2
gptid/2bb4cca3-bb50-11e9-b818-ac1f6b2542fe     N/A  da6p2
gptid/01dd6f7b-ccf5-11e3-8e0e-60a44ce69c93     N/A  ada0p2
gptid/024dced5-ccf5-11e3-8e0e-60a44ce69c93     N/A  ada1p2
gptid/02bcb705-ccf5-11e3-8e0e-60a44ce69c93     N/A  ada2p2
gptid/0329aca0-ccf5-11e3-8e0e-60a44ce69c93     N/A  ada3p2
gptid/3a133126-917a-11e8-8f5a-ac1f6b2542fe     N/A  ada4p1
gptid/3a179f20-917a-11e8-8f5a-ac1f6b2542fe     N/A  ada4p2
gptid/72fdefd5-9222-11e8-8b26-ac1f6b2542fe     N/A  ada5p1
gptid/f9a6002a-aecb-11e8-8fb0-ac1f6b2542fe     N/A  da0p1


I need help to understand how to effectively handle the replacement. Has something gone wrong here? The disks are still in my hand as I've not done the RMA with WD.
For the second problem - is this (memory?) issue connected with the disk issue. Is something to be done here, too?

My system has been running smoothly until now - all input really appreciated.
 

Attachments

  • freenas 11 errors.JPEG
    freenas 11 errors.JPEG
    173 KB · Views: 289

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
I think we should see zpool status and assess what's happening with the pool right now.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Zpool status shows
Code:
root@freenas:~ # zpool status
  pool: NAS2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:07:10 with 0 errors on Sun Jul 14 02:07:11 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS2                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/01dd6f7b-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/024dced5-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/02bcb705-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/0329aca0-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0

errors: No known data errors

  pool: NAS2_data
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:12:45 with 0 errors on Sun Jul 14 00:12:48 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS2_data                                       ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/025b0bdd-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/03993d48-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/04da7480-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/063a9ab5-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0

errors: No known data errors
 pool: VM-Datastore
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 3.77M in 0 days 00:00:00 with 0 errors on Wed Aug 14 18:34:09 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        VM-Datastore                                    DEGRADED     0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/2bb4cca3-bb50-11e9-b818-ac1f6b2542fe  ONLINE       0     0     0
            gptid/f9b84327-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE       0     0     0
          mirror-1                                      DEGRADED     0     0     0
            gptid/fbc1273f-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE       0     0     0
            4751023271579736273                         UNAVAIL      0     0     0  was /dev/gptid/fcc066d2-aecb-11e8-8fb0-ac1f6b2542fe

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:22 with 0 errors on Fri Aug  9 03:45:22 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada5p2    ONLINE       0     0     0

errors: No known data errors
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
It looks to me like you haven't replaced the disk yet (in the GUI). Perhaps you physically replaced it and did nothing else yet? I see your comment about burn-in, so that's probably it.

What's odd is the small amount of data resilvered in that pool... not even 4MB... that sounds like a "catch-up" resilver which can happen when re-connecting a drive that went offline or unavailable for a short time.
 
Last edited:

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
I would suggest just running a scrub again from cli/gui for datastore. As far as I know FREENAS is a great OS with most errors being user driven and not the OS. It's unforgiving unlike most of the other OS you may know, so discipline is a must. The difference between a successful and long running FREENAS system and a failed FREENAS system is people's ability to follow instructions if they don't understand something.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Ok... (thanks!) I guess that is why I'm here. Having followed the instructions I'm not understanding why the system reports a resilver but then only has completed a microscopic proportion of the disk. Grrrr

I guess it unusual to have two failed disks... but what is going on here?

I've started a scrub. Which has not shown up in the GUI as in progress. It seems others have questions about getting scrub schedules to work:
https://www.ixsystems.com/community/threads/how-to-fix-scrub-schedule.54678/

...a look at the terminal with "zpool status -v" actually shows the scrub in progress.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Well the net result of this process is
Code:
 pool: VM-Datastore
 state: UNAVAIL
status: One or more devices are faulted in response to persistent errors.  There are insufficient replicas for the pool to
        continue functioning.
action: Destroy and re-create the pool from a backup source.  Manually marking the device
        repaired using 'zpool clear' may allow some data to be recovered.
  scan: resilvered 0 in 0 days 00:00:00 with 0 errors on Thu Aug 15 17:42:52 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        VM-Datastore                                    UNAVAIL    733    68     0
          mirror-0                                      ONLINE       0     0     0
            gptid/2bb4cca3-bb50-11e9-b818-ac1f6b2542fe  ONLINE       0     0     0
            gptid/f9b84327-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE       0     0     0
          mirror-1                                      DEGRADED  1001   190   523
            gptid/fbc1273f-aecb-11e8-8fb0-ac1f6b2542fe  DEGRADED   284   283   566  too many errors
            gptid/fcc066d2-aecb-11e8-8fb0-ac1f6b2542fe  FAULTED      0 1.18K 2.04K  too many errors

Really not quite sure what has gone so horribly wrong here.

Why am I lacking "replicas" for this pool to continue functioning? The whole idea of a stripped mirror was kind of to avoid this kind of meltdown.

I kind of like to figure out what I've done wrong here. Any input to avoid Fukushima mark 2 would be great.
 
Top