Pool Switching from 'Degraded' to 'Healthy,' no HD Issues

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Hi,

I recently inherited responsibility for a backup server at my job, and have been experiencing some issues with it. Here's the details:

Freenas 11.2-U5
Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
Memory 16265MB
6 - WD Red 3TB NAS Hard Drives
16 GB memory stick for boot drive

The server originally had four hard drives, which wasn't enough to backup all the data off our other NAS server. I added two more drives, added them to an existing pool, and began having alerts similar to this:

Pool Backup state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

It was one drive that was detaching. Things I tried: swapped SATA cables, made sure all the cables were connected properly, reboot. The problem would eventually return.

I decided to start over, so re-installed a fresh version of FreeNAS, created a new pool and began replicating a snapshot off of our primary NAS.

I came into work this morning, and had alerts in my inbox:

Pool Backup state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Then the I received a 'gone' alert saying it was no longer an issue. I ran zpool status, this is what I got:

config:

NAME STATE READ WRITE CKSUM
Backup ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/3c8955c4-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0
gptid/3d63b6b4-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0
gptid/3e3dd1dd-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0
gptid/3f187dc2-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0
gptid/3fe9b4cb-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0
gptid/40c81797-99b6-11e9-b1f7-d05099c05094 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors


Any help would be greatly appreciated, thanks!
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
Have a look in /var/log/messages to see which drive was dropping out.
If it's consistent with previous issues, given that you've tried the location swap and cable swap it may be time to lose the drive for a fresh one.
It it has moved with the cable, well! Easy fix.
Take a look at the smart output - 'smartctl -a /dev/daXXX' - and see if it's admitting to being failed. UDMA CRC errors will likely be above zero if it has.
 
  • Like
Reactions: toc

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Thanks for the reply!

The volume status has been 'healthy' for the last few days, but now is 'degraded.'

The drive missing is a brand new one, different from the last time this happened.

Here's what I got from 'smartctl -a /dev/ada4'

/dev/ada4: Unable to detect device type

Here's what I saw specific to that drive around the time it disappeared, in the messages log:

Jun 30 23:25:37 freenas ada4 at ahcich7 bus 0 scbus7 target 0 lun 0
Jun 30 23:25:37 freenas ada4: <WDC WD30EFRX-68EUZN0 82.00A82> s/n WD-WCC4N3TR7T$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): SETFEATURES ENABLE RCACHE. ACB: e$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): CAM status: ATA Status Error
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): ATA status: 61 (DRDY DF ERR), err$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): RES: 61 04 00 00 00 40 00 00 00 0$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): Error 5, Periph was invalidated
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): SETFEATURES ENABLE WCACHE. ACB: e$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): CAM status: ATA Status Error
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): ATA status: 61 (DRDY DF ERR), err$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): RES: 61 04 00 00 00 40 00 00 00 0$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): Error 5, Periph was invalidated
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): CAM status: ATA Status Error
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): ATA status: 61 (DRDY DF ERR), err$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): RES: 61 04 00 00 00 40 00 00 00 0$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): Error 5, Periph was invalidated
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): CAM status: ATA Status Error
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): ATA status: 61 (DRDY DF ERR), err$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): RES: 61 04 00 00 00 40 00 00 00 0$
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): Error 5, Periph was invalidated
Jun 30 23:25:37 freenas ZFS: vdev state changed, pool_guid=4717793010821884791 $
Jun 30 23:25:37 freenas (ada4:ahcich7:0:0:0): Periph destroyed
Jun 30 23:25:38 freenas ZFS: vdev state changed, pool_guid=4717793010821884791 $
Jul 1 02:00:00 freenas syslog-ng[6489]: Configuration reload request received,$
Jul 1 02:00:00 freenas syslog-ng[6489]: Configuration reload finished;

I am going to swap out the cable, but is there anything else it could be?
 

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Upon cable swap and reboot:

July 1, 2019, 1:08 p.m. - Pool Backup state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
I'll be interested in any followup - I had a drive do this to me a week or two ago, same logs. A reboot brought it back online, no troubles since...
 

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
A couple hours later, these alerts popped up:

* Pool Backup state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
* Device: /dev/ada4, not capable of SMART self-check
* Device: /dev/ada4, ATA error count increased from 10 to 11

I'd be surprised if this drive was a problem, it's brand new, but it seems to be a drive issue?
 

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
I was reading on a different iX thread with a somewhat similar issue, and they were wondering about motherboard capacity. Here is what I have:

ASRock C226 WS

  • ATX 12" x 9.6"
  • Single socket H3 (LGA 1150)
  • Supports Intel Xeon processor E3-1200 v3 series & Core i3/i5/i7, Pentium, Celeron processors
  • 4-DIMM DDR3 1600/1333 ECC & UDIMM, max. 32GB
  • 2 x PCIe 3.0 x16, 1 x PCIe 2.0 x4, 3 x PCIe 2.0 x1, 1 x PCI
  • 6 x SATA3 by Intel C226 support RAID 0, 1, 5, 10
  • 4 x SATA3 by Marvell 9172 support RAID 0, 1
  • 2 x Intel i210, support Dual GLAN
  • 6 x USB 3.0 ports (4 x rear + 2 via header)
  • 10 x USB 2.0 ports (4 x rear + 6 via header)

I have 6 3tb hard drives connected to this, along with one 500gb SSD. Boot is on a USB. Power supply is ATX 12v 2.3.

Any insight into this would be a tremendous help, thanks!
 
Joined
May 10, 2017
Messages
838
Make sure you're using the Intel SATA ports, some Marvell controllers are knwon to drop disks without a reason.
 
  • Like
Reactions: toc

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Make sure you're using the Intel SATA ports, some Marvell controllers are knwon to drop disks without a reason.
Thanks. This is probably a noob question, but how can I determine if I'm using the Intel or Marvell ports?
 
Joined
May 10, 2017
Messages
838
Don't use the first four ports:

1562076375285.png
 
  • Like
Reactions: toc

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
I've had time to dig back into this, and here's where I'm at:

-I now have two hard drives (one brand new) with the status 'Unavailable.'
-I switched them both to 'Offline,' then powered down and replaced them with brand new drives.
-Upon rebooting, I went to 'Disks,' and the drives are not there.
-When I go to 'Status' for the pool I have, I can see them. Each is listed similarly to this:
/dev/gptid/3c8955c4-99b6-11e9-b1f7-d05099c05094
-They are listed as 'Offline'
-When I select 'Replace' for one of them, I can't, because there is nothing listed to replace under 'Member Disk'

I believe I followed the process correctly for HD replacement, and I suspect something else is up.

Could my controller be the issue? I'm avoiding the Marvell controller (as recommended). I was reading about RAID controllers being an issue, but aren't sure if mine qualifies as that.

This backup server has only had 4 HD's on it, until I did a fresh FreeNAS install and added two more drives. That's when the issues began.
 
Joined
May 10, 2017
Messages
838
Intel SATA ports are reliable and can be used without any issues, of course the board might have a problem, but most likley there is a different issue, like power or connection problem, do you have a different PSU you could use?
 
  • Like
Reactions: toc

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
4 x SATA3 by Marvell 9172 support RAID 0, 1

Don't use the Marvell, make sure your power supply is enough, switch out with known good sata cables. Burn in all your new disks and run smart tests.
 
  • Like
Reactions: toc

tfran1990

Patron
Joined
Oct 18, 2017
Messages
294
Discs not being detected is still a problem.


I would recommend switching out one disc at a time next time, with raidz2 taking out 2 discs could be risky for a resilver.
To buypass any sata port issues you could connect the problematic disc to a HBA
 
  • Like
Reactions: toc

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Thanks for all the replies.

I will try swapping out power supply and see what happens.

At this point, I might also buy an internal HBA to bypass the sata ports and see if that solves the problem. I'm looking at this one:

LSI LSI00301 (9207-8i), along with a couple breakout adapters.
 

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
I ordered those items and got them installed. I've got a replication task running, no issues yet! I will check back in, but I think bypassing the controller did the trick.

Thanks for all the insight, it really helped!
 

toc

Dabbler
Joined
Jun 29, 2019
Messages
12
Update:

I was able to successfully do a replication task to this machine for a backup.

One hard drive (da4p2) became problematic, and the pool switched to 'degraded.'

So I attempted to put in a fresh drive:

-I switched da4p2 to 'offline'
-shut down the computer and replaced the drive
-powered on and went to 'pool status'

The new drive is listed as '/dev/gptid/3b59ae9b-a7ef-11e9-bfa2-d05099c05094' and a different drive has become da4p2. When I select the new drive to replace, the 'Member Disk' area has no options, and I can't replace.

I went to 'Disks' and the new disk isn't even listed.

All disks are on the new controller, new breakout cables.
 
Top