Unable to find info/Log that caused HDD to be reoved from pool.

Zbass5

Cadet
Joined
Jun 30, 2020
Messages
4
Hi, New to this forum and couldn't find the info I was looking for so decided to post here. (Sorry in advance if this is not right place, or this has been already covered)

Been using FreeNas for last 6 months, then decided to build a freenas system 11.3U2 as a NFS Datastore for my home lab connected it to an ESXi 7.0 PC via 10GBNic. I built PC and tested for 2 weeks, all good. Everything ran really well for 8 weeks, then I get " ...Pool Barrel state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state..."

So, I remove the drive, test it in my windows PC with Western Digital Data Lifeguard Utility and the drive passes all test, short, long, surface etc. Put it back into Datastore with new sata cable and re-silver - all goes fine. Next day same error and HDD removed from pool. No history of errors for this drive previously reported via scheduled smart tests etc.

At this point in time, I am unable to determine what the cause was for the HDD failure. Is there a way to find out? Or is it blindingly obvious! Drive currently in "removed" state, so unable to read smart info.

The drive is still under warranty and I can return, just need to establish why? One of the other hdd is failing, and is reported via smart, but this one is a mystery. Drives are brand new. It may be motherboard issue or something else. I stressed test all hardware prior to build and all worked fine.

Can anyone shed some light on this? Thanks


FreeNas DataStore:
Intel G620 CPU
Gigabyte H67MA - Consumer MB
24GB DDR3 Non ecc Ram Kingston
1x Liteon SSD boot drive
3x WD Red 3TB drives WD30EFRX (CMR) in Raidz1
2x GIGABYTE GP-GSTFS31120 (120GB SSD) for Cache
1x Samsung SSD 970 EVO Plus 250GB - LOG Drive
Intel x520-DA1 NIC
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You may find some camcontrol or other errors shown by dmesg

It would also be interesting to look at zpool status -v
 

Zbass5

Cadet
Joined
Jun 30, 2020
Messages
4
ahcich3: Timeout on slot 26 port 0
Hi Thanks for your reply.

This is the read out shown below by dmesg. Not sure how to read, but I guess the hard disk is timing out and been dropped from the pool. So is the HDD faulty or is the port 3 on motherboard suspect? Looks like issues with part 2 as well? Any thoughts welcomed :)


ahcich3: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
(ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command
ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD30EFRX-68EUZN0 80.00A80> s/n WD WCC2T3JG6557 detached
(ada3:ahcich3:0:0:0): Periph destroyed
(aprobe0:ahcich3:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich3:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich3:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich3:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
(aprobe1:ahcich3:0:15:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe1:ahcich3:0:15:0): CAM status: ATA Status Error
(aprobe1:ahcich3:0:15:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
(aprobe1:ahcich3:0:15:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe1:ahcich3:0:15:0): Error 5, Retries exhausted
(aprobe0:ahcich3:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich3:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich3:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich3:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c8 da fb 66 40 02 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich2:0:0:0): Retrying command
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c2 a2 fc 66 40 02 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich2:0:0:0): Retrying command
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If your SMART data is telling you the disk is OK, you need to focus on the connection (or power).

Try another port on the mobo if that's an option.

Make sure you're reading the SMART data correctly and that the long test has run successfully.

That's really not a good board for FreeNAS.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Since this is a consumer board, and not a server board, check if all the overclocking options in the BIOS have been disabled. Typically these gaming-oriented boards run a standard minimal overclock on first boot. The overclock may be interfering with the SATA ports.
 

Zbass5

Cadet
Joined
Jun 30, 2020
Messages
4
Yes, I've kept everything bog standard in the MB bios and disable onboard devices etc all in the interest off reliability. Old board, but has had very little use over the years.

As the drive has been "removed" from the pool, is there anyway to get the drive online again (instead of resilvering) so I can run short and long SMART test?

Plan is going to be a process of elimination:
i) Check state of Hard drive (SMART)
ii) change SATA port & Cable;
iii) New PSU
iv) if all this fails, new Motherboard etc

I tested ram & CPU etc prior to building - all OK. I'm learning towards dying sata port as chipset runs really hot. I think 109 Celsius?

Is there anything I should do, prior to buying new MB?

btw - like your quote "...never underestimate your own stupidity...." hummmmm :(
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
No, your plan is sound, and you've likely already found the cause of the issue as the chipset overheating.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
As the drive has been "removed" from the pool, is there anyway to get the drive online again (instead of resilvering) so I can run short and long SMART test?
The disk is still there even when removed from the pool... you can try smartctl -x /dev/ada3
 

Zbass5

Cadet
Joined
Jun 30, 2020
Messages
4
Update.... Gave up trying to sort this out. Bought new motherboard. The HDD I was having issues with has been work fine for the last week. I'm guessing it the 8 year old MB doesn't want to work properly anymore (huh!)

Cheers
 
Top