Problem on drives detached/destroyed FreeNAS 11

Status
Not open for further replies.

vatastala

Dabbler
Joined
Oct 9, 2014
Messages
20
Hi all, since some days I'm experiencing a problem with my home FreeNAS 11, It's a simple system with an Asus mobo + 6x3TB disks attached directly on the 4 sata ports + 2 PCI-E external ports. I have the Freenas os booting via SSD and a ZFS Pool on my 6 disks with RaidZ-1.

Since some days my NAS started resilvering without stopping, It starts and for example after some hours or minutes restart resilvering after messages like this:
Code:
Aug  7 22:55:01 freenas ada5 at ahcich9 bus 0 scbus8 target 0 lun 0
Aug  7 22:55:01 freenas ada5: <ST3000DM008-2DM166 CC26> s/n Z504HT97 detached
Aug  7 22:55:01 freenas GEOM_ELI: Device ada5p1.eli destroyed.
Aug  7 22:55:01 freenas GEOM_ELI: Detached ada5p1.eli on last close.
Aug  7 22:55:01 freenas GEOM_ELI: Device gptid/85cd4d9b-786d-11e7-8c12-0015177adaa2.eli destroyed.
Aug  7 22:55:01 freenas GEOM_ELI: Detached gptid/85cd4d9b-786d-11e7-8c12-0015177adaa2.eli on last close.
Aug  7 22:55:01 freenas (ada5:ahcich9:0:0:0): Periph destroyed
Aug  7 22:55:02 freenas ZFS: vdev state changed, pool_guid=1878614223730411841 vdev_guid=17205288451372208862
Aug  7 22:55:06 freenas ada5 at ahcich9 bus 0 scbus8 target 0 lun 0
Aug  7 22:55:06 freenas ada5: <ST3000DM008-2DM166 CC26> ACS-2 ATA SATA 3.x device
Aug  7 22:55:06 freenas ada5: Serial Number Z504HT97
Aug  7 22:55:06 freenas ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  7 22:55:06 freenas ada5: Command Queueing enabled
Aug  7 22:55:06 freenas ada5: 2861588MB (5860533168 512 byte sectors)
Aug  7 22:55:06 freenas ada5: quirks=0x1<4K>

It's strange because It's the first time It happens, and the problem seems to going to expand on others disks, in fact since yesterday I have the same message for ada6. The disks are brand new, in fact with a smartctl I have no errors, and every time resilvering restarts I receive the message of pool DEGRADED and after some second ONLINE with resilvering in progress.

I was able to back up all data on another PC, so now I'm able to investigate the issues. My suspect is an hardware fault.

Could you please help me investigating on It?
 
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
It's a good thing you backed up your data, some folks think about that too late.

Lets start with a few things such as what specific version of FreeNAS are you running, how much RAM do you have, just about all you hardware and configuration information. You also stated that the hard drives were new, how about the make/model of those as well? And you should make a backup of your configuration file if you haven't already done so.

The very first question I have is for you to identify the drives, are ada5 and ada6 the external drives? If yes then you may have a cable or power issue with the external enclosure. Identify how these are connected, be exact here, include make/model of any enclosures, how the drives are physically connected, etc... Remember, we are not there to examine your system so we don't have any clue what you really have and we try to provide the best answers we can.

If this is not external drives then continue on reading:

Have you run the following on your new hard drives: SMART Long Test and badblocks? I would recommend these tests. And to ensure you don't have a computer issue, run MemTest86 and a quick CPU stress test. I'd do the RAM and CPU tests first, they are the foundation that the other tests need.

If you have any hard drive failures then report those. These are the first steps I'd take. If you have anything else to add then please do.
 

vatastala

Dabbler
Joined
Oct 9, 2014
Messages
20
It's a good thing you backed up your data, some folks think about that too late.

Lets start with a few things such as what specific version of FreeNAS are you running, how much RAM do you have, just about all you hardware and configuration information. You also stated that the hard drives were new, how about the make/model of those as well? And you should make a backup of your configuration file if you haven't already done so.

The very first question I have is for you to identify the drives, are ada5 and ada6 the external drives? If yes then you may have a cable or power issue with the external enclosure. Identify how these are connected, be exact here, include make/model of any enclosures, how the drives are physically connected, etc... Remember, we are not there to examine your system so we don't have any clue what you really have and we try to provide the best answers we can.

If this is not external drives then continue on reading:

Have you run the following on your new hard drives: SMART Long Test and badblocks? I would recommend these tests. And to ensure you don't have a computer issue, run MemTest86 and a quick CPU stress test. I'd do the RAM and CPU tests first, they are the foundation that the other tests need.

If you have any hard drive failures then report those. These are the first steps I'd take. If you have anything else to add then please do.

Hi, thank you for your answer. I reply in detail:

  • Freenas 11.1U5
  • 16GB non-ECC
  • Motherboard Mini-iTX Asus
  • CPU Core i5 3570s
  • External PCI-E 4xSata Syba ( Marvell 92xx Sata 6G )
  • 2x SSD ( 1 for OS, 1 for L2ARC )
  • 4x Seagate Skyhawk ST3000VX010, Hard disk 3 TB
  • 2x WD Caviar Green 3TB
In my configuration I have 2 SSD and 2 Seagate attached @ the Syba PCI_EX card, the other disks are connected on the mobo. Forget the data about ada5, I took It for example from Internet because at that moment I was not able to take from my nas...The proplems are on 3 disks now, the first is on ADA0, where the resilvering is writing data:

upload_2018-7-7_11-53-38.png


But my SSH screen shows (resilvering) also an ada6 and ada7:
upload_2018-7-7_11-54-35.png

upload_2018-7-7_11-55-25.png


Where I see there are only reads, so maybe from these two disks It resilver ada0?!?

So It starts the resilver ( as you can see now, I changed sata cables of ada6 and ada7 ) and after a while It shows detach and destruct as the example before BUT ONLY ADA6 and ADA7, NOT ADA0. In effect ada0 is attached on the external PCI_E card, so maybe can be useful to change also this sata cable. I will wait a while to see If the error occur and in case I will change It.

I will do this way and If errors persists I will evaluate your suggestions on CPU/RAM.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
First of all I think you need to take a step back, you are confusing me. Are you stating that the error message you posted was not actually yours? Not sure how you expect to get accurate help if you cannot post your actual problems.

When you troubleshoot hard drive issues then you must relate the problem to the physical hard drive serial number, not just ada0, ada1, etc... as ada0 can change, it is not always fixed to the same hard drive, especially when you move the drives around.

In effect ada0 is attached on the external PCI_E card,
In effect? Is it physically attached there or not? It's difficult to use guess work.

Also, the hard drives you have installed are Survellance hard drives, meaning that they are not really good for a anything other than video storage, meaning that when they read data back (If I understand this correctly), they will try to read a piece of data but if after a few retrys then it just skips the data and continues on because during video playback a little bit of data loss is considered acceptable. I have a WD Purple drive in my home DVR but I won't place one in my NAS, no way. I'm not saying that this is your problem however it could be, too early to tell to be honest.

So, was this setup working prior to replacing the hard drives or is this a new setup?

1. Specifically which ASUS motherboard and which SATA add on board (need the add-on board info so I can look up hardware configuration info, you could have a jumper issue)?
2. What are the physical connections to each hard drive and include the hard drive serial numbers?
3. Post the output of the following command smartctl -a /dev/ada0 for each drive.
4. Tell me which hard drive is connected to which SATA port location, be specific.

Now a few questions:
1. Did you do anything recently to your system, move it, upgrade the hard drives, move a cable, upgrade the FreeNAS software? Think about it, it could be anything and it could have happened a week before you started noticing problems.

2. Have you replaced any hardware since the problem occured such as the eSATA cables? External power supply, etc...

My current advice is to double check the eSATA cables, make sure they are connected fully. Reseat the PCIE card and test again.

Realize one thing, your drives will need to finish resilvering even if you figure out what is going on. If you never tested your drives before you installed them then I would recommend that you test them properly as they could be part of the issue. If this problem only occurs on the external drives then you need to troubleshoot the physical connections, and it could be the PCIE card at fault, however likely not if you are also using some of the internal connections.

Lastly, I suspect that your data is probably lost, you have a RAIDZ1 setup with three of the drives resilvering.

And I know I'm asking a lot of questions, it's just becasue I'm not there and you are my eyes so I need the help. Off the cuff I'd say you have a problem with the external interface, bad SATA cables, poor power, or misconfigured SATA card, if the only problems are with the hard drives connected to the PCIE card. If this is the first time using this card then it could be a compatability issue with FreeNAS/FreeBSD. And if nothing else, all the questions I've asked may help you locate the problem without further assistance. But right now I'd consider all your data gone and I'd just destroy the pool and run badblocks on all of the hard drives. If that passes then rebuild your pool and monitor your system closely. I'd also only use FreeNAS 11.0-U4, I personnaly don't trust U5 but that shouldn't be the source of you problems.

So why do you have a cache? Just asking. Maybe you do have a need for it but if not then I'd remove it.

Good luck.
 

vatastala

Dabbler
Joined
Oct 9, 2014
Messages
20
First of all I think you need to take a step back, you are confusing me. Are you stating that the error message you posted was not actually yours? Not sure how you expect to get accurate help if you cannot post your actual problems.

About this don't worry because I lost the screenshot about my errors but I have found exactly the same on the web.

When you troubleshoot hard drive issues then you must relate the problem to the physical hard drive serial number, not just ada0, ada1, etc... as ada0 can change, it is not always fixed to the same hard drive, especially when you move the drives around.

I didn't move any drive so the drives are ever the same and in the same position


In effect? Is it physically attached there or not? It's difficult to use guess work.

Yes It's attached on It

Also, the hard drives you have installed are Survellance hard drives, meaning that they are not really good for a anything other than video storage, meaning that when they read data back (If I understand this correctly), they will try to read a piece of data but if after a few retrys then it just skips the data and continues on because during video playback a little bit of data loss is considered acceptable. I have a WD Purple drive in my home DVR but I won't place one in my NAS, no way. I'm not saying that this is your problem however it could be, too early to tell to be honest.

Infact they contains many many video files for PLEX

So, was this setup working prior to replacing the hard drives or is this a new setup?

This was setted up with all WD Caviar Green, but I changed 4 of them in last months with Seagate because were old and with many errors.

1. Specifically which ASUS motherboard and which SATA add on board (need the add-on board info so I can look up hardware configuration info, you could have a jumper issue)? Asus 90-MIBJH0-G0EAY0KZ P8H61-I LX R2.0, the sata card I said before Syba with Marvell chip
2. What are the physical connections to each hard drive and include the hard drive serial numbers?
3. Post the output of the following command smartctl -a /dev/ada0 for each drive. You can find six files attached, do you see any error?
4. Tell me which hard drive is connected to which SATA port location, be specific.

upload_2018-7-7_20-57-8.png


upload_2018-7-7_20-57-41.png


Don't consider the ssd's, the order id exactly top-down

Now a few questions:
1. Did you do anything recently to your system, move it, upgrade the hard drives, move a cable, upgrade the FreeNAS software? Think about it, it could be anything and it could have happened a week before you started noticing problems. Absolutely nothing since two months ago when I changed last disk

2. Have you replaced any hardware since the problem occured such as the eSATA cables? External power supply, etc... NO

My current advice is to double check the eSATA cables, make sure they are connected fully. Reseat the PCIE card and test again.

Realize one thing, your drives will need to finish resilvering even if you figure out what is going on. If you never tested your drives before you installed them then I would recommend that you test them properly as they could be part of the issue. If this problem only occurs on the external drives then you need to troubleshoot the physical connections, and it could be the PCIE card at fault, however likely not if you are also using some of the internal connections.

Infact today It finished resilvering after I changed 2 sata cables as I said and this is the situation, I have checksum values varied and If I reboot the NAS It starts resilvering writing on ada0 and reading from the other disks, I repeat on ada6 and ada7 It only READ, don't write to fix something.

upload_2018-7-7_21-0-38.png


Lastly, I suspect that your data is probably lost, you have a RAIDZ1 setup with three of the drives resilvering. Data is backed up on the other PC so no problem, anyway a question...when It finished resilvering there's only 1 permanent error on a jpg, so maybe other data is good.

And I know I'm asking a lot of questions, it's just becasue I'm not there and you are my eyes so I need the help. Off the cuff I'd say you have a problem with the external interface, bad SATA cables, poor power, or misconfigured SATA card, if the only problems are with the hard drives connected to the PCIE card. If this is the first time using this card then it could be a compatability issue with FreeNAS/FreeBSD. And if nothing else, all the questions I've asked may help you locate the problem without further assistance. But right now I'd consider all your data gone and I'd just destroy the pool and run badblocks on all of the hard drives. If that passes then rebuild your pool and monitor your system closely. I'd also only use FreeNAS 11.0-U4, I personally don't trust U5 but that shouldn't be the source of you problems.

Infact thank you very much for your help, your are very gentle :)

So why do you have a cache? Just asking. Maybe you do have a need for it but if not then I'd remove it. It speeds up read and writes, I experimented with and without and I have improvements.

Good luck.
 

Attachments

  • ada0.txt
    5.5 KB · Views: 354
  • ada5.txt
    9.9 KB · Views: 286
  • ada4.txt
    10.1 KB · Views: 261
  • ada1.txt
    5.8 KB · Views: 261
  • ada6.txt
    5.5 KB · Views: 250
  • ada7.txt
    6.9 KB · Views: 265
Last edited:

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
(If I understand this correctly), they will try to read a piece of data but if after a few retrys then it just skips the data
If I can remember I will STFW since I have one, too.
they will try to read a piece of data but if after a few retrys then it just skips the data
... returning zeroes instead of real data? Or giving some specific error code? If the former then ZFS would help here by the means of checksum errors. Unless a checksum collision occurs... :confused:
And hopefully the SMART error count gets increased so one can RMA such drive... :p

EDIT:
@vatastala wrote:
Infact they contains many many video files for PLEX
If the disks work the way @joeschmuck wrote and a read error encountered a metadata/parity block (raidz2 has them spread everywhere IIRC) then...
I would recommend that you test them properly
I guess that this might be to be done after the resilvering finishes? I've read in our forums that scrubs and long SMART self tests cannot run in parallel otherwise they never finish. But of course it is sth else so the rule may be different...
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@pro lamer
I can't say for certain how the drive reports an error to the host machine, or how ZFS would handle it, I haven't investigated it that far. I know there are people out there using hard drives designed for video applications and all I recall when I did my research on them a few years ago was that they are tuned to favor video performance, not digital data integrity. Meaning that when a drive encounters a sector it can't read, it will try a few times but in order to allow video playback to continue smoothly it will just skip the data and move on to the next sector that needs to be read. In the video world the loss of data it typically completely unnoticed, while with true digital data, it matters. But again, using these drives in a ZFS system might be fine since ZFS will repair data issues.

The main problem I see with the OP here is three drives that are resilvering in a RAIDZ1 pool.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@vatastala
ada0, ada1, and ada6 need a Long Smart Tests run on them. They will likely pass just fine however it is something you should conduct for peace of mind.

Looking at the drive data it looks like drives ada0 (S/N: Z73086XN) and ada6 (S/N: Z730H0MJ) have a considerable amount of read errors (ID 1), where the other two drives of the same make/model do not have any errors reported. To me this is odd and worth investigation. Some drive models will report read errors but that is typically because they look ahead and if the data is not needed then the drive can call it a read error or just forget about the data, the latter is my preference but it is up to the manufacturer. Can you tell me where these two drives are physically connected?

Infact they contains many many video files for PLEX
That unfortunately doesn't mean you can use video hard drives in a ZFS formatted machine. In my DVR, it could care less about a lost piece of data, but my computer, it cares. The computer has no idea it's video content either.

the sata card I said before Syba with Marvell chip
I need an actual model number of the PCIE card, trust me, it matters. Some are not compatible with the newer versions of FreeNAS. I use a PCIE card as well but I had to ensure compatibility and while it was good for FreeNAS 8.x, I have been hoping it would remain good for as long as I have my machine. One day I will not be able to upgrade to a newer version of FreeNAS due to hardware limitations but that is okay and I expect it.

My advice is to see if ada0 and ada6 are on the PCIE card and exactly which ports they are connected to. Maybe it would be best to tell me what ports all your hard drives are connected to, and post a copy of the dmesg output (also found in the logs under cd /var/log and look for the file called dmesg.today and dmesg.yesterday.
 

vatastala

Dabbler
Joined
Oct 9, 2014
Messages
20
@vatastala
ada0, ada1, and ada6 need a Long Smart Tests run on them. They will likely pass just fine however it is something you should conduct for peace of mind.

Ok, I'm going to start the tests

Looking at the drive data it looks like drives ada0 (S/N: Z73086XN) and ada6 (S/N: Z730H0MJ) have a considerable amount of read errors (ID 1), where the other two drives of the same make/model do not have any errors reported. To me this is odd and worth investigation. Some drive models will report read errors but that is typically because they look ahead and if the data is not needed then the drive can call it a read error or just forget about the data, the latter is my preference but it is up to the manufacturer. Can you tell me where these two drives are physically connected?

Z73086XN (ada0) is connected to the pci-e card, ada1 too, and Z730H0MJ (ada6) to the motherboard

That unfortunately doesn't mean you can use video hard drives in a ZFS formatted machine. In my DVR, it could care less about a lost piece of data, but my computer, it cares. The computer has no idea it's video content either.

I need an actual model number of the PCIE card, trust me, it matters. Some are not compatible with the newer versions of FreeNAS. I use a PCIE card as well but I had to ensure compatibility and while it was good for FreeNAS 8.x, I have been hoping it would remain good for as long as I have my machine. One day I will not be able to upgrade to a newer version of FreeNAS due to hardware limitations but that is okay and I expect it.

I have exactly the same card on my other PC
upload_2018-7-9_13-38-5.png


My advice is to see if ada0 and ada6 are on the PCIE card and exactly which ports they are connected to. Maybe it would be best to tell me what ports all your hard drives are connected to, and post a copy of the dmesg output (also found in the logs under cd /var/log and look for the file called dmesg.today and dmesg.yesterday.

Ok I attach dmesg.today, there's only this
 

Attachments

  • dmesg.txt
    36.6 KB · Views: 298

vatastala

Dabbler
Joined
Oct 9, 2014
Messages
20
@joeschmuck

You can find attached the results of the long smart test.

The stranger thing is that now I'm in a state where the resilvering is complete

upload_2018-7-10_8-18-30.png


Everything is ok, except the fact that If I reboot resilvering restart, but when I copy some movies for test from the NAS to the PC Freens reads only from the other disks and not ada0 ( for example ada1 )

upload_2018-7-10_8-19-58.png


So, I think I'm in a state where ada0 is completely lost, for this reason I have a checksum not equal to zero after resilvering, but the data is not lost because are on the other 5 disks, I checked many many files and they are good, anyway because of I have a backup, I can start to create my pool from scratch.

Do you agree with this?
 

Attachments

  • smartctl_ada0.txt
    14.5 KB · Views: 291
  • smartctl_ada6.txt
    15 KB · Views: 256

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Sorry for not getting back to you sooner, I just bought me a 2018 Miata RF GT, been quite involved with it.


To answer your question, because you do have a backup then I would do the following, again this is what I would do, it's not just a recommendation...

1. Power down.
2. Run Badblock on all of the drives simultaneously. you could run it on just ada0 however it's only a little longer to do them all at once and then you have that concern off the table if they all pass.
3. If Badblock passes then I'd swap drive ada0 with one of the WD drives and see if the problem follows the hard drive or possibly sticks with the port. Also, leave the SATA cables connected to the ports, you only want to move the hard drives and make as few changes as possible while troubleshooting a problem like this.
4. Run a scrub to check how the drives are performing.

Unfortunately I can't tell you that it's a SATA cable failure, you don't have the typical failure indication for that.

Post your results.
 
Status
Not open for further replies.
Top