FreeNAS 9.10.1 Continuous Reboot, "Fatal trap 12" RAIDZ2 Volume

Status
Not open for further replies.

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
A few things, your last two drives had a SMART short test done probably right after they were installed and that was it, none of the other drives have been tested. As previously indicated, run a SMART short test and review the results. Read the troubleshooting guides for the hard drive ID info. Run a SMART long test after that.

Second, SMART testing is not setup by default. This is not Windoze and while it does look like a nicely finished program, there is a certain level of setup require. For instance, do you have your automatic emails setup? I suspect not since you don't have SMART testing setup. Make sure you setup the emails becasue they are crucial when a failure occurs.

Third, you should not be using the "-f" parameter unless you are just asking to accidentally corrupt your pool.

Thank you for the info. I did set up that automated emails, but was not alerted to anything prior to the pool malfunction. If the SMART tests were not being ran, I suppose I wouldn't have been alerted. I still feel that it likely has something to do with the multiple, rapid reboots due to the watchdog feature of the Supermicro board.

Also, I only attempted the -f option because I saw it suggested in another forum post and FreeNAS specifically said that the pool was owned by another machine or something similar and the only way to proceed with the import was to force it. I will definitely keep this caution in mind.
 
Last edited:

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
As suggested earlier in the thread, I have been running memtest for about 50 hours now with 0 errors, so I do not believe that it is a memory fault.
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
So after manually importing the pool through the command line, I get the following output from zpool status, which looks very promising, however, it is not showing up in the gui. Is it just not mounted?? I am not very familiar with mounting ZFS pools from the command line. Any help would be greatly appreciated.

Code:
[root@freenas ~]# zpool status                                                                                                      
  pool: fNASRaidZ2Vol                                                                                                              
 state: ONLINE                                                                                                                      
  scan: scrub repaired 0 in 22h43m with 0 errors on Fri Aug 12 04:43:16 2016                                                        
config:                                                                                                                            
                                                                                                                                   
        NAME                                            STATE     READ WRITE CKSUM                                                  
        fNASRaidZ2Vol                                   ONLINE       0     0     0                                                  
          raidz2-0                                      ONLINE       0     0     0                                                  
            gptid/28b545ba-c5c9-11e4-80b7-0015175bfcfe  ONLINE       0     0     0                                                  
            gptid/97d26f9e-c647-11e4-8c87-0015175bfcfe  ONLINE       0     0     0                                                  
            gptid/e117af57-c6b4-11e4-8c87-0015175bfcfe  ONLINE       0     0     0                                                  
            gptid/4cb12cc0-c923-11e4-bce5-0015175bfcfe  ONLINE       0     0     0                                                  
            gptid/06f2f984-c299-11e4-a2a6-0015175bfcfe  ONLINE       0     0     0                                                  
            gptid/78533bce-c526-11e4-b05b-0015175bfcfe  ONLINE       0     0     0
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
What about trying to scrub this pool? I'm hesitant to do anything that will change the data on these drives until I get it figured out. From what I've read, it seems that the volume should be mounted when imported, but it is not showing up in /mnt. Would a scrub help at all, or am I just grasping at straws? I did try an export and then import again with no change.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Your are terrible at following directions! Also windows never runs a single smart test ever. You have to install a program and run them manually. Lastly follow these steps.

1. Run smart tests on all your drives manually. Short then long if possible. Provide the results from each.
2. Get a new sandisk cruzer fit USB device for your os.
3. Export your pool and use the GUI to import it. If you don't use the GUI it doesn't show up in the GUI. Pretty simple and straight forward.
4. Don't run a scrub it doesn't do anything to help import your pool or make it appear in the GUI.

Sent from my Nexus 5X using Tapatalk
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Your are terrible at following directions! Also windows never runs a single smart test ever. You have to install a program and run them manually. Lastly follow these steps.

1. Run smart tests on all your drives manually. Short then long if possible. Provide the results from each.
2. Get a new sandisk cruzer fit USB device for your os.
3. Export your pool and use the GUI to import it. If you don't use the GUI it doesn't show up in the GUI. Pretty simple and straight forward.
4. Don't run a scrub it doesn't do anything to help import your pool or make it appear in the GUI.

Sent from my Nexus 5X using Tapatalk


1. I have ran short tests on all drives. I will post results shortly, but no errors were reported. Long tests will be next if needed.
2. I have not replaced the Cruzer, as it seems to be working just fine, but I will make sure to do that to cross it off the list as a possibility.
3. I was not aware of this, but it makes sense. Thanks for the clarification. Problem is, when I import through GUI, it gets stuck on "step 2" and the server reboots within seconds. It doesn't appear to accomplish anything. Importing through the command line at least allows me to see the pool in "zpool status" and it appears to be healthy there.
4. Understood

Appreciate the advice very much, but there is no need to be rude.

Thanks again
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Code:
[root@freenas ~]# smartctl -l selftest /dev/ada0                                                                                    
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)                                                            
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     12654         -                                                      
                                                                                                                                   
[root@freenas ~]# smartctl -l selftest /dev/ada1                                                                                    
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)                                                            
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     12639         -                                                      
# 2  Short offline       Completed without error       00%        54         -                                                      
# 3  Short offline       Completed without error       00%        12         -                                                      
                                                                                                                                   
[root@freenas ~]# smartctl -l selftest /dev/ada2                                                                                    
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)                                                            
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     12626         -                                                      
                                                                                                                                   
[root@freenas ~]# smartctl -l selftest /dev/ada3                                                                                    
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)                                                            
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     12673         -                                                      
                                                                                                                                   
[root@freenas ~]# smartctl -l selftest /dev/ada4                                                                                    
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)                                                            
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                            
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%     12749         -                                                      
# 2  Short offline       Completed without error       00%         3         -
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
I've attached a short video of the console output while attempting to import through the GUI. This occurs after initiating the import during "step 2" of the process. The machine reboots and GUI reloads as if nothing was done. It is obviously reporting a CRC error with ada0, but I'm not sure exactly what to make of it. What are the main differences between the GUI import and command line import? Are there additional options and or switches that could be done from the command line? I do not have another USB drive handy at the moment, but that is the next step. Does this output shed any light on the situation? I wouldn't normally be so persistent. There are just some files on this pool that I need for work that would be much more convenient to get if I could get it working. I just feel like I am missing something very simple.
 

Attachments

  • zpoolimportcapture.zip
    7.8 MB · Views: 329

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
smartctl -a /dev/ada0 please

Persistence is good!

Sent from my Nexus 5X using Tapatalk
 
Last edited by a moderator:

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Here is the latest for ada0.

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4EDUYFEZ9
LU WWN Device Id: 5 0014ee 2602992e3
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Aug 30 09:20:00 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (53460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 534) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   203   175   021    Pre-fail  Always       -       6833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       176
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       12659
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       176
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       127
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       358
194 Temperature_Celsius     0x0022   118   102   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12657         -
# 2  Extended offline    Interrupted (host reset)      90%     12656         -
# 3  Extended offline    Interrupted (host reset)      90%     12656         -
# 4  Short offline       Completed without error       00%     12654         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Could it possibly be as simply as just pulling that drive? Would it import in a degraded state?

I did notice if I went in through winscp after manually importing from the command line, I could see that the volume had mounted in the root "/fNASRaid2Vol" but appeared to be completely empty.

Furthermore, I was able to find the pool in /dev and found what looked to be a bunch of zvols, with 0 bytes and no apparent structure.

I was also able to view the preexisting directory structure of my pool with "zfs list".

What about trying to import it through the command line and roll back to a recent snapshot?

Let me know what you guys think.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Please stop being a minimalist and provide all the data. You are driving me nuts by not providing all the data requested. If you are not going to take this seriously then why should anyone else.

Post the SMART output for all of your drives, not just ada0, well you already posted that one so you don't need to do it again. It appears you are done running Memtest on your system so now kick off a SMART Long Test on all the drives and wait the required time period, then post a SMART output for all those drives.

Why are we asking you to go through these motions you ask, it is to rule out any of the hard drives as causing your issues.

If you want to keep forcing the import of your pool, have at it, it's your data at risk.

I don't normally get upset but honestly these folks are trying to help you and I just don't feel like you care a whole lot and just want to do your own thing. That is fine, just don't ask for help if you are not going to take our advice.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I stopped responding because of this exact reason. I don't think we can help.

Sent from my Nexus 5X using Tapatalk
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Please stop being a minimalist and provide all the data. You are driving me nuts by not providing all the data requested. If you are not going to take this seriously then why should anyone else.

Post the SMART output for all of your drives, not just ada0, well you already posted that one so you don't need to do it again. It appears you are done running Memtest on your system so now kick off a SMART Long Test on all the drives and wait the required time period, then post a SMART output for all those drives.

Why are we asking you to go through these motions you ask, it is to rule out any of the hard drives as causing your issues.

If you want to keep forcing the import of your pool, have at it, it's your data at risk.

I don't normally get upset but honestly these folks are trying to help you and I just don't feel like you care a whole lot and just want to do your own thing. That is fine, just don't ask for help if you are not going to take our advice.

Look guys, I'm just here to get some advice from folks who know FreeNAS, FreeBSD, and ZFS better than the vast majority. I know my way around information technologies, but FreeBSD and ZFS is still a bit fresh for me, so why not ask the experienced guys? I have enjoyed being able to converse with like minded individuals who I can learn something from. Honestly, until just a few days ago, I was just a forum lurker. I very rarely make any posts at all.

Obviously I have hit a nerve with at least a couple of you, and that was never my intention, of course. I've been up for almost three days straight because I'm about the lose 24TB worth of data that is just within arms reach(and no, like I said before, I do have backups). The point is, I started this NAS project years ago as a hobby, and It's always been such a rock solid system. I know how robust ZFS is. I know that my data is still on those disks. It's just a matter of understanding how it all works together to snap back in to place. I never put all of my eggs in one basket, but it's still a bit unsettling to think that that much information could just go "poof", so I may have been a little more on edge than usual. I just spent several thousand dollars upgrading my hardware over the last year, and now that I'm not using obsolete, consumer grade, hand-me-down hardware, this happens. Not trying to whine. I know it happens to everyone. Just a bit frustrated.



With that said, If there's information that has been requested, and I have not provided it, sometimes twice, that's news to me. Post #14 on page 1 has the smartctl output for every one of my physical disks in the pool. Shortly thereafter, I took a 48 hour break to allow memtest to run, just like was suggested to me. I have tried to run long SMART tests, but they keep failing. It's like Schrodinger cat. If I try to observe the process, it kills itself, apparently. Every long test result looks exactly like this. I'm not resetting the host every time one of these tests gets 10% complete, I can assure you.

Code:
# 3  Extended offline    Interrupted (host reset)      90%     12641         -


I've thrown out several outside the box type of suggestions to get opinion, and I've learned what not to do from a lot of those suggestions. Isn't that why we are all here? I imagine I'll stick around this board for some time, and I'd be happy to pass along any knowledge that I'm gaining if I saw a question that I could answer.

I don't take you guys for the anonymous morons that like to be jerks to people for no reason, like a lot of message boards and other social media around the internet. I was just a little caught off guard by the hostility. I guess I should go back and reread my posts. Maybe it will make more sense to me.

Anyway, might as well finish this novel by talking about the subject at hand. I've been working on it all evening and I believe I'm making some progress and narrowing down the issues. I exported the volume and took the drives to another machine. The CRC/Parity errors went completely away and the pool started resilvering. I still can't get it to import through the GUI though. It just reboots with that very fast scrolling text from the video I posted. When I do a manual import, it looks like everything is there and correct as far as the directory structure, but there are just no files and nothing that reports any data. I haven't had any luck mounting a snapshot either, but I need to do a little more reading in to that. I bought some brand new round SATA cables a few months ago and I'm beginning to wonder if they were just no good. I'm trying to get the pool as healthy as I can in this other box. In the meantime, I've got a couple more 4TB Reds and the way, as well as an IBM M1015 SAS/SATA controller for additional, off board ports, and, of course, some sata cables.


I do appreciate all of the advice. If there is in further information that I could provide that I missed, I would be happy to do so. You guys are doing me the favor here and have been helpful.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
If you want us to help, we will help but just follow our instructions. You are poking the bear and are apt to loose your 24TB of data because you are doing things based on FreeBSD forums and you are putting your data at risk. You apparently speak English and you write in English well so I don't see a language barrier issue.

Please read the this thread from beginning to end again and maybe you will see that there is some data which you have not provided yet. I for one would like to see the results of the SMART long test on all six of your drives, the complete output of "smartctl -a /dev/adax" (where x = the drive letter).

So how do you run a SMART test when FreeNAS keeps causing things to stop? Well you have a few ways to do this:
1) If you have the drives setup to sleep then disable the sleeping, although I wouldn't expect the drives to sleep during a SMART test but weirder things have happened.
2) Boot something like the UBCD and run the SMART tests.
3) Boot Ubuntu Live and run the SMART tests.

If you really cannot get a hard drive to complete a SMART test then something is wrong. You need to explain why you think the test is being aborted. Normally it's aborted due to power being shut down. Maybe you have a bad power supply? It's difficult to troubleshoot a problem remotely.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
These panics usually mean something has corrupted that must not corrupt.

If you are lucky, it is happening in real-time, and moving the pool to other hardware will make it disappear. It is not the drives; problems there would show up as expected and not panic the system.

If you are unlucky, the corruption happened in the past(in RAM, for instance), and was then written into the pool metadata as if it was valid. Enough time goes by, the corruption surfaces, and here you are. A scrub likely does not detect this kind of corruption, although you would expect to see checksum errors on scrubs over time in a system that is doing this.

In many cases, the pool can be mounted readonly=on, and data off-loaded. That usually works for spacemap corruption. Although you didn't list a full backtrace, it looks like your panic is in ZAP, which could fail in readonly mode as well.
 

Joseph Sharbutt

Dabbler
Joined
Apr 12, 2014
Messages
28
Ok, there is finally some light at the end of this tunnel. I noticed last night when I had the pool in a different machine, it resilvered and then began to do a scrub. I noticed while browsing around the console and winscp, some of my files just started to appear, where there was nothing but empty directories before. I was fairly certain that they were still there in snapshots and such, due to the way ZFS works. I was a little worried that the scrub might get confused and wipe out any structure that may have helped me recover the files, but I just let it go for a while when I started seeing positive results. Also, as I mentioned before, I was no longer seeing the CRC/Parity errors. I let that run most of the night and changed out all of my SATA cables in my server with brand new ones.

I got the server put back together this morning, and after some really strange issues trying to do a fresh install, and several attempts playing musical chairs with half a dozen USB drives, I got it FreeNAS installed. This wasn't because my USB drives were bad. I did thorough tests on all of them. Something was happening mid install that was corrupting the partitions and if I tried again, it would refuse to work with that drive until I formatted it. It was even referring to the devices differently on the next boot, from Cruzer, PNY, etc. to something like "USB Block Device". Probable had to do with the partition scheme or something, I'd just never seen it.

There also seemed to be some DNS information stuck somewhere as well. Even after several half installs, when I finally got it working, it was throwing a bunch of DNS errors for all six NICs that I have in the machine, saying that they were already in use and refusing to take an IP, even when entered statically. Finally after fiddling around with ifconfig and dhclient for awhile, I was able to clear the errors and get the GUI launched to try to import the pool again. Still no luck there. Same thing as before. It would reboot immediately after the import began and scroll a ton of information extremely fast across the console just before it restarted.

I know I'm getting a bit long winded, but I've also started noticing some "GEOM_MIRROR: Device Swap" errors this morning, in both machines. I assume it's because the swap drives were being recreated in an attempt to rebuild the cache data that likely caused me all the problems in the first place.


Anyway, all I really got on here to say was thanks for suggestion that I try to mount the drive as read only. Once I did that, although I still cannot access it from the GUI, I have been able to sftp in. I am currently downloading any file that it will let me have.

If I'm able to get the important stuff, I'm sure I'll just blow the pool away and start over. I do really like the six disk RaidZ2 setup, but that is a lot of eggs in one basket. Guess I'll just have to build another six disk box so I can have a replication partner.



Again, thanks for the advice. It's been a great learning experience.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
How are you creating/installing FreeNAS on your USB Flash Drives? Are you booting from the ISO image or trying to manually install an image to the USB directly? If you are not using the CD to boot from and install the software, might I suggest you try that. Also, you can use any computer to do this but you need to ensure you choose the correct drive to install to. There are horror stories of people installing to a windoze disk and they end up very unhappy.
 
Status
Not open for further replies.
Top