SOLVED Critical :Alert System- how to cure an error in zpool

Status
Not open for further replies.

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
First my hardware list:

Mobo = Asus Sabertooth Z77
CPU = Intel i7-3770K
Ram = 32GB DDR3 non-ECC (mobo does not support ECC memory)
Hdd = 3x2x8TB Seagate IronWolf ( 3 mirrors in one pool)
Hdd controller = Internally Intel Z77 + 2xASMedia 1061 Sata controller
Nic = Intel 82579V Gigabite LAN controller
FreeNAS = 9.10.2-U1

In my FreeNAS box the Critical button suddenly blinks red with a Warning:
upload_2017-7-3_15-58-16.png

I checked a few Forum posts and followed the general advice to use
Code:
zpool status -xv


Which indicated only one of my file was corrupted. I therefore deleted it and copied a good replacement from my back up.

The Critical button still brinks red. I issued the above "zpool status -xv" again and a different message came out:
Code:
Shell
[root@Edgehill ~]# zpool status -xv																								
  pool: i7-3770K																													
state: ONLINE																													
status: One or more devices has experienced an error resulting in data															
		corruption.  Applications may be affected.																				
action: Restore the file in question if possible.  Otherwise restore the															
		entire pool from backup.																									
   see: http://illumos.org/msg/ZFS-8000-8A																						
  scan: scrub repaired 0 in 8h26m with 0 errors on Sun Jun  4 08:27:00 2017														
config:																															
																																	
		NAME											STATE	 READ WRITE CKSUM												
		i7-3770K										ONLINE	   0	 0	 1												
		  mirror-0									  ONLINE	   0	 0	 2												
			gptid/0866c0a3-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 2												
			gptid/092c3eef-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 2												
		  mirror-1									  ONLINE	   0	 0	 0												
			gptid/09f3c8b7-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 0												
			gptid/0ab8969a-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 0												
		  mirror-2									  ONLINE	   0	 0	 0												
			gptid/0b8834e8-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 1												
			gptid/0ce30578-e89d-11e6-84a6-3085a995a374  ONLINE	   0	 0	 0												
																																	
errors: Permanent errors have been detected in the following files:																
																																	
		i7-3770K:<0x13572>   


How can I find out this file from the zpool? and is such a file repairable?

It seems my mirror0 is in trouble with cksum=2. The hard drives are only 4 to 5 months old. Under such a circumstance would a reboot be recommended?
 
Last edited:

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
The hard drives are only 4 to 5 months old.

Did you burn those disks? I'd start by checking the smart information for each of them and initiate a replacement as they are new.
 

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
My zpool got rarely used. If there were busy it was because I have to rebuild the PlexMediaServer plugin several times. However the 14TB source data is hardly touched or written on. The jails with PlexMedgiaServer is over 10GB large because of thousands of photos.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Well unfortunately you have a system without ECC RAM, you have taken a risk that you could be paying for now.

Your last scrub was 4 June, that was a month ago and you are just realizing it? Do you have email notifications setup?

Have you had any power issues, reboots or other problems? Hopefully you have an UPS at least to prevent against power issues. I'm looking for why you have corrupt files in the fiirst place. It could be a RAM issue but it's difficult to prove it.

0) If you haven't already, backup your data.
1) Run MEMTEST86 on your RAM, ensure you get at least 1 pass. If it fails then stop and troubleshoot.
2) As @melloa stated, check your SMART data.
3) I'd run a scrub next to see how much of your data is there.
4) Last, if you had no errors in your SMART data, run a SMART Long Test on all your drives to ensure they are okay and not throwing errors.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Since you've had a lot of weird issues with your pool, culminating with the current metadata corruption, I highly recommend that you:
  1. Destroy the pool
  2. Recreate it on a system with ECC RAM (at the very least, aggressively burn in your system)
 

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
I followed joeschmuck's suggestion to Memtest the memory. After 30 hours and 8 passes 5 errors were reported at two addresses. I need to take the memory out to test one at a time to find which one is faulty or if the slot is bad too.

I am new to zfs and wonder if it has a limit on the length of filename. I have been cutting down the lengths as I found several thousands filenames are longer than 400 characters due the the growth of the directory system.

I am aware of using normal PC components whereas the forum recommends server-grade hardware. It is a substantial investment changing the ram, then mobo and the CPU possibly too. My FreeNAS is not mission critical as it serves only a small number friends and family members. Nevertheless it is good to know how the equipment is specified and for what reason.

My zpool was assembled with off line hdds. I was hoping the zfs system is reliably enough for me to use as a the central permanent storage of all my data.

The file singled by FreeNAS to have error is a video file which seems to be operational whenever I play it.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
There is nothing wrong with FreeNAS or ZFS. In fact ZFS did it's job exactly as it is supposed to and warned you that you had corrupt data.

Unless you enjoy chasing your tail you should replace your dubious hardware with proper hardware. There's a hardware recommendation guide in the resources section at the top of the page to get you going in the right direction.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
As for the RAM test failure, the failure may not even be in the RAM itself so my advice is this:

1) Ensure you have compatable RAM for your motherboard.
2) Ensure the RAM speed is set proper and the voltage is proper on the motherboard for the RAM you have.
3) If you have high speed RAM, try to run it at a normal clock speed and retest.
4) Try one stick at a time and retest.
If you cannot isolate the failure to a single stick of RAM then look at the power supply, motherboard, and CPU, in that order, well that is the order I'd do it.
 

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
There is nothing wrong with FreeNAS or ZFS. In fact ZFS did it's job exactly as it is supposed to and warned you that you had corrupt data.

Unless you enjoy chasing your tail you should replace your dubious hardware with proper hardware. There's a hardware recommendation guide in the resources section at the top of the page to get you going in the right direction.

I didn't purchased components to assemble my FreeNAS. I simply took one of a spare PC and loaded FreeNAS into it as a pilot scheme. Having found FreeNAS working so well I installed it in my second existing PC with the best specification. Both are working perfectly for months. The one giving me the alert is the high spec PC.

One of the best way to learn is from one's own mistakes. My long term plan is to build a FreeNAS with recommended hardware. I still like to bottom out the shortcomings of my existing PC harware so as to avoid them in my next move.
 

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
As for the RAM test failure, the failure may not even be in the RAM itself so my advice is this:

1) Ensure you have compatable RAM for your motherboard.
2) Ensure the RAM speed is set proper and the voltage is proper on the motherboard for the RAM you have.
3) If you have high speed RAM, try to run it at a normal clock speed and retest.
4) Try one stick at a time and retest.
If you cannot isolate the failure to a single stick of RAM then look at the power supply, motherboard, and CPU, in that order, well that is the order I'd do it.
I never over-clock and the ram is always bought at the highest recommended normal speed (no OC) by the motherboard vendor. This is the first time I got an error in the ram but I have never had a need to check it out in the past.

The ram in trouble is powered by Corsair TM850M PSU and the box itself is connected to a 1200Watt UPS. I would have gone for ECC ram if it is widely available and not restricted by the motherboard. Mind you the PC is a few years old and wasn't originally assembled with FreeNAS in mind.
 

saikee

Explorer
Joined
Feb 7, 2017
Messages
77
Latest situation

30 hours Memtest result:
upload_2017-7-5_11-31-51.png


9 hour scrub
upload_2017-7-5_11-32-56.png


26minutes earlier before the scrub finished the alert button was blinking red with the error still showing. After scrub finished, with no error reported, the alert button goes green showing "The system has no alerts". Do this mean my zpool is alright now? I did replaced the only erroneous file reported by "zpool status -v" before scrubbing the pool.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, bad RAM begins to explain your recent issues.

I wouldn't trust that pool, though, not after metadata corruption.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My advice is to drop your RAM speed down to DDR3-1600 and retest. Since it appears only Test 7 is failing and only in a specific memory location,you could run only test 7 and that specific memory section if you like to stress that area. I would still suggest running MemTest86 on all your RAM since you changed the RAM speed. If your RAM fails then you will need to isolate which stick is failing and replace it.

I can't say your data is in tact, you should backup everything and manually check the files. I doubt you would check each photo but a random sampling wouldn't be a bad idea.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Latest situation

30 hours Memtest result:
View attachment 19299

9 hour scrub
View attachment 19300

26minutes earlier before the scrub finished the alert button was blinking red with the error still showing. After scrub finished, with no error reported, the alert button goes green showing "The system has no alerts". Do this mean my zpool is alright now? I did replaced the only erroneous file reported by "zpool status -v" before scrubbing the pool.

If the scrub says the pool is good, then it probably is.

Fix your broken hardware and be happy that ZFS worked like a trooper.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
If the scrub says the pool is good, then it probably is.
But it didn't:
Code:
errors: Permanent errors have been detected in the following files:																
																																	
		i7-3770K:<0x13572>
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
But it didn't:
Code:
errors: Permanent errors have been detected in the following files:																
																																	
		i7-3770K:<0x13572>

Latest scrub says "no known data errors". The inconsistency is explained by the faulty ram.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
That permanent error occurred before I scrubed the pool.

And if it's no longer reporting as an error after the last scrub, the 'permanent' error is gone.
 
Status
Not open for further replies.
Top