SOLVED Resilver every reboot

ajschot · May 23, 2017

Guys i need help and getting sick of this FN11 problem.
I was on Corral and FN10 for a long time without any problem.
Because off everybody was saying a switch is needed (i think FN9 fansboys) i went to fn11.
So far so good, i new there was onedrive with smart problems so i needed to change that, because of all the Corral bad things i did this after the switch (it were no data error related smart errors only seek time).
So i got da6 offline, putted the new disk in and started resilver.
After resilver i got 13 data errors and it kept on giving me this same error so i thought that is a problem just rebooted FN11 because i also added a tunable. And it started to resilver again, but was writing to a other harddrive (da7).
After resilver the date 13 data errors, and now after a update FN is started again to resilver!!!! This time da7 again.
COuld it be this disk is also dying? The are Seagate disks i bought them both 5 months ago (always having problems with seagate... in the past but i red good things about these so that is why i bought them....)

bt what is the best and safe way because resilver action is really intensive right?
I need to stop this after everyboot, i don't have a disk right now waiting on the other to get from seagate.
What is the best way? Get the da7 offline?

SweetAndLow · May 23, 2017

Detailed hardware specs? Pool layout? Smart output?

You did nothing in your post but tell a story. You need to give data so people can help you.

Sent from my Nexus 5X using Tapatalk

ajschot · May 24, 2017

SweetAndLow said:
Detailed hardware specs? Pool layout? Smart output?

You did nothing in your post but tell a story. You need to give data so people can help you.

Sent from my Nexus 5X using Tapatalk

Like in my avatar:
Xeon E5-2658v4 with 96Gb of ECC Reg. RAM in an ESXi 6.5 host.
FreeNAS 11- RC3 runs with 20 cores and 64Gb RAM
I use 8x4Tb harddisk from different brands (2xWD red, 3x Seagate Barracuda, 3x Toshiba MD40) on a Dell 6gpbs 9211-8i IT SAS controller (frimware P20, device passthrough).
Zpool of RAIDZ2

At this moment i run a long smart test on disk 8 (da7) which is every boot resilvering since i replaced disk 7 (da6).

I checked the cables and they seem to be fine.
I don't understand what happens, after resilver i get this message in my alerts

Code:

KRITIEK: 24 mei 2017 11:15 - De status van volume Data is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

When checking status it says Resilver completed, 13 errors

And at this moment i run a long smart test (smartctl -t long /dev/da7)
So i have to wait and see, here is my report until now (30% of smart selftest completed)

Code:


=== START OF INFORMATION SECTION ===

Model Family:	 Toshiba 3.5" MD04ACA... Enterprise HDD

Device Model:	 TOSHIBA MD04ACA400

Serial Number:	Y6J7KYDWFSAA

LU WWN Device Id: 5 000039 76ba0323b

Firmware Version: FP2A

User Capacity:	4,000,787,030,016 bytes [4.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	7200 rpm

Form Factor:	  3.5 inches

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:	Wed May 24 12:17:54 2017 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x80)	Offline data collection activity

					was never started.

					Auto Offline Data Collection: Enabled.

Self-test execution status:	  ( 247)	Self-test routine in progress...

					70% of test remaining.

Total time to complete Offline

data collection:		 (  120) seconds.

Offline data collection

capabilities:			 (0x5b) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					Offline surface scan supported.

					Self-test supported.

					No Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:			(0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:		(0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine

recommended polling time:	 (   2) minutes.

Extended self-test routine

recommended polling time:	 ( 475) minutes.

SCT capabilities:		   (0x003d)	SCT Status supported.

					SCT Error Recovery Control supported.

					SCT Feature Control supported.

					SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x000b   100   100   050	Pre-fail  Always	   -	   0

  2 Throughput_Performance  0x0005   100   100   050	Pre-fail  Offline	  -	   0

  3 Spin_Up_Time			0x0027   100   100   001	Pre-fail  Always	   -	   6725

  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5

  5 Reallocated_Sector_Ct   0x0033   100   100   050	Pre-fail  Always	   -	   0

  7 Seek_Error_Rate		 0x000b   100   100   050	Pre-fail  Always	   -	   0

  8 Seek_Time_Performance   0x0005   100   100   050	Pre-fail  Offline	  -	   0

  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   25

10 Spin_Retry_Count		0x0033   100   100   030	Pre-fail  Always	   -	   0

12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   4

191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0

192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   2

193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   17

194 Temperature_Celsius	 0x0022   100   100   000	Old_age   Always	   -	   42 (Min/Max 26/43)

196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0

197 Current_Pending_Sector  0x0032   100   100   000	Old_age   Always	   -	   0

198 Offline_Uncorrectable   0x0030   100   100   000	Old_age   Offline	  -	   0

199 UDMA_CRC_Error_Count	0x0032   200   253   000	Old_age   Always	   -	   0

220 Disk_Shift			  0x0002   100   100   000	Old_age   Always	   -	   0

222 Loaded_Hours			0x0032   100   100   000	Old_age   Always	   -	   23

223 Load_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0

224 Load_Friction		   0x0022   100   100   000	Old_age   Always	   -	   0

226 Load-in_Time			0x0026   100   100   000	Old_age   Always	   -	   637

240 Head_Flying_Hours	   0x0001   100   100   001	Pre-fail  Offline	  -	   0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

I know temperature is high but that is always with these Toshiba disks, in my old NAS i had also 2 of these and they ran always at 45-47 degrees C. Never had any problems with those. These in my FreeNAS are newer.
Maybe there went something else wrong during the first resilver version of da6?

I red all about problems with resilver form older FN versions, would it help if i put da7 offline, the scrub my pool and when it is finished let it get back online and resilver?

SweetAndLow · May 24, 2017

So you still need to provide the info I asked for but I can make a guess and say your have multiple drive failures. This is why you have errors and corruption. Hope your have a backup because you will most likely be rebuilding your pool.

Things like zpool status and whole error message are good to provide.

Sent from my Nexus 5X using Tapatalk

zoomzoom · May 24, 2017

It could be a bad SATA cable as well, and as @SweetAndLow stated, zpool status is needed since it will state how many r/w/chksum errors there are.

Once you narrow down the problem drive(s), test each with a known good SATA cable on a separate system with stress/benchmark tests, paying attention to seek times and r/w errors. Ensure all SATA cables are not bent or placed in a curve smaller than 2" (~2x thumb width).

rs225 · May 24, 2017

It should probably be zpool status -v Data

Also, 20 cores for a VM is excessive. 8 is probably a better number.

ajschot · May 24, 2017

well ... i don't understand i have to sent all disks back? i have no smart error like i wrote this is 1 zpool of 1vdev 8x4TB Raidz2.

it keep on resilver disk 8 while i have no error on disk8 i still keep on having doubts about zfs it does not look good.

Code:

=== START OF INFORMATION SECTION ===

Model Family:	 Toshiba 3.5" MD04ACA... Enterprise HDD

Device Model:	 TOSHIBA MD04ACA400

Serial Number:	Y6J7KYDWFSAA

LU WWN Device Id: 5 000039 76ba0323b

Firmware Version: FP2A

User Capacity:	4,000,787,030,016 bytes [4.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	7200 rpm

Form Factor:	  3.5 inches

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:	Wed May 24 22:45:28 2017 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x82)	Offline data collection activity

					was completed without error.

					Auto Offline Data Collection: Enabled.

Self-test execution status:	  (   0)	The previous self-test routine completed

					without error or no self-test has ever

					been run.

Total time to complete Offline

data collection:		 (  120) seconds.

Offline data collection

capabilities:			 (0x5b) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					Offline surface scan supported.

					Self-test supported.

					No Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:			(0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:		(0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine

recommended polling time:	 (   2) minutes.

Extended self-test routine

recommended polling time:	 ( 475) minutes.

SCT capabilities:		   (0x003d)	SCT Status supported.

					SCT Error Recovery Control supported.

					SCT Feature Control supported.

					SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x000b   100   100   050	Pre-fail  Always	   -	   0

  2 Throughput_Performance  0x0005   100   100   050	Pre-fail  Offline	  -	   0

  3 Spin_Up_Time			0x0027   100   100   001	Pre-fail  Always	   -	   6725

  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5

  5 Reallocated_Sector_Ct   0x0033   100   100   050	Pre-fail  Always	   -	   0

  7 Seek_Error_Rate		 0x000b   100   100   050	Pre-fail  Always	   -	   0

  8 Seek_Time_Performance   0x0005   100   100   050	Pre-fail  Offline	  -	   0

  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   36

10 Spin_Retry_Count		0x0033   100   100   030	Pre-fail  Always	   -	   0

12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   4

191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0

192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   2

193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   26

194 Temperature_Celsius	 0x0022   100   100   000	Old_age   Always	   -	   39 (Min/Max 26/43)

196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0

197 Current_Pending_Sector  0x0032   100   100   000	Old_age   Always	   -	   0

198 Offline_Uncorrectable   0x0030   100   100   000	Old_age   Offline	  -	   0

199 UDMA_CRC_Error_Count	0x0032   200   253   000	Old_age   Always	   -	   0

220 Disk_Shift			  0x0002   100   100   000	Old_age   Always	   -	   0

222 Loaded_Hours			0x0032   100   100   000	Old_age   Always	   -	   31

223 Load_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0

224 Load_Friction		   0x0022   100   100   000	Old_age   Always	   -	   0

226 Load-in_Time			0x0026   100   100   000	Old_age   Always	   -	   642

240 Head_Flying_Hours	   0x0001   100   100   001	Pre-fail  Offline	  -	   0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline	Completed without error	   00%		32		 -


SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

smart after a long test (smartctl -t /dev/da7)
i ill now do all disks the thing is it keep on trying to resilver disk 8 but i never replaced disk 8 i replaced disk 7 nd the resilver made my disk 8 to resilver every boot up.

UPDATE: the alert on errors in my zpool are gone! dissapeared maybe there was an error in the smart reading of disk 8? This is really a strange thing i did only a smart test.... and nothing else.....

SweetAndLow · May 24, 2017

Did you reboot your system? Try running a scrub on the pool.

ajschot · May 25, 2017

SweetAndLow said:
Did you reboot your system? Try running a scrub on the pool.

i needed to reboot now... because i needed to paint the room where the FreeNAS machine is.
so when i switched it on... again.... alert message and again.... start to resilver.... i think there is really nothing wrong this is just a stupid Freenas problem. And again only writing to da7 which is not the disk that i have changed.
nothing changed at the Samert report.... just still no errors...

UPDATE:
Strange I see it could be the new disk, when i look at the disks it looks like they are changed however i only switched drives so also just one cable..... strange.....
What could be the problem of a new disk? Or a bug in FN11??
I when it is done i will make a FN9 Vm and just import the pool and looks what happens in that... or have to put the disk offline and scrub (the degraded pool) and then see what happens

zoomzoom · May 25, 2017

@ajschot As has been repeatedly asked, what is the zpool output? Have you tried a different SATA cable?

Unless other users are reporting the same exact issue, this is not a problem with FreeNAS, but with something in your environment.

ajschot · May 25, 2017

zoomzoom said:
@ajschot As has been repeatedly asked, what is the zpool output? Have you tried a different SATA cable?

Unless other users are reporting the same exact issue, this is not a problem with FreeNAS, but with something in your environment.

Yes, i changed the cables just after the resilver earlier this day, but it starts again to resilver....

Oh and about my environment i did a lot of work because a year ago i had problems with beta FN10 with bhyve and the developers said there was a problem on my side but also then i was right and it was a fault of a part of freebsd not updated to the latest version.
So again, when testing people are not testing resilver on a RC version..... even though it is just 9.10

Here is zpool output, sorry i did not but i see no usefull information

Code:

pool: Data

state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

	continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Thu May 25 22:58:40 2017

		2.40T scanned out of 7.49T at 706M/s, 2h5m to go

		301G resilvered, 32.10% done

config:


	NAME											STATE	 READ WRITE CKSUM

	Data											ONLINE	   0	 0	 6

	  raidz2-0									  ONLINE	   0	 0	12

		gptid/a6a6f8ac-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/a6c19d32-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/a7937aca-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/41a9a23d-3f33-11e7-a87b-000c29bfa44f  ONLINE	   0	 0	 0  (resilvering)

		gptid/aaff7d0f-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/ad0ecf4a-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/ac78268c-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/acf30023-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

It all started when i went back from Corral, i knew i could better stay...

zoomzoom · May 25, 2017

Something is causing cksum errors, and while it could be the OS, it's far more likely it's hardware related (Occam's Razor).

Once it gets done resilvering, shut it down, remove/disconnect ada7, then boot it back up.
If it doesn't show checksum errors for any other disk and does not begin to auto-resilver, shut it down, swap any drive other than ada6 (say ada1) with ada7, and put ada7 where ada1 was.
- If chksum errors occur on on ada1 (prev ada7), but not on ada7 (prev ada1), it's a bad drive.
- If chksum erros occur on ada7 (prev ada7), but not ada1 (prev ada1), it's the ada7 SATA cable or SATA port.
- If both ada1 (prev ada7) & ada7 (prev ada1) show errors, it could just be the OS.

rs225 · May 25, 2017

You need to do zpool status -v Data

There are checksum errors at the top of the pool. In other words, the pool is corrupted. You might be able to cure it by looking at the -v output, or the pool may have to be abandoned.

zoomzoom · May 25, 2017

@rs225 I was wondering about that 6 and forgot you had mentioned before to run it with the -v Data parameter. How does a pool itself get corrupted, and why would a pool with redundancy not being to recover itself (I didn't realize the pool itself could become corrupted)?

ajschot · May 26, 2017

Code:

pool: Data

state: ONLINE

status: One or more devices has experienced an error resulting in data

	corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

	entire pool from backup.

  see: http://illumos.org/msg/ZFS-8000-8A

  scan: resilvered 937G in 6h25m with 13 errors on Fri May 26 05:24:14 2017

config:


	NAME											STATE	 READ WRITE CKSUM

	Data											ONLINE	   0	 0	14

	  raidz2-0									  ONLINE	   0	 0	28

		gptid/a6a6f8ac-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/a6c19d32-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/a7937aca-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/41a9a23d-3f33-11e7-a87b-000c29bfa44f  ONLINE	   0	 0	 0

		gptid/aaff7d0f-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/ad0ecf4a-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/ac78268c-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0

		gptid/acf30023-ba6f-11e6-a6a7-6805ca0cfed6  ONLINE	   0	 0	 0


errors: Permanent errors have been detected in the following files:


		/mnt/Data/Videos/Series/MacGyver/Macgyver 7x07.avi

		Data/Videos@Data-auto-20170424.2030:/Series/MacGyver/Macgyver 2x02.avi

		Data/Videos@Data-auto-20170424.2030:/KinderFilms/Suske en Wiske & De Texas Rakkers (2009).mkv

		Data/Videos@Data-auto-20170424.2030:/Series/MacGyver/Macgyver 7x07.avi

		Data/Videos@Data-auto-20170424.2030:/Films/Once Upon a Time in the West (1968).mkv

		Data/Videos@Data-auto-20170424.2030:/KinderFilms/Mary Poppins (1964) 1080p.mkv

		Data/Videos@Data-auto-20170424.2030:/KinderFilms/Pinocchio (1940).mkv

		Data/Videos@Data-auto-20170424.2030:/KinderFilms/Planes (2013).mkv

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Movies/Romeo + Juliet/Romeo + Juliet (HD).m4v

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Movies/42/42 (1080p HD).m4v

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Music/Stotijn/33613/05_Rota-Divertimetno Concertanto-Allegro.flac

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Movies/Horrible Bosses/Horrible Bosses (1080p HD).m4v

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Movies/Ice Age_ het mysterie van de eieren/Ice Age_ het mysterie van de eieren (1080p HD).m4v

		Data/Music@Data-auto-20170424.2030:/iTunes/iTunes Media/Movies/La crème de la crème/La crème de la crème (1080p HD).m4v

Ok so it looks like i have some dataloss, strange so resilver is not safe even if you run raidz2?!
So proberbly when i remove these files the alerts are gone?

funny that these files are almost only movie files... (one is an audio file)

I tried to play the movies that are on the list in Plex, but they al seem to be fine.... i just pressed play and scrolled trough the movies and played small parts on different places in the movies and they seem to be fine.... What could be wrong?

ajschot · May 26, 2017

rs225 said:
It should probably be zpool status -v Data

Also, 20 cores for a VM is excessive. 8 is probably a better number.

I run 20 because i was using Corral before with a lot of VM's and dockers. Plex was using most for transcoding, and E5-2658v4 is only 2,3Ghz. But now i use more apps outside FreeNAS i can lower it down a lot that is true. Maybe i will put in my old E5-2630v4.. is less power but maybe enough (10 cores, 20 threads, 2,2GHz turbo 3,1) and uses a lot less energy.

ajschot · May 26, 2017

zoomzoom said:
@rs225 I was wondering about that 6 and forgot you had mentioned before to run it with the -v Data parameter. How does a pool itself get corrupted, and why would a pool with redundancy not being to recover itself (I didn't realize the pool itself could become corrupted)?

I have no idea how this happend. I had no problems before replacing the disk da6 (now for a unknown reason da7), i also did not had problems with these files and i use ECC ram which i did a 48 hour memtest before use them in my FreeNAS setup. I have really no idea, there was no power problem here and also no errors on other disks. I checked all disks for smart errors and there are none. All are ok and passed all test without errors.
Only thing happend is that my wife let my son of 1,5years old in the room where the server is and he pressed that beautiful blue lighted button (power). I don't know if he just pressed it once or keep it pressed. Maybe that could be it but i know for sure that it was not writing any data at that moment to the disks maybe it was reading but not writing. But that was when the old disk was still in (and already gave the smart error, because i wanted to let it backup all first to crashplan before changing the disk (i had only a seektime error in smart). Since i replaced it i have these 13 errors and resilver every reboot.
I can not imagine that that was the problem of this dataloss, there must be something else.

Also the powerbutton is not woring anymore i disconnected the cable ;-)

Replacing i did the following on FN11:
1. In FN11 select the pool, then da6 and offline
2. Replaced the disk with a new disk
3. in FN11 now select replace, and it started to resilver

Also i never ever had dataloss before with these kind of things. I have no idea how this happend...

EDIT: Oh it looks like most of these files are in a snapshot right? I will delete this snapshot and look if all the problems are gone then.
I can play the movies in plex. I am thinking what happend there... maybe that was the day i got the first smart error and something went wrong with the snapshot. However the snapshot of 25th of april seems to be fine

SweetAndLow · May 26, 2017

ajschot said:
I have no idea how this happend. I had no problems before replacing the disk da6 (now for a unknown reason da7), i also did not had problems with these files and i use ECC ram which i did a 48 hour memtest before use them in my FreeNAS setup. I have really no idea, there was no power problem here and also no errors on other disks. I checked all disks for smart errors and there are none. All are ok and passed all test without errors.
Only thing happend is that my wife let my son of 1,5years old in the room where the server is and he pressed that beautiful blue lighted button (power). I don't know if he just pressed it once or keep it pressed. Maybe that could be it but i know for sure that it was not writing any data at that moment to the disks maybe it was reading but not writing. But that was when the old disk was still in (and already gave the smart error, because i wanted to let it backup all first to crashplan before changing the disk (i had only a seektime error in smart). Since i replaced it i have these 13 errors and resilver every reboot.
I can not imagine that that was the problem of this dataloss, there must be something else.

Also the powerbutton is not woring anymore i disconnected the cable ;-)

Replacing i did the following on FN11:
1. In FN11 select the pool, then da6 and offline
2. Replaced the disk with a new disk
3. in FN11 now select replace, and it started to resilver

Also i never ever had dataloss before with these kind of things. I have no idea how this happend...

EDIT: Oh it looks like most of these files are in a snapshot right? I will delete this snapshot and look if all the problems are gone then.
I can play the movies in plex. I am thinking what happend there... maybe that was the day i got the first smart error and something went wrong with the snapshot. However the snapshot of 25th of april seems to be fine

Delete the snapshot and things should be back to normal. I would seriously inspect your system because this problem looks like multiple drive failures to be.

Verify you have auto smart tests scheduled, scrubs scheduled and email notifications.

Sent from my Nexus 5X using Tapatalk

blaco · May 26, 2017

I would recommend to try to resilver the disk on a bare-metal freenas system. Maybe you have another box or can shut down esxi for a few hours?
If this works, you will know if it's a problem of virtualization or of your disks.

I had strange problems when I virtualized freenas (Ok it was on KVM with VT-d, but maybe...)
I got (random) checksum errors on every disk (on heavy load) and some large files got corrupted. After that I installed Freenas (it was Corral but i think it doesn't matter) bare-metal and everything was fine again - zero checksum errors. -> So I decided to run my VM s in Freenas with bhyve (now 11RC3).

EDIT: I suppose that deleting the snapshot would not solve the problems with the dataerros - I had the same problem and after deleting another prior snapshot appaered with (the same) corrupted files...

rs225 · May 26, 2017

The power failure did not cause this, nor the SMART errors.

Deleting the snapshots may eliminate the problem. For the first errored file, that file must be copied if possible, then removed. If copy fails, it may be possible to rescue most minus 128KB.

This is not double disk failure. This is more likely CPU, RAM, or mobo glitch, which resulted in either bad metadata, or an incorrect checksum calculation. Since the original written data was wrong, the error is not correctable.

Since it isn't known exactly what happened, it may not be possible to fix this.

Important Announcement for the TrueNAS Community.

SOLVED Resilver every reboot

Patron

Sweet'NASty

Patron

Sweet'NASty

Guru

Guru

Patron

Sweet'NASty

Patron

Guru

Patron

Guru

Guru

Guru

Patron

Patron

Patron

Sweet'NASty

Explorer

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver every reboot"

Similar threads