raidz2 keeps crashing during resilver process

Status
Not open for further replies.

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
I have a raidz2 in FreeNAS 11.1 that has been running great for quite a while but I noticed the other day one of my drives was beginning to fail as the Offline_Uncorrectable value began to steadily increase over the past couple days and had gotten to about 9k when my new drive arrived. Mind you, FreeNAS hadn't even recognized the drive as bad at this point, I was just being proactive considering the drive in question had well over 47k power on hours.

So I ordered a new drive, popped it in, offlined the bad drive, and ran a replace. Within about 2 minutes, console had locked up and about 3 minutes later the machine rebooted. Upon reboot, the resilver was still going at a fast clip, but then console locked up again and the server rebooted. I disconnected the old drive thinking maybe that was the problem, but it still would lock up and reboot shortly after restarting the resilver process.

I disconnected the new drive, and the server boots up fine and the pool shows as fine, just in a degraded state which is to be expected.

The added redundancy of z2 is great, but this crashing when replacing a disk is making me nervous and putting undue stress on the other drives so I'd like to get this fixed asap. Any help is appreciated!




root@freenas:/dev/gptid # zpool status
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:57 with 0 errors on Tue Nov 20 03:45:57 2018
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da5p2 ONLINE 0 0 0

errors: No known data errors

pool: vol01
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Nov 21 14:18:40 2018
1.37T scanned at 2.84G/s, 150G issued at 750M/s, 9.62T total
25.0G resilvered, 1.52% done, 0 days 03:40:42 to go
config:

NAME STATE READ WRITE CKSUM
vol01 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/389bc19c-edc2-11e8-83bf-003048c79950 ONLINE 0 0 0 (resilvering)
gptid/fd06a36e-1b8f-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/ff976983-1b8f-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/02a7a455-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/05bc51dc-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/08d5c475-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0

errors: No known data errors
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Please post the rest of your system specs - CPU, RAM quantity and type, motherboard, drive model and method of connection (SATA on motherboard, via SAS HBA), make/model of HBA if present, power supply make/model.

Have you verified that all necessary cooling fans are running? Hard lockup under load could be a result of insufficient cooling on hardware that isn't capable of thermally throttling (eg: motherboard, power supply)
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
CPU - Dual Xeon E5335
Board - SuperMicro X7DWE
RAM - 16GB Crucial ECC DDR2
Power Supply - 1000W Coolermaster
Drive Model(s) - 5 Hitachi's(1 of which is the new drive I just ordered), 1 Western Digital Caviar Blue, 1 Seagate Barracuda(the one I'm replacing)
All drives are SATA, connected to 1 LSI 9211-8i HBA (I have a second 9211-8i HBA installed but its been unused), both HBA's are different manufacturers though.

I also have a 4gbps Qlogic fiber channel card which is what interfaces to my ESXi server, but I've shutdown all the VM's that use this storage for now so there's no I/O happening on this array.

All cooling fans are working properly, the CPU's have massive coolermaster heatsinks/fans and the drive bay fans are all spinning without issue.
 
Last edited:

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
I got it to stop bootlooping by re-replacing the new drive with the old drive while the new drive was disconnected...

its now resilvering the old drive, but its now up to 10128 offline_uncorrectable sectors, so failure is imminent.

and surprisingly its not bootlooping during this resilver process.

so why does putting this new drive into the array cause things to freak out??

I just now noticed all my drives EXCEPT this new one is at least SATA version 3, while the new one is SATA 2.6... Could that be cause for concern?
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
and surprisingly its not bootlooping during this resilver process.

so why does putting this new drive into the array cause things to freak out??
I would say, this is an indication that the "New" drive is not as new as you might have liked for it to be.
I just now noticed all my drives EXCEPT this new one is at least SATA version 3, while the new one is SATA 2.6... Could that be cause for concern?
This is another indication that the drive is not new. It could even be the entire cause of the problem with rebooting. What size (capacity) drive are we discussing?

Before a drive is added into your NAS, it should be run through burn-in testing to ensure the health of the drive.
Here are some scripts that you might want to look at and potentially use in your NAS:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/
 

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
Thanks Chris, the drive in question is a whitelabel drive, so the fact that its not "new" is known. It's "new" to me though :P

Anyway, HoneyBadger may have been on to something with this being a thermal issue. I noticed that this actually would happen when disk i/o ramped way way up, which I believe may be causing my northbridge chip to overheat due to the heavy I/O, causing the crash. I don't have a temp sensor on it though so i can't really verify a fluctuation during I/O other than to place my finger on it and say ow lol.

I'm temporarily adjusting some tunables to slow down resilvering and I put a box fan on top of my server and it seems to be holding!
Code:
vfs.zfs.top_maxinflight=1
vfs.zfs.resilver_min_time_ms=500
vfs.zfs.resilver_delay=15



root@freenas:~ # zpool status
pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors

pool: vol01
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Nov 21 21:00:20 2018
1.63T scanned at 1.34G/s, 505G issued at 417M/s, 9.62T total
84.6G resilvered, 5.12% done, 0 days 06:22:50 to go
config:

NAME STATE READ WRITE CKSUM
vol01 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/9bc953eb-edf5-11e8-9ce2-003048c79950 ONLINE 0 0 0 (resilvering)
gptid/fd06a36e-1b8f-11e7-9f56-003048c79950 ONLINE 0 0 0 (resilvering)
gptid/ff976983-1b8f-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/02a7a455-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/05bc51dc-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0
gptid/08d5c475-1b90-11e7-9f56-003048c79950 ONLINE 0 0 0

errors: No known data errors
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
I'm leery of the white label drives. Some of our users have had bad experiences with them. Here's a reply from @danb35.

upload_2018-11-22_10-16-9.png


the drive in question is a whitelabel drive,
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I'm temporarily adjusting some tunables to slow down resilvering and I put a box fan on top of my server and it seems to be holding!
You should not need to make such adjustments.

Also, if you are going to post a long list of text, please use CODE tags, not CMD tags. The CMD tag
upload_2018-11-23_1-20-38.png

is for short commands like zpool list or zpool status to differentiate it from the rest of the text, but long lists are best posted in CODE tags
upload_2018-11-23_1-21-22.png

so they are easier to read, like this:
Code:
root@Emily-NAS:~ # zpool status
  pool: Backup
 state: ONLINE
  scan: scrub repaired 0 in 0 days 06:36:58 with 0 errors on Sat Oct 13 17:37:55 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		Backup										  ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/181101e2-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/18e924eb-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/19a7111b-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/1a9e1915-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE	   0	 0	 0

errors: No known data errors

  pool: Emily
 state: ONLINE
  scan: scrub repaired 0 in 0 days 05:23:05 with 0 errors on Tue Nov 13 05:23:07 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		Emily										   ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/af7c42c6-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b07bc723-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b1893397-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b2bfc678-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b3c1849e-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b4d16ad2-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0	 0
			gptid/bc1e50e5-c1fa-11e8-87f0-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/a03dd690-c1fb-11e8-87f0-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/a6ed2ed5-c240-11e8-87f0-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/b9de3232-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/baf4aba8-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/bbf26621-bf05-11e8-b5f3-0cc47a9cd5a4  ONLINE	   0	 0	 0
		logs
		  gptid/ae487c50-bec3-11e8-b1c8-0cc47a9cd5a4	ONLINE	   0	 0	 0
		cache
		  gptid/ae52d59d-bec3-11e8-b1c8-0cc47a9cd5a4	ONLINE	   0	 0	 0

errors: No known data errors

  pool: Irene
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:56:12 with 0 errors on Sat Oct 13 18:13:20 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		Irene										   ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/8710385b-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/87e94156-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/88db19ad-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/89addd3b-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8a865453-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8b66b1ef-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0	 0
			gptid/8c69bc72-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8d48655d-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8e2b6d1f-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8efea929-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/8fd4d25c-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0
			gptid/90c2759a-becf-11e8-b1c8-0cc47a9cd5a4  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:08:34 with 0 errors on Sun Nov 11 03:53:35 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		freenas-boot									ONLINE	   0	 0	 0
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/f659fd6d-4b12-11e6-a97c-002590aecc79  ONLINE	   0	 0	 0
			gptid/f6a61d33-4b12-11e6-a97c-002590aecc79  ONLINE	   0	 0	 0

errors: No known data errors
 

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
This is not an enterprise grade array, just an 18TB media storage server with about 400+ and growing bluray movies, 4k movies, home videos, DVR space for OTA antenna shows. Myself, wife and 4 kids consume the crap out of it on a daily basis though. All the movies and important stuff has been backed up on 2x 5TB USB drives. So in the end I think I would've lost 5 or 6 movies that hadn't been backed up if I'd had to rebuild the entire array. The recorded shows I could've watched from their own respective network apps, just without the ability to skip commercials. It would've been much more time consuming though because I also would have lost the OS VMDK containing the media server applications which I've not backed up in a while.

You can't beat $0.016/GB though. (~$50 per 3 TB drive)

Besides, the reduced reliability of these white label drives is exactly why I chose to go with raidz2. Great performance, cheap storage, and a bit more reliability than an average raid array. I think I'm doing pretty good on the Performance -> Cost -> Reliability triangle IMHO.

Got an update on this? Is the scrub still progressing with the assistance of the box fan?

All of the settings I changed in the beginning to slow down scrub performance I slowly inched back up while it was running with the Box Fan on top of it. I finally got back to default settings and it was resilvering at about 675MB/s and holding strong. Since the media server in front of the storage array serves this data over 3 load balanced 1gig network ports, I've never exceeded 300MB/s disk throughput, so this resilver opened my eyes to my thermal problems. I'll probably invest in some more fans in the future though.

But YES! The new drive and one of the other existing drives in the pool successfully resilvered within about 5 hours and the array is now fully healthy. I rebooted and ran some long SMART tests afterword to check on the full health of the other drives which ran overnight. I'll probably run a scrub this weekend after my family has caught up on their TV consumption but everything looks great.

Thanks everyone for their help!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
Hindsight is always 20/20

3TB seemed to be the sweet spot for me when I first built it. Drive prices have continued to plummet though. Maybe I can replace my 15k sas drives with ssd's soon :p
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
All of the settings I changed in the beginning to slow down scrub performance I slowly inched back up while it was running with the Box Fan on top of it. I finally got back to default settings and it was resilvering at about 675MB/s and holding strong. Since the media server in front of the storage array serves this data over 3 load balanced 1gig network ports, I've never exceeded 300MB/s disk throughput, so this resilver opened my eyes to my thermal problems. I'll probably invest in some more fans in the future though.

If you're willing, can I see a photo of the case internals (with the box fan moved for a moment, of course)?

More fans might not be necessary, just some airflow direction and targeted cooling (those toasty FB-DIMMs and the chipset) is likely all that's needed. That being said, check your IPMI and SMART logs to make sure your drives and CPU aren't getting too high.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Maybe I can replace my 15k sas drives with ssd's soon
Unless you are hosting virtual machines, there is no need for either of those. I use 5900RPM desktop drives in both of my home NAS systems to serve media for Plex and they are plenty fast enough, for me.
FreeNAS to RAM Disk.PNG
 

andstein85

Dabbler
Joined
Apr 21, 2017
Messages
14
Unless you are hosting virtual machines, there is no need for either of those. I use 5900RPM desktop drives in both of my home NAS systems to serve media for Plex and they are plenty fast enough, for me.
View attachment 26709
We're talking about 2 different boxes here... 1 is an IBM x3650m3 (2U) which runs all of my virtual machines, including my Cisco vWLC for my wifi access points and my firewall. That is the server that has the 15k SAS drives, simply because it came with them(free from work). They're tiny 146GB drives too, so I only keep the vWLC and firewall on them and some other OS stuff. No data. The other box is a custom built Supermicro X7DWE, which is my storage server and is dedicated FreeNAS. The media server's OS(windows server 2016 running Emby) runs on a separate LUN on the FreeNAS box because with the fibre channel interface, its by far way faster than the SAS drives on the local vmhost.

The only reason I'd want to replace the VM hosts' SAS drives with SSD's(actually probably replace all 6 sas drives with 1 SSD) is for power savings, not for speed. My wifi lasts about 1.5 hours on battery backup if the power goes out and I could probably get that up to 2.5 hours or more by replacing the disks with 1 SSD. As far as my NAS goes, I'm perfectly happy with the 700MB/s I get from my White Label 7200's(cooled with a with box fan), and that beast gets showdown fast if the power goes out(about 45 seconds).


If you're willing, can I see a photo of the case internals (with the box fan moved for a moment, of course)?

More fans might not be necessary, just some airflow direction and targeted cooling (those toasty FB-DIMMs and the chipset) is likely all that's needed. That being said, check your IPMI and SMART logs to make sure your drives and CPU aren't getting too high.

The drives stay at about 30-34C during load and the CPU's don't get above 38C.

Its basically everything else that's passively cooled...RAM, NB/SB chips. This Supermicro board was never designed to sit inside a PC style case that doesn't forcefully move air from front to back like my IBM x3650 does... This case is pretty terrible for airflow honestly, but it was cheap and still has tons of room for drive expansion.

Please keep in mind most of what you see are extra parts I had lying around.
20181204_094236.jpg
20181204_094136.jpg
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Supermicro X7DWE
I had one of those. It was very power hungry if I recall correctly. You might save a bit on electricity by going to a newer board.
The media server's OS(windows server 2016 running Emby)
You might be able to run Plex instead if you went to a little newer hardware for the CPU transcoding.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I had one of those. It was very power hungry if I recall correctly. You might save a bit on electricity by going to a newer board.

I tried running something very similar with an old Dell SC1430 with 2x X5355's (120w TDP each vs 80w for the E5335). Between the fan noise and the heat, it made my home office pretty unpleasant. But the Dell could run with only one CPU socket occupied, which helped a bit. I'm not sure the Supermicro board would run correctly with only a single socket occupied.

An old X9Sxx board with a Xeon E3 would cut the TDP in half, and offer almost double the performance.

https://www.cpubenchmark.net/compar...355-vs-Intel-Xeon-E3-1270-V2/1229vs1294vs1192
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I had one of those. It was very power hungry if I recall correctly.
It would've been. Core 2, which isn't that much worse; and FB-DIMMs, which burn power like there's no tomorrow.
 
Status
Not open for further replies.
Top