SATA best practices?

Status
Not open for further replies.

rvassar

Guru
Joined
May 2, 2018
Messages
972
So I got to experience my first FreeNAS unplanned disk failure / replacement event over the weekend. I lost a nearly 6 year old 3Tb HGST drive on my 2 vdev mirror pool in the early hours of Friday morning. This left the vdev relying on a months old Toshiba drive that I have little operational experience or trust in, and no replacement sitting on the shelf. I had been performing science experiments on a 3-disk raidz pool made up of odd 2Tb drives I had laying around, so I wiped the raid pool, took a snapshot of the mirror pool and set up a, “zfs send | zfs receive” job, and left for work. At lunch, I stopped by Fry’s and grabbed the only 3Tb disk they had on the shelf, a WD “Red” NAS disk (interestingly the 4Tb drives were all sold out). Once back home I placed it in a USB3 external case, attached it to a Linux host, and gave it an overnight badblocks sweep, followed by a long SMART self test. While this was going on, FreeNAS was up and available, and had completed my hasty snapshot replication.


I have a pair of SATA hot swap trays installed in the front of my case. One contained a drive participating in the raidz pool, and the other not connected for want of a SATA power plug. I had picked up a Y-cable to remedy this, but hadn’t taken the NAS down to install it. I remedied this, and installed the new drive in its permanent slot, and moved the failing drive to the hot swap tray. Unfortunately, powering cycling the failing drive made its condition worse, and it started throwing access timeouts that almost appeared to lock the SATA controller up. I had intended to let the new disk resilver from both disks and then pull the failing one from the hot swap tray, but in the end I had to pull the failing drive to keep a reasonable I/O rate. The resilver completed, and I ran a scrub, and headed to bed. Here’s where things got a little weird. In the early morning hours of Sunday morning, a disk on the the raidz pool set an error and detached and reattached repeatedly every few seconds for nearly 2 hours. The error was: “g_access(918): provider ada4 has error 6 set”. It ended as abruptly as it started, but it did trigger a pool scrub. It passed a long self test later that day.

So… It was an interesting weekend. I learned a bit about FreeNAS under duress, made a couple observations, and had a couple questions:

I’m rather middle aged, I remember back in the days of PATA & SCSI-I/II having trouble with failing drives holding the bus and hanging the I/O chain. But I have no experience with this on SATA. Since I have two controllers, the Supermicro onboard, as well as the LSI PCIe card, should I take steps to separate boot devices from data devices? I had been running the mirror pool on the LSI card, and left the boot devices & rag-tag raidz pool on the onboard ports. I have a second fanout cable for the LSI board. The other configuration would be to place a mirror half from each vdev on a different SATA controller, so there's no single point of failure. Is there any guidance here? Thoughts?

I've also come to the conclusion that keeping the hot swap trays open & available is rather handy. I may have to look into the 4 in 3 docks.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I lost a nearly 6 year old 3Tb HGST drive
I suppose that they don't last forever. I usually start planning to replace drives around the 5 year mark. If you have more that are this age, you might want to have some spares ready.
Unfortunately, powering cycling the failing drive made its condition worse, and it started throwing access timeouts that almost appeared to lock the SATA controller up.
That is one of the reasons why I prefer a SAS controller. I have not had a single disk failure interfere with the other disks on the controller.
I remember back in the days of PATA & SCSI-I/II having trouble with failing drives holding the bus and hanging the I/O chain. But I have no experience with this on SATA.
I have seen it with some SATA controllers, but not all. I have not seen it on SAS, but if you have the defective drive in the pool while you are trying to resilver, the system will continue to try to read from it and it will slow the pool down. I always remove a defective drive before starting to rebuild on the new drive.
Since I have two controllers, the Supermicro onboard, as well as the LSI PCIe card, should I take steps to separate boot devices from data devices?
I do. I have my boot pool on SATA and all my data drives are on the SAS controller. I just think it works better. Only an opinion.
The other configuration would be to place a mirror half from each vdev on a different SATA controller, so there's no single point of failure.
My previous build had half the storage drives on each of two SAS controllers. I can't say why, but I found the pool performed better when all drives were on a single controller by means of a SAS expander.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I suppose that they don't last forever. I usually start planning to replace drives around the 5 year mark. If you have more that are this age, you might want to have some spares ready.

That was the oldest drive in the pool. I am kind of kicking myself for not having a spare handy. I had wanted to bump everything up to 4Tb 7200 rpm, but the shelf was bare. I am still planning on moving up in density, and I think I'll probably skip up to 8Tb drive at this point. However, now that this old 3Tb is gone, everything else is < 9k hrs. Oddly enough, this failed drive had less than 4 years accumulated hours. I suspect I may have had it in an air-flow challenged enclosure at some point in the past. 7200 rpm drives need more airflow, and I'm noticing the HGST drives run 5 - 7 deg/C hotter than the 5400 & 5900 rpm drives. The single Toshiba 7200 rpm drive is a bit cooler, maybe 3 - 5 deg/C hotter than the slower drives.

The other pool I'm running is made up of old 2Tb scrap pulls. One of them is past the 52k hour mark, and another has recorded a peak temp of 76 deg/C, which has me wondering where it came from, and why it even still functions. But that pool is strictly for science experiments. Build scratch pad / transient artifacts, etc...

That is one of the reasons why I prefer a SAS controller. I have not had a single disk failure interfere with the other disks on the controller.

I have seen it with some SATA controllers, but not all. I have not seen it on SAS, but if you have the defective drive in the pool while you are trying to resilver, the system will continue to try to read from it and it will slow the pool down. I always remove a defective drive before starting to rebuild on the new drive.

That's an interesting observation, thank you. I may re-cable my storage next time I take the NAS down, and move all the storage devices to the PCIe SAS card. I think I may have to keep one of the hot swap trays on the motherboard controller, because it has an integral SATA cable, but that's for transient use anyway. It's there so I can toss a drive in and burn it in, or replace/resilver without taking the NAS down, "shelf backup", etc...


I do have one more question, I'm getting a daily email about the swap space:

Code:
Checking status of gmirror(8) devices:
	   Name	Status  Components
mirror/swap0  DEGRADED  ada3p1 (ACTIVE)
mirror/swap1  COMPLETE  ada2p1 (ACTIVE)
					   da3p1 (ACTIVE)
mirror/swap2  COMPLETE  da2p1 (ACTIVE)
					   da1p1 (ACTIVE)

-- End of daily output --


I'm guessing there was a mirrored swap partition on the failed drive, which could lead to a swap related panic if it stays degraded. Does this require manual intervention? Or will it reconfigure itself on reboot now that the pool is repaired?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I'm noticing the HGST drives run 5 - 7 deg/C hotter than the 5400 & 5900 rpm drives. The single Toshiba 7200 rpm drive is a bit cooler, maybe 3 - 5 deg/C hotter than the slower drives.
When I had a variety of drives from all the vendors, I noticed that there were differences down brand and model lines regarding the average temp even when they were in the same environment. It is one of the reasons I standardized on the Seagate drives I am using in my main pool; they ran around ten degrees cooler on average.
Does this require manual intervention? Or will it reconfigure itself on reboot now that the pool is repaired?
Short answer, a reboot will recreate the mirrors from the disks that are available when the system boots. There are some caveats involved, but a reboot should clear that up.
 
Status
Not open for further replies.
Top