rvassar
Guru
- Joined
- May 2, 2018
- Messages
- 972
So I got to experience my first FreeNAS unplanned disk failure / replacement event over the weekend. I lost a nearly 6 year old 3Tb HGST drive on my 2 vdev mirror pool in the early hours of Friday morning. This left the vdev relying on a months old Toshiba drive that I have little operational experience or trust in, and no replacement sitting on the shelf. I had been performing science experiments on a 3-disk raidz pool made up of odd 2Tb drives I had laying around, so I wiped the raid pool, took a snapshot of the mirror pool and set up a, “zfs send | zfs receive” job, and left for work. At lunch, I stopped by Fry’s and grabbed the only 3Tb disk they had on the shelf, a WD “Red” NAS disk (interestingly the 4Tb drives were all sold out). Once back home I placed it in a USB3 external case, attached it to a Linux host, and gave it an overnight badblocks sweep, followed by a long SMART self test. While this was going on, FreeNAS was up and available, and had completed my hasty snapshot replication.
I have a pair of SATA hot swap trays installed in the front of my case. One contained a drive participating in the raidz pool, and the other not connected for want of a SATA power plug. I had picked up a Y-cable to remedy this, but hadn’t taken the NAS down to install it. I remedied this, and installed the new drive in its permanent slot, and moved the failing drive to the hot swap tray. Unfortunately, powering cycling the failing drive made its condition worse, and it started throwing access timeouts that almost appeared to lock the SATA controller up. I had intended to let the new disk resilver from both disks and then pull the failing one from the hot swap tray, but in the end I had to pull the failing drive to keep a reasonable I/O rate. The resilver completed, and I ran a scrub, and headed to bed. Here’s where things got a little weird. In the early morning hours of Sunday morning, a disk on the the raidz pool set an error and detached and reattached repeatedly every few seconds for nearly 2 hours. The error was: “g_access(918): provider ada4 has error 6 set”. It ended as abruptly as it started, but it did trigger a pool scrub. It passed a long self test later that day.
So… It was an interesting weekend. I learned a bit about FreeNAS under duress, made a couple observations, and had a couple questions:
I’m rather middle aged, I remember back in the days of PATA & SCSI-I/II having trouble with failing drives holding the bus and hanging the I/O chain. But I have no experience with this on SATA. Since I have two controllers, the Supermicro onboard, as well as the LSI PCIe card, should I take steps to separate boot devices from data devices? I had been running the mirror pool on the LSI card, and left the boot devices & rag-tag raidz pool on the onboard ports. I have a second fanout cable for the LSI board. The other configuration would be to place a mirror half from each vdev on a different SATA controller, so there's no single point of failure. Is there any guidance here? Thoughts?
I've also come to the conclusion that keeping the hot swap trays open & available is rather handy. I may have to look into the 4 in 3 docks.
I have a pair of SATA hot swap trays installed in the front of my case. One contained a drive participating in the raidz pool, and the other not connected for want of a SATA power plug. I had picked up a Y-cable to remedy this, but hadn’t taken the NAS down to install it. I remedied this, and installed the new drive in its permanent slot, and moved the failing drive to the hot swap tray. Unfortunately, powering cycling the failing drive made its condition worse, and it started throwing access timeouts that almost appeared to lock the SATA controller up. I had intended to let the new disk resilver from both disks and then pull the failing one from the hot swap tray, but in the end I had to pull the failing drive to keep a reasonable I/O rate. The resilver completed, and I ran a scrub, and headed to bed. Here’s where things got a little weird. In the early morning hours of Sunday morning, a disk on the the raidz pool set an error and detached and reattached repeatedly every few seconds for nearly 2 hours. The error was: “g_access(918): provider ada4 has error 6 set”. It ended as abruptly as it started, but it did trigger a pool scrub. It passed a long self test later that day.
So… It was an interesting weekend. I learned a bit about FreeNAS under duress, made a couple observations, and had a couple questions:
I’m rather middle aged, I remember back in the days of PATA & SCSI-I/II having trouble with failing drives holding the bus and hanging the I/O chain. But I have no experience with this on SATA. Since I have two controllers, the Supermicro onboard, as well as the LSI PCIe card, should I take steps to separate boot devices from data devices? I had been running the mirror pool on the LSI card, and left the boot devices & rag-tag raidz pool on the onboard ports. I have a second fanout cable for the LSI board. The other configuration would be to place a mirror half from each vdev on a different SATA controller, so there's no single point of failure. Is there any guidance here? Thoughts?
I've also come to the conclusion that keeping the hot swap trays open & available is rather handy. I may have to look into the 4 in 3 docks.