Testing disk failures - can't wipe replace disk

ayc · Jan 5, 2014

I have a new 9.2 install. LSI 9211 HBA, IT mode, WD RE4 disks. I'm testing failure cases before putting the system into production.

I pulled two drives, saw the system was degraded but still accessible, received email. All good.

Put the two drives back. "View Disks" shows the drives, no serial numbers. Hmmm. "Replace Disk" doesn't let me choose these replacement drives. Ok, maybe because they're still labelled?

Tried to wipe the disks. I get:

"Error: Failed to wipe da5p1: dd /dev/da4p1: Operation not supported".

Back to the shell, dd on that partition truly does give an error. camcontrol can see the drives.

I'm new to FreeBSD, so I'm not sure how to debug. I'm thinking maybe the /dev/da5p1 is stale and not being rebuilt on hotplug? Pull the drives again, camcontrol rescan. /dev/da5* is gone, but "View Disks" still shows it. Looks like the disk list is cached. Not good, but a different problem.

Put the drives back. Still no go. kernel says:

mps0: mpssas_alloc_tm freezing simq

mps0: mpssas_remove_complete on handle 0x000d, IOCStatus= 0x0

mps0: mpssas_free_tm releasing simq

(da5:mps0:0:5:0): lost device - 0 outstanding, 3 refs

mps0: mpssas_alloc_tm freezing simq

mps0: mpssas_remove_complete on handle 0x0010, IOCStatus= 0x0

mps0: mpssas_free_tm releasing simq

(da6:mps0:0:6:0): lost device - 0 outstanding, 3 refs

mps0: mpssas_action faking success for abort or reset

mps0: mpssas_alloc_tm freezing simq

mps0: mpssas_remove_complete on handle 0x000d, IOCStatus= 0x0

mps0: mpssas_remove_complete on handle 0x0010, IOCStatus= 0x0

mps0: mpssas_free_tm releasing simq

cam_periph_alloc: attempt to re-allocate valid device da6 rejected flags 0x18 refcount 2

daasync: Unable to attach to new device due to status 0x6

cam_periph_alloc: attempt to re-allocate valid device da5 rejected flags 0x18 refcount 2

daasync: Unable to attach to new device due to status 0x6

So, what are the reference counts for the device? That feels like it's to root of the issue?? can't realloc /dev/da* due to something already holding it?

Reboot fixes it all.

Three more bits here:
1) Flashed the 9211 at P17, that's what was available. Driver is at P14, and looks to be built into the kernel, so can't update the module. Could be an issue there...
2) There's a few google hits, some going back to 2002, but no real resolutions I've seen.
3) Full disclosure. This *is* a VM. HW pass-through of NICs and HBA. Supermicro Server. It's not a hobby build.

I'm happy to help debug if someone wants more info...

Thanks,

...alan

cyberjock · Jan 5, 2014

Some controllers do weird things if you disconnect and reconnect the same disk. Things also don't necessarily work properly with passed through controller. I'm running FreeNAS in a VM myself with an M1015 reflashed to IT mode, and I make it a priority to NEVER do disk replacements with the system on for the reasons you just mentioned.

ayc · Jan 6, 2014

What's the best way to fail a disk for testing? I'll dd over parts of the drive, but I'd also like to inject errors at other levels?

...alan

Dusan · Jan 8, 2014

ayc said:
What's the best way to fail a disk for testing? I'll dd over parts of the drive, but I'd also like to inject errors at other levels?

That may not work as the system will prevent you from writing (dd) directly to the drives. You can run "sysctl kern.geom.debugflags=0x10" to override some of the checks. Then there are two more complex ways to inject errors that I know of:

If you know how to create pools via CLI, you can use gnop to simulate read failures and device removals.
If you are able to recompile FreeNAS you can build the kernel with the ADA_TEST_FAILURE option. This gives you three sysctls you can use to inject errors: kern.cam.ada.X.periodic_read_error, kern.cam.ada.X.force_read_error, kern.cam.ada.X.force_write_error. (This is on my list of things to try :)).

cyberjock · Jan 8, 2014

You can also pull a disk from a working system. Keep in mind you should NOT do this on a pool/server with important data. For experimenting and learning its great. For production, NEVER pull a disk without offlining it first!

Or, shutdown the server and remove a disk.

ayc · Jan 8, 2014

cyberjock said:
You can also pull a disk from a working system. Keep in mind you should NOT do this on a pool/server with important data. For experimenting and learning its great. For production, NEVER pull a disk without offlining it first!

Or, shutdown the server and remove a disk.

Right, the point is to understand the failure cases and how to recover from them *before* the system goes into production. Disks will fail, HBAs will get flaky, the UPS will go belly up taking out the power during a write, etc. I want to simulate this as much as possible up front and know what it looks like.

@Dusan, thanks for the pointers. I'm new to both FreeBSD and FreeNAS, hence the simple questions. I'll grab the source and rebuild to get the error injection.

Thanks,
...alan

Important Announcement for the TrueNAS Community.

Testing disk failures - can't wipe replace disk

ayc

Cadet

cyberjock

Inactive Account

ayc

Cadet

Dusan

Guru

cyberjock

Inactive Account

ayc

Cadet

Similar threads