Testing disk failures - can't wipe replace disk

Status
Not open for further replies.

ayc

Cadet
Joined
Jan 5, 2014
Messages
4
I have a new 9.2 install. LSI 9211 HBA, IT mode, WD RE4 disks. I'm testing failure cases before putting the system into production.

I pulled two drives, saw the system was degraded but still accessible, received email. All good.

Put the two drives back. "View Disks" shows the drives, no serial numbers. Hmmm. "Replace Disk" doesn't let me choose these replacement drives. Ok, maybe because they're still labelled?

Tried to wipe the disks. I get:

"Error: Failed to wipe da5p1: dd /dev/da4p1: Operation not supported".

Back to the shell, dd on that partition truly does give an error. camcontrol can see the drives.

I'm new to FreeBSD, so I'm not sure how to debug. I'm thinking maybe the /dev/da5p1 is stale and not being rebuilt on hotplug? Pull the drives again, camcontrol rescan. /dev/da5* is gone, but "View Disks" still shows it. Looks like the disk list is cached. Not good, but a different problem.

Put the drives back. Still no go. kernel says:
mps0: mpssas_alloc_tm freezing simq​
mps0: mpssas_remove_complete on handle 0x000d, IOCStatus= 0x0​
mps0: mpssas_free_tm releasing simq​
(da5:mps0:0:5:0): lost device - 0 outstanding, 3 refs​
mps0: mpssas_alloc_tm freezing simq​
mps0: mpssas_remove_complete on handle 0x0010, IOCStatus= 0x0​
mps0: mpssas_free_tm releasing simq​
(da6:mps0:0:6:0): lost device - 0 outstanding, 3 refs​
mps0: mpssas_action faking success for abort or reset​
mps0: mpssas_alloc_tm freezing simq​
mps0: mpssas_remove_complete on handle 0x000d, IOCStatus= 0x0​
mps0: mpssas_remove_complete on handle 0x0010, IOCStatus= 0x0​
mps0: mpssas_free_tm releasing simq​
cam_periph_alloc: attempt to re-allocate valid device da6 rejected flags 0x18 refcount 2​
daasync: Unable to attach to new device due to status 0x6​
cam_periph_alloc: attempt to re-allocate valid device da5 rejected flags 0x18 refcount 2​
daasync: Unable to attach to new device due to status 0x6​
So, what are the reference counts for the device? That feels like it's to root of the issue?? can't realloc /dev/da* due to something already holding it?

Reboot fixes it all.

Three more bits here:
1) Flashed the 9211 at P17, that's what was available. Driver is at P14, and looks to be built into the kernel, so can't update the module. Could be an issue there...
2) There's a few google hits, some going back to 2002, but no real resolutions I've seen.
3) Full disclosure. This *is* a VM. HW pass-through of NICs and HBA. Supermicro Server. It's not a hobby build.

I'm happy to help debug if someone wants more info...

Thanks,

...alan
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Some controllers do weird things if you disconnect and reconnect the same disk. Things also don't necessarily work properly with passed through controller. I'm running FreeNAS in a VM myself with an M1015 reflashed to IT mode, and I make it a priority to NEVER do disk replacements with the system on for the reasons you just mentioned.
 

ayc

Cadet
Joined
Jan 5, 2014
Messages
4
What's the best way to fail a disk for testing? I'll dd over parts of the drive, but I'd also like to inject errors at other levels?

...alan
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
What's the best way to fail a disk for testing? I'll dd over parts of the drive, but I'd also like to inject errors at other levels?
That may not work as the system will prevent you from writing (dd) directly to the drives. You can run "sysctl kern.geom.debugflags=0x10" to override some of the checks. Then there are two more complex ways to inject errors that I know of:
  1. If you know how to create pools via CLI, you can use gnop to simulate read failures and device removals.
  2. If you are able to recompile FreeNAS you can build the kernel with the ADA_TEST_FAILURE option. This gives you three sysctls you can use to inject errors: kern.cam.ada.X.periodic_read_error, kern.cam.ada.X.force_read_error, kern.cam.ada.X.force_write_error. (This is on my list of things to try :)).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You can also pull a disk from a working system. Keep in mind you should NOT do this on a pool/server with important data. For experimenting and learning its great. For production, NEVER pull a disk without offlining it first!

Or, shutdown the server and remove a disk.
 

ayc

Cadet
Joined
Jan 5, 2014
Messages
4
You can also pull a disk from a working system. Keep in mind you should NOT do this on a pool/server with important data. For experimenting and learning its great. For production, NEVER pull a disk without offlining it first!

Or, shutdown the server and remove a disk.


Right, the point is to understand the failure cases and how to recover from them *before* the system goes into production. Disks will fail, HBAs will get flaky, the UPS will go belly up taking out the power during a write, etc. I want to simulate this as much as possible up front and know what it looks like.

@Dusan, thanks for the pointers. I'm new to both FreeBSD and FreeNAS, hence the simple questions. I'll grab the source and rebuild to get the error injection.

Thanks,
...alan
 
Status
Not open for further replies.
Top