Testing external enclosure failure

StorageCurious · Oct 27, 2022

I've just upgraded a test system I'm running with an external enclosure (MD1400). I wanted to test a scenario that bothers me : how to bring back a pool if the enclosure disconnects (operator error, losing power to it before TrueNAS loses power in a power outage, etc).

So I did - disconnected the enclosure (with test data on it, but pulling the SAS cable). As expected, the pool is UNAVAIL. The mirror vdev is UNAVAIL. The individual disks under that mirror are "REMOVED". Fair enough.

So I did what I would do in in a real-life scenario: I just reconnected the enclosure (SAS3 cable). TrueNAS sees the disks again, it also sees them, in the Disk screen, as part of the pool.

But the pool remains unavailable, the disks stay removed. What's the process to just tell TrueNAS "hey, disks are back, import that pool"?

If I were to connect 6 disks in RAIDZ1 from another system, my understanding is I could get that data back. It's sort of the same, but am wondering how to proceed at this point.

My goal, for this "test", is to bring back the pool with the 3 files I had on it as if nothing happened.

StorageCurious · Oct 27, 2022

I may have a beginning of an answer. I tried to "online" the disks, it said "IO operations Suspended"

I googled, figured it was only test data, and I did a "zfs clear poolname" and then created a new pool importing it from that one.

It came back, as unhealthy (even though both mirrored drives were online) but after a smart Test/scrub it seems good. Data is there.

It might have to do that a SLOG disk stayed online (was on the TrueNAS system, not the enclosure) so maybe things didn't come up as easily.

I'll run more tests tomorrow, so if you're interested in this stay tuned to this thread!

StorageCurious · Oct 28, 2022

So, I went with the most basic test : no SLOG or other special vdev, just two disks mirrored on the external enclosure connected with a SAS cable (eventually two SAS cables for redundancy).

I powered off the enclosure - simulating a loss of SAS cable or power. The pool went UNAVAIL, the relevant disks became REMOVED. All as expected. How to get the pool back though wasn't as easy as expected. The steps I figured out, and please feel free to comment on how I could have done better or how this is the wrong way to do it.

1) Had to go into command shell and "zpool clear [poolname]". It removed the UNAVAIL pool from the list of pools - it simply vanishes from the pool list.
2) Then I reimported the pool (Pool - Add - import existing pool).
3) The pool came back with status UNHEALTHY. A simple SCRUB of the pool fixed that. Found the data as it was before

Things I couldn't do because the IO was interrupted, permanently until I cleared it, it seems on the pool :
- Export pool
- Bring vdevs back online

Things that "worry me"
- If I add a SLOG in this scenario (even better - a SLOG that never disconnected as it's in the server, not the enclosure) I saw the same thing. But will the data in the SLOG be flushed to the disks when the lost disks reappear? I'm not sure how to test that if I wanted to.

Isn't there a UI-friendly way of saying "yes I know a cable was pulled, please mark the pool as good and bring it back up"? While I don't mind these steps all that much, who's to say some other IT guy less familiar with all this won't be the one handling this if it happens? It is, after all, a pretty "typical" scenario when using an enclosure. Specifically, it it loses power before the server loses power in a power outage.). I can't believe TrueNAS, that prides itself on reliability of filesystem, doesn't handle something like that. I must be the one who can't find the right place to do that.

Comments welcomed - that's my only really worry left before I put this thing in production.

HoneyBadger · Oct 28, 2022

StorageCurious said:
- If I add a SLOG in this scenario (even better - a SLOG that never disconnected as it's in the server, not the enclosure) I saw the same thing. But will the data in the SLOG be flushed to the disks when the lost disks reappear? I'm not sure how to test that if I wanted to.

The SLOG should replay the transactions to the data vdev(s) as part of importing the pool (your Step 2) regardless of its location (in the head or in the shelf) - this can be tested in a simple manner by writing a number of sequentially-named files to a dataset with sync=always and making note of "what was the last file copied before I yanked the power?" - the file in the process of being copied will of course be partially complete/failed, but this is expected.

StorageCurious · Oct 28, 2022

HoneyBadger said:
The SLOG should replay the transactions to the data vdev(s) as part of importing the pool (your Step 2) regardless of its location (in the head or in the shelf) - this can be tested in a simple manner by writing a number of sequentially-named files to a dataset with sync=always and making note of "what was the last file copied before I yanked the power?" - the file in the process of being copied will of course be partially complete/failed, but this is expected.

Great idea, will test that.

Important Announcement for the TrueNAS Community.

Testing external enclosure failure

StorageCurious

Explorer

StorageCurious

Explorer

StorageCurious

Explorer

HoneyBadger

actually does care

StorageCurious

Explorer

Similar threads