Testing external enclosure failure

StorageCurious

Explorer
Joined
Sep 28, 2022
Messages
60
I've just upgraded a test system I'm running with an external enclosure (MD1400). I wanted to test a scenario that bothers me : how to bring back a pool if the enclosure disconnects (operator error, losing power to it before TrueNAS loses power in a power outage, etc).

So I did - disconnected the enclosure (with test data on it, but pulling the SAS cable). As expected, the pool is UNAVAIL. The mirror vdev is UNAVAIL. The individual disks under that mirror are "REMOVED". Fair enough.

So I did what I would do in in a real-life scenario: I just reconnected the enclosure (SAS3 cable). TrueNAS sees the disks again, it also sees them, in the Disk screen, as part of the pool.

But the pool remains unavailable, the disks stay removed. What's the process to just tell TrueNAS "hey, disks are back, import that pool"?

If I were to connect 6 disks in RAIDZ1 from another system, my understanding is I could get that data back. It's sort of the same, but am wondering how to proceed at this point.

My goal, for this "test", is to bring back the pool with the 3 files I had on it as if nothing happened.
 

StorageCurious

Explorer
Joined
Sep 28, 2022
Messages
60
I may have a beginning of an answer. I tried to "online" the disks, it said "IO operations Suspended"

I googled, figured it was only test data, and I did a "zfs clear poolname" and then created a new pool importing it from that one.

It came back, as unhealthy (even though both mirrored drives were online) but after a smart Test/scrub it seems good. Data is there.

It might have to do that a SLOG disk stayed online (was on the TrueNAS system, not the enclosure) so maybe things didn't come up as easily.

I'll run more tests tomorrow, so if you're interested in this stay tuned to this thread!
 

StorageCurious

Explorer
Joined
Sep 28, 2022
Messages
60
So, I went with the most basic test : no SLOG or other special vdev, just two disks mirrored on the external enclosure connected with a SAS cable (eventually two SAS cables for redundancy).

I powered off the enclosure - simulating a loss of SAS cable or power. The pool went UNAVAIL, the relevant disks became REMOVED. All as expected. How to get the pool back though wasn't as easy as expected. The steps I figured out, and please feel free to comment on how I could have done better or how this is the wrong way to do it.

1) Had to go into command shell and "zpool clear [poolname]". It removed the UNAVAIL pool from the list of pools - it simply vanishes from the pool list.
2) Then I reimported the pool (Pool - Add - import existing pool).
3) The pool came back with status UNHEALTHY. A simple SCRUB of the pool fixed that. Found the data as it was before

Things I couldn't do because the IO was interrupted, permanently until I cleared it, it seems on the pool :
- Export pool
- Bring vdevs back online

Things that "worry me"
- If I add a SLOG in this scenario (even better - a SLOG that never disconnected as it's in the server, not the enclosure) I saw the same thing. But will the data in the SLOG be flushed to the disks when the lost disks reappear? I'm not sure how to test that if I wanted to.

Isn't there a UI-friendly way of saying "yes I know a cable was pulled, please mark the pool as good and bring it back up"? While I don't mind these steps all that much, who's to say some other IT guy less familiar with all this won't be the one handling this if it happens? It is, after all, a pretty "typical" scenario when using an enclosure. Specifically, it it loses power before the server loses power in a power outage.). I can't believe TrueNAS, that prides itself on reliability of filesystem, doesn't handle something like that. I must be the one who can't find the right place to do that.

Comments welcomed - that's my only really worry left before I put this thing in production.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
- If I add a SLOG in this scenario (even better - a SLOG that never disconnected as it's in the server, not the enclosure) I saw the same thing. But will the data in the SLOG be flushed to the disks when the lost disks reappear? I'm not sure how to test that if I wanted to.

The SLOG should replay the transactions to the data vdev(s) as part of importing the pool (your Step 2) regardless of its location (in the head or in the shelf) - this can be tested in a simple manner by writing a number of sequentially-named files to a dataset with sync=always and making note of "what was the last file copied before I yanked the power?" - the file in the process of being copied will of course be partially complete/failed, but this is expected.
 

StorageCurious

Explorer
Joined
Sep 28, 2022
Messages
60
The SLOG should replay the transactions to the data vdev(s) as part of importing the pool (your Step 2) regardless of its location (in the head or in the shelf) - this can be tested in a simple manner by writing a number of sequentially-named files to a dataset with sync=always and making note of "what was the last file copied before I yanked the power?" - the file in the process of being copied will of course be partially complete/failed, but this is expected.

Great idea, will test that.
 
Top