SOLVED Lost Pool due to hardware failure

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
All,
I was hoping that someone might have an idea here. I suspect that the situation is unrecoverable - but you never know.
Yesterday I had three pools: BigPool, LittlePool & SSDPool. Following a hardware failure of a power cable to the backplane and a load of WTF type debugging I located the failed cable, replaced and now the backplane is recognised again. In the process a few cables may have moved around from the LSI Card to the backplane.
I now have 2 Pools, LittlePool & SSD Pool. BigPool has vanished completely.
Bigpool had about 10TB of data on which is backed up off site, and some (most) backed up locally - the first backup hadn't finished for some of the data
BigPool was 6 drives in 3 vdevs mirrored. The six drives are recognised, but are now all marked as unused.

Hopefully the system details are in my sig, but if not they will follow shortly.

Anyone got any ideas or should I just give up, recreate and restore.

I do have a video of the boot process which may contain clues
I have had to remove the slog drive from bigpool which was indirectly the cause of the power cable issue

Video at: https://www.dropbox.com/sh/eqo6asf5n4hlixd/AAC3-RvVxbjXNTvd6M5LO9dQa?dl=0
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
pool: LittlePool
state: ONLINE
scan: resilvered 64K in 0 days 00:00:01 with 0 errors on Sun Jun 7 22:16:11 2020
config:

NAME STATE READ WRITE CKSUM
LittlePool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/a172f461-9235-11ea-8e93-3cecef01f45e ONLINE 0 0 0
gptid/a6b9f712-9d2e-11ea-9fe7-000743315260 ONLINE 0 0 0
gptid/a2581b87-9235-11ea-8e93-3cecef01f45e ONLINE 0 0 0
gptid/a271eaf3-9235-11ea-8e93-3cecef01f45e ONLINE 0 0 0

errors: No known data errors

pool: SSDPool
state: ONLINE
scan: resilvered 384K in 0 days 00:00:00 with 0 errors on Sun Jun 7 19:14:35 2020
config:

NAME STATE READ WRITE CKSUM
SSDPool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/b44c49aa-9c42-11ea-9fe7-000743315260 ONLINE 0 0 0
gptid/b46ba216-9c42-11ea-9fe7-000743315260 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/0840539d-9d2c-11ea-9fe7-000743315260 ONLINE 0 0 0
gptid/08602324-9d2c-11ea-9fe7-000743315260 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/179ee884-9d2c-11ea-9fe7-000743315260 ONLINE 0 0 0
gptid/17be0f7a-9d2c-11ea-9fe7-000743315260 ONLINE 0 0 0
logs
nvd0p2 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:26 with 0 errors on Tue Jun 2 03:45:26 2020
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada6p2 ONLINE 0 0 0

errors: No known data errors
root@freenas[/tmp]#
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
no mention of bigpool, although it still appears in the GUI but as ? Unknown
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
If I look at /dev I see (for example) ada3 and ada3p1 and ada3p2. This seems to be the same across all 6 drives (different device names obviously)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Oooo

root@freenas[/tmp]# zpool import
pool: BigPool
id: 5156537368234474465
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://illumos.org/msg/ZFS-8000-6X
config:

BigPool UNAVAIL missing device
mirror-0 ONLINE
gptid/c0cd4836-9ab4-11ea-9043-000743315260 ONLINE
gptid/c0ebe5d9-9ab4-11ea-9043-000743315260 ONLINE
mirror-1 ONLINE
gptid/d2b4b0f4-9ab4-11ea-9043-000743315260 ONLINE
gptid/d2f05784-9ab4-11ea-9043-000743315260 ONLINE
mirror-2 ONLINE
gptid/e68cc6d8-9ab4-11ea-9043-000743315260 ONLINE
gptid/e69cb1c1-9ab4-11ea-9043-000743315260 ONLINE
logs
8442316613177910272 FAULTED corrupted data

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
root@freenas[/tmp]#
Looks like I may have to put the slog back is how I interpret that.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
If I look at /dev I see (for example) ada3 and ada3p1 and ada3p2.

Hey @NugentS,

Yes, these partitions were created by FreeNAS. The p1 partition is small (2 Gig by default) and is used for swap. The second one, p2, is the one containing the data.

You did not said if your pool was encrypted or not. Hope it was not, as it would put you further away from your own data and it would require to be extra cautious...
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Pool is not encrypted - thats a world of hurt I don't need
:)
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Pool is not encrypted - thats a world of hurt I don't need

Good for you! I don't need that one either.

So as zpool import told you, you are missing devices. Either they are still not connected or not connected properly. Double check your hardware and be sure that everything is plugged correctly.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
The missing device is a Intel Optane 900P or summat like that. Its on my desk as its power connection to the server is what caused the issue in the first place.

Crappy power lead
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Might put the fans back in as well (that I seem to have left out of the fanwall)

LOL
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
The missing device is a Intel Optane 900P or summat like that. Its on my desk as its power connection to the server is what caused the issue in the first place.

Crappy power lead

You just got a first class demo about how a single server, no matter FreeNAS or anything else, can be more than a single point of failure. You should be able to recover from that one, but you are now very close to the point where you would have been forced to rely on your backups.

So once that is over, be sure not only to complete your backups, but ensure that this backup can be restored. Once your restore test is successful, ensure the backup procedure you designed will be executed at the frequency you need to follow the rate your data changes.

Don't do as most and do your backups after loosing it all. Way better to do it before :)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
All the lost data is offsite (but I do not fancy copying 10TB across my internet connection)
Backups - when I tested initially did restore

When I started looking at FreeNAS, I built it on a VMWare setup I had. I wasn't goin to pull the trigger on a FreeNAS box without knowing how I was gonna do backups, both of VM's and data
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Woohoo. Bigpool is back. And I have a higher quality (I hope) splitter cable on its way from Amazon. Blasted molex connectors are horrid.

@garm - thank you for the pointers. I had exhausted my knowledge and didn't want to flail around and completely break things. If it wasn't for you I would have given up and reformatted the pool.

@Heracles - for keeping me hopeful and joining in

@ixsystems - for FreeNAS, as I don't think much else would have survived my messing around. I am impressed how survivable ZFS is over "normal" RAID.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
@NugentS Did you just have to plug the optane drive back in to import the pool again? I'm curious because I thought that the loss of a log drive wouldn't affect access to the pool or ability to import it.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Yes I did. Without the optane in place I could not import the pool.
I thought I was safe without the optane - but apparently not.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Thanks. @Heracles or @garm I wonder if you could clarify for me: If my Nas/pool with a SLOG device attached to it suffers a power loss, which also knocks the SLOG disk out of commission, does that mean on bootup I wouldn't be able to import the pool because it's missing?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
If my Nas/pool with a SLOG device attached to it suffers a power loss, which also knocks the SLOG disk out of commission, does that mean on bootup I wouldn't be able to import the pool because it's missing?

It is possible... Depending of what is written in the log and still to be transferred to the pool. In that scenario, the content of the SLOG is lost. So how critical that content was ? It can be plain nothing if the SLOG was empty up to the most critical part of the ZFS structure. That is why it is always recommended to have a power protected device as an SLOG. Or even better, power protect your entire server with a UPS. In case of power outage, the UPS will keep the server up and will give enough time for a graceful shutdown. Should there be a big spike, high enough to damage things, the UPS will sacrifice itself and again, will run the load from its battery.

You can read this post by @jgreco for more info.
 
Top