Lost my zpool

Status
Not open for further replies.

St1x

Cadet
Joined
Nov 18, 2014
Messages
3
Hello, I'm looking for some help in recovering my zpool please.

Spec:

HP Proliant N40L
8GB Ram
3 x 2TB Seagate drives
1 zpool volume
Freenas 9.2.1.7

Situation:

One of the seagates was reporting a chksum error on the drive:

pool: Freenas
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 25.1G in 0h13m with 0 errors on Wed Nov 5 21:23:30 2014
config:

NAME STATE READ WRITE CKSUM
Freenas ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 22

errors: No known data errors

It recommended to replace the drive so i bought another, this time a samsung and started the replacement process. I had quite a few issues with this, the new drive wouldn't resilver, it kept crashing the system at 93.8% with this:

> GEOM_ELI: Device ada3p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI: Crypto: software
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 3c 40 40 e0 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 08 3c 40 40 e0 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 3c 40 40 e0 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 08 3c 40 40 e0 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 3c 40 40 e0 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 08 3c 40 40 e0 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 3c 40 40 e0 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 08 3c 40 40 e0 00 00 00 00
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 08 3c 40 40 e0 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> (ada3:ahcich3:0:0:0): RES: 41 40 08 3c 40 40 e0 00 00 00 00
> (ada3:ahcich3:0:0:0): Error 5, Retries exhausted
> ahcich3: Timeout on slot 5 port 0
> ahcich3: is 00000000 cs 00000060 ss 00000000 rs 00000060 tfd c0 serr 00000000 cmd 0000e517
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 23 port 0
> ahcich3: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd c0 serr 00000000 cmd 0000f717
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 13 port 0
> ahcich3: is 00000000 cs 00006000 ss 00000000 rs 00006000 tfd c0 serr 00000000 cmd 0000ed17
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 25 port 0
> ahcich3: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd c0 serr 00000000 cmd 0000f917
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 31 port 0
> ahcich3: is 00000000 cs 80000001 ss 00000000 rs 80000001 tfd c0 serr 00000000 cmd 0000ff17
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 28 port 0
> ahcich3: is 00000000 cs 30000000 ss 00000000 rs 30000000 tfd c0 serr 00000000 cmd 0000fc17
> (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command

I had to restart the machine and it wouldn't even boot through to the webgui. It crashed with the last 4 lines showing this:

KBD: enter: panic
[ thread pid 661 tid 100075 ]
Stopped at kbd_enter+0x3b: movq $0,0xaed112(%rip)
db>

So this time i downloaded 9.2.1.8 onto a fresh USB stick and put the 3 original disks back in and it booted through to the webgui. I had recently exported my config for the box so i restored that back to freenas and it went back to the previous state. It would crash before it could load the webgui with the above KBD: enter: panic error.

I have wiped the USB stick and started again with 9.2.1.8, it now boots through to the webgui. I can run the following commands:

[root@freenas] ~# camcontrol devlist
<ST2000DL003-9VT166 CC3C> at scbus0 target 0 lun 0 (pass0,ada0)
<ST2000DL003-9VT166 CC3C> at scbus1 target 0 lun 0 (pass1,ada1)
<ST2000DL003-9VT166 CC3C> at scbus2 target 0 lun 0 (pass2,ada2)
<Kingston DataTraveler 2.0 PMAP> at scbus7 target 0 lun 0 (da0,pass3)

[root@freenas] ~# gpart show
=> 34 3907029101 ada0 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834703 2 freebsd-zfs (1.8T)

=> 34 3907029101 ada1 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834703 2 freebsd-zfs (1.8T)

=> 34 3907029101 ada2 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834703 2 freebsd-zfs (1.8T)

=> 63 30489345 da0 MBR (14G)
63 1930257 1 freebsd [active] (942M)
1930320 63 - free - (31k)
1930383 1930257 2 freebsd (942M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 26584416 - free - (12G)

=> 0 1930257 da0s1 BSD (942M)
0 16 - free - (8.0k)
16 1930241 1 !0 (942M)

[root@freenas] ~# zpool status
no pools available

When i try a zpool import it starts streaming text so fast i can't read it then it reboots the system and i'm back to a completely fresh version of 9.2.1.8 and no data.

My main aim is to be able to get it back up in a state so that i can recover the data, after that it can be destroyed. I'm not sure what went wrong.

I'm not very experienced with freenas, if anybody could point me in the right direction it would be hugely appreciated. I managed to loose all my data.

Thanks!
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That's what we call "a corrupted pool". Your data is basically gone forever. This is yet another example of why the "RAIDZ1 is dead" post is in my signature. ;)

You almost certainly have corruption of the disk that's bad and another disk. Without enough redundancy ZFS doesn't know what to do, so it throws up all over your screen.

You really don't have much in terms of options. We keep telling people not to use RAIDZ1. Sadly, people keep using it anyway.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
On the theory that you've got nothing left to lose, what happens if you try importing the pool with only the two good drives (ada0 ada1) connected?

Was ada2 and ada3 connected at the same time during the first resilver attempt? (the failing drive, and it's replacement?)

Had scrubs ever been run on this pool in the past?
 

St1x

Cadet
Joined
Nov 18, 2014
Messages
3
Just gave it a go with the ada0 ada1 disks and it streams text then restarts the box when trying the zpool import.

At one point ada2 and ada3 were connected at the same time during a resilvering attempt.

No scrubs had been run on the pool, it had been up for around 2 years.

I'm going through your guides now cyberjock ;)

Does anyone have an idea of why it failed? I have a feeling the sata controller may be at fault. Not sure if i can trust this hardware anymore.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I would be suspicious about either the port ada3, the cable, or the new drive. Same with ada2, because you had checksum errors, not read errors.

In fact, you should run a memory test on the machine just a matter of course.

In theory, I don't see any reason for the failure (yet).
 

St1x

Cadet
Joined
Nov 18, 2014
Messages
3
I think i'm going to test the drives for faults to see which was causing issues, does anyone have any recommended programs for this?

Then a complete rethink on my setup. Can't trust the HP N40L anymore so it's time for new hardware. I'm thinking the TS-470 Pro.
 
Status
Not open for further replies.
Top