ZFS Issues after power outage

weathermon · Dec 15, 2019

Hi guys,

We have an offsite server that we have freenas installed on and over the weekend we had a storm that caused a UPS to fail. We have an IBM Xyratex HS-1235T with 12x 4TB SAS drives set up in a ZFS pool. When I came in this morning I had emails from freenas saying there were issues with the ZFS:

Code:

Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
SAN1-DR       43.5T  23.3T  20.2T        -         -     7%    53%  1.00x  DEGRADED  /mnt
freenas-boot   111G  2.04G   109G        -         -      -     1%  1.00x  ONLINE  -

  pool: SAN1-DR
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
                Sufficient replicas exist for the pool to continue functioning in a
                degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
                repaired.
  scan: scrub repaired 84K in 0 days 08:35:47 with 0 errors on Sun Dec 15 08:35:51 2019
config:

                NAME                                            STATE     READ WRITE CKSUM
                SAN1-DR                                         DEGRADED     0     0     0
                  raidz1-0                                      DEGRADED   242     0     0
                    gptid/07b98107-90ba-11e9-8396-001b21bc114c  DEGRADED   165     0     0  too many errors
                    gptid/0dca0fc6-90ba-11e9-8396-001b21bc114c  DEGRADED   163     0     0  too many errors
                    gptid/13e2d0e1-90ba-11e9-8396-001b21bc114c  DEGRADED   165     0     0  too many errors
                    gptid/19ff7c11-90ba-11e9-8396-001b21bc114c  ONLINE     165     0     0
                    gptid/203e98e6-90ba-11e9-8396-001b21bc114c  DEGRADED   164     0    56  too many errors
                    gptid/26678797-90ba-11e9-8396-001b21bc114c  DEGRADED   166     0     0  too many errors
                    gptid/2c953ee1-90ba-11e9-8396-001b21bc114c  DEGRADED   162     0     0  too many errors
                    gptid/32ba96b2-90ba-11e9-8396-001b21bc114c  DEGRADED   165     0     0  too many errors
                    gptid/38e987f0-90ba-11e9-8396-001b21bc114c  FAULTED    165     0     0  too many errors
                    gptid/3f0eed31-90ba-11e9-8396-001b21bc114c  ONLINE     164     0     0
                    gptid/453b1bd1-90ba-11e9-8396-001b21bc114c  DEGRADED   166     0     0  too many errors
                    gptid/1ce329d6-fc61-11e9-bc87-001b21bc114c  ONLINE       0     0     0

errors: No known data errors

-- End of daily output --

All of the drives were working fine before hand - is this a software issue caused by the power outage or are all the drives actually failing from the power outage? Here are more kernel errors that also got emailed to me:

Code:

freenas.san1-dr kernel log messages:
> (da3:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7e 38 00 00 01 00 00 00
> (da3:mps0:0:11:0): CAM status: SCSI Status Error
> (da3:mps0:0:11:0): SCSI status: Check Condition
> (da3:mps0:0:11:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da3:mps0:0:11:0): Field Replaceable Unit: 0
> (da3:mps0:0:11:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da3:mps0:0:11:0): Error 22, Unretryable error
> (da6:mps0:0:14:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7e 70 00 00 01 00 00 00
> (da6:mps0:0:14:0): CAM status: SCSI Status Error
> (da6:mps0:0:14:0): SCSI status: Check Condition
> (da6:mps0:0:14:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da6:mps0:0:14:0): Field Replaceable Unit: 0
> (da6:mps0:0:14:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da6:mps0:0:14:0): Error 22, Unretryable error
> (da0:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7e 28 00 00 01 00 00 00
> (da0:mps0:0:8:0): CAM status: SCSI Status Error
> (da0:mps0:0:8:0): SCSI status: Check Condition
> (da0:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da0:mps0:0:8:0): Field Replaceable Unit: 0
> (da0:mps0:0:8:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da0:mps0:0:8:0): Error 22, Unretryable error
> (da4:mps0:0:12:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d b8 00 00 01 00 00 00
> (da4:mps0:0:12:0): CAM status: SCSI Status Error
> (da4:mps0:0:12:0): SCSI status: Check Condition
> (da4:mps0:0:12:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da4:mps0:0:12:0): Field Replaceable Unit: 0
> (da4:mps0:0:12:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da4:mps0:0:12:0): Error 22, Unretryable error
> (da10:mps0:0:18:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7e 30 00 00 01 00 00 00
> (da10:mps0:0:18:0): CAM status: SCSI Status Error
> (da10:mps0:0:18:0): SCSI status: Check Condition
> (da10:mps0:0:18:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da10:mps0:0:18:0): Field Replaceable Unit: 0
> (da10:mps0:0:18:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da10:mps0:0:18:0): Error 22, Unretryable error
> (da1:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7e 40 00 00 01 00 00 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da1:mps0:0:9:0): Field Replaceable Unit: 0
> (da1:mps0:0:9:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da1:mps0:0:9:0): Error 22, Unretryable error
> (da9:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d e8 00 00 01 00 00 00
> (da9:mps0:0:17:0): CAM status: SCSI Status Error
> (da9:mps0:0:17:0): SCSI status: Check Condition
> (da9:mps0:0:17:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da9:mps0:0:17:0): Field Replaceable Unit: 0
> (da9:mps0:0:17:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da9:mps0:0:17:0): Error 22, Unretryable error
> (da5:mps0:0:13:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d d0 00 00 01 00 00 00
> (da5:mps0:0:13:0): CAM status: SCSI Status Error
> (da5:mps0:0:13:0): SCSI status: Check Condition
> (da5:mps0:0:13:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da5:mps0:0:13:0): Field Replaceable Unit: 0
> (da5:mps0:0:13:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da5:mps0:0:13:0): Error 22, Unretryable error
> (da2:mps0:0:10:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d d0 00 00 01 00 00 00
> (da2:mps0:0:10:0): CAM status: SCSI Status Error
> (da2:mps0:0:10:0): SCSI status: Check Condition
> (da2:mps0:0:10:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da2:mps0:0:10:0): Field Replaceable Unit: 0
> (da2:mps0:0:10:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da2:mps0:0:10:0): Error 22, Unretryable error
> (da7:mps0:0:15:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d c8 00 00 01 00 00 00
> (da7:mps0:0:15:0): CAM status: SCSI Status Error
> (da7:mps0:0:15:0): SCSI status: Check Condition
> (da7:mps0:0:15:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da7:mps0:0:15:0): Field Replaceable Unit: 0
> (da7:mps0:0:15:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da7:mps0:0:15:0): Error 22, Unretryable error
> (da8:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 ce b3 7d c8 00 00 01 00 00 00
> (da8:mps0:0:16:0): CAM status: SCSI Status Error
> (da8:mps0:0:16:0): SCSI status: Check Condition
> (da8:mps0:0:16:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
> (da8:mps0:0:16:0): Field Replaceable Unit: 0
> (da8:mps0:0:16:0): Descriptor 0x80: 00 00 05 24 00 00 ff ff ff ff ff ff 00 00
> (da8:mps0:0:16:0): Error 22, Unretryable error

-- End of security output --

Cheers, Mike

Chris Moore · Dec 15, 2019

We generally don't suggest using that many drives in a RAIDz1 vdev because of the risk of multi disk failure.

In this particular instance, due to the power outage, I think your are actually alright. The status says, "errors: No known data errors".
I think the read faults are probably due to the drives going offline before the server went offline. You could probably do a zpool clear from the command line: https://docs.oracle.com/cd/E19253-01/819-5461/gazge/index.html

Then run a scrub on the pool.

Chris Moore · Dec 15, 2019

There are some checksum errors on one of the disks but that should be cleared up by the scrub and it will also exercise the disks. If you have bad disks it would likely show during the scrub.

weathermon · Dec 15, 2019

Hi Chris,

Thank you very much, I'll try what you suggest once we get our UPS replaced.

Cheers, Mike

Important Announcement for the TrueNAS Community.

ZFS Issues after power outage

weathermon

Cadet

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

weathermon

Cadet

Similar threads