Sudden SCSI errors and pool unavailable

Lebesgue · Dec 16, 2018

Hi everyone,
running FreeNAS 11.1-U6 on HPE ProLiant ML350e Gen8 v2, dual E5-2430L, 192 GB HPE ram, 2 HP HBA 220, one HP HBA 221 and one HP HBA222.
I have two pools:
- one which has been rock solid for years using 12 WD RED SATA HDDs in 2 x RaidZ2 connected to the 221 and 222 adapter
- one using 16 SmrtStor SSD's formatted as 512b configured in 4 x RaidZ1 connected to the two 220 HBA adapters

Problems are with the latter pool, which after a period without any errors or warnings I suddenly loose altogether.

The output from dmesg are like this:
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Retrying command (per sense data)
(da6:mps0:0:13:0): READ(10). CDB: 28 00 37 fe e8 80 00 01 00 00
(da6:mps0:0:13:0): CAM status: SCSI Status Error
(da6:mps0:0:13:0): SCSI status: Check Condition
(da6:mps0:0:13:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da6:mps0:0:13:0): Field Replaceable Unit: 24
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Error 5, Retries exhausted
(da7:mps0:0:14:0): READ(10). CDB: 28 00 00 00 00 80 00 01 00 00
(da7:mps0:0:14:0): CAM status: SCSI Status Error
(da7:mps0:0:14:0): SCSI status: Check Condition
(da7:mps0:0:14:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da7:mps0:0:14:0): Field Replaceable Unit: 24
(da7:mps0:0:14:0): Actual Retry Count: 0

The status after trying a "zpool clear":
pool: SSDVOL
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Sun Dec 16 00:17:42 2018
5.64G scanned at 119K/s, 0 issued at 0/s, 381G total
0 repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
SSDVOL UNAVAIL 0 0 0
raidz1-0 UNAVAIL 0 0 0
1393438304880803758 UNAVAIL 0 0 0 was /dev/gptid/6352e0be-dfb3-11e8-b4db-5065f366e21a
5695038371597712408 UNAVAIL 0 0 0 was /dev/gptid/63d7f0af-dfb3-11e8-b4db-5065f366e21a
gptid/645ecfac-dfb3-11e8-b4db-5065f366e21a ONLINE 0 0 0
12334108673341178812 UNAVAIL 0 0 0 was /dev/gptid/64e74eaf-dfb3-11e8-b4db-5065f366e21a
raidz1-1 UNAVAIL 0 0 0
gptid/0f37a250-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/0fc063f5-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
5413651162169996821 UNAVAIL 0 0 0 was /dev/gptid/10677993-dfb4-11e8-b4db-5065f366e21a
12121637861202479481 UNAVAIL 0 0 0 was /dev/gptid/10fb582c-dfb4-11e8-b4db-5065f366e21a
raidz1-2 UNAVAIL 0 0 0
16842034515400028476 UNAVAIL 0 0 0 was /dev/gptid/d57ce93f-dfb4-11e8-b4db-5065f366e21a
1367789756877623967 UNAVAIL 0 0 0 was /dev/gptid/d61893d5-dfb4-11e8-b4db-5065f366e21a
14228190446825773043 UNAVAIL 0 0 0 was /dev/gptid/d6ba56de-dfb4-11e8-b4db-5065f366e21a
895488834081205217 UNAVAIL 0 0 0 was /dev/gptid/d780279b-dfb4-11e8-b4db-5065f366e21a
raidz1-3 DEGRADED 0 0 0
16763065356661527645 UNAVAIL 0 0 0 was /dev/gptid/663b53ce-dfb5-11e8-b4db-5065f366e21a
gptid/66e61f6a-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/67937323-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/686139c3-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x1d6>
<metadata>:<0x1e1>
<metadata>:<0x1e3>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x0>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x1>
SSDVOL/esxi-std-dataset/esxi-std-zvol:<0x1>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/esxi-ha-dataset/esxi-hqt-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x0>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x1>

SMART reports "non-medium error count" whereas the drives themselves look fine:
Vendor: SanDisk
Product: DOPE0480S5xnNMRI
Revision: 3P03
Compliance: SPC-4
User Capacity: 480,999,997,440 bytes [480 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50011731002bc374
Serial number: FG00CW8N
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Dec 16 14:04:13 2018 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 28 C
Drive Trip Temperature: 63 C
Manufactured in week 11 of year 2014
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 135
defect list format 6 unknown
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast|delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 2054.826 0
write: 0 0 0 0 0 5875.177 0

Non-medium error count: 519

I have upgraded BIOS and firmware, replaced the two HBA 220 adapters, re-used mini SAS cables I had from elsewhere and now ordered new mini SAS cables to rule these out entirely. No errors are reported in iLO but I do have a "Server power restored" although I cannot see if this coincides with the SCSI errors and loss of the pool.
Despite intensive Googling I still do not know:
- the reasons for the "SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)" error?
- the reason for the "Non-medium error count"?
- if these most likely are related?

Thanks for taking time to read this post. Appreciate any feedback and hope for help.

Rgds. Lebesgue

Chris Moore · Dec 16, 2018

For the purpose of this discussion, the fact that we are discussing SSDs in not pertinent, so I will just say disk or disks.
This looks to have been a hardware fault in the SAS controller that was running the ten disks that are now listed as UNAVAIL. I was fortunate that once, when I had a controller fail, replacing the controller brought the pool back. Unfortunately, there is this:

Lebesgue said:
Medium format corrupted

That error tends to indicate (to me) that the data on disk was corrupted by the drive controller. Because "Medium" is in reference to the recording medium. I know that it is a bit confusing when a word has multiple meanings. Medium can indicate "something in a middle position", that is the first definition in the dictionary, but if you continue to read it also lists, "something (such as a magnetic disk) on which information may be stored" and this is how it is used here. Keeping in mind that these errors were intended to be able to apply equally to disk and tape because SAS controllers are also able to operate tape drives. It may be that the failure has actually damaged data on disk for all disks connected to that controller.

There may be other input from other more experienced members but this appears to be one of the type of catastrophic failures that backups are intended to guard against.

Lebesgue · Dec 16, 2018

The failed disks are connected to two distinct HBA's which have been replaced recently, thus although possible I do not consider this the likely scenario. As stated I have ordered new mini SAS cables to rule these out.

I am not concerned about data loss and may rebuild the pool if required: permanent VM's reside in a datastore snapshotted and replicated to the HDD pool and relevant files furthermore cloud synchronized using the native FreeNAS feature.

Chris Moore · Dec 16, 2018

Lebesgue said:
The failed disks are connected to two distinct HBA's which have been replaced recently,

I imagine that if it were investigated closely enough, you might find that the offline disks are the ones that were all connected to one of the two SAS controllers and the ones that are still working were all connected to the other SAS controller. There is no way for me to investigate this remotely. Good luck.

Lebesgue · Dec 16, 2018

I am positive it is not a single controller. It is both controllers and/or all cables unless it is the individual disks of course.
Each controller drives 2x4 disks and as 10 disks are affected this rules out one controller.
Furthermore in order to map disks to Raidz1 vdev I inserted one disk at a time and created one vdev at a time, thus Raidz1-0 and Raidz1-1 are on controller 1 and 1-2 and 1-3 on controller 2.

Important Announcement for the TrueNAS Community.

Sudden SCSI errors and pool unavailable

Lebesgue

Dabbler

Chris Moore

Hall of Famer

Lebesgue

Dabbler

Chris Moore

Hall of Famer

Lebesgue

Dabbler

Similar threads