Hi everyone,
running FreeNAS 11.1-U6 on HPE ProLiant ML350e Gen8 v2, dual E5-2430L, 192 GB HPE ram, 2 HP HBA 220, one HP HBA 221 and one HP HBA222.
I have two pools:
- one which has been rock solid for years using 12 WD RED SATA HDDs in 2 x RaidZ2 connected to the 221 and 222 adapter
- one using 16 SmrtStor SSD's formatted as 512b configured in 4 x RaidZ1 connected to the two 220 HBA adapters
Problems are with the latter pool, which after a period without any errors or warnings I suddenly loose altogether.
The output from dmesg are like this:
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Retrying command (per sense data)
(da6:mps0:0:13:0): READ(10). CDB: 28 00 37 fe e8 80 00 01 00 00
(da6:mps0:0:13:0): CAM status: SCSI Status Error
(da6:mps0:0:13:0): SCSI status: Check Condition
(da6:mps0:0:13:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da6:mps0:0:13:0): Field Replaceable Unit: 24
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Error 5, Retries exhausted
(da7:mps0:0:14:0): READ(10). CDB: 28 00 00 00 00 80 00 01 00 00
(da7:mps0:0:14:0): CAM status: SCSI Status Error
(da7:mps0:0:14:0): SCSI status: Check Condition
(da7:mps0:0:14:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da7:mps0:0:14:0): Field Replaceable Unit: 24
(da7:mps0:0:14:0): Actual Retry Count: 0
The status after trying a "zpool clear":
pool: SSDVOL
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Sun Dec 16 00:17:42 2018
5.64G scanned at 119K/s, 0 issued at 0/s, 381G total
0 repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
SSDVOL UNAVAIL 0 0 0
raidz1-0 UNAVAIL 0 0 0
1393438304880803758 UNAVAIL 0 0 0 was /dev/gptid/6352e0be-dfb3-11e8-b4db-5065f366e21a
5695038371597712408 UNAVAIL 0 0 0 was /dev/gptid/63d7f0af-dfb3-11e8-b4db-5065f366e21a
gptid/645ecfac-dfb3-11e8-b4db-5065f366e21a ONLINE 0 0 0
12334108673341178812 UNAVAIL 0 0 0 was /dev/gptid/64e74eaf-dfb3-11e8-b4db-5065f366e21a
raidz1-1 UNAVAIL 0 0 0
gptid/0f37a250-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/0fc063f5-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
5413651162169996821 UNAVAIL 0 0 0 was /dev/gptid/10677993-dfb4-11e8-b4db-5065f366e21a
12121637861202479481 UNAVAIL 0 0 0 was /dev/gptid/10fb582c-dfb4-11e8-b4db-5065f366e21a
raidz1-2 UNAVAIL 0 0 0
16842034515400028476 UNAVAIL 0 0 0 was /dev/gptid/d57ce93f-dfb4-11e8-b4db-5065f366e21a
1367789756877623967 UNAVAIL 0 0 0 was /dev/gptid/d61893d5-dfb4-11e8-b4db-5065f366e21a
14228190446825773043 UNAVAIL 0 0 0 was /dev/gptid/d6ba56de-dfb4-11e8-b4db-5065f366e21a
895488834081205217 UNAVAIL 0 0 0 was /dev/gptid/d780279b-dfb4-11e8-b4db-5065f366e21a
raidz1-3 DEGRADED 0 0 0
16763065356661527645 UNAVAIL 0 0 0 was /dev/gptid/663b53ce-dfb5-11e8-b4db-5065f366e21a
gptid/66e61f6a-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/67937323-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/686139c3-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x1d6>
<metadata>:<0x1e1>
<metadata>:<0x1e3>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x0>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x1>
SSDVOL/esxi-std-dataset/esxi-std-zvol:<0x1>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/esxi-ha-dataset/esxi-hqt-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x0>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x1>
SMART reports "non-medium error count" whereas the drives themselves look fine:
Vendor: SanDisk
Product: DOPE0480S5xnNMRI
Revision: 3P03
Compliance: SPC-4
User Capacity: 480,999,997,440 bytes [480 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50011731002bc374
Serial number: FG00CW8N
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Dec 16 14:04:13 2018 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 28 C
Drive Trip Temperature: 63 C
Manufactured in week 11 of year 2014
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 135
defect list format 6 unknown
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast|delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 2054.826 0
write: 0 0 0 0 0 5875.177 0
Non-medium error count: 519
I have upgraded BIOS and firmware, replaced the two HBA 220 adapters, re-used mini SAS cables I had from elsewhere and now ordered new mini SAS cables to rule these out entirely. No errors are reported in iLO but I do have a "Server power restored" although I cannot see if this coincides with the SCSI errors and loss of the pool.
Despite intensive Googling I still do not know:
- the reasons for the "SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)" error?
- the reason for the "Non-medium error count"?
- if these most likely are related?
Thanks for taking time to read this post. Appreciate any feedback and hope for help.
Rgds. Lebesgue
running FreeNAS 11.1-U6 on HPE ProLiant ML350e Gen8 v2, dual E5-2430L, 192 GB HPE ram, 2 HP HBA 220, one HP HBA 221 and one HP HBA222.
I have two pools:
- one which has been rock solid for years using 12 WD RED SATA HDDs in 2 x RaidZ2 connected to the 221 and 222 adapter
- one using 16 SmrtStor SSD's formatted as 512b configured in 4 x RaidZ1 connected to the two 220 HBA adapters
Problems are with the latter pool, which after a period without any errors or warnings I suddenly loose altogether.
The output from dmesg are like this:
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Retrying command (per sense data)
(da6:mps0:0:13:0): READ(10). CDB: 28 00 37 fe e8 80 00 01 00 00
(da6:mps0:0:13:0): CAM status: SCSI Status Error
(da6:mps0:0:13:0): SCSI status: Check Condition
(da6:mps0:0:13:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da6:mps0:0:13:0): Field Replaceable Unit: 24
(da6:mps0:0:13:0): Actual Retry Count: 0
(da6:mps0:0:13:0): Error 5, Retries exhausted
(da7:mps0:0:14:0): READ(10). CDB: 28 00 00 00 00 80 00 01 00 00
(da7:mps0:0:14:0): CAM status: SCSI Status Error
(da7:mps0:0:14:0): SCSI status: Check Condition
(da7:mps0:0:14:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
(da7:mps0:0:14:0): Field Replaceable Unit: 24
(da7:mps0:0:14:0): Actual Retry Count: 0
The status after trying a "zpool clear":
pool: SSDVOL
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Sun Dec 16 00:17:42 2018
5.64G scanned at 119K/s, 0 issued at 0/s, 381G total
0 repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
SSDVOL UNAVAIL 0 0 0
raidz1-0 UNAVAIL 0 0 0
1393438304880803758 UNAVAIL 0 0 0 was /dev/gptid/6352e0be-dfb3-11e8-b4db-5065f366e21a
5695038371597712408 UNAVAIL 0 0 0 was /dev/gptid/63d7f0af-dfb3-11e8-b4db-5065f366e21a
gptid/645ecfac-dfb3-11e8-b4db-5065f366e21a ONLINE 0 0 0
12334108673341178812 UNAVAIL 0 0 0 was /dev/gptid/64e74eaf-dfb3-11e8-b4db-5065f366e21a
raidz1-1 UNAVAIL 0 0 0
gptid/0f37a250-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/0fc063f5-dfb4-11e8-b4db-5065f366e21a ONLINE 0 0 0
5413651162169996821 UNAVAIL 0 0 0 was /dev/gptid/10677993-dfb4-11e8-b4db-5065f366e21a
12121637861202479481 UNAVAIL 0 0 0 was /dev/gptid/10fb582c-dfb4-11e8-b4db-5065f366e21a
raidz1-2 UNAVAIL 0 0 0
16842034515400028476 UNAVAIL 0 0 0 was /dev/gptid/d57ce93f-dfb4-11e8-b4db-5065f366e21a
1367789756877623967 UNAVAIL 0 0 0 was /dev/gptid/d61893d5-dfb4-11e8-b4db-5065f366e21a
14228190446825773043 UNAVAIL 0 0 0 was /dev/gptid/d6ba56de-dfb4-11e8-b4db-5065f366e21a
895488834081205217 UNAVAIL 0 0 0 was /dev/gptid/d780279b-dfb4-11e8-b4db-5065f366e21a
raidz1-3 DEGRADED 0 0 0
16763065356661527645 UNAVAIL 0 0 0 was /dev/gptid/663b53ce-dfb5-11e8-b4db-5065f366e21a
gptid/66e61f6a-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/67937323-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
gptid/686139c3-dfb5-11e8-b4db-5065f366e21a ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x1d6>
<metadata>:<0x1e1>
<metadata>:<0x1e3>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x0>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol:<0x1>
SSDVOL/esxi-std-dataset/esxi-std-zvol:<0x1>
SSDVOL/esxi-ha-dataset/esxi-ha-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/esxi-ha-dataset/esxi-hqt-zvol@auto-20181110.1620-5w:<0x1>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x0>
SSDVOL/frenas-vm-dataset/mapr/nas-vm-mapr:<0x1>
SMART reports "non-medium error count" whereas the drives themselves look fine:
Vendor: SanDisk
Product: DOPE0480S5xnNMRI
Revision: 3P03
Compliance: SPC-4
User Capacity: 480,999,997,440 bytes [480 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50011731002bc374
Serial number: FG00CW8N
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Dec 16 14:04:13 2018 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 28 C
Drive Trip Temperature: 63 C
Manufactured in week 11 of year 2014
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 135
defect list format 6 unknown
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast|delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 2054.826 0
write: 0 0 0 0 0 5875.177 0
Non-medium error count: 519
I have upgraded BIOS and firmware, replaced the two HBA 220 adapters, re-used mini SAS cables I had from elsewhere and now ordered new mini SAS cables to rule these out entirely. No errors are reported in iLO but I do have a "Server power restored" although I cannot see if this coincides with the SCSI errors and loss of the pool.
Despite intensive Googling I still do not know:
- the reasons for the "SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)" error?
- the reason for the "Non-medium error count"?
- if these most likely are related?
Thanks for taking time to read this post. Appreciate any feedback and hope for help.
Rgds. Lebesgue
Last edited: