I built a new FreeNAS array this week and all seemed well until I started performing a write-heavy operation (VMWare converter P2V operation) from physical hardware into vSphere, iSCSI datastore is on this new FreeNAS array ("storage3").
System: Dell PowerEdge r310
Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
24GB ECC Registered PC3-8500 DDR3 (Soon to be 56GB)
IBM 46M0907 PCI Express 2.0 x8 SAS Host Bus Adapter for System X - Flashed to IT mode p16
Habey DS-1280 SAS Enclosure 12 bay
8x WD RED 3TB (2 in PowerEdge bays, 6 in Habey Enclosure)
6x Barracuda XT 3T (All in Habey Enclosure)
2x Barracuda 7200.11 3TB (in PowerEdge Bays)
2 Port Intel Nic for MPIO iSCSI on separate subnets
2 Port onboard Nic, server stack LAGG0
FreeNAS-9.3-STABLE-201509282017
Now, down to the problem
This morning (I started the P2V operation last night) I received a series of email alarms from storage3:
Upon further investigation I found more information.
The first thing I noticed was some SCSI Sense errors in the logs. They repeat quite often. I have read about 10 other threads on this but my situation seems to differ slightly, so I am documenting this separately. This is only on the (brand new) WD 3TB disks that are in the Habey enclosure, there are two other WD 3TB disks in the Dell r310 chasis that are working fine.
This is the ouptut of zpool status. Somehow the array is still functioning though!
I checked these disk IDs and all 6 disks that say degraded in the GUI are only the WD RED 3TB disks (brand new) that are in the Habey enclosure. There are two other WD RED 3TB disks in the Dell r310 chasis connected to the Intel 3400 controller that are not showing any errors. After seeing this I began checking the smart status of all the drives. Two of the drives have SMART errrors, but they are not the ones showing degraded in zpool status:
I am running IT Mode p16 firmware on the 9211-8i based IBM 46M0907, although I have gotten an alert from FreeNAS on this and other arrays:
Other diagnostic information:
storage3 dmesg: http://pastebin.com/TBcnh9uz
Bottom Line: Any postulations as to what is going on or suggestions of what to try first? This enclosure was in service for years attached to a RAID HBA on linux and worked fine, but then again it appears to be working the way it is now. The most interesting detail in my mind is that only the WD RED 3TB disks are having the SCSI sense error even though they are all in the same enclosure! I am not against buying a new enclosure since this one was already used but I need to exhaust other troubleshooting options before spending that kind of money.
These are the steps that I can think of to start with:
System: Dell PowerEdge r310
Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
24GB ECC Registered PC3-8500 DDR3 (Soon to be 56GB)
IBM 46M0907 PCI Express 2.0 x8 SAS Host Bus Adapter for System X - Flashed to IT mode p16
Habey DS-1280 SAS Enclosure 12 bay
8x WD RED 3TB (2 in PowerEdge bays, 6 in Habey Enclosure)
6x Barracuda XT 3T (All in Habey Enclosure)
2x Barracuda 7200.11 3TB (in PowerEdge Bays)
2 Port Intel Nic for MPIO iSCSI on separate subnets
2 Port onboard Nic, server stack LAGG0
FreeNAS-9.3-STABLE-201509282017
Now, down to the problem
This morning (I started the P2V operation last night) I received a series of email alarms from storage3:
Code:
8:09PM The volume Storage3-A (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. 8:10PM The volume Storage3-A (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. 9:29PM The volume Storage3-A (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. 9:30PM The volume Storage3-A (ZFS) state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. 1:08AM The volume Storage3-A (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. 1:08AM The volume Storage3-A (ZFS) state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
Upon further investigation I found more information.
The first thing I noticed was some SCSI Sense errors in the logs. They repeat quite often. I have read about 10 other threads on this but my situation seems to differ slightly, so I am documenting this separately. This is only on the (brand new) WD 3TB disks that are in the Habey enclosure, there are two other WD 3TB disks in the Dell r310 chasis that are working fine.
Code:
(da7:mps0:0:13:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
This is the ouptut of zpool status. Somehow the array is still functioning though!
Code:
/var/log# zpool status Storage3-A pool: Storage3-A state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 4.46M in 0h0m with 0 errors on Wed Oct 14 13:09:57 2015 config: NAME STATE READ WRITE CKSUM Storage3-A DEGRADED 0 0 0 raidz2-0 DEGRADED 0 25 0 gptid/8467b285-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/85d6b049-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/86a1e71f-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/87b75965-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/887a8ed8-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/894a3304-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/8a5b2e59-71f5-11e5-8731-001517fd0d2a DEGRADED 0 305 15 too many errors gptid/8b689e60-71f5-11e5-8731-001517fd0d2a DEGRADED 0 518 1 too many errors gptid/8c7d5e87-71f5-11e5-8731-001517fd0d2a DEGRADED 0 414 31 too many errors gptid/8d8c9b8d-71f5-11e5-8731-001517fd0d2a DEGRADED 0 396 15 too many errors gptid/8e9dfc25-71f5-11e5-8731-001517fd0d2a DEGRADED 0 378 14 too many errors gptid/8fad7082-71f5-11e5-8731-001517fd0d2a DEGRADED 0 321 15 too many errors gptid/90906ac9-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/91604b28-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/92384e9b-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 gptid/93069446-71f5-11e5-8731-001517fd0d2a ONLINE 0 0 0 errors: No known data errors
I checked these disk IDs and all 6 disks that say degraded in the GUI are only the WD RED 3TB disks (brand new) that are in the Habey enclosure. There are two other WD RED 3TB disks in the Dell r310 chasis connected to the Intel 3400 controller that are not showing any errors. After seeing this I began checking the smart status of all the drives. Two of the drives have SMART errrors, but they are not the ones showing degraded in zpool status:
Code:
smartctl -H /dev/da9 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p26 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Please note the following marginal Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 190 Airflow_Temperature_Cel 0x0022 061 045 045 Old_age Always In_the_past 39 (Min/Max 19/40)
Code:
smartctl -H /dev/da11 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p26 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Please note the following marginal Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 190 Airflow_Tmperature_Cel 0x0022 062 045 045 Old_age Always In_the_past 38 (Min/Max 19/39)
I am running IT Mode p16 firmware on the 9211-8i based IBM 46M0907, although I have gotten an alert from FreeNAS on this and other arrays:
Code:
Firmware version 16 does not match driver version 20 for /dev/mps0
Code:
/var/log# sas2flash -listall LSI Corporation SAS2 Flash Utility Version 16.00.00.00 (2013.03.01) Copyright (c) 2008-2013 LSI Corporation. All rights reserved Adapter Selected is a LSI SAS: SAS2008(B2) Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr ---------------------------------------------------------------------------- 0 SAS2008(B2) 16.00.00.00 10.00.00.06 07.31.00.00 00:05:00:00 Finished Processing Commands Successfully. Exiting SAS2Flash.
Other diagnostic information:
Code:
ifconfig em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW2 (1/0/16) options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO> ether 00:15:17:fd:0d:2a inet 10.0.3.29 netmask 0xffffff00 broadcast 10.0.3.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 description: connected to ISCSI-SW1 (1/0/16) options=4019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO> ether 00:15:17:fd:0d:2b inet 10.0.4.29 netmask 0xffffff00 broadcast 10.0.4.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (1/0/8) options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: connected to ServerStack (2/0/40) options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect (1000baseT <full-duplex>) status: active ipfw0: flags=8801<UP,SIMPLEX,MULTICAST> metric 0 mtu 65536 nd6 options=9<PERFORMNUD,IFDISABLED> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 inet 127.0.0.1 netmask 0xff000000 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 78:2b:cb:0a:4f:9a inet 10.0.50.45 netmask 0xfffffe00 broadcast 10.0.51.255 nd6 options=9<PERFORMNUD,IFDISABLED> media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4 laggport: bce1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: bce0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
Code:
pciconf -lv hostb0@pci0:0:0:0: class=0x060000 card=0x02a31028 chip=0xd1308086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor DMI' class = bridge subclass = HOST-PCI pcib1@pci0:0:3:0: class=0x060400 card=0x02a31028 chip=0xd1388086 rev=0x11 hdr=0x01 vendor = 'Intel Corporation' device = 'Core Processor PCI Express Root Port 1' class = bridge subclass = PCI-PCI pcib2@pci0:0:5:0: class=0x060400 card=0x02a31028 chip=0xd13a8086 rev=0x11 hdr=0x01 vendor = 'Intel Corporation' device = 'Core Processor PCI Express Root Port 3' class = bridge subclass = PCI-PCI none0@pci0:0:8:0: class=0x088000 card=0x00000000 chip=0xd1558086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor System Management Registers' class = base peripheral none1@pci0:0:8:1: class=0x088000 card=0x00000000 chip=0xd1568086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor Semaphore and Scratchpad Registers' class = base peripheral none2@pci0:0:8:2: class=0x088000 card=0x00000000 chip=0xd1578086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor System Control and Status Registers' class = base peripheral none3@pci0:0:8:3: class=0x088000 card=0x00000000 chip=0xd1588086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor Miscellaneous Registers' class = base peripheral none4@pci0:0:16:0: class=0x088000 card=0x00000000 chip=0xd1508086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor QPI Link' class = base peripheral none5@pci0:0:16:1: class=0x088000 card=0x00000000 chip=0xd1518086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Core Processor QPI Routing and Protocol Registers' class = base peripheral ehci0@pci0:0:26:0: class=0x0c0320 card=0x02a31028 chip=0x3b3c8086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset USB2 Enhanced Host Controller' class = serial bus subclass = USB pcib3@pci0:0:28:0: class=0x060400 card=0x02a31028 chip=0x3b428086 rev=0x05 hdr=0x01 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset PCI Express Root Port 1' class = bridge subclass = PCI-PCI pcib4@pci0:0:28:4: class=0x060400 card=0x02a31028 chip=0x3b4a8086 rev=0x05 hdr=0x01 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset PCI Express Root Port 5' class = bridge subclass = PCI-PCI ehci1@pci0:0:29:0: class=0x0c0320 card=0x02a31028 chip=0x3b348086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset USB2 Enhanced Host Controller' class = serial bus subclass = USB pcib5@pci0:0:30:0: class=0x060401 card=0x02a31028 chip=0x244e8086 rev=0xa5 hdr=0x01 vendor = 'Intel Corporation' device = '82801 PCI Bridge' class = bridge subclass = PCI-PCI isab0@pci0:0:31:0: class=0x060100 card=0x02a31028 chip=0x3b148086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '3400 Series Chipset LPC Interface Controller' class = bridge subclass = PCI-ISA atapci0@pci0:0:31:2: class=0x01018f card=0x02a31028 chip=0x3b208086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset 4 port SATA IDE Controller' class = mass storage subclass = ATA atapci1@pci0:0:31:5: class=0x010185 card=0x02a31028 chip=0x3b268086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '5 Series/3400 Series Chipset 2 port SATA IDE Controller' class = mass storage subclass = ATA em0@pci0:4:0:0: class=0x020000 card=0x125e8086 chip=0x105e8086 rev=0x06 hdr=0x00 vendor = 'Intel Corporation' device = '82571EB Gigabit Ethernet Controller' class = network subclass = ethernet em1@pci0:4:0:1: class=0x020000 card=0x125e8086 chip=0x105e8086 rev=0x06 hdr=0x00 vendor = 'Intel Corporation' device = '82571EB Gigabit Ethernet Controller' class = network subclass = ethernet mps0@pci0:5:0:0: class=0x010700 card=0x30201000 chip=0x00721000 rev=0x03 hdr=0x00 vendor = 'LSI Logic / Symbios Logic' device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]' class = mass storage subclass = SAS bce0@pci0:2:0:0: class=0x020000 card=0x02a31028 chip=0x163b14e4 rev=0x20 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme II BCM5716 Gigabit Ethernet' class = network subclass = ethernet bce1@pci0:2:0:1: class=0x020000 card=0x02a31028 chip=0x163b14e4 rev=0x20 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme II BCM5716 Gigabit Ethernet' class = network subclass = ethernet vgapci0@pci0:1:3:0: class=0x030000 card=0x02a31028 chip=0x0532102b rev=0x0a hdr=0x00 vendor = 'Matrox Graphics, Inc.' device = 'MGA G200eW WPCM450' class = display subclass = VGA
storage3 dmesg: http://pastebin.com/TBcnh9uz
Code:
camcontrol devlist <ATA ST33000651AS CC44> at scbus0 target 0 lun 0 (pass0,da0) <ATA ST33000651AS CC44> at scbus0 target 1 lun 0 (pass1,da1) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 8 lun 0 (pass2,da2) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 9 lun 0 (pass3,da3) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 10 lun 0 (pass4,da4) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 11 lun 0 (pass5,da5) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 12 lun 0 (pass6,da6) <ATA WDC WD30EFRX-68E 0A82> at scbus0 target 13 lun 0 (pass7,da7) <ATA ST33000651AS CC44> at scbus0 target 14 lun 0 (pass8,da8) <ATA ST33000651AS CC44> at scbus0 target 15 lun 0 (pass9,da9) <ATA ST33000651AS CC44> at scbus0 target 16 lun 0 (pass10,da10) <ATA ST33000651AS CC44> at scbus0 target 17 lun 0 (pass11,da11) <ST3000DM001-1CH166 CC29> at scbus1 target 0 lun 0 (pass12,ada0) <WDC WD30EFRX-68EUZN0 82.00A82> at scbus1 target 1 lun 0 (pass13,ada1) <ST3000DM001-1ER166 CC25> at scbus2 target 0 lun 0 (pass14,ada2) <WDC WD30EFRX-68EUZN0 82.00A82> at scbus2 target 1 lun 0 (pass15,ada3) <TEAC DVD-ROM DV-28SW R.2A> at scbus3 target 0 lun 0 (pass16,cd0) <Kingston DataTraveler 2.0 PMAP> at scbus6 target 0 lun 0 (pass17,da12) <Kingston DataTraveler 2.0 PMAP> at scbus7 target 0 lun 0 (pass18,da13)
Bottom Line: Any postulations as to what is going on or suggestions of what to try first? This enclosure was in service for years attached to a RAID HBA on linux and worked fine, but then again it appears to be working the way it is now. The most interesting detail in my mind is that only the WD RED 3TB disks are having the SCSI sense error even though they are all in the same enclosure! I am not against buying a new enclosure since this one was already used but I need to exhaust other troubleshooting options before spending that kind of money.
These are the steps that I can think of to start with:
- Replace SFF-8087 cable
- Flash controller to p20 IT firmware
- Reseat drives
- Remove drives that might be problematic and resliver then see if issues go away
Last edited: