SOLVED Serious problem with Western Digital Red 3TB WD30EFRX

Status
Not open for further replies.

Satam

Dabbler
Joined
Jan 23, 2014
Messages
40
Hello,

I finally set up my FreeNAS but I ran straight into a difficult hardware error. It started when one WD Red fell flat during writing to the pool. FreeBSD put it offline so I removed it while the server was still running. I checked its SMART values on another computer and they PASSED. Then I reconnected it to the server using a different SATA cable, because I suspected a bad cable, but poof! all WD Reds were sent to sleep as soon as I connected the SATA cable. All drives are brand new. I haven't checked their SMART values yet, but I bet they're all fine.

My hardware is as the following:

Supermicro X10SLH-F (C226, Haswell)
Intel Xeon E3 1230v3
bequiet! Straight Power E9 580W
4x Western Digital Red 3TB WD30EFRX
1x Seagate Barracuda 3TB ST3000DM001
1x Western Digital Black 1TB WD1001FALS (it's going to be replaced with a 3TB drive)
The cables I used where the ones that came with the Supermicro board.

The drives are connected to the 6 SATA ports in that order as seen in the list. It was the second drive which fell out of the RAID at first. I checked the message log and this is the error:

Code:
Feb 13 23:50:43 freenas kernel: ahcich1: Timeout on slot 3 port 0
Feb 13 23:50:43 freenas kernel: ahcich1: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd c0 serr 00000000 cmd 0004c317
Feb 13 23:50:43 freenas kernel: (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:50:43 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:50:43 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:51:14 freenas kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 13 23:51:44 freenas kernel: ahcich1: Timeout on slot 3 port 0
Feb 13 23:51:44 freenas kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00000008 tfd 150 serr 00000000 cmd 0004c317
Feb 13 23:51:44 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:51:44 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:51:44 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
Feb 13 23:54:47 freenas kernel: ahcich1: Timeout on slot 29 port 0
Feb 13 23:54:47 freenas kernel: ahcich1: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17
Feb 13 23:54:47 freenas kernel: (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:54:47 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:54:47 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:55:19 freenas kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 13 23:55:49 freenas kernel: ahcich1: Timeout on slot 29 port 0
Feb 13 23:55:49 freenas kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 20000000 tfd 150 serr 00000000 cmd 0004dd17
Feb 13 23:55:49 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:55:49 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:55:49 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
Feb 13 23:58:34 freenas kernel: ahcich1: Timeout on slot 13 port 0
Feb 13 23:58:34 freenas kernel: ahcich1: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd c0 serr 00000000 cmd 0004cd17
Feb 13 23:58:34 freenas kernel: (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:58:34 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:58:34 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:06 freenas kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 13 23:59:36 freenas kernel: ahcich1: Timeout on slot 13 port 0
Feb 13 23:59:36 freenas kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00002000 tfd 150 serr 00000000 cmd 0004cd17
Feb 13 23:59:36 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Feb 13 23:59:36 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Feb 13 23:59:36 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
Feb 13 23:59:45 freenas kernel: ahcich1: Timeout on slot 0 port 0
Feb 13 23:59:45 freenas kernel: ahcich1: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd 10c1 serr 00000000 cmd 0004c017
Feb 13 23:59:45 freenas kernel: ahcich1: Error while READ LOG EXT
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 10 53 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 50 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 58 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 55 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 51 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 10 54 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 57 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 d0 55 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 50 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 d0 52 6f 40 0c 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 13 23:59:45 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 13 23:59:57 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 bb 6f 40 0c 00 00 00 00 00
Feb 13 23:59:57 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 13 23:59:57 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
Feb 13 23:59:57 freenas kernel: (ada1:ahcich1:0:0:0): RES: 41 10 90 bb 6f 40 0c 00 00 00 00
Feb 13 23:59:57 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: ahcich1: Timeout on slot 28 port 0
Feb 14 00:00:07 freenas kernel: ahcich1: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 10c1 serr 00000000 cmd 0004dc17
Feb 14 00:00:07 freenas kernel: ahcich1: Error while READ LOG EXT
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 10 1a 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 1b 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 19 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 20 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 1d 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 d0 1d 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 10 1c 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 d0 1e 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 90 21 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 1f 70 40 0c 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:00:07 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:02:29 freenas kernel: ahcich1: Timeout on slot 21 port 0
Feb 14 00:02:29 freenas kernel: ahcich1: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 10c1 serr 00000000 cmd 0004d517
Feb 14 00:02:29 freenas kernel: ahcich1: Error while READ LOG EXT
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 50 1c c2 40 0c 00 00 00 00 00
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 a0 1c c2 40 0c 00 00 00 00 00
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:02:29 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:04:02 freenas kernel: ahcich1: Timeout on slot 3 port 0
Feb 14 00:04:02 freenas kernel: ahcich1: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd c0 serr 00000000 cmd 0004c317
Feb 14 00:04:02 freenas kernel: (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 14 00:04:02 freenas kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 14 00:04:02 freenas kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 14 00:04:34 freenas kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 14 00:04:58 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Feb 14 00:04:58 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: Unconditionally Re-queue Request
Feb 14 00:04:58 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
Feb 14 00:04:58 freenas kernel: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
Feb 14 00:04:58 freenas kernel: ada1: <WDC WD30EFRX-68EUZN0 80.00A80> s/n WD-WCC4N0581357 detached
Feb 14 00:04:58 freenas kernel: GEOM_ELI: Crypto WRITE request failed (error=6). gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[WRITE(offset=103467245568, length=4096)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[READ(offset=99794751488, length=12288)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[READ(offset=270336, length=8192)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[READ(offset=2998444761088, length=8192)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[READ(offset=2998445023232, length=8192)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: Crypto WRITE request failed (error=6). gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli[WRITE(offset=286384291840, length=4096)]
Feb 14 00:04:58 freenas kernel: GEOM_ELI: Device gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli destroyed.
Feb 14 00:04:58 freenas kernel: GEOM_ELI: Detached gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli on last close.
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Feb 14 00:05:03 freenas kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted


I have no idea what those errors mean. I don't think it is reallocating sectors, because those didn't show in the smart log.

Actually I suspect the WD Black to be the culprit... I was hesitant to use this drive because it gave me some trouble on my desktop before. When this drive was connected to the onboard controller, hotswapping of any drive resulted in drive resets, it didn't even have to be the WD Black being hotswapped. Just like the thing that happened when I reconnected drive #2 in my server. The WD Black didn't even reset, it was all the WD Reds that went offline. It was a daring move to use the WD Black to begin with but I hoped for the problem not to occur with the C226 chipset. Yet this is just a wild guess. The reason for these errors might be something different completely. Maybe the SATA cables Supermicro gave me aren't so super after all.

Well, this is an ongoing "investigation", after all I want to get this NAS running. But if anyone has some ideas in the meantime, I would appreciate it. Maybe somone had some trouble with the WD Black WD1001FALS? Or the WD Red WD30EFRX? Or the Supermicro X10SLH-F? Or even the bequiet! Straight Power E9 580W? Every single bit except the WD Black is brand-new, and even that drive gave me some trouble. So actually it could be anything. Maybe even that flimsy Fractal Design case I shouldn't have bought.

Anyway here's a smartctl output of the first drive that went offline, drive #2. Yeah I tried short self-tests, but smartctl left me with a prompt and I wondered how the heck is it going to notify me that the self-test finished, so I aborted it. The second time it seems to have been aborted by issuing the smartctl -a command. Weird stuff.

Code:
smartctl 6.2 2013-07-26 r3841 [x86_64-w64-mingw32-win7-sp1] (sf-6.2-1)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Red (AF)
Device Model:    WDC WD30EFRX-68EUZN0
Firmware Version: 80.00A80
User Capacity:    3.000.592.982.016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Feb 14 02:45:20 2014 WEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (  41)    The self-test routine was interrupted
                    by the host with a hard or soft reset.
Total time to complete Offline
data collection:        (39360) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 395) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  100  253  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  213  213  021    Pre-fail  Always      -      4316
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      6
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      5
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      5
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      4
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      6
194 Temperature_Celsius    0x0022  118  115  000    Old_age  Always      -      32
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short captive      Interrupted (host reset)      90%        4        -
# 2  Short offline      Aborted by host              40%        4        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


PS: Oh, btw. The pool is still saveable, right? I mean 4 drives were "destroyed", but the data on them should be fine. I mean it would be quite premature for ZFS to give up on a pool just because the drives dropped out and nothing else. The actual data on the drives is most probably fine, the drives are most probably fine (by themselves but not all together).
 

Satam

Dabbler
Joined
Jan 23, 2014
Messages
40
After a reboot the pool and all disks went online just fine. I copied some more data into one of its datasets and luckily no more of those syslog kernel messages appeared.

Then I did a scrub run because for disk #2 it showed 7 checksum errors. Right after the scrub started that number went to "30 (repairing)"... ok, fine when that drive first went offline I didn't stop the copying process right away because I didn't notice it. The pool was still writeable...

Now after a while into the scrub process that number however went up to over 1 million errors! That's ~1 million blocks, let's assume the biggest record size since the data was large files, that's about 128 GB of data, maybe less, is it? That's not unrealistic in my scenario as I didn't notice the initial errors and the drive going offline until the copy job was done.

Here's the zpool status:

Code:
  pool: tank0
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 14 14:06:51 2014
        955G scanned out of 4.35T at 355M/s, 2h48m to go
        33.3G repaired, 21.42% done
config:
 
    NAME                                                STATE    READ WRITE CKSUM
    tank0                                              ONLINE      0    0    0
      raidz2-0                                          ONLINE      0    0    0
        gptid/78bbe0c3-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0    0
        gptid/7978e1a0-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0 1.07M  (repairing)
        gptid/7a3549f8-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0    0
        gptid/7af42e1a-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0    0
        gptid/7b5d5c12-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0    0
        gptid/7bbe8a3c-94e3-11e3-9d3d-002590dc48b0.eli  ONLINE      0    0    0
 
errors: No known data errors


I hope that's nothing serious... ;) anyway the drive seems to be fine still. Smart reports no error, syslog is clear. Here's the smartctl output anyway:
Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Red (AF)
Device Model:    WDC WD30EFRX-68EUZN0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Feb 14 14:52:12 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (39360) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 395) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      1
  3 Spin_Up_Time            0x0027  199  199  021    Pre-fail  Always      -      5008
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      8
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  196  191  000    Old_age  Always      -      67
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      17
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      7
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      6
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      11
194 Temperature_Celsius    0x0022  114  114  000    Old_age  Always      -      36
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short captive      Interrupted (host reset)      90%        4        -
# 2  Short offline      Aborted by host              40%        4        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 


Could it be that the checksum errors are still a remnant of the desaster from earlier and will be recovered once the scrub run is over? (It's still running as I type...) Or is it an ongoing problem still at work? I suspect that if there's still something wrong it might be the Supermicro board or the RAM. But I did a memtest86+ run before setting up FreeNAS and the RAM was fine.

PS: Oh, how does FreeBSD inform the user when the RAM's ECC catches an (correctable/uncorrectable) error? I guess I'll get notified by the BMC or UEFI at the next reboot, but that's quite useless when the errors occur during runtime and FreeNAS does nothing about it as a consequence.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
There is no complete smartest result on your smartctl outputs. Run a long smart test, and let it finish (several hours, something like 6hours with 3TB drive). Then post the result.
 

Satam

Dabbler
Joined
Jan 23, 2014
Messages
40
Quick update: The problem seems to be gone and no data was lost in the process, what was lost was a lot of time I spent worrying. I guess in the end it really was a faulty cable. Now it's really a shame that Supermicro packages their motherboards with such cheap ass unreliable cables. It would have been more responsible to just omit the cables.

Lesson learned: For serious stuff use proven cables.
Another lessen learned: Without ZFS redundancy the data I had copied would have been gone, not that it would have mattered much, but in principle it was a nice thing to be protected from data loss like that. Actually I'm thankful for this experience. If you don't do stupid things hardware failures are a lot less painful with ZFS. With my first experience with ZFS it had already done what it's supposed to. ;)
 
Status
Not open for further replies.
Top