Resilver issues

joriz · Apr 8, 2020

Hi All,

I'm using 3 fileservers running FreeNAS. Some time ago 1 of the servers was showing a degraded pool so i replaced the faulty disk and initiated the resilver proces.
During the resilver proces the status of the new disk changed to faulted and the pool was still showing degraded. The disk in pool status was showing 55 (errors?) as write. Read and Checksum were both 0.

Things i've tested:

Connected the new disk to a Windows machine and run a surface scan --> all OK
Tried to resilver it again after the surfaca scan --> failed
Inspected the disk port and the port of the tray --> looks ok

I have now inserted the disk in a different tray of the server and want to try a resilver again but i don't know which line i have to pick to start the replace/resilver.
Because i have tried to resilver it several times it is showing multiple lines in the pool status.

Systems specs:
FreeNAS-11.2-U7
2 x SAS2008 HBA
8 x 6TB WD RED (2 x Raidz2 pools)
Chassis with 16 port backplane
Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (8 cores)
64GB memory
Supermicro mainboard.

The faulty device name is da7.

Errors i see in the logs during the resilver proces:

Code:

Apr  6 20:53:50  ZFS: vdev state changed, pool_guid=1085976887892895734 vdev_guid=16393062259941390275
Apr  6 20:54:04  GEOM_MIRROR: Cannot open consumer da6p1 (error=1).
Apr  6 20:54:04  GEOM_MIRROR: Device swap0 destroyed.
Apr  6 22:51:01  (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 50 52 da 50 00 00 00 20 00 00
Apr  6 22:51:01  (da7:mps1:0:17:0): CAM status: SCSI Status Error
Apr  6 22:51:01  (da7:mps1:0:17:0): SCSI status: Check Condition
Apr  6 22:51:01  (da7:mps1:0:17:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Apr  6 22:51:01  (da7:mps1:0:17:0): Info: 0x25052da50
Apr  6 22:51:01  (da7:mps1:0:17:0): Error 22, Unretryable error
Apr  6 22:51:10      (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 1038 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
Apr  6 22:51:10      (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 186 terminated ioc 804b l(da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00
Apr  6 22:51:10  oginfo 31080000 scsi 0 state 0 xfer 0
Apr  6 22:51:10  (da7:mps1:0:17:0): CAM status: CCB request completed with an error
Apr  6 22:51:10  (da7:mps1:0:17:0): Retrying command
Apr  6 22:51:10  (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00
Apr  6 22:51:10  (da7:mps1:0:17:0): CAM status: CCB request completed with an error
Apr  6 22:51:10  (da7:mps1:0:17:0): Retrying command
Apr  6 22:51:10  (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00
Apr  6 22:51:10  (da7:mps1:0:17:0): CAM status: SCSI Status Error
Apr  6 22:51:10  (da7:mps1:0:17:0): SCSI status: Check Condition
Apr  6 22:51:10  (da7:mps1:0:17:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Apr  6 22:51:10  (da7:mps1:0:17:0): Info: 0x400290
Apr  6 22:51:10  (da7:mps1:0:17:0): Error 22, Unretryable error
Apr  6 22:51:18  (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00
Apr  6 22:51:18  (da7:mps1:0:17:0): CAM status: SCSI Status Error
Apr  6 22:51:18  (da7:mps1:0:17:0): SCSI status: Check Condition
Apr  6 22:51:18  (da7:mps1:0:17:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Apr  6 22:51:18  (da7:mps1:0:17:0): Info: 0x2baa0f290
Apr  6 22:51:18  (da7:mps1:0:17:0): Error 22, Unretryable error
Apr  6 22:51:26  (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 01 34 43 d8 c8 00 00 00 58 00 00
Apr  6 22:51:26  (da7:mps1:0:17:0): CAM status: SCSI Status Error
Apr  6 22:51:26  (da7:mps1:0:17:0): SCSI status: Check Condition
Apr  6 22:51:26  (da7:mps1:0:17:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Apr  6 22:51:26  (da7:mps1:0:17:0): Info: 0x13443d8c8
Apr  6 22:51:26  (da7:mps1:0:17:0): Error 22, Unretryable error
Apr  6 22:51:29  ZFS: vdev state changed, pool_guid=1085976887892895734 vdev_guid=16393062259941390275
Apr  6 23:05:46  (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 45 df 61 c8 00 00 00 50 00 00
Apr  6 23:05:46  (da7:mps1:0:17:0): CAM status: SCSI Status Error
Apr  6 23:05:46  (da7:mps1:0:17:0): SCSI status: Check Condition
Apr  6 23:05:46  (da7:mps1:0:17:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Apr  6 23:05:46  (da7:mps1:0:17:0): Info: 0x245df61c8
Apr  6 23:05:46  (da7:mps1:0:17:0): Error 22, Unretryable error
Apr  6 23:05:55      (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00 length 8192 SMID 340 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
Apr  6 23:05:55      (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 887 terminated ioc 804b l(da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00
Apr  6 23:05:55  oginfo 31080000 scsi 0 state 0 xfer 0
Apr  6 23:05:55  (da7:mps1:0:17:0): CAM status: CCB request completed with an error
Apr  6 23:05:55  (da7:mps1:0:17:0): Retrying command
Apr  6 23:05:55      (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 668 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
Apr  6 23:05:55      (da7:mps1:0:17:0): WRITE(16). CDB: 8a 00 00 00 00 02 45 df 63 20 00 00 00 f8 00 00 length 126976 SMID 514 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0

Ddog 800 · Apr 12, 2020

OK, so it's not just me! I also just had a drive failure and also just replaced it. During resilver, I keep getting this exact same error and the resilver restarts. Over and over again. Each time, it gets to around 30-35% before failing. I'm quite at a loss. I actually ordered two drives and they both do the exact same thing. I've tried swapping cables, etc. Still the same thing.

I won't even post my dmesg output or anything because it's the exact same thing you've got right here, as well as the write errors in the GUI/zpool status.

I'm also running a SAS2008 HBA (LSI 9211-8i in my case) and 8x 3TB WD Red drives.

Not sure if it matters, but all of the old drives are WD30EFRX-68EUZN0 while both of the new drives are WD30EFAX-68JH4N0. I'm really starting to run out of ideas here.

Ddog 800 · Apr 12, 2020

Just to add some extra detail, here are my full specs:

FreeNAS 11.3 U1
1x LSI 9211-8i (SAS2008 chipset)
6x WD Red 3TB
2x Cable Matters SFF-8087 -> 4x SATA breakout cable
Intel Core i7-4790 @ 3.60GHz
Asus Z97-C
32GB memory

radovan · Apr 13, 2020

Hello,
you've just run into crappy hard drives trouble. Wecome to the club. I have 5 WD RED 4TB EFAX drives with same problem.

More documented here:

WD RED EFRX / EFAX

I have a out of warranty 6tb WD red EFRX that I need to replace and I'm having trouble finding the same model. Does anyone has experience with the new WD red EFAX? do they have the same performance ? (It looks like they might be SMR drives) What would you recommend as a replacement disk? I could...

www.ixsystems.com

WD40EFAX drives - IDNF when resilvering ZFS array

I’ve just purchased 3 WD REDs to replace aging drives in a ZFS array ALL THREE are failing during resilvering with IDNF (sector ID not found) errors: Here’s a typical example After command completion occurred, registers were: ER – ST COUNT LBA_48 LH LM LL DV DC – – – == – == == == – – – –...

community.wd.com

It would be perhaps good idea if freenas stops recommending wd reds on their web (https://www.freenas.org/hardware-requirements/)

nokia88 · Apr 13, 2020

Thank you radovan.
Just checked with smartcl the part number: WD60EFAX
Shame on you Western Digital for selling a NAS drive like this...

JaimieV · Apr 13, 2020

I've had more WD Reds fail than any other flavour of drive, over the last five years. I don't know what sort of NAS/longevity they were aiming for, but they missed.

Ddog 800 · Apr 13, 2020

OK, thanks for the confirmation! I found a couple of similar threads last night after I posted this. Definitely looking like these newer WD Red drives are crap. Looks like the WD Gold drives are pretty close in price to the Red Pro drives, so I'm going to order a couple of this.

I'll post my results to confirm success once they arrive and I get one installed and resilvered.

radovan · Apr 14, 2020

At least wd is not denying the lie anymore...

Western Digital Fesses Up: Some Red HDDs Use Slow SMR Tech Without Disclosure

The company failed to disclose the use of SMR in advertising or spec sheets.

www.tomshardware.com

Ddog 800 · Apr 18, 2020

So just to provide a confirmation to my situation, I received a replacement WD Gold drive and installed it. The drive was resilvered in about 4 hours with no errors or issues at all.

As if we need further confirmation at this point now that this story has been making the rounds in the news. :D

If anyone does need to stick with regular WD Reds, you're better off ensuring you purchase one of the EFRX models, if possible, and avoid EFAX like the plague! There's probably still some old stock out there. I imagine WD will do what they can to fix these compatibility problems with the various RAID platforms that the EFAX drives are currently breaking, but it seems like they (and other manufacturers) are still going to try and push forward with the much slower SMC technology. I guess we'll have to wait and see how it goes.

radovan · Apr 29, 2020

I've opened support case with WD, they offered RMA and exchage for EFRX version. So that's probably a way to go.

Important Announcement for the TrueNAS Community.

Resilver issues

joriz

Cadet

Ddog 800

Cadet

Ddog 800

Cadet

radovan

Cadet

WD RED EFRX / EFAX

WD40EFAX drives - IDNF when resilvering ZFS array

nokia88

Cadet

JaimieV

Guru

Ddog 800

Cadet

radovan

Cadet

Western Digital Fesses Up: Some Red HDDs Use Slow SMR Tech Without Disclosure

Ddog 800

Cadet

radovan

Cadet

Similar threads