Hi everyone
I just bough a HP DL380 G8 with 8x2TB SAS Drivers.
I was able to install and configure FreeNAS-11.3-U2 and create on RaidZ2 pool using all 8 Drivers.
In addition, I have added one M.2 NVME driver for caching and one SSD for logging.
I wanted to test the hot-swap capability and potential drive replacement procedure of the server so I have removed one of the drives, booted from another OS and formatted that driver. I then booted back to FreeNas and re-added that disk. The system started a replacement and re-silvering process started.
I know that this should have taken a few hours but I have noticed that it take too long and upon looking at the server log, I noticed that the re-silvering process was restarting after the process got to ~4%.
Looking at the server output, I see the following errors:
The interesting part here is that da5 is not even the drive i removed (it was da7)
Here is the output of the smart test for da5:
Here is the same output for da7:
I tried replacing the bays to make sure it is not a cable problem but I wonder is there is a disk problem here that I need to consider.
Here are more details on the system:
Itamar
I just bough a HP DL380 G8 with 8x2TB SAS Drivers.
I was able to install and configure FreeNAS-11.3-U2 and create on RaidZ2 pool using all 8 Drivers.
In addition, I have added one M.2 NVME driver for caching and one SSD for logging.
I wanted to test the hot-swap capability and potential drive replacement procedure of the server so I have removed one of the drives, booted from another OS and formatted that driver. I then booted back to FreeNas and re-added that disk. The system started a replacement and re-silvering process started.
Code:
pool: FR4G state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 09:33:29 2020 7.07T scanned at 3.22G/s, 256G issued at 261M/s, 7.07T total 8.59G resilvered, 3.53% done, 0 days 07:36:58 to go config: NAME STATE READ WRITE CKSUM FR4G DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/1f782fce-8198-11ea-b536-2c44fd830388 ONLINE 0 0 0 gptid/20024e2c-8198-11ea-b536-2c44fd830388 ONLINE 0 0 0 gptid/20c6721c-8198-11ea-b536-2c44fd830388 ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 5134065479397326922 UNAVAIL 0 0 0 was /dev/gptid/20e82911-8198-11ea-b536-2c44fd830388 gptid/bebdfb0e-8266-11ea-899c-2c44fd830388 ONLINE 0 0 0 gptid/210337fe-8198-11ea-b536-2c44fd830388 ONLINE 0 0 12 gptid/212dd557-8198-11ea-b536-2c44fd830388 ONLINE 0 0 0 gptid/21487bc4-8198-11ea-b536-2c44fd830388 ONLINE 0 0 0 gptid/2115e086-8198-11ea-b536-2c44fd830388 ONLINE 0 0 13 logs gptid/685d7224-8345-11ea-9ec5-2c44fd830388 ONLINE 0 0 0 cache gptid/68c9e10b-8345-11ea-9ec5-2c44fd830388 ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da8p2 ONLINE 0 0 0 errors: No known data errors
I know that this should have taken a few hours but I have noticed that it take too long and upon looking at the server log, I noticed that the re-silvering process was restarting after the process got to ~4%.
Looking at the server output, I see the following errors:
Code:
Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11181200 Apr 21 09:36:27 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 4 Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): READ(10). CDB: 28 00 02 5c a0 60 00 01 00 00 Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): CAM status: SCSI Status Error Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI status: Check Condition Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): SCSI sense: RECOVERED ERROR asc:18,5 (Recovered data - recommend reassignment) Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Info: 0x25ca090 Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Field Replaceable Unit: 1 Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Command Specific Info: 0x11040400 Apr 21 09:36:28 fr4g (da5:ciss0:32:5:0): Actual Retry Count: 7
The interesting part here is that da5 is not even the drive i removed (it was da7)
Here is the output of the smart test for da5:
Code:
smartctl -a /dev/da5 smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST2000NM0001 Revision: 0001 Compliance: SPC-4 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c5004129ea67 Serial number: Z1P1KMH500009232N4PY Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Apr 21 09:51:09 2020 EDT SMART support is: Unavailable - device lacks SMART capability. === START OF READ SMART DATA SECTION === Current Drive Temperature: 32 C Drive Trip Temperature: 68 C Manufactured in week 09 of year 2012 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 49 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 49 Elements in grown defect list: 3539 Vendor (Seagate Cache) information Blocks sent to initiator = 873378287 Blocks received from initiator = 638587946 Blocks read from cache and sent to initiator = 37263257 Number of read and write commands whose size <= segment size = 633436 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 64723.22 number of minutes until next internal SMART test = 41 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 1542301948 605 0 1542302553 711 447.178 106 write: 0 0 0 0 0 328.511 0 verify: 554223 145 0 554368 190 0.000 45 Non-medium error count: 1 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 64713 - [- - -] # 2 Background short Completed - 64701 - [- - -]
Here is the same output for da7:
Code:
smartctl -a /dev/da7 smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST2000NM0001 Revision: 0002 Compliance: SPC-4 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500573e7883 Serial number: Z1P66CMH0000940661R9 Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Apr 21 09:53:32 2020 EDT SMART support is: Unavailable - device lacks SMART capability. === START OF READ SMART DATA SECTION === Current Drive Temperature: 30 C Drive Trip Temperature: 68 C Manufactured in week 35 of year 2013 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 54 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 54 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 462698561 Blocks received from initiator = 1051950447 Blocks read from cache and sent to initiator = 3971534 Number of read and write commands whose size <= segment size = 1939050 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 50599.40 number of minutes until next internal SMART test = 33 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 860180076 0 0 860180076 0 236.902 0 write: 0 0 0 0 0 538.893 0 Non-medium error count: 2 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 50589 - [- - -] # 2 Background short Completed - 50577 - [- - -] # 3 Background short Aborted (by user command) - 50577 - [- - -]
I tried replacing the bays to make sure it is not a cable problem but I wonder is there is a disk problem here that I need to consider.
Here are more details on the system:
- HP Proliant 665553-B21 DL380p Gen8
- 2 x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
- 128GB RAM
- 8 x Segate ST2000NM0001 2TB SAS 3.5inc drivers
- 1 x Samsung SSD 850 EVO 500GB for log drive
- 1 x PC401 NVMe SK hynix 512GB for cache
Itamar
Last edited: