Disks statuses Faulted and Degraded

m0t0rh3ad · Jan 16, 2021

Hello, I'm using TrueNAS-SCALE-20.12-ALPHA

During Scrub process on of the Disks became FAULTED and another Degraded:

zpool status -x
pool: storage
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub in progress since Sat Jan 16 10:26:48 2021
18.1T scanned at 671M/s, 15.8T issued at 587M/s, 52.7T total
396M repaired, 30.07% done, 18:16:00 to go
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
25f5c8e3-dad5-4317-9cd0-541eb34b02c1 ONLINE 0 0 0
f7166c07-fa60-4b8f-be4c-6bafca895a6f ONLINE 0 0 0
b98c2d82-2be3-4713-9cfd-df3366bf6056 ONLINE 0 0 0
a0a8f7b7-73a8-4107-984f-6107ac9bf869 ONLINE 0 0 0
9edf71ba-b7b6-4d17-a8fa-afd2c6f25598 FAULTED 36 1 0 too many errors (repairing)
c5cd41e3-d069-4553-8702-514256a365bc ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
b5efefcc-8e65-4c45-a813-9f4d60f891ee ONLINE 0 0 0
fec84c1d-84a1-49c8-8f15-566e6185e4fe ONLINE 0 0 0
f3a37a11-f0bc-4e58-8f2b-b586f8f720dd ONLINE 0 0 0
043a1431-0bbd-4b6a-913f-ca0c45b6c1f3 ONLINE 0 0 0
c1c71649-68f0-4da9-9d30-bd81b32e5ef9 DEGRADED 0 0 12.6K too many errors (repairing)
ead04bff-68f9-4a59-90f6-bccb61a78f99 ONLINE 0 0 3 (repairing)

errors: No known data errors

When I've checked Disks Smart I can't found anything "bad":

smartctl -a /dev/sdg
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.0-1-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721010AL5204
Revision: C384
Compliance: SPC-4
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca26b2b1ff8
Serial number: 1SGSR76Z
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Jan 16 18:22:12 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 307792
Current Drive Temperature: 42 C
Drive Trip Temperature: 85 C

Manufactured in week 41 of year 2017
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 29
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 201
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 11330039840768000

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1233620 0 1233620 18437875 68744.629 0
write: 0 0 0 0 2345199 29193.131 0
verify: 0 0 0 0 1788847 0.000 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Self test in progress ... - NOW - [- - -]
# 2 Foreground short Completed - 5088 - [- - -]
# 3 Foreground long Aborted (device reset ?) - 5087 - [- - -]
# 4 Foreground short Completed - 5082 - [- - -]
# 5 Background short Completed - 5082 - [- - -]
# 6 Background short Completed - 5082 - [- - -]
# 7 Background short Completed - 5082 - [- - -]
# 8 Background short Completed - 752 - [- - -]
# 9 Background short Completed - 1 - [- - -]

Long (extended) Self-test duration: 64492 seconds [1074.9 minutes]

smartctl -a /dev/sdn
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.0-1-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721010AL5204
Revision: C384
Compliance: SPC-4
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca266d3b32c
Serial number: 7JKSE7UC
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Jan 16 18:22:51 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 318948
Current Drive Temperature: 38 C
Drive Trip Temperature: 85 C

Manufactured in week 41 of year 2017
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 148
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 371
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 10750650816135168

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 187 0 187 4496513 69141.144 0
write: 0 0 0 0 2410094 20699.158 0
verify: 0 0 0 0 28839 0.000 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 6276 - [- - -]
# 2 Background short Completed - 6253 - [- - -]
# 3 Background short Completed - 6231 - [- - -]
# 4 Background short Completed - 6207 - [- - -]
# 5 Background short Completed - 6183 - [- - -]
# 6 Background short Completed - 979 - [- - -]

Long (extended) Self-test duration: 64549 seconds [1075.8 minutes]

Can you suggest something and should I worried about these disks and data?

Kris Moore · Jan 16, 2021

Yea, I'd be concerned about this one disk:

> 9edf71ba-b7b6-4d17-a8fa-afd2c6f25598 FAULTED 36 1 0

Getting a bunch of read errors and the one write-error does seem like a drive on the edge. Maybe check cable though first?

m0t0rh3ad · Jan 21, 2021

Very strange behaviour, I've checked cables and clean error. Today I've got the same situation with another disk - 36 read errors. Can it be a bug in the alpha version of TrueNAS Scale or not?

Kris Moore · Jan 21, 2021

m0t0rh3ad said:
Very strange behaviour, I've checked cables and clean error. Today I've got the same situation with another disk - 36 read errors. Can it be a bug in the alpha version of TrueNAS Scale or not?

This type of error I think would be very unlikely to be a bug in SCALE itself. Can you check 'dmesg' output? Are you seeing any other kernel / hardware reported errors?

m0t0rh3ad · Jan 21, 2021

Kris Moore said:
This type of error I think would be very unlikely to be a bug in SCALE itself. Can you check 'dmesg' output? Are you seeing any other kernel / hardware reported errors?

sirius# dmesg |grep error
[ 79.248606] blk_update_request: I/O error, dev sda, sector 1024 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 79.278087] blk_update_request: I/O error, dev sda, sector 1025 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 79.431662] blk_update_request: I/O error, dev sda, sector 1032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 79.453183] blk_update_request: I/O error, dev sda, sector 1033 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 79.513873] blk_update_request: I/O error, dev sda, sector 1048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 79.540053] blk_update_request: I/O error, dev sdb, sector 1048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 79.554243] blk_update_request: I/O error, dev sda, sector 1049 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 79.566262] blk_update_request: I/O error, dev sdk, sector 1048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 79.566276] md/raid1:md127: sdk1: unrecoverable I/O read error for block 24
[ 79.566282] Buffer I/O error on dev md127, logical block 24, async page read
[ 79.656718] blk_update_request: I/O error, dev sdb, sector 1049 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 79.688840] blk_update_request: I/O error, dev sdk, sector 1049 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 79.701185] md/raid1:md127: sdk1: unrecoverable I/O read error for block 25
[ 79.710381] Buffer I/O error on dev md127, logical block 25, async page read
[ 79.719675] md/raid1:md127: sdk1: unrecoverable I/O read error for block 26
[ 79.728847] Buffer I/O error on dev md127, logical block 26, async page read
[ 79.738476] md/raid1:md127: sdk1: unrecoverable I/O read error for block 27
[ 79.747770] Buffer I/O error on dev md127, logical block 27, async page read
[ 79.756247] md/raid1:md127: sdk1: unrecoverable I/O read error for block 28
[ 79.764456] Buffer I/O error on dev md127, logical block 28, async page read
[ 79.772892] md/raid1:md127: sdk1: unrecoverable I/O read error for block 29
[ 79.781076] Buffer I/O error on dev md127, logical block 29, async page read
[ 79.789425] md/raid1:md127: sdk1: unrecoverable I/O read error for block 30
[ 79.797625] Buffer I/O error on dev md127, logical block 30, async page read
[ 79.805895] md/raid1:md127: sdk1: unrecoverable I/O read error for block 31
[ 79.814193] Buffer I/O error on dev md127, logical block 31, async page read
[ 80.246155] md/raid1:md127: sdk1: unrecoverable I/O read error for block 24
[ 80.255719] Buffer I/O error on dev md127, logical block 24, async page read
[ 80.275896] md/raid1:md127: sdk1: unrecoverable I/O read error for block 25
[ 80.284361] Buffer I/O error on dev md127, logical block 25, async page read
[ 3705.289816] transmission-da[26942]: segfault at 0 ip 0000000000000000 sp 00007f47b62aca48 error 14 in transmission-daemon[56080af56000+18000]
[178367.382304] print_req_error: 200 callbacks suppressed
[178367.382307] blk_update_request: I/O error, dev sdm, sector 136190240 op 0x1:(WRITE) flags 0x700 phys_seg 23 prio class 0
[178367.399984] zio pool=boot-pool vdev=/dev/sdm3 error=5 type=2 offset=52011593728 size=122880 flags=40080c80
[178367.445069] blk_update_request: I/O error, dev sdm, sector 136609008 op 0x1:(WRITE) flags 0x700 phys_seg 3 prio class 0
[178367.456842] zio pool=boot-pool vdev=/dev/sdm3 error=5 type=2 offset=52226002944 size=12288 flags=40080c80
[178367.502188] blk_update_request: I/O error, dev sdm, sector 137408624 op 0x1:(WRITE) flags 0x700 phys_seg 11 prio class 0
[178367.514145] zio pool=boot-pool vdev=/dev/sdm3 error=5 type=2 offset=52635406336 size=98304 flags=40080c80
[273329.020370] blk_update_request: I/O error, dev sde, sector 18639025288 op 0x0:(READ) flags 0x700 phys_seg 32 prio class 0
[273329.032502] zio pool=storage vdev=/dev/disk/by-partuuid/9edf71ba-b7b6-4d17-a8fa-afd2c6f25598 error=5 type=1 offset=9541032939520 size=1048576 flags=40080c80
[273329.129433] blk_update_request: I/O error, dev sde, sector 18639023304 op 0x0:(READ) flags 0x700 phys_seg 31 prio class 0
[273329.142023] zio pool=storage vdev=/dev/disk/by-partuuid/9edf71ba-b7b6-4d17-a8fa-afd2c6f25598 error=5 type=1 offset=9541031923712 size=1015808 flags=40080c80

Kris Moore · Jan 22, 2021

Yea, this really does seem to be HW related. All those blk_update_request errors are no good, and you're getting them on several devices, which might indicate controller...

m0t0rh3ad · Jan 22, 2021

Kris Moore said:
Yea, this really does seem to be HW related. All those blk_update_request errors are no good, and you're getting them on several devices, which might indicate controller...

I've got an integrated controller on MB and 12 disks connected to the controller. Server case with the backplane, controller connected to the backplane via sas hd cables. It also may be backplane or controller problems, maybe both...

m0t0rh3ad · Jan 22, 2021

I've started Scrub process manually and got error with SDE disk:
dmesg | grep sde
[ 8.428189] sd 0:0:4:0: [sde] 2441609216 4096-byte logical blocks: (10.0 TB/9.10 TiB)
[ 8.428524] sd 0:0:4:0: [sde] Write Protect is off
[ 8.428525] sd 0:0:4:0: [sde] Mode Sense: f7 00 10 08
[ 8.429042] sd 0:0:4:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 8.517962] sde: sde1 sde2
[ 8.568722] sd 0:0:4:0: [sde] Attached SCSI disk
[ 81.408265] sd 0:0:4:0: [sde] tag#1053 request not aligned to the logical block size
[ 81.416876] sd 0:0:4:0: [sde] tag#1054 request not aligned to the logical block size
[ 81.461420] sd 0:0:4:0: [sde] tag#1064 request not aligned to the logical block size
[ 81.470332] sd 0:0:4:0: [sde] tag#1065 request not aligned to the logical block size
[ 81.514656] sd 0:0:4:0: [sde] tag#1073 request not aligned to the logical block size
[ 81.523119] sd 0:0:4:0: [sde] tag#1075 request not aligned to the logical block size
[ 81.565566] sd 0:0:4:0: [sde] tag#585 request not aligned to the logical block size
[ 81.574784] sd 0:0:4:0: [sde] tag#586 request not aligned to the logical block size
[ 81.624257] sd 0:0:4:0: [sde] tag#1025 request not aligned to the logical block size
[ 81.632699] sd 0:0:4:0: [sde] tag#1026 request not aligned to the logical block size
[ 81.675911] sd 0:0:4:0: [sde] tag#1035 request not aligned to the logical block size
[ 81.684647] sd 0:0:4:0: [sde] tag#1037 request not aligned to the logical block size
[ 81.728263] sd 0:0:4:0: [sde] tag#1038 request not aligned to the logical block size
[ 81.739992] sd 0:0:4:0: [sde] tag#1039 request not aligned to the logical block size
[ 81.787515] sd 0:0:4:0: [sde] tag#1045 request not aligned to the logical block size
[ 81.796475] sd 0:0:4:0: [sde] tag#1047 request not aligned to the logical block size
[ 81.844717] sd 0:0:4:0: [sde] tag#1055 request not aligned to the logical block size
[ 81.853732] sd 0:0:4:0: [sde] tag#1056 request not aligned to the logical block size
[ 81.897058] sd 0:0:4:0: [sde] tag#1064 request not aligned to the logical block size
[ 81.905451] sd 0:0:4:0: [sde] tag#1066 request not aligned to the logical block size
[ 81.948459] sd 0:0:4:0: [sde] tag#1073 request not aligned to the logical block size
[ 81.956848] sd 0:0:4:0: [sde] tag#1074 request not aligned to the logical block size
[ 82.067407] sd 0:0:4:0: [sde] tag#1029 request not aligned to the logical block size
[ 82.075594] sd 0:0:4:0: [sde] tag#1031 request not aligned to the logical block size
[273328.930941] sd 0:0:4:0: [sde] tag#234 CDB: Read(10) 28 00 8a df 1e 91 00 01 00 00
[273328.999843] sd 0:0:4:0: [sde] tag#234 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
[273329.011597] sd 0:0:4:0: [sde] tag#234 CDB: Read(10) 28 00 8a df 1e 91 00 01 00 00
[273329.020370] blk_update_request: I/O error, dev sde, sector 18639025288 op 0x0:(READ) flags 0x700 phys_seg 32 prio class 0
[273329.065315] sd 0:0:4:0: [sde] tag#228 CDB: Read(10) 28 00 8a df 1d 99 00 00 f8 00
[273329.107822] sd 0:0:4:0: [sde] tag#228 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
[273329.120374] sd 0:0:4:0: [sde] tag#228 CDB: Read(10) 28 00 8a df 1d 99 00 00 f8 00
[273329.129433] blk_update_request: I/O error, dev sde, sector 18639023304 op 0x0:(READ) flags 0x700 phys_seg 31 prio class 0
[420190.387056] sd 0:0:4:0: [sde] tag#146 CDB: Read(10) 28 00 06 35 e1 8c 00 00 10 00
[420190.445643] sd 0:0:4:0: [sde] tag#146 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[420190.458433] sd 0:0:4:0: [sde] tag#146 CDB: Read(10) 28 00 06 35 e1 8c 00 00 10 00
[420190.469985] blk_update_request: I/O error, dev sde, sector 833555552 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[420190.525560] sd 0:0:4:0: [sde] tag#145 CDB: Read(10) 28 00 06 35 91 37 00 00 08 00
[420190.597611] sd 0:0:4:0: [sde] tag#145 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[420190.608761] sd 0:0:4:0: [sde] tag#145 CDB: Read(10) 28 00 06 35 91 37 00 00 08 00
[420190.617657] blk_update_request: I/O error, dev sde, sector 833391032 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[420190.667609] sd 0:0:4:0: [sde] tag#144 CDB: Read(10) 28 00 06 33 e0 73 00 00 20 00
[420190.820586] sd 0:0:4:0: [sde] tag#144 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[420190.831892] sd 0:0:4:0: [sde] tag#144 CDB: Read(10) 28 00 06 33 e0 73 00 00 20 00
[420190.841694] blk_update_request: I/O error, dev sde, sector 832504728 op 0x0:(READ) flags 0x700 phys_seg 4 prio class 0
[420190.886378] sd 0:0:4:0: [sde] tag#142 CDB: Read(10) 28 00 33 00 2a 70 00 00 08 00
[420190.974481] sd 0:0:4:0: [sde] tag#142 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[420190.987829] sd 0:0:4:0: [sde] tag#142 CDB: Read(10) 28 00 33 00 2a 70 00 00 08 00
[420191.001014] blk_update_request: I/O error, dev sde, sector 6845191040 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[420191.050317] sd 0:0:4:0: [sde] tag#134 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[420191.160175] sd 0:0:4:0: [sde] tag#176 CDB: Read(10) 28 00 3f eb a6 13 00 00 fd 00
[420191.207006] sd 0:0:4:0: [sde] tag#176 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=32s
[420191.218731] sd 0:0:4:0: [sde] tag#176 CDB: Read(10) 28 00 3f eb a6 13 00 00 fd 00
[420191.228740] blk_update_request: I/O error, dev sde, sector 8579264664 op 0x0:(READ) flags 0x700 phys_seg 32 prio class 0

Can you please explain, it's problems with disk or other HW?

m0t0rh3ad · Jan 22, 2021

Disk smart

smartctl -a /dev/sde
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.0-1-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721010AL5204
Revision: C384
Compliance: SPC-4
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca26b2b1ff8
Serial number: 1SGSR76Z
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri Jan 22 22:52:42 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 316703
Current Drive Temperature: 41 C
Drive Trip Temperature: 85 C

Manufactured in week 41 of year 2017
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 29
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 207
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 11427655656144896

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1254048 0 1254048 18769528 72488.247 0
write: 0 0 0 0 2345949 29270.337 0
verify: 0 0 0 0 2095794 0.000 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Self test in progress ... - NOW - [- - -]
# 2 Foreground short Completed - 5088 - [- - -]
# 3 Foreground long Aborted (device reset ?) - 5087 - [- - -]
# 4 Foreground short Completed - 5082 - [- - -]
# 5 Background short Completed - 5082 - [- - -]
# 6 Background short Completed - 5082 - [- - -]
# 7 Background short Completed - 5082 - [- - -]
# 8 Background short Completed - 752 - [- - -]
# 9 Background short Completed - 1 - [- - -]

Long (extended) Self-test duration: 64492 seconds [1074.9 minutes]

Important Announcement for the TrueNAS Community.

Disks statuses Faulted and Degraded

m0t0rh3ad

Dabbler

Kris Moore

SVP of Engineering

m0t0rh3ad

Dabbler

Kris Moore

SVP of Engineering

m0t0rh3ad

Dabbler

Kris Moore

SVP of Engineering

m0t0rh3ad

Dabbler

m0t0rh3ad

Dabbler

m0t0rh3ad

Dabbler

Similar threads