I/O errors with S3008L-L8E and LRSACX36-24I

willll

Cadet
Joined
May 21, 2023
Messages
5
Hi,

While checking my kernel logs I found a lot of this errors messages (/var/log/syslog) :

Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2990830026752 size=1019904 flags=40080c90
Jun 18 10:15:24 truenas kernel: mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
Jun 18 10:15:24 truenas kernel: mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
Jun 18 10:15:24 truenas kernel: mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
Jun 18 10:15:24 truenas kernel: mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2996811702272 size=4096 flags=180990
Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2996811829248 size=4096 flags=180990
Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2990831046656 size=1015808 flags=40080c90
Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2996811546624 size=4096 flags=180990
Jun 18 10:15:24 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/f00b9403-bf92-11ed-a3c9-6c626d38329f error=5 type=1 offset=2990832062464 size=1019904 flags=40080c90
Jun 18 10:15:24 truenas kernel: sd 1:0:2:0: Power-on or device reset occurred

Or

Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] Unaligned partial completion (resid=177148, sector_sz=512)
Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] tag#1353 CDB: Read(16) 88 00 00 00 00 01 45 21 88 a8 00 00 06 58 00 00
Jun 18 09:53:49 truenas kernel: scsi_io_completion_action: 2 callbacks suppressed
Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] tag#1353 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] tag#1353 Sense Key : Aborted Command [current]
Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] tag#1353 Add. Sense: Information unit iuCRC error detected
Jun 18 09:53:49 truenas kernel: sd 1:0:2:0: [sdd] tag#1353 CDB: Read(16) 88 00 00 00 00 01 45 21 88 a8 00 00 06 58 00 00
Jun 18 09:53:49 truenas kernel: print_req_error: 2 callbacks suppressed

Or

un 18 10:14:36 truenas kernel: sd 1:0:13:0: [sdo] tag#516 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Jun 18 10:14:36 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/ad1aefb6-c915-11ed-9838-001e4fc1342e error=5 type=1 offset=2975952191488 size=618496 flags=40080c90
Jun 18 10:14:36 truenas kernel: sd 1:0:13:0: [sdo] tag#516 CDB: Read(16) 88 00 00 00 00 01 5e 32 4e 98 00 00 00 58 00 00
Jun 18 10:14:36 truenas kernel: sd 1:0:13:0: [sdo] tag#485 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Jun 18 10:14:36 truenas kernel: blk_update_request: I/O error, dev sdo, sector 5875322520 op 0x0:(READ) flags 0x700 phys_seg 11 prio class 0
Jun 18 10:14:36 truenas kernel: sd 1:0:13:0: [sdo] tag#485 CDB: Read(16) 88 00 00 00 00 01 5e 32 44 d0 00 00 05 10 00 00
Jun 18 10:14:36 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/ad1aefb6-c915-11ed-9838-001e4fc1342e error=5 type=1 offset=2975952809984 size=45056 flags=180990
Jun 18 10:14:36 truenas kernel: blk_update_request: I/O error, dev sdo, sector 5875320016 op 0x0:(READ) flags 0x700 phys_seg 113 prio class 0
Jun 18 10:14:36 truenas kernel: zio pool=main vdev=/dev/disk/by-partuuid/ad1aefb6-c915-11ed-9838-001e4fc1342e error=5 type=1 offset=2975951527936 size=663552 flags=40080c90
Jun 18 10:14:37 truenas kernel: sd 1:0:13:0: Power-on or device reset occurred

My understanding is that the LSI card cannot read the HDDs, at some point, and reset the HDD. That even prevent smartctl the finish its tests.
All the hard drives are impacted, and 5 of them are less than 4 months old, I would rule out the hard drives for now.

>name -a
Linux truenas 5.15.79+truenas #1 SMP Mon Apr 10 14:00:27 UTC 2023 x86_64 GNU/Linux

I already spend some times looking for my card's firmware :

>sas3flash -c 0 -list

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

Adapter Selected is a Avago SAS: SAS3008(C0)

Controller Number : 0
Controller : SAS3008(C0)
PCI Address : 00:44:00:00
SAS Address : 5003048-0-18d8-d402
NVDATA Version (Default) : 0e.01.00.08
NVDATA Version (Persistent) : 0e.01.30.28
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9300-8e
BIOS Version : 08.37.00.00
UEFI BSD Version : 18.00.00.00
FCODE Version : N/A
Board Name : LSI3008-IT
Board Assembly : N/A
Board Tracer Number : N/A

Finished Processing Commands Successfully.
Exiting SAS3Flash.

I also tried to change the computer for a poweredge R720xd,

I have the following setup :
* Supermicro S3008L-L8E
* SFF-8644 to SFF-8643 card adapter,
* LRSACX36-24I SAS expander card
* 20 various Western Digital SATA HDDs

HDDs are powered through there own PSUs, 3 HDDs max per SATA power line.

I am not where to look at now.
 

Valor

Cadet
Joined
Jun 21, 2023
Messages
3
I have the exact same log records as you.
On 5/24, I purchased three 18TB Toshiba hard drives and an LSI 9211-8i HBA card, configured to use IT mode. However, starting from 5/27, I have been experiencing the same log messages as you.

The hard drives are randomly and continuously being kicked out of the pool and then restored. Subsequently, a resilver process starts. I have tried replacing the hard drive cables, but it did not help. As my next step, I plan to switch to an LSI 2308 controller.

I initially had TrueNAS-SCALE-22.12.2 version, and even after upgrading to 22.12.3, the same error persisted. I just upgraded to 22.12.3.1 a moment ago, and if there are any further developments, I will provide you with updates.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
If it is happening to a specific model of hard drive, it is possible that the hard drive is set to auto-park the heads. This can be seen in SMART output from smartctl -x DEV on the disks in question.

Here are the 2 items I would look at, power on hours and load cycle count;
9 Power_On_Hours -O--CK 100 100 000 - 66543 ... 225 Load_Cycle_Count -O--CK 096 096 000 - 48512
This is from a 2TB laptop drive in my media server. Most of the 48K load cycles were from the first few months until I realized the problem and disabled that function. In a server, (this is in my media server), it makes little sense to park the heads aggressively.

How this "may" affect the SAS controller is that the time needed to "un-park" the heads is seen as taking too long. Thus, an error.


Next, it could be the TLER value, (Time Limited Error Recovery, some vendors use a different name, but same thing). Normally on NAS drives with redundancy at the higher level, (OS in our case, not RAID controller), a simple 7 seconds is all that is needed. For a desktop non-redundant use, going to extreme to recover a failing disk sector is useful. It can take more than a minute to work through all the recovery attempts before actually succeeding or failing.

If a disk is taking too long on a read, this ends up causing the SAS driver or ZFS to give up on the disk. If the disk comes back, then ZFS attempts recovery, and restores the failed block from redundancy.

If the TLER is set to 7 seconds. After that time, the disk returns bad block error. ZFS detects this, then simply, AND AUTOMATICALLY, recovers the bad block from redundancy and writes it back to the failing disk using a spare sector. Of course it will log a read error, but won't drop a disk from the pool.



As an example, years ago a friend kept getting a hard drive drop out of his NAS RAID array, (not TrueNAS and not ZFS). I suggested NAS specific drives because of the head parking & TLER issue. That solved his problem, though we never determined which it was.
 

willll

Cadet
Joined
May 21, 2023
Messages
5
I have the exact same log records as you.
On 5/24, I purchased three 18TB Toshiba hard drives and an LSI 9211-8i HBA card, configured to use IT mode. However, starting from 5/27, I have been experiencing the same log messages as you.

The hard drives are randomly and continuously being kicked out of the pool and then restored. Subsequently, a resilver process starts. I have tried replacing the hard drive cables, but it did not help. As my next step, I plan to switch to an LSI 2308 controller.

I initially had TrueNAS-SCALE-22.12.2 version, and even after upgrading to 22.12.3, the same error persisted. I just upgraded to 22.12.3.1 a moment ago, and if there are any further developments, I will provide you with updates.

I started having issues with CORE, and migrated to TrueNAS-SCALE-22.12.2, it did not change anything. I noticed that playing around with the HBA card fw had some impact, for example with Firmware Version : 16.00.00.00, the disks were bumped out the pool.

While reading this : https://forums.servethehome.com/index.php?threads/misadventures-with-lsi-sas3008-cards.33761/
I realize that there is some bad batches of those cards for a lot of models. I ordered a SAS 9300-8E, which is also based on a SAS3008, to rule out the card. I will post my results here.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I realize that there is some bad batches of those cards for a lot of models.

More likely fake cards. "Bad batches" aren't really a thing; these machines go to data centers straight from the manufacturer and typically have on site service in the unusual case that something goes wrong. Cards replaced under warranty are generally trashed; no one in the professional world spends time trying to "debug" them. This means that the cards that reach the end of lease at 3 or 5 years have been working fine for that time period, so when the e-cyclers come to tear down the rack and parts up the servers, those should all be good cards. See


It might be interesting to know where you acquired the card. There are a bunch of bad sellers out there,


and I like to do what I can to help people avoid getting ripped off, so I'm always interested in hearing about where you got something that turned out bad.
 

willll

Cadet
Joined
May 21, 2023
Messages
5
More likely fake cards. "Bad batches" aren't really a thing; these machines go to data centers straight from the manufacturer and typically have on site service in the unusual case that something goes wrong. Cards replaced under warranty are generally trashed; no one in the professional world spends time trying to "debug" them. This means that the cards that reach the end of lease at 3 or 5 years have been working fine for that time period, so when the e-cyclers come to tear down the rack and parts up the servers, those should all be good cards. See


It might be interesting to know where you acquired the card. There are a bunch of bad sellers out there,


and I like to do what I can to help people avoid getting ripped off, so I'm always interested in hearing about where you got something that turned out bad.

Thank you,
It is a Supermicro, hence I was not expecting any issues, but you are right it may be a counterfeit card.
To figure this out I bought another card : SAS 9300-8E (https://www.amazon.ca/dp/B07VV976RB?psc=1&ref=ppx_yo2ov_dt_b_product_details)

I swapped the 2 LSI cards, and I still have the same I\O errors.

To rule out the http://www.linkreal.com.cn/en/products/LRSACX3624I.html I have now both cards connected, and see if the one connected to the LRSACX3624I is behaving differently than the other.
I will also check the head park time, but it looks like I need to dig into the HDD documentation to make sense of openSeaChest outputs.
 

Valor

Cadet
Joined
Jun 21, 2023
Messages
3
I'm sorry for the late update. Since the last time, I have made several attempts.

Here are the attempts I have made:
1.Replacing the hard drive data cable.
2.Replacing the HBA card with a different model.
3.Changing the hard drive connection from the HBA card to the motherboard's SATA interface.

Next, I tried some system adjustments:
1.Enabling TLER setting to 7.0 seconds.
2.Disabling NCQ.

I conducted cross-testing between hardware and system adjustments. For example, after changing the data cable, I tested for any changes between default settings and enabling TLER or disabling NCQ.

Afterwards, I researched many other cases and discovered from the SMART data that the Start_Stop_Count and Power-Off_Retract_Count values increase every time the hard drive is disconnected. However, the UDMA_CRC_Error_Count remains at 0. Based on this, I concluded that the reason for the drive disconnection is not data read errors but rather power abnormalities. There is a high possibility of a faulty power supply since the current one was taken from an old system and has been in use for over 10 years (since 2012). To address this, I replaced the power supply with a brand new one on 7/5, hoping to prove my hypothesis.

Lastly, I want to mention that since this issue occurs intermittently, even if everything seems fine for half a day, it doesn't mean the problem is resolved. There were instances where I thought I had identified the problem, only to receive a mail notification on my phone the next day with a subject line saying "ZFS device fault..." It can be quite disheartening. The issue first occurred on 5/27, then after adjusting the cable positioning and other operations, it seemed to be resolved until the hard drive disconnection issue reoccurred on 6/17. This gap was nearly a month, and the problem shifted from random drive disconnections to a specific drive. However, since the hard drives are relatively new and the issue initially happened randomly, I still believe the likelihood of the drives being faulty is low.


Here are the specifications of my NAS system:
Intel Xeon E3-1230v2
Kingston 8G RAM*4
ASUS P8H77-M PRO
Samsung 870 evo 256G SSD*2
Supermicro LSI2308-IT Firmware Version 20.00.07.00
Toshiba mg09aca18te 18T*3
FSP AURUM 550W
TrueNAS Scale 22.12.3.1
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One of the designers of ZFS had intermittent problems, which he thought odd. Until he realized he had a hardware fault and ZFS was correctly identifying it. The old UFS, (under Solaris 10), would happily supply back the bad data blocks. But, ZFS called out the lying hardware!

So, everyone remember, most of the time that ZFS calls out a problem, it is real. (The most recent exception I have seen, is that WD Gold disks with SATA NCQ, Native Command Queuing enabled, ZFS had problems. But, other RAID schemes did not seem to have any problem.)
 
Last edited:

willll

Cadet
Joined
May 21, 2023
Messages
5
Hi,

while testing I found a strange behavior :

root@truenas[~]# for i in {1..10}; do sas3ircu 0 DISPLAY |grep SATA_HDD | wc -l; done
8
9
10
11
11
10
7
10
12
9

the number of detected SATA HDD keeps changing from one call to another, if I try another controller I have the expected behavior :

root@truenas[~]# for i in {1..10}; do sas3ircu 1 DISPLAY |grep SATA_HDD | wc -l; done
7
7
[SNIP...]
7
7

I figured this out while trying to automatically find the faulty adapter :

join -j 1 <( \
join -j 1 <( \
ctrl_ids=$(storcli show ctrlcount | grep 'Controller Count = ' | grep -o '=[^,]*' | cut -d' ' -f2| awk '$1=$1-1');\
for ctrl_id in {0..$ctrl_ids}; \
do \
ctrl_name=$(sas3flash -c $ctrl_id -list | grep -e 'Board Name ' -e 'SAS Address ' | cut -d':' -f2 | xargs | awk '{printf ("%s(%s)\n", $2, $1)}'); \
sas3ircu $ctrl_id DISPLAY | grep 'Serial No ' | grep -o '[^ ]*$' | \
awk -v ctrl_name=$ctrl_name \
'{ printf ("%s\t%s\n", $1, ctrl_name) }'; \
unset ctrl_name; \
done \
| sort -k1,1; \
unset ctrl_ids \
) <( \
lsblk --nodeps -o serial,name | sort -k1,1; \
) \
| awk '{ print $3 " " $1 " " $2 }' | sort -k1,1 \
) <( \
dmesg | grep 'I/O error, dev' | grep -o 'dev [^,]*' | cut -d' ' -f2 | sort | uniq \
)

It searches for the dev that generates "I/O error" in kernel logs.

sdb ********** SAS9300-8e(500605b-0-0a07-1fb3)
sdc ********** SAS9300-8e(500605b-0-0a07-1fb3)
sde ********** SAS9300-8e(500605b-0-0a07-1fb3)
sdf ********** SAS9300-8e(500605b-0-0a07-1fb3)
sdg ********** SAS9300-8e(500605b-0-0a07-1fb3)
sdl ********** SAS9300-8e(500605b-0-0a07-1fb3)
sdn ********** SAS9300-8e(500605b-0-0a07-1fb3)

SAS9300-8e is connected to the LRSACX36-24I SAS expander card and was working fine in standalone. I am trying to get debug information from linkreal support, but without much success.
 
Last edited:

willll

Cadet
Joined
May 21, 2023
Messages
5
Hi,

end of the story for me : I returned the faulty LRSACX36-24I and received a new one,
now everything looks way better.

Thank you very much for your support !
 

Valor

Cadet
Joined
Jun 21, 2023
Messages
3
After replacing the power supply, there have been no problems ever since
I'm sorry for the late update. Since the last time, I have made several attempts.

Here are the attempts I have made:
1.Replacing the hard drive data cable.
2.Replacing the HBA card with a different model.
3.Changing the hard drive connection from the HBA card to the motherboard's SATA interface.

Next, I tried some system adjustments:
1.Enabling TLER setting to 7.0 seconds.
2.Disabling NCQ.

I conducted cross-testing between hardware and system adjustments. For example, after changing the data cable, I tested for any changes between default settings and enabling TLER or disabling NCQ.

Afterwards, I researched many other cases and discovered from the SMART data that the Start_Stop_Count and Power-Off_Retract_Count values increase every time the hard drive is disconnected. However, the UDMA_CRC_Error_Count remains at 0. Based on this, I concluded that the reason for the drive disconnection is not data read errors but rather power abnormalities. There is a high possibility of a faulty power supply since the current one was taken from an old system and has been in use for over 10 years (since 2012). To address this, I replaced the power supply with a brand new one on 7/5, hoping to prove my hypothesis.

Lastly, I want to mention that since this issue occurs intermittently, even if everything seems fine for half a day, it doesn't mean the problem is resolved. There were instances where I thought I had identified the problem, only to receive a mail notification on my phone the next day with a subject line saying "ZFS device fault..." It can be quite disheartening. The issue first occurred on 5/27, then after adjusting the cable positioning and other operations, it seemed to be resolved until the hard drive disconnection issue reoccurred on 6/17. This gap was nearly a month, and the problem shifted from random drive disconnections to a specific drive. However, since the hard drives are relatively new and the issue initially happened randomly, I still believe the likelihood of the drives being faulty is low.


Here are the specifications of my NAS system:
Intel Xeon E3-1230v2
Kingston 8G RAM*4
ASUS P8H77-M PRO
Samsung 870 evo 256G SSD*2
Supermicro LSI2308-IT Firmware Version 20.00.07.00
Toshiba mg09aca18te 18T*3
FSP AURUM 550W
TrueNAS Scale 22.12.3.1
After replacing the power supply, there have been no problems ever since.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
After replacing the power supply, there have been no problems ever since

That's a common problem. For anyone else who runs across this thread, please also check out

 
Top