Long-time user of FreeNAS / TrueNAS, but first time poster. Apologies in advance for missing anything obvious in the post.
I've got a bit of a conundrum that googling or poking around the system hasn't been able to shed light on so far. The system in question is a pretty basic albeit mildly overpowered bare-metal home system (SuperMicro X10DRi, dual 2683v4, 128GiB RAM, dual 25Gb NICs running, running the latest Bluefin version (TrueNAS-SCALE-22.12.4.2)). I have two pools in the system, one being a 4-disk RAID10 of 3.84TiB U.2 NVMe disks on a PCIe v3 x16 to 4xU.2 card for VM disks and similar high-IOPS needs, and another for bulk storage and backups (RAID10 of currently 4 14TB SAS HDDs, being expanded with a third mirror vdev later today once the currently running resilvering completes, having a mirrored pair of small Optanes for metadata (will become a RAID10 of Optanes later today). The HDDs are managed by the on-board SAS controller (lsi3008 in IT mode).
The conundrum is about disk temperature reporting. For the HDD pool, none of the spinning disks show up in the reporting data for temperature, so at least the situation is consistent in that sense. The metadata NVMes do report temperature though. Since the HDDs are SAS, that may not be entirely surprising, but the disks do all properly report drive temperature using smartctl so it's not that the data doesn't exist.
For the U.2 pool, it's slightly more odd. Again, all four of the disks do report drive temperature when checking via smartctl, but for this pool, one (1) out of four disks actually get reported in the UI as well, whereas the remaining three do not. More specifically, nvme6n1 shows up, but nvme[457]n1 do not, and I can't come up with any reason why one would and the others not.
Is this behavior expected and if so why? If not, what can be done to get the SAS disks temp data to show up, and for the three AWOL NVMe temperature datasets to do the same?
I've got a bit of a conundrum that googling or poking around the system hasn't been able to shed light on so far. The system in question is a pretty basic albeit mildly overpowered bare-metal home system (SuperMicro X10DRi, dual 2683v4, 128GiB RAM, dual 25Gb NICs running, running the latest Bluefin version (TrueNAS-SCALE-22.12.4.2)). I have two pools in the system, one being a 4-disk RAID10 of 3.84TiB U.2 NVMe disks on a PCIe v3 x16 to 4xU.2 card for VM disks and similar high-IOPS needs, and another for bulk storage and backups (RAID10 of currently 4 14TB SAS HDDs, being expanded with a third mirror vdev later today once the currently running resilvering completes, having a mirrored pair of small Optanes for metadata (will become a RAID10 of Optanes later today). The HDDs are managed by the on-board SAS controller (lsi3008 in IT mode).
The conundrum is about disk temperature reporting. For the HDD pool, none of the spinning disks show up in the reporting data for temperature, so at least the situation is consistent in that sense. The metadata NVMes do report temperature though. Since the HDDs are SAS, that may not be entirely surprising, but the disks do all properly report drive temperature using smartctl so it's not that the data doesn't exist.
For the U.2 pool, it's slightly more odd. Again, all four of the disks do report drive temperature when checking via smartctl, but for this pool, one (1) out of four disks actually get reported in the UI as well, whereas the remaining three do not. More specifically, nvme6n1 shows up, but nvme[457]n1 do not, and I can't come up with any reason why one would and the others not.
Is this behavior expected and if so why? If not, what can be done to get the SAS disks temp data to show up, and for the three AWOL NVMe temperature datasets to do the same?
pool: hdd
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Oct 17 10:02:31 2023
11.8T scanned at 619M/s, 9.80T issued at 514M/s, 12.7T total
3.89T resilvered, 77.10% done, 01:38:59 to go
config:
NAME STATE READ WRITE CKSUM
hdd ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
48e25fc3-bdc1-4e33-b801-4a5519ef8c2f ONLINE 0 0 0
c374fe8e-efe3-4dc6-818b-ce520ef7805c ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
7ed29237-aae1-48f4-9770-9859bd61b39d ONLINE 0 0 0
c1b889ea-2a6e-4d8d-b15e-1386878fbc36 ONLINE 0 0 0 (resilvering)
2dc86382-f20a-4e91-b08d-8020e38a209a ONLINE 0 0 0
special
mirror-2 ONLINE 0 0 0
1665d28f-9513-4d39-882a-a29d03c19056 ONLINE 0 0 0
40a98559-fc53-4caf-b4e8-a9d72eacec90 ONLINE 0 0 0
errors: No known data errors
pool: nvme
state: ONLINE
scan: scrub repaired 0B in 00:35:18 with 0 errors on Sun Sep 24 00:35:19 2023
config:
NAME STATE READ WRITE CKSUM
nvme ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
2b93aa42-2680-403d-a8d2-3256ebf2e619 ONLINE 0 0 0
ecdbf9c1-ba12-4659-82f2-75b6424d655d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
de2be2df-bca2-4999-b6a1-e97092d4e931 ONLINE 0 0 0
50be280b-1ca7-4ceb-bd5f-bc3f57baa828 ONLINE 0 0 0
errors: No known data errors
root@truenas[~]#
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WDC
Product: WLEB14T0S5xeF7.2
Revision: 3P00
Compliance: SPC-4
User Capacity: 14,000,519,643,136 bytes [14.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca2647dcaac
Serial number: 9RJ75LYC
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Oct 17 17:06:18 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 80493
Current Drive Temperature: 32 C
Drive Trip Temperature: 85 C
Accumulated power on time, hours:minutes 1368:53
Manufactured in week 45 of year 2019
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 25
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 79
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 5489 13456.050 0
write: 0 0 0 0 5192 7443.623 0
verify: 0 0 0 0 4792 32.818 0
Non-medium error count: 0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPEK1A058GA
Serial Number: BTOC12850HQG058A
Firmware Version: U5110550
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 0
NVMe Version: 1.1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 58,977,157,120 [58.9 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 5cd2e4 2fff840100
Local Time is: Tue Oct 17 17:08:42 2023 PDT
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0016): Format Frmw_DL Self_Test
Optional NVM Commands (0x0056): Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 78 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.70W - - 0 0 0 0 1000 4000
1 + 3.90W - - 0 1 0 1 1000 4000
2 + 2.80W - - 0 2 0 2 1000 4000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 0%
Percentage Used: 0%
Data Units Read: 158,503 [81.1 GB]
Data Units Written: 5,896,357 [3.01 TB]
Host Read Commands: 5,342,764
Host Write Commands: 102,024,781
Controller Busy Time: 52
Power Cycles: 27
Power On Hours: 5,918
Unsafe Shutdowns: 2
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: HUSPR3238ADP301
Serial Number: CJH0010094C0
Firmware Version: KMGNP131
PCI Vendor/Subsystem ID: 0x1c58
IEEE OUI Identifier: 0x000cca
Controller ID: 3
NVMe Version: <1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 3,820,752,101,376 [3.82 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 000cca 00615b2f01
Local Time is: Tue Oct 17 17:10:50 2023 PDT
Firmware Updates (0x09): 4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x01): S/H_per_NS
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 512 8 2
2 - 4096 0 0
3 - 4096 8 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 87,344,117,101 [44.7 PB]
Data Units Written: 351,739,898 [180 TB]
Host Read Commands: 82,146,566,593
Host Write Commands: 1,517,163,751
Controller Busy Time: 2,263,307
Power Cycles: 95
Power On Hours: 51,552
Unsafe Shutdowns: 70
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: HUSPR3238ADP301
Serial Number: CJH0010094C0
Firmware Version: KMGNP131
PCI Vendor/Subsystem ID: 0x1c58
IEEE OUI Identifier: 0x000cca
Controller ID: 3
NVMe Version: <1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 3,820,752,101,376 [3.82 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 000cca 00615b2f01
Local Time is: Tue Oct 17 17:10:50 2023 PDT
Firmware Updates (0x09): 4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x01): S/H_per_NS
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 512 8 2
2 - 4096 0 0
3 - 4096 8 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 87,344,117,101 [44.7 PB]
Data Units Written: 351,739,898 [180 TB]
Host Read Commands: 82,146,566,593
Host Write Commands: 1,517,163,751
Controller Busy Time: 2,263,307
Power Cycles: 95
Power On Hours: 51,552
Unsafe Shutdowns: 70
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged