Some disks missing temperature data in reports/widgets

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
Long-time user of FreeNAS / TrueNAS, but first time poster. Apologies in advance for missing anything obvious in the post.

I've got a bit of a conundrum that googling or poking around the system hasn't been able to shed light on so far. The system in question is a pretty basic albeit mildly overpowered bare-metal home system (SuperMicro X10DRi, dual 2683v4, 128GiB RAM, dual 25Gb NICs running, running the latest Bluefin version (TrueNAS-SCALE-22.12.4.2)). I have two pools in the system, one being a 4-disk RAID10 of 3.84TiB U.2 NVMe disks on a PCIe v3 x16 to 4xU.2 card for VM disks and similar high-IOPS needs, and another for bulk storage and backups (RAID10 of currently 4 14TB SAS HDDs, being expanded with a third mirror vdev later today once the currently running resilvering completes, having a mirrored pair of small Optanes for metadata (will become a RAID10 of Optanes later today). The HDDs are managed by the on-board SAS controller (lsi3008 in IT mode).

The conundrum is about disk temperature reporting. For the HDD pool, none of the spinning disks show up in the reporting data for temperature, so at least the situation is consistent in that sense. The metadata NVMes do report temperature though. Since the HDDs are SAS, that may not be entirely surprising, but the disks do all properly report drive temperature using smartctl so it's not that the data doesn't exist.

For the U.2 pool, it's slightly more odd. Again, all four of the disks do report drive temperature when checking via smartctl, but for this pool, one (1) out of four disks actually get reported in the UI as well, whereas the remaining three do not. More specifically, nvme6n1 shows up, but nvme[457]n1 do not, and I can't come up with any reason why one would and the others not.

Is this behavior expected and if so why? If not, what can be done to get the SAS disks temp data to show up, and for the three AWOL NVMe temperature datasets to do the same?

pool: hdd state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Oct 17 10:02:31 2023 11.8T scanned at 619M/s, 9.80T issued at 514M/s, 12.7T total 3.89T resilvered, 77.10% done, 01:38:59 to go config: NAME STATE READ WRITE CKSUM hdd ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 48e25fc3-bdc1-4e33-b801-4a5519ef8c2f ONLINE 0 0 0 c374fe8e-efe3-4dc6-818b-ce520ef7805c ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 replacing-0 ONLINE 0 0 0 7ed29237-aae1-48f4-9770-9859bd61b39d ONLINE 0 0 0 c1b889ea-2a6e-4d8d-b15e-1386878fbc36 ONLINE 0 0 0 (resilvering) 2dc86382-f20a-4e91-b08d-8020e38a209a ONLINE 0 0 0 special mirror-2 ONLINE 0 0 0 1665d28f-9513-4d39-882a-a29d03c19056 ONLINE 0 0 0 40a98559-fc53-4caf-b4e8-a9d72eacec90 ONLINE 0 0 0 errors: No known data errors
pool: nvme state: ONLINE scan: scrub repaired 0B in 00:35:18 with 0 errors on Sun Sep 24 00:35:19 2023 config: NAME STATE READ WRITE CKSUM nvme ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 2b93aa42-2680-403d-a8d2-3256ebf2e619 ONLINE 0 0 0 ecdbf9c1-ba12-4659-82f2-75b6424d655d ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 de2be2df-bca2-4999-b6a1-e97092d4e931 ONLINE 0 0 0 50be280b-1ca7-4ceb-bd5f-bc3f57baa828 ONLINE 0 0 0 errors: No known data errors root@truenas[~]#
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: WDC Product: WLEB14T0S5xeF7.2 Revision: 3P00 Compliance: SPC-4 User Capacity: 14,000,519,643,136 bytes [14.0 TB] Logical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000cca2647dcaac Serial number: 9RJ75LYC Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Oct 17 17:06:18 2023 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certification = 0 Total blocks reassigned during format = 0 Total new blocks reassigned = 0 Power on minutes since format = 80493 Current Drive Temperature: 32 C Drive Trip Temperature: 85 C Accumulated power on time, hours:minutes 1368:53 Manufactured in week 45 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 25 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 79 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 5489 13456.050 0 write: 0 0 0 0 5192 7443.623 0 verify: 0 0 0 0 4792 32.818 0 Non-medium error count: 0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: INTEL SSDPEK1A058GA Serial Number: BTOC12850HQG058A Firmware Version: U5110550 PCI Vendor/Subsystem ID: 0x8086 IEEE OUI Identifier: 0x5cd2e4 Controller ID: 0 NVMe Version: 1.1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 58,977,157,120 [58.9 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 5cd2e4 2fff840100 Local Time is: Tue Oct 17 17:08:42 2023 PDT Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x0016): Format Frmw_DL Self_Test Optional NVM Commands (0x0056): Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Maximum Data Transfer Size: 32 Pages Warning Comp. Temp. Threshold: 70 Celsius Critical Comp. Temp. Threshold: 78 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 4.70W - - 0 0 0 0 1000 4000 1 + 3.90W - - 0 1 0 1 1000 4000 2 + 2.80W - - 0 2 0 2 1000 4000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 0% Percentage Used: 0% Data Units Read: 158,503 [81.1 GB] Data Units Written: 5,896,357 [3.01 TB] Host Read Commands: 5,342,764 Host Write Commands: 102,024,781 Controller Busy Time: 52 Power Cycles: 27 Power On Hours: 5,918 Unsafe Shutdowns: 2 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: HUSPR3238ADP301 Serial Number: CJH0010094C0 Firmware Version: KMGNP131 PCI Vendor/Subsystem ID: 0x1c58 IEEE OUI Identifier: 0x000cca Controller ID: 3 NVMe Version: <1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 3,820,752,101,376 [3.82 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 000cca 00615b2f01 Local Time is: Tue Oct 17 17:10:50 2023 PDT Firmware Updates (0x09): 4 Slots, Slot 1 R/O Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x01): S/H_per_NS Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 1 - 512 8 2 2 - 4096 0 0 3 - 4096 8 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 45 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 87,344,117,101 [44.7 PB] Data Units Written: 351,739,898 [180 TB] Host Read Commands: 82,146,566,593 Host Write Commands: 1,517,163,751 Controller Busy Time: 2,263,307 Power Cycles: 95 Power On Hours: 51,552 Unsafe Shutdowns: 70 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Error Information (NVMe Log 0x01, 16 of 63 entries) No Errors Logged
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: HUSPR3238ADP301 Serial Number: CJH0010094C0 Firmware Version: KMGNP131 PCI Vendor/Subsystem ID: 0x1c58 IEEE OUI Identifier: 0x000cca Controller ID: 3 NVMe Version: <1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 3,820,752,101,376 [3.82 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 000cca 00615b2f01 Local Time is: Tue Oct 17 17:10:50 2023 PDT Firmware Updates (0x09): 4 Slots, Slot 1 R/O Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x01): S/H_per_NS Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 1 - 512 8 2 2 - 4096 0 0 3 - 4096 8 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 45 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 87,344,117,101 [44.7 PB] Data Units Written: 351,739,898 [180 TB] Host Read Commands: 82,146,566,593 Host Write Commands: 1,517,163,751 Controller Busy Time: 2,263,307 Power Cycles: 95 Power On Hours: 51,552 Unsafe Shutdowns: 70 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Error Information (NVMe Log 0x01, 16 of 63 entries) No Errors Logged
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I've seen reporting issues due to system clock not being set correctly. Perhaps check that.

Was the reporting ever working? When did it break?
 

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
System clock is set using NTP and seems to be correct. And I don't believe it ever worked on this particular installation.
 

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
One correction to my initial message (in case it ends up mattering).-- the MB in this server isn't an X10DRi (I have that MB in two other systems) but rather an X10DRH-CLN4.
 

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
To add -- today, after adding the third mirror vdev, I realized that I can see temperature data from one out of six SAS HDDs (with all of them reporting temperature data as part of the S.M.A.R.T. data). Still no leads on my end why it would work for a single disk but not for other identical disks connected the same way to the same controller and so on.

=== START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST14000NM004G Revision: ET03 Compliance: SPC-5 User Capacity: 14,000,519,643,136 bytes [14.0 TB] Logical block size: 512 bytes Physical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500d6e9e28b Serial number: ZL2BTTB40000C1380JFG Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Oct 18 10:10:58 2023 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned <not available> Power on minutes since format <not available> Current Drive Temperature: 41 C Drive Trip Temperature: 60 C Accumulated power on time, hours:minutes 64:29 Manufactured in week 15 of year 2021 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 3 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 196 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 76496 Blocks received from initiator = 5052512 Blocks read from cache and sent to initiator = 111867 Number of read and write commands whose size <= segment size = 642 Number of read and write commands whose size > segment size = 3 Vendor (Seagate/Hitachi) factory information number of hours powered up = 64.48 number of minutes until next internal SMART test = 59 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 0.038 0 write: 0 0 0 0 0 2.651 0 verify: 0 0 0 0 0 0.001 0 Non-medium error count: 0 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 0 - [- - -] Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Try downloading smart_report by @joeschmuck - its in the resources section.
Configure and run it - does that show the temps?
 

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
I don't actually find a separate script or similar from @joeschmuck but rather "only" a good hardware troubleshooting guide that has a section on outputting S.M.A.R.T. data. That's using smartctl -a, which is what I already did above; all disks, be that NVMe or SAS HDD or SATA SSD (boot disk) have temperature data in their output of smartctl -a, and nothing that in any (to me) obvious way that explains why a single disk per pool shows up while the others do not. So I'm still somewhat stumped. It's not like it's a critical problem, but it is a conundrum nonetheless. :)

If there's indeed a tool that I missed, I'd very much appreciate being ELI5 how to find it. :)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I don't actually find a separate script or similar from @joeschmuck
The link is in my signature below. It uses SMART data as well, not the TrueNAS API so I suspect it will report your temperatures correctly for all your drives.
I have two pools in the system, one being a 4-disk RAID10 of 3.84TiB U.2 NVMe disks on a PCIe v3 x16 to 4xU.2 card for VM disks and similar high-IOPS needs, and another for bulk storage and backups (RAID10 of currently 4 14TB SAS HDDs, being expanded with a third mirror vdev later today once the currently running resilvering completes, having a mirrored pair of small Optanes for metadata (will become a RAID10 of Optanes later today). The HDDs are managed by the on-board SAS controller (lsi3008 in IT mode).
You keep using the term RAID10. That is not the proper term for ZFS, typically it would be "Mirror" which is what you have. When I see RAID xyz, I start to think that a person has setup a RAID and then topped it off with putting a RAIDZ/Mirror on top of that. Yes, it has been done. I don't think that is happening here. Are you running TrueNAS on bare metal or in a VM? If a VM, how did you pass through the drives?

To be honest, based on the information you provided, I really suspect it is TrueNAS at fault. And yes, there have been reports about this I think it was 22.12.4.1 version of SCALE not reporting temperatures or reporting them incorrectly, I know 22.12.4.2 was released but I do not know if that addressed the reporting issues. And if it was identified as the motherboard (BIOS) clock and TrueNAS clock not matching, I just never seen that message. I don't look at every post.

But to expand on the clock thing, let me ask it this way, when you boot your hardware into the BIOS, is that time the current local time for you? It is possible that the BIOS time could be set at UTC and TrueNAS sets the OS to your local time as an example. I think the mismatch is the issue. But I'm focusing on the BIOS time v.s. the TrueNAS time. You have to independently check, can't assume this is correct.

Good Luck.
 

jimla1965

Cadet
Joined
Oct 18, 2023
Messages
4
Must be common problem, worked on build TrueNAS SCALE 22.12.3.3. I have been having Report Disk Temperature problems starting with build TrueNAS-SCALE-22.12.4.0 on all drives. Then I upgraded to TrueNAS-SCALE-22.12.4.2 and it works with a Seagate Ironwolf Pro 14TB(ST14000NE0008-2RX103) but does not work on the EXOS 18X 16TB(ST16000NM000J-2TW103) drives. smartctl works on all drives on all builds.

Not sure if a hardware detection or temp reading problem. Because I have another system with only EXOS drives, it does not give a "Disk Temperature" option under "Metrics" drop down at all.
 

sobaka

Cadet
Joined
Oct 17, 2023
Messages
6
That's a nifty script, and I appreciate the TRS-80 reference in the -config step. :)

I'll try to refrain from using the term RAID10 -- for me, that's essentially a short-hand for striped mirror vdevs, but I can appreciate that people come with all kinds of more or less odd setups. But yes, as mentioned above, it's a baremetal system, no old-school RAID involved in any shape or form.

The multi_report output does indeed contain temperature for all disks _except_ the bootdisk SSD. That in itself is mildly curious as a plain ol' smartctl -a does contain it (admittedly as SMART value 194 rather than some top-level attribute) and the TrueNAS report in the UI does pick up on that one. For the rest, the output from multi_report doesn't give me any new insights; temp data is indeed there, but that's the case in manual smartctl -a as well.

Other than that, the script seems hugely useful and I'll certainly keep using it.

Version wise I'm already on TrueNAS-SCALE-22.12.4.2

For the clock topic, I can't say I remember off the top of my head. I'll confirm when I reboot next time to complete the disk shuffling. Am I correct in saying that it doesn't matter if the two clocks are set to UTC or local, but they do (at least with the issue) need to be set to the same? Any insight into why the symptom ends up being that one disk per pool reports temp but the others don't?
 

jimla1965

Cadet
Joined
Oct 18, 2023
Messages
4
Maybe a problem is in some python code:

Temps are not reported if standby is enable. I don't think "Always On" is equal to 'ALWAYS ON' in python, this could explain the problem. Then again I'm not a python expert or know exactly how code works.

/usr/lib/python3/dist-packages/middlewared/plugins/disk_/temperature.py
async def disks_for_temperature_monitoring(self):
return [
disk['name']
for disk in await self.middleware.call(
'disk.query',
[
['name', '!=', None],
['togglesmart', '=', True],
# Polling for disk temperature does not allow them to go to sleep automatically
['hddstandby', '=', 'ALWAYS ON'],
]
)
]

/usr/lib/python3/dist-packages/middlewared/plugins/disk.py
disk_hddstandby = sa.Column(sa.String(120), default="Always On")
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I appreciate the TRS-80 reference in the -config step. :)
I thought it would differentiate the youngsters from us seasoned folks. Glad you liked it.
The multi_report output does indeed contain temperature for all disks _except_ the bootdisk SSD.
I would not mind if you ran the script using the -dump email switch so it delivers to my email address the drive json data so I can figure out why it's not reporting and fix it. But, this means I would know your email address. I don't share that kind of stuff. But it's up to you, especially if you plan to create a CRON Job to run it periodically.
Am I correct in saying that it doesn't matter if the two clocks are set to UTC or local, but they do (at least with the issue) need to be set to the same?
That is my understanding. If you live in the pacific time zone, then you set your BIOS to your local time and you set TrueNAS to your timezone. When all is done, they both should have the same time. If I understand it correctly. I know if I have this wrong, someone will correct me.

That is all the answers I have, everything else is pure speculation on my part.

If you can live with the temperature issue for a while (assuming the timezone thing does not fix it) then wait until the next SCALE version comes out (23.xx), I think November? I don't recall exactly but not too far away. Do not upgrade any ZFS features unless you KNOW you need one of them. If you did upgrade, it prevents you from rolling back to a previous version. There are no new features that I as a home user needs and I like the ability to roll back without issue.
 

jimla1965

Cadet
Joined
Oct 18, 2023
Messages
4
Fixed my "Some disks missing temperature data in reports/widgets" problem. All disks are now reporting correctly, It was related to system time. While some drives work and other won't is a mystery.

Originally my Bios was set UTC time and True NAS times matched my local time.
1) Restart System, in the BIOS set time to local time. Load True NAS and confirm Dashboard time is no longer correct.
2) Restart System, in the BIOS set time back to UTC. Load True NAS and confirm Dashboard time is now correct and matches time System Settings->General page.

Drive temp works now
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
It was related to system time.
Isn't that crazy. Why would these completely different things be connected by time. So long as the developers figure it out, all will be right in the world.
 

jimla1965

Cadet
Joined
Oct 18, 2023
Messages
4
Isn't that crazy. Why would these completely different things be connected by time. So long as the developers figure it out, all will be right in the world.
They are most likely connected

What could be happing by resetting time is re-running initial setup code for the new time and zone, which most likely reran the init code for reports. This could fix many random reports config problems.
 
Top