/dev/nvme#, number of error log entries increased

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
I am having random errors that read "Device: /dev/nvme8, number of Error Log entries increased from 0 to 4". It's a Dell PowerEdge R740xd running TrueNAS core 12.0-U8.1 with the following specs:
  • 2 x Xeon Silver 4216 CPUs
  • 386GB DDR4 Registered ECC RAM
  • 16 x 15.36TB Micron 9300 NVME SSD
  • 8 x 15.36TB Micron 7400 NVME SSD(When it was time to add capacity, the 9300 series wasn't available.)
  • zpool config: 3 x 8 RAIDZ2
  • Server uptime: 296 days
  • Server age: 3.5 years
Three different drives have shown this error and it's new within the last week or so. Running "smartctl -a /dev/nvme#" returns this, with a bunch of model and capability information removed:

=== START OF INFORMATION SECTION ===
Model Number: Micron_9300_MTFDHAL15T3TDP


=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 1,033,373,068 [529 TB]
Data Units Written: 1,369,891,288 [701 TB]
Host Read Commands: 8,844,889,058
Host Write Commands: 27,859,272,654
Controller Busy Time: 1,104,282
Power Cycles: 19
Power On Hours: 20,608
Unsafe Shutdowns: 11
Media and Data Integrity Errors: 0
Error Information Log Entries: 18
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 44 Celsius
Temperature Sensor 2: 42 Celsius
Temperature Sensor 3: 39 Celsius
Temperature Sensor 4: 39 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 18 - - 0x0008 0x000 0 0 0xd2
1 17 - - 0x0008 0x000 0 0 0xd2
2 16 - - 0x0008 0x000 0 0 0xd2
3 15 - - 0x0008 0x000 0 0 0xd2
4 14 - - 0x0008 0x000 0 0 0xd2
5 13 - - 0x0008 0x000 0 0 0xd2
6 12 - - 0x0008 0x000 0 0 0xd2
7 11 - - 0x0008 0x000 0 0 0xd2
8 10 - - 0x0008 0x000 0 0 0xd2
9 9 - - 0x0008 0x000 0 0 0xd2
10 8 - - 0x0008 0x000 0 0 0xd2
11 7 - - 0x0008 0x000 0 0 0xd2
12 6 - - 0x0008 0x000 0 0 0xd2
13 5 - - 0x0008 0x000 0 0 0xd2
14 4 - - 0x0008 0x000 0 0 0xd2
15 3 - - 0x0008 0x000 0 0 0xd2
... (47 entries not read)


I've done some research on this already and I don't think this is the issue where firmware might be slightly incompatible and can generate errors when power cycling the servers. This particular server has an uptime of 296 days and these errors only started showing recently. The other two drives show identical smartctl output.

We have not far off of 1,000 mechanical and flash drives contained in several TrueNAS Enterprise and Core servers. I've not seen this particular error in 7 years of working on these.

Is this something I should be worried about? Anyone have an idea what's going on here?
 
Top