Scheduled SMART tests not always done

Status
Not open for further replies.

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I have 7 drives and have them all scheduled for SMART tests in the GUI: short tests about every week and long tests every month.

I have noticed that some drives get tested and others don't. It doesn't seem to be the same drives each time. Last time, only 3 drives had the short test and 4 didn't.

4 of the drives are on a LSI SAS expander card flashed to IT mode, firmware v. 16. But there is no connection - some of the missed tests are on drives connected to motherboard SATA ports.

I have verified that all drives are selected in the GUI dialog. This seems like a bug to me. Does noone else have this problem?
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Do you have this bug?

In short, if you go to the web GUI --> Storage --> View disks do you see a serial number for each disk?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Yes, I do have the bug, one of my drives doesn't show a serial. But is there any reason to think that is related? Four drives missed the last test.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
3 of the drives are on a LSI SAS expander card flashed to IT mode, firmware v. 16. But there is no connection - some of the missed tests are on drives connected to motherboard SATA ports.
Glorious, I don't have an answer, but I'm curious as to why all your drives are not connected to your HBA?
What's the thinking behind dividing them over the two controllers?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The reason is exposed in this post and I've made a guide just a little further in the same thread to fix the SMART tests but after it worked fine for a while my last two scheduled tests haven't been executed on the disk with the missing serial. I'm trying to re-fix this and I've started a conversation with cyberjock yesterday to ask some help but no answer for now.

But as you've more disks that missed the test than those who don't have a serial there is maybe another problem. Can you post the content of /etc/local/smartd.conf please?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
BigDave, that's because some of the drives are in a separate pool that I want to go into standby. For reasons I don't understand, FreeBSD is unable to put drives on the HBA into standby, they must run all the time. So the ones that need to standby have to be on the motherboard. The ones that spin 24/7 can be on the HBA.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Hmm, didn't know about this file. Guessing what it means, it looks like all the drives have both short and long tests configured, except da1 which has no tests. Somehow there is a disconnect between the GUI and this config file, but even that doesn't explain what is actually happening. Confusing.
Code:
################################################
# smartd.conf generated by /etc/ix.rc.d/ix-smartd
################################################
/dev/da0 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/da0 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/ada2 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/ada2 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/da3 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/da3 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/da2 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/da2 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/ada0 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/ada0 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/ada1 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(01)/(1|2|3|4|5|6|7)/(09)
/dev/ada1 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|13|19|25)/(1|2|3|4|5|6|7)/(09)
/dev/da1 -a -n standby -W 0,0,37 -m xxxxxxxx@yyy.com
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Under SERVICES>SMART>CONFIGURATION> POWER MODE: What is your power mode setting?
Could it be set to something besides NEVER?
The drives in standby would have to be awakend, No?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Good point. It is set to STANDBY. If it was set for NEVER, no drives would remain in standby because of the hourly check for temperature or whatever.
Screen Shot 2015-02-14 at 10.13.27 AM.png

So, here's how I have things set:

There is a daily replication task to the pool that's in standby, at 0845. That does wake them up. Those drives are set for 20 minutes of idle before they go into standby. The SMART tests are all set for 0900. So, the drives should all be spinning when the SMART tests are fired. I check to confirm that the replication has succeeded almost every day and there has never been a problem in that regard. For a long time I also monitored the spin status of the drives and those drives were spinning at 0900 every day.

You don't think they would go into standby during a SMART test, do you? I don't think they do. Even if they did, it should show in smartctl -a as an uncompleted test.
 
Last edited:

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
The reason is exposed in this post and I've made a guide just a little further in the same thread to fix the SMART tests but after it worked fine for a while my last two scheduled tests haven't been executed on the disk with the missing serial. I'm trying to re-fix this and I've started a conversation with cyberjock yesterday to ask some help but no answer for now.

The problem where /etc/smartd.conf has the wrong info for a drive has the same root cause as the missing serial number in the "View Disks" screen. There's something weird happening when you reboot the system. I've been trying to track it down since yesterday, digging through the code in /usr/local/www/freenasUI/middleware/notifier.py. Once you reboot and trigger a regen of smartd.conf, either by what I show in that post, or running "python /usr/local/www/freenasUI/middleware/notifier.py sync_disks" from the GUI shell (or SSH'd in as root or sudo), then smartd.conf will be fixed and keep the proper values until the next reboot.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Wow, I "watch" this thread and I've the email notifications enabled but I didn't received them, it seems something is also broken on the forum...

Yeah, the thing is I know a lot of programming languages but not python... it's not that complicated but the FreeNAS architecture is complicated (or maybe just obscure if you're not one of the devs on the project). So for now I try to fix the symptoms and not the cause.

Personally I don't care to having or not the serial in the GUI, but what is important to me is the SMART tests. Today I just restarted smartd without touching anything else to see if the next scheduled test works or not.

NB: as far as I've tested running notifier.py doesn't fix smartd.conf, you also need to reselect the missing drive(s) in the SMART tests in the GUI. AFAIK notifier.py update the storage_disks table in the config and the SMART test GUI recreate the smartd.conf (and use the storage_disks table to do that, I don't know why because there is no need of the serial numbers in the smartd.conf).
 

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
NB: as far as I've tested running notifier.py doesn't fix smartd.conf, you also need to reselect the missing drive(s) in the SMART tests in the GUI. AFAIK notifier.py update the storage_disks table in the config and the SMART test GUI recreate the smartd.conf (and use the storage_disks table to do that, I don't know why because there is no need of the serial numbers in the smartd.conf).

Yeah, I'm not sure why I thought that - to fix smartd.conf, you need to rerun /etc/ix.rc.d/ix-smartd. I'm not sure how the ix-* stuff is plumbed into the normal rc.d handling, so no idea if service ix-smartd restart would work or not. Safer to use the GUI as you say.

But I managed to figure out exactly why smartd.conf regeneration is broken and the missing drive serial # won't go away. Bug #8034 entered. I don't know why the drive serial # went missing in the first place, but once it's messed up, the code in notifier.py makes sure it stays messed up.

Also, check out update #1 in my bug entry. There's a simple fix for the smartd.conf issue (at least the one related to the missing disk serial #) that doesn't require the deeper problem to be fixed. Just add a dependency on ix-syncdisks to the REQUIRE line of /conf/base/etc/ix.rc.d/ix-smartd and reboot, as described in that update.
 
Last edited:

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Thanks for wading into that deep stuff there and figuring out the problem. I don't know if this architecture is more complicated than it needs to be, but it's way over my head. With any luck they'll fix the serial number issue at the same time.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Many thanks to have found how to resolve this issue :D

I'll test that, in 2 days i've a scheduled SMART test so I'll see if it works or not ;)

Edit: exactly what I was thinking about the storage_disks table. I wondered if I clean the table and re-import the config the problem would be gone, but I didn't want to mess up the config (I don't know which tables uses the storage_disk table IDs).

Edit²: well, I made the change to ix-smartd, rebooted, and the smartd.conf is still broken.
 
Last edited:

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
Edit²: well, I made the change to ix-smartd, rebooted, and the smartd.conf is still broken.

Hmm, I'd like to understand why. Can you post the output from the following:
  • head -20 /etc/ix.rc.d/ix-smartd
  • sqlite3 /data/freenas-v1.db 'SELECT * FROM storage_disk'
  • cat /etc/local/smartd.conf
You'll have to run the sqlite3 command as root. Any chance you made the ix-smartd edit to the one in /etc/ix.rc.d, and not the one in /conf/base/etc/ix.rc.d?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I made the change to /conf/base/etc/ix.rc.d/ix-smartd, then went into the GUI and looked at Tasks > SMART tests. da1 (which had been missing SMART tests in my /etc/local/smartd.conf) was now unselected for both tests. I selected it and hit OK for both, tests, and now /etc/local/smartd.conf seems to look right.

Edit: forgot to say that I had previously this morning done the command from this bug report to get the serial to show:
Code:
python /usr/local/www/freenasUI/tools/sync_disks.py da1

Not sure if that had anything to do with it.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep:
Code:
[root@freenas] ~# head -20 /etc/ix.rc.d/ix-smartd
#!/bin/sh
#
# $FreeBSD$
#

# PROVIDE: ix-smartd
# REQUIRE: FILESYSTEMS ix-syncdisks
# BEFORE: smartd

. /etc/rc.subr

: ${smartd_config="/usr/local/etc/smartd.conf"}

print_devlist ()
{
        camcontrol devlist -v | while read LINE; do
                case $LINE in
                [^\<]*)
                        DRV=$(expr "$LINE" : '.*on \(.*\)[0-9][0-9]* bus')
                        CID=$(expr "$LINE" : '.*on .*\([0-9][0-9]*\) bus')
[root@freenas] ~#

Code:
[root@freenas] ~# sqlite3 /data/freenas-v1.db 'SELECT * FROM storage_disk'
Minimum|Always On|3000592982016|W300****||{serial}W300****|1|3|192|1|Auto|||da||1|da3
Minimum|Always On|3000592982016|W300****||{serial}W300****|1|2|192|2|Auto|||da||1|da2
Minimum|Always On|3000592982016|WD-WCC1T086****||{serial}WD-WCC1T086****|1|1|128|3|Auto|||da||1|da1
Minimum|Always On|3000592982016|||{devicename}da0|1|0|128|5|Auto|||da||1|da0
Minimum|Always On|3000592982016|WD-WCC4N0HT****||{serial}WD-WCC4N0HT****|1|4|128|6|Auto|||da||1|da4
Minimum|Always On|3000592982016|WD-WCC4NEFD****||{serial}WD-WCC4NEFD****|1|5|128|7|Auto|||da||1|da5
Minimum|Always On|3000592982016|W730****||{serial}W730****|1|6|192|8|Auto|||da||1|da6
Minimum|Always On|3000592982016|W730****||{serial}W730****|1|7|192|9|Auto|||da||1|da7
[root@freenas] ~#

Code:
[root@freenas] ~# cat /etc/local/smartd.conf
################################################
# smartd.conf generated by /etc/ix.rc.d/ix-smartd
################################################
/dev/da3 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da3 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da2 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da2 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da1 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da1 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da0 -a -n never -W 0,45,50 -m ********@gmail.com
/dev/da4 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da4 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da5 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da5 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da6 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da6 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
/dev/da7 -a -n never -W 0,45,50 -m ********@gmail.com -s L/(01|02|03|04|05|06|07|08|09|10|11|12)/(02|17)/(1|2|3|4|5|6|7)/(04)
/dev/da7 -a -n never -W 0,45,50 -m ********@gmail.com -s S/(01|02|03|04|05|06|07|08|09|10|11|12)/(07|12|22|27)/(1|2|3|4|5|6|7)/(06)
[root@freenas] ~#


I know that the /conf/base/etc/* replace the /etc/* at start up, I modified both just to be sure.

@Glorious1: that's normal. If you go through the GUI smartd.conf is regenerated. The goal here is to have an automatic fix.
 

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
@Bidule0hm, those all look as expected. Can you post a couple more db dumps (run as root again):
  • sqlite3 /data/freenas-v1.db 'select * from tasks_smarttest_smarttest_disks'
  • sqlite3 /data/freenas-v1.db 'select * from tasks_smarttest'
Your 'da0' with the missing serial # is in the middle of smartd.conf, not the end, which implies the ix-smartd/ix-syncdisks dependency is working. My suspicion is that the dump of tasks_smarttest_smarttest_disks will show it missing lines with a third field of 5. That's the primary key for the da0 row in storage_disk, and would explain the missing tests on the da0 row in smartd.conf. If you then look at the GUI SMART Tests dialog, you'll probably see that da0 is no longer selected.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Code:
[root@freenas] ~# sqlite3 /data/freenas-v1.db 'select * from tasks_smarttest_smarttest_disks'
147|1|1
148|1|2
149|1|3
150|1|6
151|1|7
152|1|8
153|1|9
155|2|1
156|2|2
157|2|3
158|2|6
159|2|7
160|2|8
161|2|9
[root@freenas] ~#

Code:
[root@freenas] ~# sqlite3 /data/freenas-v1.db 'select * from tasks_smarttest'
1,2,3,4,5,6,7|02,17|1,2,3,4,5,6,7,8,9,10,11,12|L|1|04|da0, da1, da2, da3, da4, da5, da6, da7
1,2,3,4,5,6,7|07,12,22,27|1,2,3,4,5,6,7,8,9,10,11,12|S|2|06|da0, da1, da2, da3, da4, da5, da6, da7
[root@freenas] ~#


Ok, so in theory I reselect it in the GUI and it's OK even if I reboot? I precisely want to avoid the need to reselect it at each boot.
 

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
Yep, in theory, the new selection should persist over a reboot. As long as a drive keeps the same primary key in table storage_disks, it should stay selected. If the pk changes, then my guess is the selection gets broken, unless there's code somewhere that tries to maintain the join with the tasks_smarttest_smarttest_disks table. Maybe the pk changes if, say, a drive gets hotswapped, or possibly if a drive temporarily drops out than back in. Seems like the SMART tests setup could have a master control that just says "Automatically run SMART tests on any drive that supports it", rather than having to individually select drives into the tests, which can silently end up broken like we've seen. The existing individual selection mechanism seems too fragile, especially given that no-one appears to understand why the missing serial number occurs in the first place - something to do with inconsistent results from camcontrol or smartctl?
 
Status
Not open for further replies.
Top