Can't view previous SMART test logs &is it in sequence?

diskdiddler · Sep 3, 2014

I've got a disk I suspect with an issue, my server is making a ticking noise, it sounds like one of the actuators is rocking back and forth as if it's seeking sector 1000, 2000, 1000, 2000, 1000, 2000.
It's very reminiscent of the WDIDLE3 issue on some of the earlier WD Greens with a bad firmware. (I have a 500gb here which is unusable, any time there's NOT system requested disk activity, the disk will make ticking noises endlessly, despite working)

ANYHOW
I'd like to see the last SMART tests and what information came from them? I've found the area to configure the scheduled jobs but no way to view the old reports, do they not get archived or anything? Only if there's an error do you get an email? (I'm unsure)

Also if I do kick off a SMART long test, surely it runs per disk in sequence right, so the system is still usable? I believe a long test generally tests the entire surface of the disk?

Finally, how do I manually initiate a test? Do I need to manually set up a new scheduled job? I can't see a "run now" button on the configuration page for the schedules.

DrKK · Sep 3, 2014

OK,

I'll take this.

Your drives are almost certainly /dev/adaN where N is some number. You can get the full SMART readout with:

Code:

smartctl -x /dev/ada0

or whatever device you want.

Somewhere in there, you should see the results of the last several SMART tests. "Completed without error" is the key phrase you will see in the log. If you have the email set up correctly, my understand is that whenever your smart test does not complete without error, you'll be emailed.

Using smartctl, you can initiate your own tests anytime you want. I believe the syntax is

Code:

smartctl -t long /dev/ada0

for example to kick off a long test, but you'll need to Google that to make sure. And so on.

This should be enough information to get you started up.

diskdiddler · Sep 3, 2014

So just to clarify, there's no way to kick it off manually through the GUI?
I'm happy to shell in and do it - thought it might be available there though.

Do you know if the scripted SMART long check ( mine is bi-monthly) will sequentially do the disks (surely it would..?)

cyberjock · Sep 3, 2014

You don't really need to "view" the results of the tests. The SMART monitoring will check your logs and if you get anything except a "passed with no errors" then you get a nastygram email telling you that your server needs administration.

Long checks are totally dependent on the manufacturer and they are under no obligation to check the entire disk. However, I will tell you that every platter-based disk I have ever seen does a full platter check.

danb35 · Sep 4, 2014

diskdiddler said:
Do you know if the scripted SMART long check ( mine is bi-monthly) will sequentially do the disks (surely it would..?)

No, it does not, and there's no need to.

jgreco · Sep 4, 2014

Ideally, a well-designed long test is not supposed to busy out the drive, so it shouldn't substantially affect the performance characteristics of the drive. What happens on a drive that is already 100% saturated and doesn't actually have free IOPS is one of the possible areas of concern; it's been quite some time since I played with failure modes here but I think the drive actually timed out the long test with some ambiguous-sounding error. The idea is that it should always be safe to do a long test without it slagging out your I/O.

So, no, the smartd test will just run them all at the same time and rely on the drives to behave properly.

Note also that this is similar to the ZFS scrub behaviour where it ensures that the drives are not really busy before slagging them out with scrub traffic (I think the current algorithm is actually something like wait-for-drive-having-been-in-idle-state-at-least-4-seconds).

jgreco · Sep 4, 2014

diskdiddler said:
So just to clarify, there's no way to kick it off manually through the GUI?

Not that I'm aware of. Both this and the ability to review current SMART output would be nice additions.

I know the prevailing school of thought on the bugtracker (primarily by Princess Toothy Guard Dog) seems to be "it'll email you" but the reality is that when gear is on the bench, the machine isn't in its final production network location and so may not be ABLE to e-mail (may fail DNS fwd/rev validation tests, etc). The current setup is fine for hobbyists but I would like to be able to actually request the system launch a conveyance test, which is basically something that only happens maybe once or twice in a drive's lifetime, or review current stats without logging in to the CLI. Well actually *I* don't care since I'm a CLI guy but I find it difficult to advise people to be doing these things that aren't available except through the CLI.

Ericloewe · Sep 4, 2014

jgreco said:
Not that I'm aware of. Both this and the ability to review current SMART output would be nice additions.

I know the prevailing school of thought on the bugtracker (primarily by Princess Toothy Guard Dog) seems to be "it'll email you" but the reality is that when gear is on the bench, the machine isn't in its final production network location and so may not be ABLE to e-mail (may fail DNS fwd/rev validation tests, etc). The current setup is fine for hobbyists but I would like to be able to actually request the system launch a conveyance test, which is basically something that only happens maybe once or twice in a drive's lifetime, or review current stats without logging in to the CLI. Well actually *I* don't care since I'm a CLI guy but I find it difficult to advise people to be doing these things that aren't available except through the CLI.

While I am by no means anything more than an amateur programmer, I can't imagine something like pfSense's solution (GUI section to run SMART tests and view results, results displayed are simply the output of smartctl -a /dev/whatever piped into a fixed-spacing font HTML element) being particularly hard to implement.

It would certainly be one less reason to use the CLI, particularly during initial setup and validation - not a bad idea in an appliance.

jgreco · Sep 4, 2014

Ericloewe said:
While I am by no means anything more than an amateur programmer, I can't imagine something like pfSense's solution (GUI section to run SMART tests and view results, results displayed are simply the output of smartctl -a /dev/whatever piped into a fixed-spacing font HTML element) being particularly hard to implement.

It would certainly be one less reason to use the CLI, particularly during initial setup and validation - not a bad idea in an appliance.

That's exactly my thinking. It'd be cool to go further with the whole SMART thing... since we seem to lack the ability to detect problems in other ways.

cyberjock · Sep 4, 2014

I will tell you that querying lots of disks can take minutes to do. My 24 disk server takes more than 2 minutes to query all of the disks. And for bigger servers I can only shiver at how long *that* would take.

joeschmuck · Sep 4, 2014

The problem with querying all the drives, or even a single drive is to know what will be returned from each drive, meaning knowing what each model drive type could possibly return and being able to handle those non-standard return values because if you provide a simple GUI way to return say "smartctl -x /dev/adaX", someone will want it filtered to only show test results or possible error issues. There is no way to meet everyone's request, well unless you make a user customizable filter. hum... maybe that would work. I'm sure someone could write a simple (maybe not that simple) script to do all that.

@OP
If your drive is still ticking, post the output of "smartctl -a /dev/adaX" but use www.pastebin.com or code brackets to retain the output format.

jgreco · Sep 4, 2014

That is simply an argument against polling in realtime, Princess...

Ericloewe · Sep 4, 2014

Drives are polled every 30 minutes anyway, right? Storing the output shouldn't be too hard...

diskdiddler · Sep 4, 2014

cyberjock said:
I will tell you that querying lots of disks can take minutes to do. My 24 disk server takes more than 2 minutes to query all of the disks. And for bigger servers I can only shiver at how long *that* would take.

danb35 said:
No, it does not, and there's no need to.

I'm referring to a SMART long test here, which could take anywhere from 2 to 18 hours, depending on the disk size, since it checks every single sector of the disk (AFAIK) ....... hence making the disk EXTREMELY busy and difficult for it to be used for regular access.

cyberjock · Sep 4, 2014

diskdiddler said:
I'm referring to a SMART long test here, which could take anywhere from 2 to 18 hours, depending on the disk size, since it checks every single sector of the disk (AFAIK) ....... hence making the disk EXTREMELY busy and difficult for it to be used for regular access.

No, it doesn't make the disk extremely busy and difficult to use for regular access. In fact, most companies have put the performance penalty at <10%. Many will stop a SMART test temporarily when disk activity is requested.

diskdiddler · Sep 4, 2014

Yes but if it's in a 6 disk array, a single byte written or read from the array will delay the operation, right?
Am I right in thinking it does check the entire surface?

cyberjock · Sep 4, 2014

On a potential disk write it's cached by ZFS unless it's a sync write. And if you have that many sync writes you probably have a ZIL. So no problem there.

On a potential disk read it's either cached by ZFS(so no problem) or it has to be obtained from the pool (which will take milliseconds to do). So are you going to argue that the additional time it takes to actually perform the read handicapped your pool?

If something as simple as a second read operation is so detrimental to a pool's performance you've got bigger problems. Because you probably have multiple read operations coming in all day long every day. ;)

This is such a small part of the bigger picture its almost laughable to be discussing it. It's like pissing in the ocean and arguing that the level went up. Sure, it went up. Is anyone going to actually give enough of a crap to even think about measuring it? Hell no.

diskdiddler · Sep 4, 2014

We're clearly not on the same page here.
If I drop 35gb of data to my system across the lan at 100MB/s - I'm going to assume that any single disk will be taking at least 5gb of those writes (and likely more for redundancy, probably more like 8.7gb)
How is a hard disk meant to sequentially read each sector of the entire disk, while also writing 8.7gb of data? It's going to _seriously_ impact performance and considering a full surface read is a multi-hour event (it was 12 hours last time I did one on my 3TB disks, let alone this batch of 5's I have) I figure it's going to be pretty nasty.

I'm under the impression that a long SMART check is fairly similar to a CHKDSK /R under Windows.

cyberjock · Sep 4, 2014

Right, and it's exactly like I said above. The SMART test is interrupted temporarily while the disk handles the writes. So aside from the fact that the disk will have to be interrupted a bunch of times, not much is lost. And since ZFS handles writes in large chunks every 4-6 seconds you won't actually have as many interruptions as you think.

Hence, the performance penalty is very small.

diskdiddler · Sep 4, 2014

Ok well I've kicked off smartctl -t long /dev/ada5 and I'll work backwards, it's claiming 9 hours per disk
Interestingly it says

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

So does this mean the disk does infact go out of sync from the array for the test? before needing ... umm what's it called? The thing where the data is repaired?

Important Announcement for the TrueNAS Community.

Can't view previous SMART test logs &is it in sequence?

Wizard

FreeNAS Generalissimo

Wizard

Inactive Account

Hall of Famer

Resident Grinch

Resident Grinch

Server Wrangler

Resident Grinch

Inactive Account

Old Man

Resident Grinch

Server Wrangler

Wizard

Inactive Account

Wizard

Inactive Account

Wizard

Inactive Account

Wizard

Similar threads