How FreeNAS responds to a failed disk

cyberjock · Jul 22, 2014

If you've properly setup your FreeNAS box it should send you emails when things go bad as well as some informational emails. Depending on the FreeNAS version you are using you can expect either nightly emails or emails only the first night after you do a bootup/reboot.

So my friend's server is setup with what I call a proper schedule of SMART testing, scrubs, etc. If you want to see what I use and recommend check out http://forums.freenas.org/index.php?threads/scrub-and-smart-testing-schedules.20108/

So here's a real-world result of my friend's server. Server had been running 24x7 for about 5 months and all of a sudden a drive decided it didn't want to work. To the hardware the hard drive was "disconnected" from the system.

To explain a little, his server sends SMART emails to his cell phone as SMS texts (Google your cell phone provider for how to do this). For him, his "email address" is xxxxxxxxxx@vtext.com (he uses Verizon). But the root user is his personal email. We did this because standard emails aren't exactly a "high priority" but SMART emails are. They can range from a failed disk to a failing disk to a disk that is overheating because a fan in his server just failed. In the event a fan fails you don't want to let it run all weekend while the drive cooks, so the smart thing is to get an immediate message.

So at 1:22AM this particular day he got an email sent to his "root" account that consisted of the following:

Subject: Critical Alerts

The volume tank (ZFS) status is DEGRADED

Of course, he wasn't checking his email at that exact minute. Since SMART still runs, just 3 minutes later he got an SMS on his phone...

Subject: Fwd: SMART error (FailedOpenDevice) detected on host: freenas

This message was generated by the smartd daemon running on:

host name: freenas
DNS domain: <removed to protect the innocent>

The following warning/error was logged by the smartd daemon:

Device: /dev/da9 [SAT], unable to open device

Device info:
ST3000DM001-9YN166, S/N:<redacted>, WWN:5-000c50-0536e2d5e, FW:CC4B, 3.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

Of course this SMS message got his attention. It just so happened he and I were awake and on Skype and within 30 minutes he had the disk removed, new disk installed, and resilvering was already in progress. Before noon that day his server was back online with full redundancy restored. Knowing this kind of thing happens and its a matter of when and not if he had a spare disk already tested and ready to go. He just had to do the actual disk swap.

Anyone curious to know when the last time he had actually logged into the WebGUI? About 5 months.

So what should you take away from this? If you know what you are doing and actually think about how you want the server to contact you if there is a problem you can have almost instantaneous feedback in the event something goes wrong. This also means that if you've done your job and planned ahead there is no reason to log into your server at any regular interval because the server should be telling tell you if things aren't right.

Hope this shows the power that FreeNAS offers if you take the time to set it up properly.

Yatti420 · Jul 22, 2014

Using the same setup with UFS when degraded should report something similar.. The problem is that UFS doesn't have self-healing etc..

Hopefully he can add the info to the hard drive thread here.. I noticed it's seagate desktop drives.. http://forums.freenas.org/index.php?posts/132332/

titan_rw · Jul 22, 2014

I had a real world example of a bad disk too. Didn't completely fail, but it had come up with zfs read errors, and smart showed uncorrectable.

Here's the emails I got (at 3am), excluding non-relevant drive model / serial number:

Email ONE:

Code:

The following warning/error was logged by the smartd daemon:

Device: /dev/da4 [SAT], 65512 Currently unreadable (pending) sectors

Email TWO:

Code:

The following warning/error was logged by the smartd daemon:

Device: /dev/da4 [SAT], 65512 Offline uncorrectable sectors

Email THREE:

Code:

The following warning/error was logged by the smartd daemon:

Device: /dev/da4 [SAT], SMART Usage Attribute: 187 Reported_Uncorrect changed from 100 [Raw 0] to 55 [Raw 45]

Email FOUR:

Code:

The following warning/error was logged by the smartd daemon:

Device: /dev/da4 [SAT], ATA error count increased from 0 to 45

That was enough to get me to log in to check what was happening.

zpool status showed (a few) READ errors on that device. I offlined the drive at that point for further testing / replacing.

Of course, as soon as I offlined it I get this:

Code:

The volume nas1pool (ZFS) status is DEGRADED

The drive wasn't under warranty, and my local computer store wouldn't price match newegg, so I ordered two. One to use right away, and one cold spare.

I have smartd set to email me daily, regardless of whether it's already emailed me or not. So I got these emails over the next few days waiting for the replacement: (these were all separate emails making my phone chirp each time).

Code:

Device: /dev/da4 [SAT], 1992 Currently unreadable (pending) sectors
Device: /dev/da4 [SAT], 1992 Offline uncorrectable sectors
Device: /dev/da4 [SAT], SMART Usage Attribute: 187 Reported_Uncorrect changed from 50 [Raw 50] to 1 [Raw 332]
Device: /dev/da4 [SAT], ATA error count increased from 50 to 332
Device: /dev/da4 [SAT], 320 Currently unreadable (pending) sectors
Device: /dev/da4 [SAT], 320 Offline uncorrectable sectors
Device: /dev/da4 [SAT], 232 Currently unreadable (pending) sectors
Device: /dev/da4 [SAT], not capable of SMART self-check
Device: /dev/da4 [SAT], failed to read SMART Attribute Data
Device: /dev/da4 [SAT], Read SMART Self-Test Log Failed
Device: /dev/da4 [SAT], Read SMART Error Log Failed

Couple notes. Attribute 187 going from normalized value of 50 to a normalized value of 1 is one of those funny ones. Smart attributes will never go below 1, so any attribute at 1 is the attribute "as bad as it gets". And it's still a 'healthy' attribute as far as the thresholds go. That's why I still fail to understand people who run short smart tests daily. Unless you shoot your drive with a real lead bullet, it's probably going to pass a short test. This drive would pass a short test. Would anyone actually continue using this drive? Me thinks not.

Also, since the drive was off warranty I started playing with it after it was not part of the pool. Running badblocks I was able to find the LBA range that had problems. It was about a gigabyte range somewhere in the middle of the disk. So I started a badblocks loop continually read / writing this range. Errors were scrolling like crazy in the ssh window. After about 4 hours of that, which was hundreds of passes over the problem area, the drive decided it had had enough, and disconnected from the PC. That is what triggered the "not capable of smart test" or the "read smart log failed" stuff. Anyway, just goes to show, you can take a drive with bad sectors and kill it totally by beating on the bad sectors. If this hadn't been in a redundant setup, I would have been able to recover most of the data BEFORE I started punishing the drive. Afterwords, no luck. The drive spins up, but is not detected AT ALL. No bios (hba card) detection, no OS dectection.

Received the new drives, tested them both first (badblocks / long test), replaced the one, and resilvered. That was almost a month ago. No trouble since.

This is a 11 drive Z3 pool if anyone's interested. So for the 5 days or so it took to get a drive, I still had two drive redundancy. No sweat at all.

DrKK · Jul 22, 2014

A SMART "short" test also performs an electronic self-test of the controller. So it's not just a review of SMART parameters, my understanding is that it will perform a number of controller-level electrical tests as well.

And in any case, since a SMART "short" test is basically free, and transparent, there is no compelling reason *NOT* to do one.

But I agree with you, the proper regimen of tests includes LONG (surface) SMART tests, as well as ZFS scrubs.

titan_rw · Jul 22, 2014

I don't bother with shorts at all. I do longs twice a month, and scrubs twice a month. And emails notifying about changed smart attributes.

I know I've seen quite a few times people saying to do short tests hourly because they want to know if (their drive is dieing / their drive is over temperature / the sky is falling). In all those cases, I've never seen a short test to help. It won't complain about temperature until the drive exceeds the manufacturers limits, which is usually over 60C. Smartd should have warned you about that long before it gets that bad. On drives I've known to be bad, but still being detected by the controller, I've tried short tests, and they've always passed. Even if the tests are 'free', I don't bother. I also like keeping the self test log 'cleaner' for when I go to review tests.

cyberjock · Jul 22, 2014

So there's two sides to the "Short SMART Test" story. On one side it does run some tests. On the other side I've never seen a short test find a problem that you didn't already know about. The reality of it as far as I know is that there is nothing a short test does that you wouldn't find out about for yourself if your hard drive were failing. I do it just because there is no reason not to and it doesn't hurt to do them once a week or so. :P

Anyone doing hourly short tests are failing to distinguish between SMART monitoring and SMART testing. In fact, doing hourly short tests doesn't do you a damn bit of good with regards to hard drive temp. Unless you exceed the 65C that 99% of drives use you'll be told nothing and everything will look okay. So anyone doing hourly short tests already doesn't understand how this technology works and actually is doing themselves no benefit at all (and is actually hurting themselves).

Now I've had short test fail, but not until after you've failed a long test and/or not until after the SMART monitoring has already identified the disk as bad.

hungarianhc · Jul 28, 2014

Great thread, guys - My server is running great so I won't do it, but I'm tempted to pull out a drive, see how the system responds, and then put it back in to see how it recovers.

Important Announcement for the TrueNAS Community.

How FreeNAS responds to a failed disk

cyberjock

Inactive Account

Yatti420

Wizard

titan_rw

Guru

DrKK

FreeNAS Generalissimo

titan_rw

Guru

cyberjock

Inactive Account

hungarianhc

Patron

Similar threads

Important Announcement for the TrueNAS Community.

How FreeNAS responds to a failed disk

cyberjock

Inactive Account

Yatti420

Wizard

titan_rw

Guru

DrKK

FreeNAS Generalissimo

titan_rw

Guru

cyberjock

Inactive Account

hungarianhc

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How FreeNAS responds to a failed disk"

Similar threads