User story: ZFS Filesystem isn't reliable

mosh · Dec 24, 2019

Hi,
I'm using FreeNAS in my company for ~3 years now.
I have one server with FreeNAS-11.1-U6.3 installed
mirroring OS SSD disks and HD spinning disks for the data.

Last week we had a crysis where the DC got hot and required us to bring down all servers until the heating issue will be resolved.
Let me just add that i'm hosting mysql servers, beegfs storage, and non-branded nfs storage in the same rack.
All servers and services came up fine once the issue resolved. except our FreeNAS that was reporting about a failed disk and filesystem corruption (receiving Input/Output errors).
zfs scrub didn't fix the filesystem issues and all disks are marked as DEGRADED and i have permanent errors that will require me to wipe the whole thing and rebuild it.

I'm telling you this, because in my opinion ZFS is still immature. I'd like to believe that in 2019 filesystems errors shouldn't take down the whole dataset.

Code:

  pool: nfs
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 392K in 0 days 00:07:52 with 21 errors on Tue Dec 24 17:37:24 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    nfs                                             DEGRADED     0     0 25.3K
      raidz2-0                                      DEGRADED     0     0  101K
        gptid/6ca54600-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     2  too many errors
        gptid/6da1bb59-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    20  too many errors
        gptid/6eabf943-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    18  too many errors
        gptid/6fb66a76-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    36  too many errors
        gptid/70c8b367-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     9  too many errors
        gptid/71e694dd-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    27  too many errors
        gptid/7304cac0-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    27  too many errors
        gptid/2318063a-f491-11e9-83b9-801844f2984a  DEGRADED     0     0     0  too many errors
        gptid/4bd7f20f-e67d-11e9-83b9-801844f2984a  DEGRADED     0     0    37  too many errors
        gptid/761d4c4b-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     3  too many errors
        gptid/831d0608-2d19-11e9-83b9-801844f2984a  DEGRADED     0     0    14  too many errors
        gptid/7854977f-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    10  too many errors

errors: Permanent errors have been detected in the following files:

        nfs/pma:<0x0>

G8One2 · Dec 24, 2019

Hardware spec's? Did you have regular scubs enabled?

sretalla · Dec 24, 2019

mosh said:
in my opinion ZFS is still immature

In my opinion it is not.

It's far more likely there's something missing in your operational procedures or some other fault has occurred.

With all drives having checksum errors, it would point at a disk controller card for me, so I would be checking the cabling to it and the card itself (firmware version for a start, but also for evidence of electrical burn).

Also, DEGRADED isn't OFFLINE, so you may still have access to the pool. Have you tried?

toadman · Dec 24, 2019

Very odd indeed.

HoneyBadger · Dec 24, 2019

Just how hot did it get in there? An overheating datacenter immediately makes me think your HBA failed as a result; some models run pretty toasty to begin with (especially the popular first-gen SAS2008s) and the swath of errors across the entire pool increases my suspicions. Take the pool offline, replace the HBA, do a zpool clear and scrub again. I've had an HBA literally "let the magic blue smoke out" and lose zero bytes from a live pool, so don't lose hope yet.

Any errant SMART results from your drives? As @G8One2 says, were you regularly scrubbing the pool and running scheduled SMART tests?

jgreco · Dec 24, 2019

mosh said:
I'm telling you this, because in my opinion ZFS is still immature. I'd like to believe that in 2019 filesystems errors shouldn't take down the whole dataset.

ZFS is predicated on the idea that there will be enough redundancy to recover failed blocks. Being unable to recover a block is possibly sufficient damage to leave permanent damage to a pool. Possibly. But if you run into a situation where you want to survive toasting a shelf of drives, or roast an HBA that's driving a bunch of disks, this is where it is necessary for you to design your hardware to make sure that one side of your mirrors is running on a disk shelf and HBA, and the other side of your mirrors is running on a *different* disk shelf and a *different* HBA.

ZFS is sufficiently complex and featureful that it lacks a filesystem "fixer" to pick out and eject obviously bad data, because it shouldn't be possible to get bad data into a pool to begin with. The complexity of validating metadata for snapshots and dedup is computationally and resource intensive, and since the design is unlike your conventional UFS or MSDOS filesystem in that it is able to detect and recover such errors from redundancy, you're basically expected not to roast your disks or do other things that cause massive failure that exceeds the recoverability threshold you designed into your filer. And since you can build in *huge* recoverability capabilities, I'd suggest you share at least a little responsibility here.

If you think that ZFS is immature, heaven knows what you would think of as mature for petabyte and beyond data storage.

AlexGG · Dec 24, 2019

mosh said:

I'm telling you this, because in my opinion ZFS is still immature. I'd like to believe that in 2019 filesystems errors shouldn't take down the whole dataset.

Code:

      raidz2-0                                      DEGRADED     0     0  101K
        gptid/6ca54600-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     2  too many errors
        gptid/6da1bb59-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    20  too many errors
        gptid/6eabf943-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    18  too many errors
        gptid/6fb66a76-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    36  too many errors
        gptid/70c8b367-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     9  too many errors
        gptid/71e694dd-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    27  too many errors
        gptid/7304cac0-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    27  too many errors
        gptid/2318063a-f491-11e9-83b9-801844f2984a  DEGRADED     0     0     0  too many errors
        gptid/4bd7f20f-e67d-11e9-83b9-801844f2984a  DEGRADED     0     0    37  too many errors
        gptid/761d4c4b-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0     3  too many errors
        gptid/831d0608-2d19-11e9-83b9-801844f2984a  DEGRADED     0     0    14  too many errors
        gptid/7854977f-3bf8-11e8-8ae1-801844f2984a  DEGRADED     0     0    10  too many errors

This does not look like a filesystem error. This looks like some piece of faulty hardware, I guess most likely the disk controller, and RAM is second most likely. No matter what filesystem you put onto faulty hardware, the filesystem is going down in some or other manner. From what I see, ZFS seem to have performed pretty good.

mosh · Dec 25, 2019

G8One2 said:
Hardware spec's? Did you have regular scubs enabled?

Scrubs were set on every Sunday
Server is Dell PowerEdge R730xd + PERC H730P Mini in HBA Mode

G8One2 · Dec 25, 2019

Im pretty sure IX doesn't support RAID cards, even when they are in JBOD or HBA mode. I suspect someone else could chime in on that with the specifics.

mosh · Dec 25, 2019

sretalla said:
In my opinion it is not.

It's far more likely there's something missing in your operational procedures or some other fault has occurred.

With all drives having checksum errors, it would point at a disk controller card for me, so I would be checking the cabling to it and the card itself (firmware version for a start, but also for evidence of electrical burn).

Also, DEGRADED isn't OFFLINE, so you may still have access to the pool. Have you tried?

Hi,
I still have access to the pool, I get Input/output errors on some of the data.

Here are the controller details, as reported by Dell's iDRAC

Code:

Advanced Properties
Status     
Name    PERC H730P Mini (Embedded)
Device Description    Integrated RAID Controller 1
Controller Mode    HBA
Security Status    Not Assigned
Encryption Mode    None
Firmware Version    25.5.3.0005
Driver Version    06.712.04.00-fbsd
Cache Memory Size    2048 MB
SAS Address    0x5D0946604C212500
PCI Vendor ID    0x1000
PCI Subvendor ID    0x1028
PCI Device ID    0x5d
PCI Subdevice ID    0x1f47
PCI Bus    0x3
PCI Device    0x0
PCI Function    0x0
Slot Type    Information Not Available
Slot Length    Information Not Available
Bus Width    Information Not Available
Copyback Mode    On
Patrol Read Rate    30%
Patrol Read State    Stopped
Patrol Read Mode    Auto
Check Consistency Rate    30%
Check Consistency Mode    Normal
Rebuild Rate    30%
BGI Rate    30%
Reconstruct Rate    30%
Max Capable Speed    12.0 Gbps
Persistent Hotspare    Disabled
Load Balance Setting    Auto
Preserved Cache    Not Present
Time Interval for Spin Down    30 minutes
Spindown Unconfigured Drives    Disabled
Spindown Hotspares    Disabled
Learn Mode    Not Supported
T10 PI Capability    Not Capable
Support RAID10 Uneven Spans    Supported
Support Enhanced Auto Foreign Import    Supported
Enhanced Auto Import Foreign Config    Disabled
Support Controller Boot Mode    Supported
Controller Boot Mode    Continue Boot On Error
Real-time Configuration Capability    Capable

jgreco · Dec 25, 2019

There is no true HBA or JBOD mode on LSI RAID's, including the PERC. You are passing everything through the RAID controller driver.

A RAID CARD IS NOT EXPECTED TO WORK IN THE NECESSARY MANNER.

And by "necessary manner", I don't care that it "mostly worked" or "you got it to work" -- I mean including edge cases you don't normally experience.

ZFS and FreeNAS actually really want a true HBA, i.e. a port that passes data back and forth to the drive without further abstraction, processing, or caching.

Confused about that LSI card? Join the crowd ...

A recent post prompted me to think, there's a lot of confusion regarding LSI HBA/RAID cards. If you're confused, it is probably understandable... it can be a bit complicated! ZFS users on FreeNAS should avoid using hardware RAID cards. ZFS and FreeNAS work best when the drive is managed...

www.ixsystems.com

The LSI controllers are well known for cooking themselves even in a normal environment, so if you had an issue where the data center got real hot, it is possible that damage was done to the silicon, and IT IS PROBABLE that the silicon spat corrupted crap onto your disks while trying to write normal data. This is a known failure mode of overheated LSI RAID cards -- various corruption of data and even LBA data sent to drives.

So this isn't a story about ZFS. This is a story about your RAID controller and how it was a single point of failure that corrupted your pool. :-(

seanm · Dec 25, 2019

jgreco said:
ZFS and FreeNAS actually really want a true HBA, i.e. a port that passes data back and forth to the drive without further abstraction, processing, or caching.

I wonder... does/could the FreeNAS web UI detect such devices and warn against them?

jgreco · Dec 25, 2019

seanm said:
I wonder... does/could the FreeNAS web UI detect such devices and warn against them?

No, and apparently that's by design. Last time the topic came up, years ago, the developers didn't really want to bless any particular hardware (which is the inevitable end of that rabbit hole). So here in the forums we emphasize the things known to work well. We know the Intel gigabit ethernets work well, for example. Many people don't have problems with the Broadcom bge's, but some of us *have* had problems with them. And of course Realtek is crap, but depending on whether you have knockoff silicon or which specific chip, you might be able to get stable (even if slow) operation out of it.

And look at the original poster in this thread. Yes you can absolutely get an LSI RAID controller to "work". The MFI driver isn't bad. It's meant for a traditional FreeBSD system, where it is pretty darn good. On FreeNAS, the biggest issue is that ZFS tends to throw massive transaction groups at it which swamps the write cache and causes poor performance. A typical RAID controller assumes normal "computer-y" things happening like "load program, open file, write some data" and has a cache sized to that task. However, if you don't have a BBU or CacheVault, and ZFS throws a bunch of data into the controller write cache, and you lose power, you can get critical damage to the pool because stuff that ZFS thought was written hasn't been. There are all sorts of edge cases like this. Further, the driver hasn't been thoroughly tested under the insane loads that FreeNAS can generate (think especially of scrubs or resilvers).

But these things can be MADE to work, and the developers have historically felt that merely having a suboptimal device is not a catastrophe that needs to be warned about. Even if there are risks and caveats that you should know (and don't).

Here in the forums, we push the other way. We tend to expect that people come to FreeNAS because they love their data, and they want to take as good care of it as they can. We analyze their systems and do try to warn people about this stuff. It is crazy awesome that a user with 78 posts ( @G8One2 ) spotted the RAID controller and replied within ten minutes. Because people deserve to know, and we know that not everybody reads all the documentation that's available. Life's too short. :^)

In any case, I feel a lot better about this thread because there's a very plausible sequence of events here. When the data center overheated, it's likely the PERC RAID controller went into bit vomit mode and threw some trash out into the pool. It's probably not the drives as the OP notes errors on each drive. It's also probably not recoverable because the redundancy information would have been written out through the same controller, so both the data and redundancy is corrupt. ZFS is, for better or for worse, not designed to deal with this. It has as an underlying design assumption that the pool as an entity is always in a good state. So the user's pool is probably trashed and needs to be rebuilt.

We can probably agree that the underlying design assumption about pool integrity is "unfortunate", but as I previously mentioned, there is a huge amount of complexity, and making an "fsck" for ZFS is effectively impractical. The amount of data you'd need to be validating at the petabyte scale would mean a consistency check could take weeks or months, and the amount of RAM would be ridiculous.

G8One2 · Dec 25, 2019

Hardware recommendations (read this first)

A new version is available in the Resources section of the forum. Or just click here. The old version will be unstickied after a while, but will still be available. If you want the FreeNAS Engineers official take on this subject then read this: http://www.freenas.org/hardware-requirements/...

www.ixsystems.com

G8One2 · Dec 25, 2019

It amazes me how many times I see posts about people having issues, but using desktop gaming hardware, i7's or just incompatible hardware in general. Everything is well documented, and easily searched. I've been down that road, I too tried using old desktop hardware when i first started playing around with FreeNas, but this was only because i had spare parts laying around. The end result was always the same, I would eventually lose all my data some way or another. When i got serious about it, is when i actually purchased server grade equipment, and followed the hardware recommendations posted by cyberjock.

G8One2 · Dec 25, 2019

A smart man once said:

Remember these rules:

1. This isn't Burger King. You can't build your server 'your way' and expect it all to "just work". It might. But it might not. Do you want to build it right the first time or the second time? You ready to throw down money to build a single working box with double the hardware?
2. I'm not a baker. I don't sugar coat things. I'm here to give you some cold hard facts and that's about it.
3. If you post a build in the forum that doesn't follow these recommendations, expect them to be reiterated to you all over again. Quite literally, the people looking at builds and comment have standards that are relatively in-line with this thread. So if you have a build on paper and it doesn't pass this post you can kind of guess what kind of responses you are going to get.
4. Be careful what you settle for. Often when you settle for less than you deserve you get even less than you expected.

-Cyberjock

styno · Dec 25, 2019

I don't want to make you more worried than you already are, and sorry if you lost some data. But... Are you sure that the other filesystems are not silently serving corrupted data and zfs is the only one that is actively checking (and healing if possible)?

Constantin · Dec 25, 2019

Healing? At this point, I’d run shutdown now followed by looking into what hardware might be toasted. The HBA certainly looks like a replacement candidate.

It’s pure hubris to be running potentially bad hardware with an extant pool unless the aim is to corrupt it.

styno · Dec 25, 2019

Well ofc! I was merely/bluntly comparing filesystems, not trying to fix the situation (as did OP)

Constantin · Dec 25, 2019

The OP may have gotten lucky with their other systems but I wouldn’t count on it.

Instead of making unsubstantiated criticisms about ZFS, I’d be running comparisons between known-good data and the data being hosted in the DC. I’d also take all hardware offline to run burn in tests to verify operation, ZFS or not.

I sure hope the OP followed good practices and has offsite data to compare against! Also am surprised that a temperature excursion in the DC would not trigger an alert followed by emergency shutdown via the sysadmins.

It’s a good reminder for me to check whether my FreeNAS is set up to contact me when something goes sideways. If I were smarter, I’d have auto-shutdown scripts ready to trigger should a disk or CPU go beyond normal limits.

Important Announcement for the TrueNAS Community.

User story: ZFS Filesystem isn't reliable

Explorer

Patron

Powered by Neutrality

Guru

actually does care

Resident Grinch

Contributor

Explorer

Patron

Explorer

Resident Grinch

Guru

Resident Grinch

Patron

Patron

Patron

Patron

Vampire Pig

Patron

Vampire Pig

Similar threads