Losing ZFS pool overnight

paul.warwicker.1 · Apr 28, 2019

I recently upgraded from 11.1-U7 to 11.2-U3. Since then I have been experiencing random failures, particularly when backing up to the server from a couple of Windows 10 machines overnight. Initially, I suspected the samba setup but the issue appears to be that I am losing a whole pool. It is nothing to do with the timing as such; I can cause the problem during the day, although I have to wait a random time, often a few hours after a manual start of the full backup. The share and the oracle02/.system/samba4 dataset just evaporates under its feet.

I have raised the smbd logging to try to give an indication of the time when the failure occurs, but checking the other logs, around the time of the failure has so far revealed nothing. If I stay logged in, I can issue some commands, but trying to read /var/log is impossible and df hangs. I have to reboot to inspect any logs.

It goes from this:

Code:

root@oracle[~]# zfs list

NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
freenas-boot                                               1.02G   107G   176K  none
freenas-boot/ROOT                                          1.02G   107G   136K  none
freenas-boot/ROOT/Initial-Install                             8K   107G  1.01G  legacy
freenas-boot/ROOT/default                                  1.02G   107G  1.01G  legacy
oracle01                                                   23.1G  5.08T   128K  /mnt/oracle01
oracle01/ubackups                                          23.1G  5.08T  23.1G  /mnt/oracle01/ubackups
oracle01/utest                                             1.07M  5.08T  1.07M  /mnt/oracle01/utest
oracle02                                                   6.97T  3.42T  99.0M  /mnt/oracle02
oracle02/.bhyve_containers                                 70.5M  3.42T  70.5M  /mnt/oracle02/.bhyve_containers
oracle02/.system                                           35.1M  3.42T   170K  legacy
oracle02/.system/configs-1e6cbdfb415748b98d868947a6e14a88   454K  3.42T   454K  legacy
oracle02/.system/cores                                      156K  3.42T   156K  legacy
oracle02/.system/rrd-1e6cbdfb415748b98d868947a6e14a88      24.2M  3.42T  24.2M  legacy
oracle02/.system/samba4                                     604K  3.42T   604K  legacy
oracle02/.system/syslog-1e6cbdfb415748b98d868947a6e14a88   9.34M  3.42T  9.34M  legacy
oracle02/.system/webui                                      156K  3.42T   156K  legacy
oracle02/docker                                             498M  3.42T   498M  /mnt/oracle02/docker
oracle02/iocage                                            4.97G  3.42T  4.13M  /mnt/oracle02/iocage
oracle02/iocage/download                                    272M  3.42T   156K  /mnt/oracle02/iocage/download
oracle02/iocage/download/11.2-RELEASE                       272M  3.42T   272M  /mnt/oracle02/iocage/download/11.2-RELEASE
oracle02/iocage/images                                      156K  3.42T   156K  /mnt/oracle02/iocage/images
oracle02/iocage/jails                                      3.35G  3.42T   156K  /mnt/oracle02/iocage/jails
oracle02/iocage/jails/jenkins                               648M  3.42T   334K  /mnt/oracle02/iocage/jails/jenkins
oracle02/iocage/jails/jenkins/root                          648M  3.42T  1.60G  /mnt/oracle02/iocage/jails/jenkins/root
oracle02/iocage/jails/plex                                 2.72G  3.42T   320K  /mnt/oracle02/iocage/jails/plex
oracle02/iocage/jails/plex/root                            2.72G  3.42T  3.66G  /mnt/oracle02/iocage/jails/plex/root
oracle02/iocage/log                                         170K  3.42T   170K  /mnt/oracle02/iocage/log
oracle02/iocage/releases                                   1.35G  3.42T   156K  /mnt/oracle02/iocage/releases
oracle02/iocage/releases/11.2-RELEASE                      1.35G  3.42T   156K  /mnt/oracle02/iocage/releases/11.2-RELEASE
oracle02/iocage/releases/11.2-RELEASE/root                 1.35G  3.42T  1.35G  /mnt/oracle02/iocage/releases/11.2-RELEASE/root
oracle02/iocage/templates                                   156K  3.42T   156K  /mnt/oracle02/iocage/templates
oracle02/jails                                             3.30G  3.42T   185K  /mnt/oracle02/jails
oracle02/jails/.warden-template-pluginjail-11.0-x64         648M  3.42T   648M  /mnt/oracle02/jails/.warden-template-pluginjail-11.0-x64
oracle02/jails/.warden-template-standard-11.0-x64          2.67G  3.42T  2.67G  /mnt/oracle02/jails/.warden-template-standard-11.0-x64
oracle02/local                                             1.19T  3.42T  1.19T  /mnt/oracle02/local
oracle02/media                                              740G  3.42T   740G  /mnt/oracle02/media
oracle02/ubackups                                          23.1G  3.42T  23.1G  /mnt/oracle02/ubackups
oracle02/utest                                             1.14M  3.42T  1.14M  /mnt/oracle02/utest
oracle02/vmware                                             415G  3.42T   415G  /mnt/oracle02/vmware
oracle02/wbackups                                          4.33T  3.42T  4.09T  /mnt/oracle02/wbackups
oracle02/wroot                                              304G  3.42T   304G  /mnt/oracle02/wroot
root@oracle[~]#

to this:

Code:

[paulw@oracle ~]$ zfs list
NAME                                USED  AVAIL  REFER  MOUNTPOINT
freenas-boot                       1.02G   107G   176K  none
freenas-boot/ROOT                  1.02G   107G   136K  none
freenas-boot/ROOT/Initial-Install     8K   107G  1.01G  legacy
freenas-boot/ROOT/default          1.02G   107G  1.01G  legacy
oracle01                           23.1G  5.08T   128K  /mnt/oracle01
oracle01/ubackups                  23.1G  5.08T  23.1G  /mnt/oracle01/ubackups
oracle01/utest                     1.07M  5.08T  1.07M  /mnt/oracle01/utest
[paulw@oracle ~]$

Any sage words of advice??

It all was all working okay at 11.1.

-paul

Heracles · Apr 28, 2019

Hi Paul,

Many people ended up loosing dataset when they migrate from 11.1 to 11.2. The ones with snapshots of their dataset were able to recover their data but those without snapshot or backup lost it all.

You can go and read this thread for more info about the situation you may be in. Maybe you can also provide them with more data for them to do the diagnosis.

Good luck,

paul.warwicker.1 · Apr 28, 2019

Thanks for the reply, but I don't think I am in that situation.

The main oracle02 dataset was rebuilt at 11.1 when I migrated the data to the new Red drives. AFAIK, there is no data loss at all. Even with this particular issue, after a reboot all the shares are accessible and the data is intact. It just seems to be stressing the oracle02/wbackups is causing the whole pool to be removed taking down any system datasets.

The dataset oracle01 was rebuilt at 11.2. oracle01 was previously the main dataset (also raidz2 stored in an external enclosure) but had one failed drive. So I recycled the Blacks as a degraded volume into the Gen8 itself replacing several Green disks of similar size. Once the data was copied to the new oracle02, oracle01 was completely rebuilt as a 4 disk volume.

-paul

Heracles · Apr 28, 2019

Hi again,

Good for you if you are not in that situation. When you said that the dataset evaporated, I thought that you were in that boat.

It may be some kind of self-protection; ZFS detects something bad and would rather unmount the pool than risking to loose it to whatever it detected. I never faced something like that yet, so I will let other people offer you better support...

Good luck,

Chris Moore · Apr 28, 2019

paul.warwicker.1 said:
I can cause the problem during the day, although I have to wait a random time, often a few hours after a manual start of the full backup. The share and the oracle02/.system/samba4 dataset just evaporates under its feet.

I would appreciate if you shared hardware details because this sounds like a hardware problem to me. Is the pool that 'evaporates' on a separate controller?

Please see this guidance:
https://www.ixsystems.com/community/threads/forum-guidelines.45124

I see what is in your signature, but that is not close to enough detail.

Jailer · Apr 28, 2019

paul.warwicker.1 said:
oracle01 was previously the main dataset (also raidz2 stored in an external enclosure)

Is this still the case? You're going to have to provide complete hardware specs and a little more information if others are going to be able to help you. If your drives are located in an external enclosure that's a very important piece of information that is missing.

https://www.ixsystems.com/community/threads/forum-guidelines.45124/

Chris Moore · Apr 28, 2019

Just to be clear, what you are talking about here is a "Micro Server"...

with only four drive bays and your pool that is evaporating is somehow connected externally.

paul.warwicker.1 · Apr 28, 2019

Yes, that is the Microserver (https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03793258).

It is just the stock motherboard.
It has an HP Dynamic Smart Array B120i controller internally and I have an eSATA enclosure connected via a PCIe card. However all disks are JBOD, there is no use of the RAID controller.
HP Ethernet 1Gb 2-port 332i adapter plus ILO

oracle02 zpool is in the external enclosure, oracle01 zpool is on the internal controller.

I have run it this way for approximately 5 years and was fine until I went to 11.2. Previous versions were 9.3->Corral (persevered for quite some time as it worked for me)->11.1->11.2

There are no hardware errors as far as I am aware. In fact, it has been replicating selected datasets from oracle02 to oracle01 for the past 6 hours continuously without incident.

-paul

Chris Moore · Apr 28, 2019

paul.warwicker.1 said:
I have an eSATA enclosure connected via a PCIe card

There is your problem. Throw that eSATA hardware out. When I first started doing FreeNAS, I tried to use that kind of gear myself for several years and it is just NOT reliable. It could be the controller chip on the card or the one in the external enclosure or some incompatibility between the two but that is the problem. I have seen that kind of behavior myself and it is the reason I tell everyone that asks, get a SAS controller. SATA is terrible and eSATA is worse but eSATA to a port multiplier is the absolute worst of all.

paul.warwicker.1 said:
I have run it this way for approximately 5 years and was fine until I went to 11.2.

They probably change a driver between version 11.1 and 11.2 and that kind of thing is the reason I finally gave up on my SATA port multiplier hardware but it happened for me when we went from 9.10 to 11, if I recall correctly.

paul.warwicker.1 said:
There are no hardware errors as far as I am aware. In fact, it has been replicating selected datasets from oracle02 to oracle01 for the past 6 hours continuously without incident.

Sure. It works great, right up until it doesn't work at all. It could be that one of the chips is going bad and under certain conditions it overheats. You could try buying all new hardware.

paul.warwicker.1 · Apr 28, 2019

Chris Moore said:
You could try buying all new hardware.

I appreciate the advice, but not entirely practical. If anything reverting to 11.1 is more pragmatic because it worked almost flawlessly there. Obviously I would be stranded, but I had tried reverting to boot 11.1 off a different USB boot disk, so I know it is recoverable in that sense. Also one of the reasons I had not upgraded the zpool.

I can't be entirely sure that it is not the hardware, but if it was a chip overheating, would you not expect several hours of replication to have provoked this? Now that I have a large (740G) dataset replicated from 02->01, I will try replicating the other way to force oracle02 to be written to.

-paul

Chris Moore · Apr 28, 2019

paul.warwicker.1 said:
I will try replicating the other way to force oracle02 to be written to.

Out of mild curiosity, why are you calling these datasets oracle?

paul.warwicker.1 · Apr 28, 2019

Nothing to do with software. My machines are named after Matrix characters and this particular server is oracle.local :)

The replication test in the opposite direction has caused the same problem. I left it to resilver overnight. Most of the errors are either to oracle02/.system/<something>, so it may be related to that. Not entirely sure where they were previously. Will try moving those to freenas-boot. No point in not trying.

-paul

KrisBee · Apr 29, 2019

Paul,

I think I'd be pragmatic and revert to FN11.1 in the short term. As to a hardware change, why not replicate to a second microserver ( gen8 or earlier N54l, N40L model) ? Not thought it through, but can you link a second microserver via external SAS cards to your Gen8? Something along these lines: https://www.youtube.com/watch?v=WgfDKjp3njk

paul.warwicker.1 · Apr 29, 2019

KrisBee said:
why not replicate to a second microserver

That would be a 3rd Gen8 as I already have a maxed out/upgraded Gen8 running ESXi.
The Gen8s are well-engineered boxes but as you know they only have one low profile PCIe slot. If I went for a different controller, it probably would be the P222 SAS controller.
This controller can only connect to a maximum of 4 external drives. I have not seen a card with any more. Where they have 8, they usually split 4 internally. I have 5 disks in my enclosure in raidz2, so I would also need to compromise on available space or raid level. In addition to the cost of the controller, and I have not searched extensively yet, but have not yet found a genuinely decent backplane enclosure for four disks. I have seen some which are which have 8 bays (thinking ahead and putting a second P222 in the second Gen8), but they are ridiculously expensive. Unfortunately, I have to stay with 3.5" SATA.

Reluctantly and rather disappointingly, I have reverted back to 11.1. A couple of issues but I will start a new thread for that.

-paul

KrisBee · Apr 30, 2019

Paul,

I picked the wrong youtube ref, part 1 is here: https://www.youtube.com/watch?v=wQU8hsfCAz0 Gen8 are great boxes but with limited expansion capability. A P222 is not suitable for FreeNAS, AFAIK it cannot work as a pure HBA. For that kind of setup you would need a 9207/9217-4i4e or 8e card. The idea in the video is attractive as the N40L does have a reasonable backplane, After you switch the internal SAS cable to the SAS adaptor. it only needs any OS that boots with min of memory to keep power to the 4 drives. The NORCO brand adaptor, or its equivalent, is the difficult thing to source in the UK. HBA card and sff-8088 to sff-8088 cables can be had s/hand on fleabay.

But this kind of two machine configuration seems more about turning a 4 drive Gen8 into a 8 drive device for use with a single pool than replication between separate pools as you are doing.

Important Announcement for the TrueNAS Community.

Losing ZFS pool overnight

paul.warwicker.1

Dabbler

Heracles

Wizard

paul.warwicker.1

Dabbler

Heracles

Wizard

Chris Moore

Hall of Famer

Jailer

Not strong, but bad

Chris Moore

Hall of Famer

paul.warwicker.1

Dabbler

Chris Moore

Hall of Famer

paul.warwicker.1

Dabbler

Chris Moore

Hall of Famer

paul.warwicker.1

Dabbler

KrisBee

Wizard

paul.warwicker.1

Dabbler

KrisBee

Wizard

Similar threads