Update to 9.10-STABLE-201604018 failed and datasets are all empty

Status
Not open for further replies.

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
I've uploaded the debug log after creating a new system and just importing the 'main' pool in the gui.

memtest ran overnight w/o error.
 

Attachments

  • debug-freenas-20160422072844.tgz
    140.5 KB · Views: 234

styno

Patron
Joined
Apr 11, 2016
Messages
466
Is there a way to boot back into the old version and grab a debug as well?
(hoping the upgrade process left some traces/clues...)
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Maybe also post the output of "zfs history". In the past I've found this can take some time to run, but it'll include the timestamp for each entry so you should be able to focus on the entries since your upgrade and just prior.
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Is there a way to boot back into the old version and grab a debug as well?
(hoping the upgrade process left some traces/clues...)

Here's the debug log of a reboot with the 9.10-stable-20160418 release.
 

Attachments

  • debug-freenas-20160422184813_9.10_20160418.tgz
    338.9 KB · Views: 222

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Maybe also post the output of "zfs history". In the past I've found this can take some time to run, but it'll include the timestamp for each entry so you should be able to focus on the entries since your upgrade and just prior.

Here's the relevant snippet for the 'main' pool:
Code:
2016-04-15.02:00:09 zpool scrub main
2016-04-15.08:33:49 zpool import -o cachefile=none -R /mnt -f main
2016-04-15.08:33:49 zpool set cachefile=/data/zfs/zpool.cache main
2016-04-19.07:39:45 zpool import -o cachefile=none -R /mnt -f main
2016-04-19.07:39:45 zpool set cachefile=/data/zfs/zpool.cache main
2016-04-19.20:59:32 zpool import -o cachefile=none -R /mnt -f main
2016-04-19.20:59:32 zpool set cachefile=/data/zfs/zpool.cache main


4/15 - biweekly scrub
4/15 - update with no issues
4/19 @ 7:39 - initiated update with 4/18 release.
4/19 @ 19:00 - Discovered couldn't log into gui or ssh (to check on update from the morning). Hard reboot. Failed to boot w/ 4/18. Rolled back to 4/15. No jails or data in pools.

The other pools look pretty similar:
Code:
2016-04-01.00:00:14 zpool scrub main01
2016-04-01.21:57:40 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-01.21:57:40 zpool set cachefile=/data/zfs/zpool.cache main01
2016-04-12.08:16:15 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-12.08:16:15 zpool set cachefile=/data/zfs/zpool.cache main01
2016-04-15.00:00:13 zpool scrub main01
2016-04-15.08:33:52 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-15.08:33:52 zpool set cachefile=/data/zfs/zpool.cache main01
2016-04-19.07:39:47 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-19.07:39:47 zpool set cachefile=/data/zfs/zpool.cache main01
2016-04-19.20:59:39 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-19.20:59:39 zpool set cachefile=/data/zfs/zpool.cache main01
2016-04-19.22:12:56 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 6087474642287709240
2016-04-19.22:12:56 zpool set cachefile=/data/zfs/zpool.cache main01


Code:
2016-04-02.03:00:08 zpool scrub main2
2016-04-12.08:16:18 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 698393427071308264
2016-04-12.08:16:18 zpool set cachefile=/data/zfs/zpool.cache main2
2016-04-15.08:33:58 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 698393427071308264
2016-04-15.08:33:58 zpool set cachefile=/data/zfs/zpool.cache main2
2016-04-16.03:00:16 zpool scrub main2
2016-04-19.07:39:51 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 698393427071308264
2016-04-19.07:39:51 zpool set cachefile=/data/zfs/zpool.cache main2
2016-04-19.20:59:41 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 698393427071308264
2016-04-19.20:59:41 zpool set cachefile=/data/zfs/zpool.cache main2
2016-04-19.22:12:58 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 698393427071308264
2016-04-19.22:12:58 zpool set cachefile=/data/zfs/zpool.cache main2
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
I exported my zpool main2 and had my neighbor take a look at it (he's a storage engineer but not a Freenas/Freebsd expert). In ubuntu he saw the same empty drives but noticed that ashift was being reported as 0. Is that expected? From googling, 9 (512) and 12 (4k) are the common settings. With 0 zfs should ought to use the drives setting (12/4k for my 4-tb WD Green's). Since all three pools were impacted similarly implies that something was possibly changed in the pools configuration during the update (eg. zfg version udate) that then resulted in some knowledge of the file system being lost? The boot cache remembered the datasets though.

A couple questions:
1) How can I report the ashift value on the current pools? "zpool/zfs get ashift main" doesn't return a value.
2) What is the expected setting?
3) If ashift is incorrect and can result in an empty file system being presented can the value be corrected? This Bsd thread discusses something similar: https://forums.freebsd.org/threads/46957/
The same thread implies reduced performance but not lost data as a result of having an incorrect value.

Weird that the pool still reports my original creation date back in December 2014 but the data has disappeared.

Code:
 ~# zfs get all main
NAME  PROPERTY              VALUE                  SOURCE
main  type                  filesystem             -
main  creation              Wed Dec 10 16:44 2014  -
main  used                  2.80G                  -
main  available             1.75T                  -
main  referenced            112K                   -
main  compressratio         1.97x                  -
main  mounted               yes                    -
main  quota                 none                   default
main  reservation           none                   default
main  recordsize            128K                   default
main  mountpoint            /mnt/main              default
main  sharenfs              off                    default
main  checksum              on                     default
main  compression           lz4                    local
main  atime                 on                     default
main  devices               on                     default
main  exec                  on                     default
main  setuid                on                     default
main  readonly              off                    default
main  jailed                off                    default
main  snapdir               hidden                 default
main  aclmode               passthrough            local
main  aclinherit            passthrough            local
main  canmount              on                     default
main  xattr                 off                    temporary
main  copies                1                      default
main  version               5                      -
main  utf8only              off                    -
main  normalization         none                   -
main  casesensitivity       sensitive              -
main  vscan                 off                    default
main  nbmand                off                    default
main  sharesmb              off                    default
main  refquota              none                   default
main  refreservation        none                   default
main  primarycache          all                    default
main  secondarycache        all                    default
main  usedbysnapshots       0                      -
main  usedbydataset         112K                   -
main  usedbychildren        2.80G                  -
main  usedbyrefreservation  0                      -
main  logbias               latency                default
main  dedup                 off                    default
main  mlslabel                                     -
main  sync                  standard               default
main  refcompressratio      1.00x                  -
main  written               112K                   -
main  logicalused           3.71G                  -
main  logicalreferenced     41.5K                  -
main  volmode               default                default
main  filesystem_limit      none                   default
main  snapshot_limit        none                   default
main  filesystem_count      none                   default
main  snapshot_count        none                   default
main  redundant_metadata    all                    default


Code:
 ~# zpool get all main
NAME  PROPERTY                       VALUE                          SOURCE
main  size                           1.81T                          -
main  capacity                       0%                             -
main  altroot                        /mnt                           local
main  health                         ONLINE                         -
main  guid                           11934946034099847969           default
main  version                        -                              default
main  bootfs                         -                              default
main  delegation                     on                             default
main  autoreplace                    off                            default
main  cachefile                      /data/zfs/zpool.cache          local
main  failmode                       continue                       local
main  listsnapshots                  off                            default
main  autoexpand                     on                             local
main  dedupditto                     0                              default
main  dedupratio                     1.00x                          -
main  free                           1.81T                          -
main  allocated                      2.81G                          -
main  readonly                       off                            -
main  comment                        -                              default
main  expandsize                     -                              -
main  freeing                        0                              default
main  fragmentation                  4%                             -
main  leaked                         0                              default
main  feature@async_destroy          enabled                        local
main  feature@empty_bpobj            active                         local
main  feature@lz4_compress           active                         local
main  feature@multi_vdev_crash_dump  enabled                        local
main  feature@spacemap_histogram     active                         local
main  feature@enabled_txg            active                         local
main  feature@hole_birth             active                         local
main  feature@extensible_dataset     enabled                        local
main  feature@embedded_data          active                         local
main  feature@bookmarks              enabled                        local
main  feature@filesystem_limits      enabled                        local
main  feature@large_blocks           enabled                        local


Thank you for the help.
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Additional comments (from main2 imported into ubuntu):

"Below is the output of 'zdb -C main2'. The things of note, the vdev shows the ashift set to 12, but the pool is showing 0 via 'zpool get'.
Notice that the version is '5000' this indicates that a 'zpool upgrade' was done on this pool. '5000' means we have feature flags which I know we did not set that up initially.
Then there is the 'space map refcount mismatch: expected 140 != actual 116' at the very end. It feels like it is related in some way to the problem. However, this might not be a critical issues based on some of what I have read.
Either way I'll bet the if you are going to find a way to fix this, it will involve the zdb command."

Code:
root@:/main2# zdb -C main2
MOS Configuration:
        version: 5000
        name: 'main2'
        state: 0
        txg: 5916858
        pool_guid: 698393427071308264
        errata: 0
        hostname: 'sun'
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 698393427071308264
            children[0]:
                type: 'mirror'
                id: 0
                guid: 15743317423485594646
                metaslab_array: 23
                metaslab_shift: 35
                ashift: 12
                asize: 4000782221312
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 3818485169074205401
                    path: '/dev/sdb'
                    whole_disk: 0
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16257859811095562581
                    path: '/dev/sda2'
                    whole_disk: 1
                    create_txg: 4
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
space map refcount mismatch: expected 140 != actual 116
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Interestingly zdb doesn't want to report on my empty pools but the boot pool works.

Code:
[root@freenas] ~# zdb -C main
zdb: can't open 'main': No such file or directory
[root@freenas] ~# zdb -C main01
zdb: can't open 'main01': No such file or directory
[root@freenas] ~# zdb -C freenas-boot

MOS Configuration:
        version: 28
        name: 'freenas-boot'
        state: 0
        txg: 8190888
        pool_guid: 1980700216968690554
        hostid: 1928646676
        hostname: ''
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 1980700216968690554
            children[0]:
                type: 'mirror'
                id: 0
                guid: 2414297400553854654
                whole_disk: 0
                metaslab_array: 30
                metaslab_shift: 27
                ashift: 9
                asize: 15731523584
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 15638183002374690772
                    path: '/dev/gptid/5bb691d2-bbdf-11e4-a190-bc5ff4653672'
                    whole_disk: 1
                    DTL: 4240
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16578772712183022327
                    path: '/dev/gptid/7276d242-4ba5-11e5-90a5-bc5ff4653672'
                    whole_disk: 1
                    DTL: 4243
                    create_txg: 4
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 7873896676484202315
                    path: '/dev/gptid/72f882d9-1bd1-11e5-a49d-bc5ff4653672'
                    whole_disk: 1
                    DTL: 4241
                    create_txg: 4
        features_for_read:
[root@freenas] ~#
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
I've had a look at the logs but they start in the evening after you did this upgrade. Nothing there before or during :(
Just for your reference: the following is from my testsystem.
Code:
[root@goliath] ~# zdb -C freenas-boot

MOS Configuration:
        version: 5000
        name: 'freenas-boot'
        state: 0
        txg: 2161
        pool_guid: 15609061563420489255
        hostid: 717573323
        hostname: ''
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15609061563420489255
            create_txg: 4
            children[0]:
                type: 'mirror'
                id: 0
                guid: 2021053957597793508
                whole_disk: 0
                metaslab_array: 30
                metaslab_shift: 27
                ashift: 9
                asize: 15999959040
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 17630457644219948266
                    path: '/dev/da0p2'
                    whole_disk: 1
                    DTL: 90
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 5287390018985210882
                    path: '/dev/da1p2'
                    whole_disk: 1
                    DTL: 86
                    create_txg: 4
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
[root@goliath] ~# zdb -C goliathpool
zdb: can't open 'goliathpool': No such file or directory
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Thanks for taking a look.

Bug #14536 talks about a symlink missing the zpool cache. With -e added to the zdb command it now finds my pools. main01 reports the refcount mismatch as well but a ashift values look ok.

I'll generate logs from the boot image before the update and see if that shows something.
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Here are logs from the 9.10 Stable 4/14 release. May have something leading up to the update.
 

Attachments

  • debug-freenas-20160423180149_9.10_20160414.tgz
    400.4 KB · Views: 230

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
This thread is extremely disturbing to me. It should not be possible to instantly lose all data from ZFS due to a minor upgrade in a STABLE train with code only 4 days apart! :(
It's probably not due to the upgrade, it's from something that was charged and only caused problems after a reboot. Upgrades only touch the boot device so randomly losing data doesn't just happen. I'd the OP could tell us what they did when setting this pool up or modifications made outside the freenas gui it might help to shed some light.
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
It's probably not due to the upgrade, it's from something that was charged and only caused problems after a reboot. Upgrades only touch the boot device so randomly losing data doesn't just happen. I'd the OP could tell us what they did when setting this pool up or modifications made outside the freenas gui it might help to shed some light.
The pools main and main01 were created through the freenas gui and I didn't modify any pool settings.
main2 was created in unbuntu and then imported (rsync of existing files). The version of zfs was changed to match that of the current Freenas version (~March 2015).
In Februrary/March I migrated sys_logs to a usb hdd enclosure since the thumbdrives were starting to fail. Seemed to go ok.
No changes were done to the system's configuration leading up to 04/18's update. Only changes for a while was the 9.3 --> 9.10 which went smoothly and then subsequent 9.10 updates on the stable train.
I've mostly played with plugins for general media serving and have stayed away from the boot/file systems. The one major thing I did was create jail to host a headless vncserver to run KDE/Gnome with. That was done in a jail and was a few months ago.
Next on the list was to add the crashplan plugin. :-/

The main recurring issue I've had during the life of the system (12/2014) has been usb thumbdrive failures (bad checksums and the occasional re-silver) and one of the 4-tb drives on main2 started throwing SMART warnings so RMA'd the drive.

I think I've now reached the acceptance stage and will begin the rebuilding.
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
Here are logs from the 9.10 Stable 4/14 release. May have something leading up to the update.
Unfortunately they also start around "Apr 19 22:13:34" but the first issues were starting earlier that day? I also believe it should not be the update process that caused this, but there is no prove yet to rule this out.
OP, was this system accessible from internet? (Maybe a VPN client was running on this system?) Can you have a closer look at the .history files and logins? (The debug logs don't include these files).
If you reinstall and rebuild, can you keep one of the usb sticks aside or maybe create an image for further investigation? No-one wants to end up in a disturbing situation like this!
 

lomilomi

Dabbler
Joined
Jan 30, 2016
Messages
22
Apr 19 22:13 was when I tried to re-boot and went back a release through grub.

That's an intriguing and scary thought. The only intentional service I had externally accessible was Plex (port forwarded). I played with creating a VPN port going out to connect transmission but not the other way around. I'll take a look at those logs (I have the usb sticks set aside). I haven't seen any other mischief with other machines but doesn't mean it isn't there.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Interestingly zdb doesn't want to report on my empty pools but the boot pool works.

Code:
[root@freenas] ~# zdb -C main
zdb: can't open 'main': No such file or directory

Use:

Code:
zdb -U /data/zfs/zpool.cache -C main
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
This thread is extremely disturbing to me. It should not be possible to instantly lose all data from ZFS due to a minor upgrade in a STABLE train with code only 4 days apart! :(

I don't see any evidence to prove that proves that the upgrade is responsible for losing the zpool. Sure, they may have happened within a relatively short period of time. But I'm seeing zero evidence linking the two at all.

In fact, because this is the only known incidence, I find it exceedingly unlikely that the two would ever be able to be linked together.
 
Status
Not open for further replies.
Top