1. 23 Jan, 2016 17 commits
  2. 15 Dec, 2015 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.3.3 · 35483418
      Greg Kroah-Hartman authored
      35483418
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · e0aa614d
      Filipe Manana authored
      commit b06c4bf5 upstream.
      
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      e0aa614d
    • Hans Verkuil's avatar
      cobalt: fix Kconfig dependency · 4f2347e6
      Hans Verkuil authored
      commit fc88dd16 upstream.
      
      The cobalt driver should depend on VIDEO_V4L2_SUBDEV_API.
      
      This fixes this kbuild error:
      
      tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
      head:   99bc7215
      commit: 85756a06 [media] cobalt: add new driver
      config: x86_64-randconfig-s0-09201514 (attached as .config)
      reproduce:
        git checkout 85756a06
        # save the attached .config to linux build tree
        make ARCH=x86_64
      
      All error/warnings (new ones prefixed by >>):
      
         drivers/media/i2c/adv7604.c: In function 'adv76xx_get_format':
      >> drivers/media/i2c/adv7604.c:1853:9: error: implicit declaration of function 'v4l2_subdev_get_try_format' [-Werror=implicit-function-declaration]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                  ^
         drivers/media/i2c/adv7604.c:1853:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                ^
         drivers/media/i2c/adv7604.c: In function 'adv76xx_set_format':
         drivers/media/i2c/adv7604.c:1882:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                ^
         cc1: some warnings being treated as errors
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f2347e6
    • Lu, Han's avatar
      ALSA: hda/hdmi - apply Skylake fix-ups to Broxton display codec · de48652f
      Lu, Han authored
      commit e2656412 upstream.
      
      Broxton and Skylake have the same behavior on display audio. So this patch
      applys Skylake fix-ups to Broxton.
      Signed-off-by: default avatarLu, Han <han.lu@intel.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de48652f
    • Dan Williams's avatar
      ALSA: pci: depend on ZONE_DMA · 4ba2bd2c
      Dan Williams authored
      commit 2db1a579 upstream.
      
      There are several sound drivers that 'select ZONE_DMA'.  This is
      backwards as ZONE_DMA is an architecture capability exported to drivers.
      Switch the polarity of the dependency to disable these drivers when the
      architecture does not support ZONE_DMA.  This was discovered in the
      context of testing/enabling devm_memremap_pages() which depends on
      ZONE_DEVICE.  ZONE_DEVICE in turn depends on !ZONE_DMA.
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ba2bd2c
    • Arnd Bergmann's avatar
      ceph: fix message length computation · 4bf8855e
      Arnd Bergmann authored
      commit 777d738a upstream.
      
      create_request_message() computes the maximum length of a message,
      but uses the wrong type for the time stamp: sizeof(struct timespec)
      may be 8 or 16 depending on the architecture, while sizeof(struct
      ceph_timespec) is always 8, and that is what gets put into the
      message.
      
      Found while auditing the uses of timespec for y2038 problems.
      
      Fixes: b8e69066 ("ceph: include time stamp in every MDS request")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bf8855e
    • Ming Lei's avatar
      block: fix segment split · 278b54df
      Ming Lei authored
      commit 578270bf upstream.
      
      Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
      always points to the iterator local variable, which is obviously
      wrong, so fix it by pointing to the local variable of 'bvprv'.
      
      Fixes: 5014c311(block: fix bogus compiler warnings in blk-merge.c)
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reported-by: default avatarMark Salter <msalter@redhat.com>
      Tested-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: default avatarMark Salter <msalter@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      278b54df
    • Junxiao Bi's avatar
      ocfs2: fix umask ignored issue · eeec50cb
      Junxiao Bi authored
      commit 8f1eb487 upstream.
      
      New created file's mode is not masked with umask, and this makes umask not
      work for ocfs2 volume.
      
      Fixes: 702e5bc6 ("ocfs2: use generic posix ACL infrastructure")
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eeec50cb
    • Jeff Layton's avatar
      nfs: if we have no valid attrs, then don't declare the attribute cache valid · 3994c2de
      Jeff Layton authored
      commit c812012f upstream.
      
      If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
      (correctly) not update any of the attributes, but it then clears the
      NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
      up to date. Don't clear the flag if the fattr struct has no valid
      attrs to apply.
      Reviewed-by: default avatarSteve French <steve.french@primarydata.com>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3994c2de
    • Jeff Layton's avatar
      nfs4: resend LAYOUTGET when there is a race that changes the seqid · 4ffcc6f9
      Jeff Layton authored
      commit 4f2e9dce upstream.
      
      pnfs_layout_process will check the returned layout stateid against what
      the kernel has in-core. If it turns out that the stateid we received is
      older, then we should resend the LAYOUTGET instead of falling back to
      MDS I/O.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ffcc6f9
    • Benjamin Coddington's avatar
      nfs4: start callback_ident at idr 1 · 9a08c6f6
      Benjamin Coddington authored
      commit c68a027c upstream.
      
      If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
      it when the nfs_client is freed.  A decoding or server bug can then find
      and try to put that first nfs_client which would lead to a crash.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Fixes: d6870312 ("nfs4client: convert to idr_alloc()")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a08c6f6
    • Daniel Borkmann's avatar
      debugfs: fix refcount imbalance in start_creating · 84897330
      Daniel Borkmann authored
      commit 0ee9608c upstream.
      
      In debugfs' start_creating(), we pin the file system to safely access
      its root. When we failed to create a file, we unpin the file system via
      failed_creating() to release the mount count and eventually the reference
      of the vfsmount.
      
      However, when we run into an error during lookup_one_len() when still
      in start_creating(), we only release the parent's mutex but not so the
      reference on the mount. Looks like it was done in the past, but after
      splitting portions of __create_file() into start_creating() and
      end_creating() via 190afd81 ("debugfs: split the beginning and the
      end of __create_file() off"), this seemed missed. Noticed during code
      review.
      
      Fixes: 190afd81 ("debugfs: split the beginning and the end of __create_file() off")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84897330
    • Andrew Elble's avatar
      nfsd: eliminate sending duplicate and repeated delegations · 4e9f4529
      Andrew Elble authored
      commit 34ed9872 upstream.
      
      We've observed the nfsd server in a state where there are
      multiple delegations on the same nfs4_file for the same client.
      The nfs client does attempt to DELEGRETURN these when they are presented to
      it - but apparently under some (unknown) circumstances the client does not
      manage to return all of them. This leads to the eventual
      attempt to CB_RECALL more than one delegation with the same nfs
      filehandle to the same client. The first recall will succeed, but the
      next recall will fail with NFS4ERR_BADHANDLE. This leads to the server
      having delegations on cl_revoked that the client has no way to FREE
      or DELEGRETURN, with resulting inability to recover. The state manager
      on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED,
      and the state manager on the client will be looping unable to satisfy
      the server.
      
      List discussion also reports a race between OPEN and DELEGRETURN that
      will be avoided by only sending the delegation once to the
      client. This is also logically in accordance with RFC5561 9.1.1 and 10.2.
      
      So, let's:
      
      1.) Not hand out duplicate delegations.
      2.) Only send them to the client once.
      
      RFC 5561:
      
      9.1.1:
      "Delegations and layouts, on the other hand, are not associated with a
      specific owner but are associated with the client as a whole
      (identified by a client ID)."
      
      10.2:
      "...the stateid for a delegation is associated with a client ID and may be
      used on behalf of all the open-owners for the given client.  A
      delegation is made to the client as a whole and not to any specific
      process or thread of control within it."
      Reported-by: default avatarEric Meddaugh <etmsys@rit.edu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Olga Kornievskaia <aglo@umich.edu>
      Signed-off-by: default avatarAndrew Elble <aweits@rit.edu>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e9f4529
    • Jeff Layton's avatar
      nfsd: serialize state seqid morphing operations · 9a4a7278
      Jeff Layton authored
      commit 35a92fe8 upstream.
      
      Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
      running in parallel. The server would receive the OPEN_DOWNGRADE first
      and check its seqid, but then an OPEN would race in and bump it. The
      OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
      was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
      it should have been rejected since the seqid changed.
      
      The only recourse we have here I think is to serialize operations that
      bump the seqid in a stateid, particularly when we're given a seqid in
      the call. To address this, we add a new rw_semaphore to the
      nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
      after looking up the stateid to ensure that nothing else is going to
      bump it while we're operating on it.
      
      In the case of OPEN, we do a down_read, as the call doesn't contain a
      seqid. Those can run in parallel -- we just need to serialize them when
      there is a concurrent OPEN_DOWNGRADE or CLOSE.
      
      LOCK and LOCKU however always take the write lock as there is no
      opportunity for parallelizing those.
      Reported-and-Tested-by: default avatarAndrew W Elble <aweits@rit.edu>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a4a7278
    • Stefan Richter's avatar
      firewire: ohci: fix JMicron JMB38x IT context discovery · eee26fbf
      Stefan Richter authored
      commit 100ceb66 upstream.
      
      Reported by Clifford and Craig for JMicron OHCI-1394 + SDHCI combo
      controllers:  Often or even most of the time, the controller is
      initialized with the message "added OHCI v1.10 device as card 0, 4 IR +
      0 IT contexts, quirks 0x10".  With 0 isochronous transmit DMA contexts
      (IT contexts), applications like audio output are impossible.
      
      However, OHCI-1394 demands that at least 4 IT contexts are implemented
      by the link layer controller, and indeed JMicron JMB38x do implement
      four of them.  Only their IsoXmitIntMask register is unreliable at early
      access.
      
      With my own JMB381 single function controller I found:
        - I can reproduce the problem with a lower probability than Craig's.
        - If I put a loop around the section which clears and reads
          IsoXmitIntMask, then either the first or the second attempt will
          return the correct initial mask of 0x0000000f.  I never encountered
          a case of needing more than a second attempt.
        - Consequently, if I put a dummy reg_read(...IsoXmitIntMaskSet)
          before the first write, the subsequent read will return the correct
          result.
        - If I merely ignore a wrong read result and force the known real
          result, later isochronous transmit DMA usage works just fine.
      
      So let's just fix this chip bug up by the latter method.  Tested with
      JMB381 on kernel 3.13 and 4.3.
      
      Since OHCI-1394 generally requires 4 IT contexts at a minium, this
      workaround is simply applied whenever the initial read of IsoXmitIntMask
      returns 0, regardless whether it's a JMicron chip or not.  I never heard
      of this issue together with any other chip though.
      
      I am not 100% sure that this fix works on the OHCI-1394 part of JMB380
      and JMB388 combo controllers exactly the same as on the JMB381 single-
      function controller, but so far I haven't had a chance to let an owner
      of a combo chip run a patched kernel.
      
      Strangely enough, IsoRecvIntMask is always reported correctly, even
      though it is probed right before IsoXmitIntMask.
      
      Reported-by: Clifford Dunn
      Reported-by: default avatarCraig Moore <craig.moore@qenos.com>
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eee26fbf
    • Daeho Jeong's avatar
      ext4, jbd2: ensure entering into panic after recording an error in superblock · 5cbe4871
      Daeho Jeong authored
      commit 4327ba52 upstream.
      
      If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
      journaling will be aborted first and the error number will be recorded
      into JBD2 superblock and, finally, the system will enter into the
      panic state in "errors=panic" option.  But, in the rare case, this
      sequence is little twisted like the below figure and it will happen
      that the system enters into panic state, which means the system reset
      in mobile environment, before completion of recording an error in the
      journal superblock. In this case, e2fsck cannot recognize that the
      filesystem failure occurred in the previous run and the corruption
      wouldn't be fixed.
      
      Task A                        Task B
      ext4_handle_error()
      -> jbd2_journal_abort()
        -> __journal_abort_soft()
          -> __jbd2_journal_abort_hard()
          | -> journal->j_flags |= JBD2_ABORT;
          |
          |                         __ext4_abort()
          |                         -> jbd2_journal_abort()
          |                         | -> __journal_abort_soft()
          |                         |   -> if (journal->j_flags & JBD2_ABORT)
          |                         |           return;
          |                         -> panic()
          |
          -> jbd2_journal_update_sb_errno()
      Tested-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5cbe4871
    • Lukas Czerner's avatar
      ext4: fix potential use after free in __ext4_journal_stop · 5a4ead78
      Lukas Czerner authored
      commit 6934da92 upstream.
      
      There is a use-after-free possibility in __ext4_journal_stop() in the
      case that we free the handle in the first jbd2_journal_stop() because
      we're referencing handle->h_err afterwards. This was introduced in
      9705acd6 and it is wrong. Fix it by
      storing the handle->h_err value beforehand and avoid referencing
      potentially freed handle.
      
      Fixes: 9705acd6Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a4ead78
    • Theodore Ts'o's avatar
      ext4 crypto: fix bugs in ext4_encrypted_zeroout() · bcdde051
      Theodore Ts'o authored
      commit 36086d43 upstream.
      
      Fix multiple bugs in ext4_encrypted_zeroout(), including one that
      could cause us to write an encrypted zero page to the wrong location
      on disk, potentially causing data and file system corruption.
      Fortunately, this tends to only show up in stress tests, but even with
      these fixes, we are seeing some test failures with generic/127 --- but
      these are now caused by data failures instead of metadata corruption.
      
      Since ext4_encrypted_zeroout() is only used for some optimizations to
      keep the extent tree from being too fragmented, and
      ext4_encrypted_zeroout() itself isn't all that optimized from a time
      or IOPS perspective, disable the extent tree optimization for
      encrypted inodes for now.  This prevents the data corruption issues
      reported by generic/127 until we can figure out what's going wrong.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcdde051
    • Theodore Ts'o's avatar
      ext4 crypto: fix memory leak in ext4_bio_write_page() · 99d17a1f
      Theodore Ts'o authored
      commit 937d7b84 upstream.
      
      There are times when ext4_bio_write_page() is called even though we
      don't actually need to do any I/O.  This happens when ext4_writepage()
      gets called by the jbd2 commit path when an inode needs to force its
      pages written out in order to provide data=ordered guarantees --- and
      a page is backed by an unwritten (e.g., uninitialized) block on disk,
      or if delayed allocation means the page's backing store hasn't been
      allocated yet.  In that case, we need to skip the call to
      ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
      bounce page and an ext4 crypto context getting leaked.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99d17a1f
    • Ilya Dryomov's avatar
      rbd: don't put snap_context twice in rbd_queue_workfn() · 4d8dce2c
      Ilya Dryomov authored
      commit 70b16db8 upstream.
      
      Commit 4e752f0a ("rbd: access snapshot context and mapping size
      safely") moved ceph_get_snap_context() out of rbd_img_request_create()
      and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
      error path in rbd_queue_workfn().  However, rbd_img_request_create()
      consumes a ref on snapc, so calling ceph_put_snap_context() after
      a successful rbd_img_request_create() leads to an extra put.  Fix it.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJosh Durgin <jdurgin@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d8dce2c
    • David Sterba's avatar
      btrfs: fix signed overflows in btrfs_sync_file · f0009492
      David Sterba authored
      commit 9dcbeed4 upstream.
      
      The calculation of range length in btrfs_sync_file leads to signed
      overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
      
      https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
      
      The fsync call passes 0 and LLONG_MAX, the range length does not fit to
      loff_t and overflows, but the value is converted to u64 so it silently
      works as expected.
      
      The minimal fix is a typecast to u64, switching functions to take
      (start, end) instead of (start, len) would be more intrusive.
      
      Coccinelle script found that there's one more opencoded calculation of
      the length.
      
      <smpl>
      @@
      loff_t start, end;
      @@
      * end - start
      </smpl>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0009492
    • Filipe Manana's avatar
      Btrfs: fix race when listing an inode's xattrs · 82c82fad
      Filipe Manana authored
      commit f1cd1f0b upstream.
      
      When listing a inode's xattrs we have a time window where we race against
      a concurrent operation for adding a new hard link for our inode that makes
      us not return any xattr to user space. In order for this to happen, the
      first xattr of our inode needs to be at slot 0 of a leaf and the previous
      leaf must still have room for an inode ref (or extref) item, and this can
      happen because an inode's listxattrs callback does not lock the inode's
      i_mutex (nor does the VFS does it for us), but adding a hard link to an
      inode makes the VFS lock the inode's i_mutex before calling the inode's
      link callback.
      
      If we have the following leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
                 slot N - 2         slot N - 1              slot 0
      
      The race illustrated by the following sequence diagram is possible:
      
             CPU 1                                               CPU 2
      
        btrfs_listxattr()
      
          searches for key (257 XATTR_ITEM 0)
      
          gets path with path->nodes[0] == leaf X
          and path->slots[0] == N
      
          because path->slots[0] is >=
          btrfs_header_nritems(leaf X), it calls
          btrfs_next_leaf()
      
          btrfs_next_leaf()
            releases the path
      
                                                         adds key (257 INODE_REF 666)
                                                         to the end of leaf X (slot N),
                                                         and leaf X now has N + 1 items
      
            searches for the key (257 INODE_REF 256),
            with path->keep_locks == 1, because that
            is the last key it saw in leaf X before
            releasing the path
      
            ends up at leaf X again and it verifies
            that the key (257 INODE_REF 256) is no
            longer the last key in leaf X, so it
            returns with path->nodes[0] == leaf X
            and path->slots[0] == N, pointing to
            the new item with key (257 INODE_REF 666)
      
          btrfs_listxattr's loop iteration sees that
          the type of the key pointed by the path is
          different from the type BTRFS_XATTR_ITEM_KEY
          and so it breaks the loop and stops looking
          for more xattr items
            --> the application doesn't get any xattr
                listed for our inode
      
      So fix this by breaking the loop only if the key's type is greater than
      BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82c82fad
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 42c6cb15
      Filipe Manana authored
      commit 1d512cb7 upstream.
      
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      42c6cb15