1. 06 Jan, 2017 40 commits
    • Ondrej Kozina's avatar
      dm crypt: mark key as invalid until properly loaded · 26011e67
      Ondrej Kozina authored
      commit 265e9098 upstream.
      
      In crypt_set_key(), if a failure occurs while replacing the old key
      (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag
      set.  Otherwise, the crypto layer would have an invalid key that still
      has DM_CRYPT_KEY_VALID flag set.
      Signed-off-by: default avatarOndrej Kozina <okozina@redhat.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26011e67
    • Wei Yongjun's avatar
      dm flakey: return -EINVAL on interval bounds error in flakey_ctr() · bd5fcd18
      Wei Yongjun authored
      commit bff7e067 upstream.
      
      Fix to return error code -EINVAL instead of 0, as is done elsewhere in
      this function.
      
      Fixes: e80d1c80 ("dm: do not override error code returned from dm_get_device()")
      Signed-off-by: default avatarWei Yongjun <weiyj.lk@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd5fcd18
    • Bart Van Assche's avatar
      dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device · 1ca66d6a
      Bart Van Assche authored
      commit 301fc3f5 upstream.
      
      When dm_table_set_type() is used by a target to establish a DM table's
      type (e.g. DM_TYPE_MQ_REQUEST_BASED in the case of DM multipath) the
      DM core must go on to verify that the devices in the table are
      compatible with the established type.
      
      Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ca66d6a
    • Mike Snitzer's avatar
      dm table: fix 'all_blk_mq' inconsistency when an empty table is loaded · d948d3b1
      Mike Snitzer authored
      commit 6936c12c upstream.
      
      An earlier DM multipath table could have been build ontop of underlying
      devices that were all using blk-mq.  In that case, if that active
      multipath table is replaced with an empty DM multipath table (that
      reflects all paths have failed) then it is important that the
      'all_blk_mq' state of the active table is transfered to the new empty DM
      table.  Otherwise dm-rq.c:dm_old_prep_tio() will incorrectly clone a
      request that isn't needed by the DM multipath target when it is to issue
      IO to an underlying blk-mq device.
      
      Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature")
      Reported-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d948d3b1
    • Bart Van Assche's avatar
      blk-mq: Do not invoke .queue_rq() for a stopped queue · 45f63111
      Bart Van Assche authored
      commit bc27c01b upstream.
      
      The meaning of the BLK_MQ_S_STOPPED flag is "do not call
      .queue_rq()". Hence modify blk_mq_make_request() such that requests
      are queued instead of issued if a queue has been stopped.
      Reported-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45f63111
    • Stephen Boyd's avatar
      PM / OPP: Pass opp_table to dev_pm_opp_put_regulator() · e3742a15
      Stephen Boyd authored
      commit 91291d9a upstream.
      
      Joonyoung Shim reported an interesting problem on his ARM octa-core
      Odoroid-XU3 platform. During system suspend, dev_pm_opp_put_regulator()
      was failing for a struct device for which dev_pm_opp_set_regulator() is
      called earlier.
      
      This happened because an earlier call to
      dev_pm_opp_of_cpumask_remove_table() function (from cpufreq-dt.c file)
      removed all the entries from opp_table->dev_list apart from the last CPU
      device in the cpumask of CPUs sharing the OPP.
      
      But both dev_pm_opp_set_regulator() and dev_pm_opp_put_regulator()
      routines get CPU device for the first CPU in the cpumask. And so the OPP
      core failed to find the OPP table for the struct device.
      
      This patch attempts to fix this problem by returning a pointer to the
      opp_table from dev_pm_opp_set_regulator() and using that as the
      parameter to dev_pm_opp_put_regulator(). This ensures that the
      dev_pm_opp_put_regulator() doesn't fail to find the opp table.
      
      Note that similar design problem also exists with other
      dev_pm_opp_put_*() APIs, but those aren't used currently by anyone and
      so we don't need to update them for now.
      Reported-by: default avatarJoonyoung Shim <jy0922.shim@samsung.com>
      Signed-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      [ Viresh: Wrote commit log and tested on exynos 5250 ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3742a15
    • Felipe Balbi's avatar
      usb: gadget: composite: always set ep->mult to a sensible value · 8b63a922
      Felipe Balbi authored
      commit eaa496ff upstream.
      
      ep->mult is supposed to be set to Isochronous and
      Interrupt Endapoint's multiplier value. This value
      is computed from different places depending on the
      link speed.
      
      If we're dealing with HighSpeed, then it's part of
      bits [12:11] of wMaxPacketSize. This case wasn't
      taken into consideration before.
      
      While at that, also make sure the ep->mult defaults
      to one so drivers can use it unconditionally and
      assume they'll never multiply ep->maxpacket to zero.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b63a922
    • Mel Gorman's avatar
      mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted · d4f4b2e6
      Mel Gorman authored
      commit a6de734b upstream.
      
      Vlastimil Babka pointed out that commit 479f854a ("mm, page_alloc:
      defer debugging checks of pages allocated from the PCP") will allow the
      per-cpu list counter to be out of sync with the per-cpu list contents if
      a struct page is corrupted.
      
      The consequence is an infinite loop if the per-cpu lists get fully
      drained by free_pcppages_bulk because all the lists are empty but the
      count is positive.  The infinite loop occurs here
      
                      do {
                              batch_free++;
                              if (++migratetype == MIGRATE_PCPTYPES)
                                      migratetype = 0;
                              list = &pcp->lists[migratetype];
                      } while (list_empty(list));
      
      What the user sees is a bad page warning followed by a soft lockup with
      interrupts disabled in free_pcppages_bulk().
      
      This patch keeps the accounting in sync.
      
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f4b2e6
    • Shaohua Li's avatar
      mm/vmscan.c: set correct defer count for shrinker · 0927d281
      Shaohua Li authored
      commit 5f33a080 upstream.
      
      Our system uses significantly more slab memory with memcg enabled with
      the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
      4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
      have two memcg for one shrinker.  In do_shrink_slab:
      
      1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
         is 1024, then no memory is freed.  nr_deferred = 700
      
      2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
         total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
         nr_deferred = 10.
      
      The deferred share of cg1 is lost in this case.  kswapd will free no
      memory even run above steps again and again.
      
      The fix makes sure one memcg's deferred share isn't lost.
      
      Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0927d281
    • Solganik Alexander's avatar
      nvmet: Fix possible infinite loop triggered on hot namespace removal · 3e0ef1b8
      Solganik Alexander authored
      commit e4fcf07c upstream.
      
      When removing a namespace we delete it from the subsystem namespaces
      list with list_del_init which allows us to know if it is enabled or
      not.
      
      The problem is that list_del_init initialize the list next and does
      not respect the RCU list-traversal we do on the IO path for locating
      a namespace. Instead we need to use list_del_rcu which is allowed to
      run concurrently with the _rcu list-traversal primitives (keeps list
      next intact) and guarantees concurrent nvmet_find_naespace forward
      progress.
      
      By changing that, we cannot rely on ns->dev_link for knowing if the
      namspace is enabled, so add enabled indicator entry to nvmet_ns for
      that.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarSolganik Alexander <sashas@lightbitslabs.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e0ef1b8
    • Omar Sandoval's avatar
      loop: return proper error from loop_queue_rq() · 6290a3bc
      Omar Sandoval authored
      commit b4a567e8 upstream.
      
      ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
      an errno.
      
      Fixes: f4aa4c7b ("block: loop: convert to per-device workqueue")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6290a3bc
    • Jaegeuk Kim's avatar
      f2fs: fix overflow due to condition check order · bf0f0207
      Jaegeuk Kim authored
      commit e87f7329 upstream.
      
      In the last ilen case, i was already increased, resulting in accessing out-
      of-boundary entry of do_replace and blkaddr.
      Fix to check ilen first to exit the loop.
      
      Fixes: 2aa8fbb9693020 ("f2fs: refactor __exchange_data_block for speed up")
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf0f0207
    • Nicolai Stange's avatar
      f2fs: set ->owner for debugfs status file's file_operations · 154d83a8
      Nicolai Stange authored
      commit 05e6ea26 upstream.
      
      The struct file_operations instance serving the f2fs/status debugfs file
      lacks an initialization of its ->owner.
      
      This means that although that file might have been opened, the f2fs module
      can still get removed. Any further operation on that opened file, releasing
      included,  will cause accesses to unmapped memory.
      
      Indeed, Mike Marshall reported the following:
      
        BUG: unable to handle kernel paging request at ffffffffa0307430
        IP: [<ffffffff8132a224>] full_proxy_release+0x24/0x90
        <...>
        Call Trace:
         [] __fput+0xdf/0x1d0
         [] ____fput+0xe/0x10
         [] task_work_run+0x8e/0xc0
         [] do_exit+0x2ae/0xae0
         [] ? __audit_syscall_entry+0xae/0x100
         [] ? syscall_trace_enter+0x1ca/0x310
         [] do_group_exit+0x44/0xc0
         [] SyS_exit_group+0x14/0x20
         [] do_syscall_64+0x61/0x150
         [] entry_SYSCALL64_slow_path+0x25/0x25
        <...>
        ---[ end trace f22ae883fa3ea6b8 ]---
        Fixing recursive fault but reboot is needed!
      
      Fix this by initializing the f2fs/status file_operations' ->owner with
      THIS_MODULE.
      
      This will allow debugfs to grab a reference to the f2fs module upon any
      open on that file, thus preventing it from getting removed.
      
      Fixes: 902829aa ("f2fs: move proc files to debugfs")
      Reported-by: default avatarMike Marshall <hubcap@omnibond.com>
      Reported-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      154d83a8
    • Jaegeuk Kim's avatar
      Revert "f2fs: use percpu_counter for # of dirty pages in inode" · 67e5239c
      Jaegeuk Kim authored
      commit 204706c7 upstream.
      
      This reverts commit 1beba1b3.
      
      The perpcu_counter doesn't provide atomicity in single core and consume more
      DRAM. That incurs fs_mark test failure due to ENOMEM.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67e5239c
    • Sergey Karamov's avatar
      ext4: do not perform data journaling when data is encrypted · d06eaf28
      Sergey Karamov authored
      commit 73b92a2a upstream.
      
      Currently data journalling is incompatible with encryption: enabling both
      at the same time has never been supported by design, and would result in
      unpredictable behavior. However, users are not precluded from turning on
      both features simultaneously. This change programmatically replaces data
      journaling for encrypted regular files with ordered data journaling mode.
      
      Background:
      Journaling encrypted data has not been supported because it operates on
      buffer heads of the page in the page cache. Namely, when the commit
      happens, which could be up to five seconds after caching, the commit
      thread uses the buffer heads attached to the page to copy the contents of
      the page to the journal. With encryption, it would have been required to
      keep the bounce buffer with ciphertext for up to the aforementioned five
      seconds, since the page cache can only hold plaintext and could not be
      used for journaling. Alternatively, it would be required to setup the
      journal to initiate a callback at the commit time to perform deferred
      encryption - in this case, not only would the data have to be written
      twice, but it would also have to be encrypted twice. This level of
      complexity was not justified for a mode that in practice is very rarely
      used because of the overhead from the data journalling.
      
      Solution:
      If data=journaled has been set as a mount option for a filesystem, or if
      journaling is enabled on a regular file, do not perform journaling if the
      file is also encrypted, instead fall back to the data=ordered mode for the
      file.
      
      Rationale:
      The intent is to allow seamless and proper filesystem operation when
      journaling and encryption have both been enabled, and have these two
      conflicting features gracefully resolved by the filesystem.
      
      Fixes: 44614711Signed-off-by: default avatarSergey Karamov <skaramov@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d06eaf28
    • Dan Carpenter's avatar
      ext4: return -ENOMEM instead of success · e33673be
      Dan Carpenter authored
      commit 578620f4 upstream.
      
      We should set the error code if kzalloc() fails.
      
      Fixes: 67cf5b09 ("ext4: add the basic function for inline data support")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e33673be
    • Darrick J. Wong's avatar
      ext4: reject inodes with negative size · 36648770
      Darrick J. Wong authored
      commit 7e6e1ef4 upstream.
      
      Don't load an inode with a negative size; this causes integer overflow
      problems in the VFS.
      
      [ Added EXT4_ERROR_INODE() to mark file system as corrupted. -TYT]
      
      Fixes: a48380f7 (ext4: rename i_dir_acl to i_size_high)
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36648770
    • Theodore Ts'o's avatar
      ext4: add sanity checking to count_overhead() · 1bfcffbb
      Theodore Ts'o authored
      commit c48ae41b upstream.
      
      The commit "ext4: sanity check the block and cluster size at mount
      time" should prevent any problems, but in case the superblock is
      modified while the file system is mounted, add an extra safety check
      to make sure we won't overrun the allocated buffer.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bfcffbb
    • Theodore Ts'o's avatar
      ext4: fix in-superblock mount options processing · 9689eb99
      Theodore Ts'o authored
      commit 5aee0f8a upstream.
      
      Fix a large number of problems with how we handle mount options in the
      superblock.  For one, if the string in the superblock is long enough
      that it is not null terminated, we could run off the end of the string
      and try to interpret superblocks fields as characters.  It's unlikely
      this will cause a security problem, but it could result in an invalid
      parse.  Also, parse_options is destructive to the string, so in some
      cases if there is a comma-separated string, it would be modified in
      the superblock.  (Fortunately it only happens on file systems with a
      1k block size.)
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9689eb99
    • Theodore Ts'o's avatar
      ext4: use more strict checks for inodes_per_block on mount · 52a9daa3
      Theodore Ts'o authored
      commit cd6bb35b upstream.
      
      Centralize the checks for inodes_per_block and be more strict to make
      sure the inodes_per_block_group can't end up being zero.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52a9daa3
    • Chandan Rajendra's avatar
      ext4: fix stack memory corruption with 64k block size · 75055843
      Chandan Rajendra authored
      commit 30a9d7af upstream.
      
      The number of 'counters' elements needed in 'struct sg' is
      super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
      elements in the array. This is insufficient for block sizes >= 32k. In
      such cases the memcpy operation performed in ext4_mb_seq_groups_show()
      would cause stack memory corruption.
      
      Fixes: c9de560dSigned-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75055843
    • Chandan Rajendra's avatar
      ext4: fix mballoc breakage with 64k block size · 86efd99f
      Chandan Rajendra authored
      commit 69e43e8c upstream.
      
      'border' variable is set to a value of 2 times the block size of the
      underlying filesystem. With 64k block size, the resulting value won't
      fit into a 16-bit variable. Hence this commit changes the data type of
      'border' to 'unsigned int'.
      
      Fixes: c9de560dSigned-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86efd99f
    • Alex Porosanu's avatar
      crypto: caam - fix AEAD givenc descriptors · 8022387d
      Alex Porosanu authored
      commit d128af17 upstream.
      
      The AEAD givenc descriptor relies on moving the IV through the
      output FIFO and then back to the CTX2 for authentication. The
      SEQ FIFO STORE could be scheduled before the data can be
      read from OFIFO, especially since the SEQ FIFO LOAD needs
      to wait for the SEQ FIFO LOAD SKIP to finish first. The
      SKIP takes more time when the input is SG than when it's
      a contiguous buffer. If the SEQ FIFO LOAD is not scheduled
      before the STORE, the DECO will hang waiting for data
      to be available in the OFIFO so it can be transferred to C2.
      In order to overcome this, first force transfer of IV to C2
      by starting the "cryptlen" transfer first and then starting to
      store data from OFIFO to the output buffer.
      
      Fixes: 1acebad3 ("crypto: caam - faster aead implementation")
      Signed-off-by: default avatarAlex Porosanu <alexandru.porosanu@nxp.com>
      Signed-off-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8022387d
    • Eric W. Biederman's avatar
      ptrace: Capture the ptracer's creds not PT_PTRACE_CAP · ade692b8
      Eric W. Biederman authored
      commit 64b875f7 upstream.
      
      When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
      overlooked.  This can result in incorrect behavior when an application
      like strace traces an exec of a setuid executable.
      
      Further PT_PTRACE_CAP does not have enough information for making good
      security decisions as it does not report which user namespace the
      capability is in.  This has already allowed one mistake through
      insufficient granulariy.
      
      I found this issue when I was testing another corner case of exec and
      discovered that I could not get strace to set PT_PTRACE_CAP even when
      running strace as root with a full set of caps.
      
      This change fixes the above issue with strace allowing stracing as
      root a setuid executable without disabling setuid.  More fundamentaly
      this change allows what is allowable at all times, by using the correct
      information in it's decision.
      
      Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ade692b8
    • Linus Torvalds's avatar
      vfs,mm: fix return value of read() at s_maxbytes · 23d179ac
      Linus Torvalds authored
      commit d05c5f7b upstream.
      
      We truncated the possible read iterator to s_maxbytes in commit
      c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
      but our end condition handling was wrong: it's not an error to try to
      read at the end of the file.
      
      Reading past the end should return EOF (0), not EINVAL.
      
      See for example
      
        https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
        http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html
      
      where a md5sum of a maximally sized file fails because the final read is
      exactly at s_maxbytes.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-by: default avatarJoseph Salisbury <joseph.salisbury@canonical.com>
      Cc: Wei Fang <fangwei1@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23d179ac
    • Eric W. Biederman's avatar
      mm: Add a user_ns owner to mm_struct and fix ptrace permission checks · e45692fa
      Eric W. Biederman authored
      commit bfedb589 upstream.
      
      During exec dumpable is cleared if the file that is being executed is
      not readable by the user executing the file.  A bug in
      ptrace_may_access allows reading the file if the executable happens to
      enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
      unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
      
      This problem is fixed with only necessary userspace breakage by adding
      a user namespace owner to mm_struct, captured at the time of exec, so
      it is clear in which user namespace CAP_SYS_PTRACE must be present in
      to be able to safely give read permission to the executable.
      
      The function ptrace_may_access is modified to verify that the ptracer
      has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
      This ensures that if the task changes it's cred into a subordinate
      user namespace it does not become ptraceable.
      
      The function ptrace_attach is modified to only set PT_PTRACE_CAP when
      CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
      PT_PTRACE_CAP is to be a flag to note that whatever permission changes
      the task might go through the tracer has sufficient permissions for
      it not to be an issue.  task->cred->user_ns is always the same
      as or descendent of mm->user_ns.  Which guarantees that having
      CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
      credentials.
      
      To prevent regressions mm->dumpable and mm->user_ns are not considered
      when a task has no mm.  As simply failing ptrace_may_attach causes
      regressions in privileged applications attempting to read things
      such as /proc/<pid>/stat
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Fixes: 8409cca7 ("userns: allow ptrace from non-init user namespaces")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e45692fa
    • NeilBrown's avatar
      block_dev: don't test bdev->bd_contains when it is not stable · 04804d83
      NeilBrown authored
      commit bcc7f5b4 upstream.
      
      bdev->bd_contains is not stable before calling __blkdev_get().
      When __blkdev_get() is called on a parition with ->bd_openers == 0
      it sets
        bdev->bd_contains = bdev;
      which is not correct for a partition.
      After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
      and then ->bd_contains is stable.
      
      When FMODE_EXCL is used, blkdev_get() calls
         bd_start_claiming() ->  bd_prepare_to_claim() -> bd_may_claim()
      
      This call happens before __blkdev_get() is called, so ->bd_contains
      is not stable.  So bd_may_claim() cannot safely use ->bd_contains.
      It currently tries to use it, and this can lead to a BUG_ON().
      
      This happens when a whole device is already open with a bd_holder (in
      use by dm in my particular example) and two threads race to open a
      partition of that device for the first time, one opening with O_EXCL and
      one without.
      
      The thread that doesn't use O_EXCL gets through blkdev_get() to
      __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;
      
      Immediately thereafter the other thread, using FMODE_EXCL, calls
      bd_start_claiming() from blkdev_get().  This should fail because the
      whole device has a holder, but because bdev->bd_contains == bdev
      bd_may_claim() incorrectly reports success.
      This thread continues and blocks on bd_mutex.
      
      The first thread then sets bdev->bd_contains correctly and drops the mutex.
      The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
      again in:
      			BUG_ON(!bd_may_claim(bdev, whole, holder));
      The BUG_ON fires.
      
      Fix this by removing the dependency on ->bd_contains in
      bd_may_claim().  As bd_may_claim() has direct access to the whole
      device, it can simply test if the target bdev is the whole device.
      
      Fixes: 6b4517a7 ("block: implement bd_claiming and claiming block")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      04804d83
    • Aleksa Sarai's avatar
      fs: exec: apply CLOEXEC before changing dumpable task flags · 52d69727
      Aleksa Sarai authored
      commit 613cc2b6 upstream.
      
      If you have a process that has set itself to be non-dumpable, and it
      then undergoes exec(2), any CLOEXEC file descriptors it has open are
      "exposed" during a race window between the dumpable flags of the process
      being reset for exec(2) and CLOEXEC being applied to the file
      descriptors. This can be exploited by a process by attempting to access
      /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
      
      The race in question is after set_dumpable has been (for get_link,
      though the trace is basically the same for readlink):
      
      [vfs]
      -> proc_pid_link_inode_operations.get_link
         -> proc_pid_get_link
            -> proc_fd_access_allowed
               -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
      
      Which will return 0, during the race window and CLOEXEC file descriptors
      will still be open during this window because do_close_on_exec has not
      been called yet. As a result, the ordering of these calls should be
      reversed to avoid this race window.
      
      This is of particular concern to container runtimes, where joining a
      PID namespace with file descriptors referring to the host filesystem
      can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
      against access of CLOEXEC file descriptors -- file descriptors which may
      reference filesystem objects the container shouldn't have access to).
      
      Cc: dev@opencontainers.org
      Reported-by: default avatarMichael Crosby <crosbymichael@gmail.com>
      Signed-off-by: default avatarAleksa Sarai <asarai@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52d69727
    • Eric W. Biederman's avatar
      exec: Ensure mm->user_ns contains the execed files · 781e976a
      Eric W. Biederman authored
      commit f84df2a6 upstream.
      
      When the user namespace support was merged the need to prevent
      ptrace from revealing the contents of an unreadable executable
      was overlooked.
      
      Correct this oversight by ensuring that the executed file
      or files are in mm->user_ns, by adjusting mm->user_ns.
      
      Use the new function privileged_wrt_inode_uidgid to see if
      the executable is a member of the user namespace, and as such
      if having CAP_SYS_PTRACE in the user namespace should allow
      tracing the executable.  If not update mm->user_ns to
      the parent user namespace until an appropriate parent is found.
      Reported-by: default avatarJann Horn <jann@thejh.net>
      Fixes: 9e4a36ec ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      781e976a
    • Wang Xiaoguang's avatar
      btrfs: make file clone aware of fatal signals · fc1d3e5f
      Wang Xiaoguang authored
      commit 69ae5e44 upstream.
      
      Indeed this just make the behavior similar to xfs when process has
      fatal signals pending, and it'll make fstests/generic/298 happy.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc1d3e5f
    • Filipe Manana's avatar
      Btrfs: fix incremental send failure caused by balance · 8c59356c
      Filipe Manana authored
      commit d5e84fd8 upstream.
      
      Commit 95155585 ("Btrfs: send, don't bug on inconsistent snapshots")
      removed some BUG_ON() statements (replacing them with returning errors
      to user space and logging error messages) when a snapshot is in an
      inconsistent state due to failures to update a delayed inode item (ENOMEM
      or ENOSPC) after adding/updating/deleting references, xattrs or file
      extent items.
      
      However there is a case, when no errors happen, where a file extent item
      can be modified without having the corresponding inode item updated. This
      case happens during balance under very specific timings, when relocation
      is in the stage where it updates data pointers and a leaf that contains
      file extent items is COWed. When that happens file extent items get their
      disk_bytenr field updated to a new value that reflects the post relocation
      logical address of the extent, without updating their respective inode
      items (as there is nothing that needs to be updated on them). This is
      performed at relocation.c:replace_file_extents() through
      relocation.c:btrfs_reloc_cow_block().
      
      So make an incremental send deal with this case and don't do any processing
      for a file extent item that got its disk_bytenr field updated by relocation,
      since the extent's data is the same as the one pointed by the file extent
      item in the parent snapshot.
      
      After the recent commit mentioned above this case resulted in EIO errors
      returned to user space (and an error message logged to dmesg/syslog) when
      doing an incremental send, while before it, it resulted in hitting a
      BUG_ON leading to the following trace:
      
      [  952.206705] ------------[ cut here ]------------
      [  952.206714] kernel BUG at ../fs/btrfs/send.c:5653!
      [  952.206719] Internal error: Oops - BUG: 0 [#1] SMP
      [  952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common virtio_pci virtio_ring virtio drm sg efivarfs
      [  952.228333] Supported: Yes
      [  952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1
      [  952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [  952.231683] task: ffff800058e94100 ti: ffff8000d866c000 task.ti: ffff8000d866c000
      [  952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs]
      [  952.234375] LR is at changed_cb+0x58/0xa48 [btrfs]
      [  952.236552] pc : [<ffff7ffffc39de7c>] lr : [<ffff7ffffc39d4e0>] pstate: 80000145
      [  952.238049] sp : ffff8000d866fa20
      [  952.238732] x29: ffff8000d866fa20 x28: 0000000000000019
      [  952.239840] x27: 00000000000028d5 x26: 00000000000024a2
      [  952.241008] x25: 0000000000000002 x24: ffff8000e66e92f0
      [  952.242131] x23: ffff8000b8c76800 x22: ffff800092879140
      [  952.243238] x21: 0000000000000002 x20: ffff8000d866fb78
      [  952.244348] x19: ffff8000b8f8c200 x18: 0000000000002710
      [  952.245607] x17: 0000ffff90d42480 x16: ffff800000237dc0
      [  952.246719] x15: 0000ffff90de7510 x14: ab000c000a2faf08
      [  952.247835] x13: 0000000000577c2b x12: ab000c000b696665
      [  952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f
      [  952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671
      [  952.251352] x7 : 8000000000577c2b x6 : ffff800053eadf45
      [  952.252468] x5 : 0000000000000000 x4 : ffff80005e169494
      [  952.253582] x3 : 0000000000000004 x2 : ffff8000d866fb78
      [  952.254695] x1 : 000000000003e2a3 x0 : 000000000003e2a4
      [  952.255803]
      [  952.256150] Process snapperd (pid: 12779, stack limit = 0xffff8000d866c020)
      [  952.257516] Stack: (0xffff8000d866fa20 to 0xffff8000d8670000)
      [  952.258654] fa20: ffff8000d866fae0 ffff7ffffc308fc0 ffff800092879140 ffff8000e66e92f0
      [  952.260219] fa40: 0000000000000035 ffff800055de6000 ffff8000b8c76800 ffff8000d866fb78
      [  952.261745] fa60: 0000000000000002 00000000000024a2 00000000000028d5 0000000000000019
      [  952.263269] fa80: ffff8000d866fae0 ffff7ffffc3090f0 ffff8000d866fae0 ffff7ffffc309128
      [  952.264797] faa0: ffff800092879140 ffff8000e66e92f0 0000000000000035 ffff800055de6000
      [  952.268261] fac0: ffff8000b8c76800 ffff8000d866fb78 0000000000000002 0000000000001000
      [  952.269822] fae0: ffff8000d866fbc0 ffff7ffffc39ecfc ffff8000b8f8c200 ffff8000b8f8c368
      [  952.271368] fb00: ffff8000b8f8c378 ffff800055de6000 0000000000000001 ffff8000ecb17500
      [  952.272893] fb20: ffff8000b8c76800 ffff800092879140 ffff800062b6d000 ffff80007a9e2470
      [  952.274420] fb40: ffff8000b8f8c208 0000000005784000 ffff8000580a8000 ffff8000b8f8c200
      [  952.276088] fb60: ffff7ffffc39d488 00000002b8f8c368 0000000000000000 000000000003e2a4
      [  952.280275] fb80: 000000000000006c ffff7ffffc39ec00 000000000003e2a4 000000000000006c
      [  952.283219] fba0: ffff8000b8f8c300 0000000000000100 0000000000000001 ffff8000ecb17500
      [  952.286166] fbc0: ffff8000d866fcd0 ffff7ffffc3643c0 ffff8000f8842700 0000ffff8ffe9278
      [  952.289136] fbe0: 0000000040489426 ffff800055de6000 0000ffff8ffe9278 0000000040489426
      [  952.292083] fc00: 000000000000011d 000000000000001d ffff80007a9e4598 ffff80007a9e43e8
      [  952.294959] fc20: ffff8000b8c7693f 0000000000003b24 0000000000000019 ffff8000b8f8c218
      [  952.301161] fc40: 00000001d866fc70 ffff8000b8c76800 0000000000000128 ffffffffffffff84
      [  952.305749] fc60: ffff800058e941ff 0000000000003a58 ffff8000d866fcb0 ffff8000000f7390
      [  952.308875] fc80: 000000000000012a 0000000000010290 ffff8000d866fc00 000000000000007b
      [  952.311915] fca0: 0000000000010290 ffff800046c1b100 74732d7366727462 000001006d616572
      [  952.314937] fcc0: ffff8000fffc4100 cb88537fdc8ba60e ffff8000d866fe10 ffff8000002499e8
      [  952.318008] fce0: 0000000040489426 ffff8000f8842700 0000ffff8ffe9278 ffff80007a9e4598
      [  952.321321] fd00: 0000ffff8ffe9278 0000000040489426 000000000000011d 000000000000001d
      [  952.324280] fd20: ffff80000072c000 ffff8000d866c000 ffff8000d866fda0 ffff8000000e997c
      [  952.327156] fd40: ffff8000fffc4180 00000000000031ed ffff8000fffc4180 ffff800046c1b7d4
      [  952.329895] fd60: 0000000000000140 0000ffff907ea170 000000000000011d 00000000000000dc
      [  952.334641] fd80: ffff80000072c000 ffff8000d866c000 0000000000000000 0000000000000002
      [  952.338002] fda0: ffff8000d866fdd0 ffff8000000ebacc ffff800046c1b080 ffff800046c1b7d4
      [  952.340724] fdc0: ffff8000d866fdf0 ffff8000000db67c 0000000000000040 ffff800000e69198
      [  952.343415] fde0: 0000ffff8ffea790 00000000000031ed ffff8000d866fe20 ffff800000254000
      [  952.346101] fe00: 000000000000001d 0000000000000004 ffff8000d866fe90 ffff800000249d3c
      [  952.348980] fe20: ffff8000f8842700 0000000000000000 ffff8000f8842701 0000000000000008
      [  952.351696] fe40: ffff8000d866fe70 0000000000000008 ffff8000d866fe90 ffff800000249cf8
      [  952.354387] fe60: ffff8000f8842700 0000ffff8ffe9170 ffff8000f8842701 0000000000000008
      [  952.357083] fe80: 0000ffff8ffe9278 ffff80008ff85500 0000ffff8ffe90c0 ffff800000085c84
      [  952.359800] fea0: 0000000000000000 0000ffff8ffe9170 ffffffffffffffff 0000ffff90d473bc
      [  952.365351] fec0: 0000000000000000 0000000000000015 0000000000000008 0000000040489426
      [  952.369550] fee0: 0000ffff8ffe9278 0000ffff907ea790 0000ffff907ea170 0000ffff907ea790
      [  952.372416] ff00: 0000ffff907ea170 0000000000000000 000000000000001d 0000000000000004
      [  952.375223] ff20: 0000ffff90a32220 00000000003d0f00 0000ffff907ea0a0 0000ffff8ffe8f30
      [  952.378099] ff40: 0000ffff9100f554 0000ffff91147000 0000ffff91117bc0 0000ffff90d473b0
      [  952.381115] ff60: 0000ffff9100f620 0000ffff880069b0 0000ffff8ffe9170 0000ffff8ffe91a0
      [  952.384003] ff80: 0000ffff8ffe9160 0000ffff8ffe9140 0000ffff88006990 0000ffff8ffe9278
      [  952.386860] ffa0: 0000ffff88008a60 0000ffff8ffe9480 0000ffff88014ca0 0000ffff8ffe90c0
      [  952.389654] ffc0: 0000ffff910be8e8 0000ffff8ffe90c0 0000ffff90d473bc 0000000000000000
      [  952.410986] ffe0: 0000000000000008 000000000000001d 6e2079747265706f 72616d223d656d61
      [  952.415497] Call trace:
      [  952.417403] [<ffff7ffffc39de7c>] changed_cb+0x9f4/0xa48 [btrfs]
      [  952.420023] [<ffff7ffffc308fc0>] btrfs_compare_trees+0x500/0x6b0 [btrfs]
      [  952.422759] [<ffff7ffffc39ecfc>] btrfs_ioctl_send+0xb4c/0xe10 [btrfs]
      [  952.425601] [<ffff7ffffc3643c0>] btrfs_ioctl+0x374/0x29a4 [btrfs]
      [  952.428031] [<ffff8000002499e8>] do_vfs_ioctl+0x33c/0x600
      [  952.430360] [<ffff800000249d3c>] SyS_ioctl+0x90/0xa4
      [  952.432552] [<ffff800000085c84>] el0_svc_naked+0x38/0x3c
      [  952.434803] Code: 2a1503e0 17fffdac b9404282 17ffff28 (d4210000)
      [  952.437457] ---[ end trace 9afd7090c466cf15 ]---
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c59356c
    • Josef Bacik's avatar
      Btrfs: don't BUG() during drop snapshot · 02fffa11
      Josef Bacik authored
      commit 4867268c upstream.
      
      Really there's lots of things that can go wrong here, kill all the
      BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO
      instead.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      [ switched to btrfs_err, errors go to common label ]
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02fffa11
    • Anand Jain's avatar
      btrfs: fix a possible umount deadlock · 0f2e022d
      Anand Jain authored
      commit 0ccd0528 upstream.
      
      btrfs_show_devname() is using the device_list_mutex, sometimes
      a call to blkdev_put() leads vfs calling into this func. So
      call blkdev_put() outside of device_list_mutex, as of now.
      
      [  983.284212] ======================================================
      [  983.290401] [ INFO: possible circular locking dependency detected ]
      [  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
      [  983.302081] -------------------------------------------------------
      [  983.308357] umount/21720 is trying to acquire lock:
      [  983.313243]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150
      [  983.321264]
      [  983.321264] but task is already holding lock:
      [  983.327101]  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
      [  983.337839]
      [  983.337839] which lock already depends on the new lock.
      [  983.337839]
      [  983.346024]
      [  983.346024] the existing dependency chain (in reverse order) is:
      [  983.353512]
      -> #4 (&fs_devs->device_list_mutex){+.+...}:
      [  983.359096]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.365143]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
      [  983.371521]        [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs]
      [  983.378710]        [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150
      [  983.384593]        [<ffffffff9126ffc7>] m_show+0x17/0x20
      [  983.389957]        [<ffffffff91276405>] seq_read+0x2b5/0x3b0
      [  983.395669]        [<ffffffff9124c808>] __vfs_read+0x28/0x100
      [  983.401464]        [<ffffffff9124eb3b>] vfs_read+0xab/0x150
      [  983.407080]        [<ffffffff9124ec32>] SyS_read+0x52/0xb0
      [  983.412609]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
      [  983.419617]
      -> #3 (namespace_sem){++++++}:
      [  983.424024]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.430074]        [<ffffffff918239e9>] down_write+0x49/0x80
      [  983.435785]        [<ffffffff91272457>] lock_mount+0x67/0x1c0
      [  983.441582]        [<ffffffff91272ab2>] do_add_mount+0x32/0xf0
      [  983.447458]        [<ffffffff9127363a>] finish_automount+0x5a/0xc0
      [  983.453682]        [<ffffffff91259513>] follow_managed+0x1b3/0x2a0
      [  983.459912]        [<ffffffff9125b750>] lookup_fast+0x300/0x350
      [  983.465875]        [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0
      [  983.471846]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
      [  983.477731]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
      [  983.483702]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
      [  983.489240]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
      [  983.496254]
      -> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}:
      [  983.501798]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.507855]        [<ffffffff918239e9>] down_write+0x49/0x80
      [  983.513558]        [<ffffffff91366237>] start_creating+0x87/0x100
      [  983.519703]        [<ffffffff91366647>] debugfs_create_dir+0x17/0x100
      [  983.526195]        [<ffffffff911df153>] bdi_register+0x93/0x210
      [  983.532165]        [<ffffffff911df313>] bdi_register_owner+0x43/0x70
      [  983.538570]        [<ffffffff914080fb>] device_add_disk+0x1fb/0x450
      [  983.544888]        [<ffffffff91580226>] loop_add+0x1e6/0x290
      [  983.550596]        [<ffffffff91fec358>] loop_init+0x10b/0x14f
      [  983.556394]        [<ffffffff91002207>] do_one_initcall+0xa7/0x180
      [  983.562618]        [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266
      [  983.569370]        [<ffffffff918174be>] kernel_init+0xe/0x100
      [  983.575166]        [<ffffffff9182620f>] ret_from_fork+0x1f/0x40
      [  983.581131]
      -> #1 (loop_index_mutex){+.+.+.}:
      [  983.585801]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.591858]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
      [  983.598256]        [<ffffffff9157ed3f>] lo_open+0x1f/0x60
      [  983.603704]        [<ffffffff9128eec3>] __blkdev_get+0x123/0x400
      [  983.609757]        [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350
      [  983.615639]        [<ffffffff9128f554>] blkdev_open+0x64/0x80
      [  983.621428]        [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0
      [  983.627651]        [<ffffffff9124c029>] vfs_open+0x69/0x80
      [  983.633181]        [<ffffffff9125db74>] path_openat+0x834/0xaa0
      [  983.639152]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
      [  983.645035]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
      [  983.650999]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
      [  983.656535]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
      [  983.663541]
      -> #0 (&bdev->bd_mutex){+.+.+.}:
      [  983.668107]        [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
      [  983.674510]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.680561]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
      [  983.686967]        [<ffffffff9128ec51>] blkdev_put+0x31/0x150
      [  983.692761]        [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
      [  983.699699]        [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
      [  983.707178]        [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
      [  983.714380]        [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
      [  983.721061]        [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
      [  983.727908]        [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
      [  983.734744]        [<ffffffff91250f56>] kill_anon_super+0x16/0x30
      [  983.740888]        [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
      [  983.747909]        [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
      [  983.754745]        [<ffffffff912515fd>] deactivate_super+0x5d/0x70
      [  983.760977]        [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
      [  983.766773]        [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
      [  983.772738]        [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
      [  983.778708]        [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
      [  983.785373]        [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
      [  983.792212]        [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1
      [  983.799225]
      [  983.799225] other info that might help us debug this:
      [  983.799225]
      [  983.807291] Chain exists of:
        &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex
      
      [  983.816521]  Possible unsafe locking scenario:
      [  983.816521]
      [  983.822489]        CPU0                    CPU1
      [  983.827043]        ----                    ----
      [  983.831599]   lock(&fs_devs->device_list_mutex);
      [  983.836289]                                lock(namespace_sem);
      [  983.842268]                                lock(&fs_devs->device_list_mutex);
      [  983.849478]   lock(&bdev->bd_mutex);
      [  983.853127]
      [  983.853127]  *** DEADLOCK ***
      [  983.853127]
      [  983.859113] 3 locks held by umount/21720:
      [  983.863145]  #0:  (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70
      [  983.872713]  #1:  (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs]
      [  983.882206]  #2:  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
      [  983.893422]
      [  983.893422] stack backtrace:
      [  983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1
      [  983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
      [  983.913492]  0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0
      [  983.921018]  ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050
      [  983.928542]  ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0
      [  983.936072] Call Trace:
      [  983.938545]  [<ffffffff91429521>] dump_stack+0x85/0xc4
      [  983.943715]  [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c
      [  983.949748]  [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
      [  983.955613]  [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
      [  983.961123]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
      [  983.966550]  [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
      [  983.972407]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
      [  983.977832]  [<ffffffff9128ec51>] blkdev_put+0x31/0x150
      [  983.983101]  [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
      [  983.989500]  [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
      [  983.996415]  [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
      [  984.003068]  [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
      [  984.009189]  [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170
      [  984.014881]  [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
      [  984.021176]  [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
      [  984.027476]  [<ffffffff91250f56>] kill_anon_super+0x16/0x30
      [  984.033082]  [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
      [  984.039548]  [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
      [  984.045839]  [<ffffffff912515fd>] deactivate_super+0x5d/0x70
      [  984.051525]  [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
      [  984.056774]  [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
      [  984.062201]  [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
      [  984.067625]  [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
      [  984.073747]  [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
      [  984.080038]  [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1
      Reported-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f2e022d
    • Liu Bo's avatar
      Btrfs: fix memory leak in do_walk_down · 65563ab7
      Liu Bo authored
      commit a958eab0 upstream.
      
      The extent buffer 'next' needs to be free'd conditionally.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65563ab7
    • Jeff Mahoney's avatar
      btrfs: clean the old superblocks before freeing the device · 364b85c5
      Jeff Mahoney authored
      commit cea67ab9 upstream.
      
      btrfs_rm_device frees the block device but then re-opens it using
      the saved device name.  A race exists between the close and the
      re-open that allows the block size to be changed.  The result
      is getting stuck forever in the reclaim loop in __getblk_slow.
      
      This patch moves the superblock cleanup before closing the block
      device, which is also consistent with other callers.  We also don't
      need a private copy of dev_name as the whole routine operates under
      the uuid_mutex.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      364b85c5
    • Josef Bacik's avatar
      Btrfs: don't leak reloc root nodes on error · 6a6e9276
      Josef Bacik authored
      commit 6bdf131f upstream.
      
      We don't track the reloc roots in any sort of normal way, so the only way the
      root/commit_root nodes get free'd is if the relocation finishes successfully and
      the reloc root is deleted.  Fix this by free'ing them in free_reloc_roots.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a6e9276
    • Liu Bo's avatar
      Btrfs: return gracefully from balance if fs tree is corrupted · 4d3d9b59
      Liu Bo authored
      commit 3561b9db upstream.
      
      When relocating tree blocks, we firstly get block information from
      back references in the extent tree, we then search fs tree to try to
      find all parents of a block.
      
      However, if fs tree is corrupted, eg. if there're some missing
      items, we could come across these WARN_ONs and BUG_ONs.
      
      This makes us print some error messages and return gracefully
      from balance.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d3d9b59
    • Liu Bo's avatar
      Btrfs: bail out if block group has different mixed flag · a6522e48
      Liu Bo authored
      commit 49303381 upstream.
      
      Currently we allow inconsistence about mixed flag
       (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA).
      
      We'd get ENOSPC if block group has mixed flag and btrfs doesn't.
      If that happens, we have one space_info with mixed flag and another
      space_info only with BTRFS_BLOCK_GROUP_METADATA, and
      global_block_rsv.space_info points to the latter one, but all bytes
      from block_group contributes to the mixed space_info, thus all the
      allocation will fail with ENOSPC.
      
      This adds a check for the above case.
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      [ updated message ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6522e48
    • Liu Bo's avatar
      Btrfs: fix memory leak in reading btree blocks · d7839adc
      Liu Bo authored
      commit 2571e739 upstream.
      
      So we can read a btree block via readahead or intentional read,
      and we can end up with a memory leak when something happens as
      follows,
      1) readahead starts to read block A but does not wait for read
         completion,
      2) btree_readpage_end_io_hook finds that block A is corrupted,
         and it needs to clear all block A's pages' uptodate bit.
      3) meanwhile an intentional read kicks in and checks block A's
         pages' uptodate to decide which page needs to be read.
      4) when some pages have the uptodate bit during 3)'s check so
         3) doesn't count them for eb->io_pages, but they are later
         cleared by 2) so we has to readpage on the page, we get
         the wrong eb->io_pages which results in a memory leak of
         this block.
      
      This fixes the problem by firstly getting all pages's locking and
      then checking pages' uptodate bit.
      
         t1(readahead)                              t2(readahead endio)                                       t3(the following read)
      read_extent_buffer_pages                    end_bio_extent_readpage
        for pg in eb:                                for page 0,1,2 in eb:
            if pg is uptodate:                           btree_readpage_end_io_hook(pg)
                num_reads++                              if uptodate:
        eb->io_pages = num_reads                             SetPageUptodate(pg)              _______________
        for pg in eb:                                for page 3 in eb:                                     read_extent_buffer_pages
             if pg is NOT uptodate:                      btree_readpage_end_io_hook(pg)                       for pg in eb:
                 __extent_read_full_page(pg)                 sanity check reports something wrong                 if pg is uptodate:
                                                             clear_extent_buffer_uptodate(eb)                         num_reads++
                                                                 for pg in eb:                                eb->io_pages = num_reads
                                                                     ClearPageUptodate(page)  _______________
                                                                                                              for pg in eb:
                                                                                                                  if pg is NOT uptodate:
                                                                                                                      __extent_read_full_page(pg)
      
      So t3's eb->io_pages is not consistent with the number of pages it's reading,
      and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
      number so that we're not able to free the eb.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7839adc
    • Richard Watts's avatar
      clk: ti: omap36xx: Work around sprz319 advisory 2.1 · 1a087cd8
      Richard Watts authored
      commit 035cd485 upstream.
      
      The OMAP36xx DPLL5, driving EHCI USB, can be subject to a long-term
      frequency drift. The frequency drift magnitude depends on the VCO update
      rate, which is inversely proportional to the PLL divider. The kernel
      DPLL configuration code results in a high value for the divider, leading
      to a long term drift high enough to cause USB transmission errors. In
      the worst case the USB PHY's ULPI interface can stop responding,
      breaking USB operation completely. This manifests itself on the
      Beagleboard xM by the LAN9514 reporting 'Cannot enable port 2. Maybe the
      cable is bad?' in the kernel log.
      
      Errata sprz319 advisory 2.1 documents PLL values that minimize the
      drift. Use them automatically when DPLL5 is used for USB operation,
      which we detect based on the requested clock rate. The clock framework
      will still compute the PLL parameters and resulting rate as usual, but
      the PLL M and N values will then be overridden. This can result in the
      effective clock rate being slightly different than the rate cached by
      the clock framework, but won't cause any adverse effect to USB
      operation.
      Signed-off-by: default avatarRichard Watts <rrw@kynesim.co.uk>
      [Upported from v3.2 to v4.9]
      Signed-off-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Tested-by: default avatarLadislav Michl <ladis@linux-mips.org>
      Signed-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Cc: Adam Ford <aford173@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a087cd8