1. 19 Feb, 2016 40 commits
    • Mauricio Faria de Oliveira's avatar
      Revert "dm mpath: fix stalls when handling invalid ioctls" · 3662b6cf
      Mauricio Faria de Oliveira authored
      commit 47796938 upstream.
      
      This reverts commit a1989b33.
      
      That commit introduced a regression at least for the case of the SG_IO ioctl()
      running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
      are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
      than blocking due to queue_if_no_path until a path becomes active, for example.
      
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
      from multipath devices; which leads to SCSI/filesystem errors in such a guest.
      
      More general scenarios can hit that regression too. The following demonstration
      employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
      (some output & user changes omitted for brevity and comments added for clarity).
      
      Reverting that commit restores normal operation (queueing) in failing scenarios;
      tested on linux-next (next-20151022).
      
      1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
      
          $ cat sg_simple0.c
          ... see [3] ...
          $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
          $ gcc sgio_inquiry.c -o sgio_inquiry
      
      2) The ioctl() works fine with active paths present.
      
          # multipath -l 85ag56
          85ag56 (...) dm-19 IBM     ,2145
          size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
          |-+- policy='service-time 0' prio=0 status=active
          | |- 8:0:11:0  sdz  65:144  active undef running
          | `- 9:0:9:0   sdbf 67:144  active undef running
          `-+- policy='service-time 0' prio=0 status=enabled
            |- 8:0:12:0  sdae 65:224  active undef running
            `- 9:0:12:0  sdbo 68:32   active undef running
      
          $ ./sgio_inquiry /dev/mapper/85ag56
          Some of the INQUIRY command's response:
              IBM       2145              0000
          INQUIRY duration=0 millisecs, resid=0
      
      3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
         for unprivileged users (rather than blocking due to queue_if_no_path).
      
          # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
                do multipathd -k"fail path $path"; done
      
          # multipath -l 85ag56
          85ag56 (...) dm-19 IBM     ,2145
          size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
          |-+- policy='service-time 0' prio=0 status=enabled
          | |- 8:0:11:0  sdz  65:144  failed undef running
          | `- 9:0:9:0   sdbf 67:144  failed undef running
          `-+- policy='service-time 0' prio=0 status=enabled
            |- 8:0:12:0  sdae 65:224  failed undef running
            `- 9:0:12:0  sdbo 68:32   failed undef running
      
          $ ./sgio_inquiry /dev/mapper/85ag56
          sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
      
      4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
         it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
      
          $ dmesg
          <...>
          [] device-mapper: multipath: Failing path 65:144.
          [] device-mapper: multipath: Failing path 67:144.
          [] device-mapper: multipath: Failing path 65:224.
          [] device-mapper: multipath: Failing path 68:32.
          [] sgio_inquiry: sending ioctl 2285 to a partition!
      
      5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
         (then queueing happens -- in this example, queue_if_no_path is set);
         this is due to a conditional check in scsi_verify_blk_ioctl().
      
          # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
          sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
      
          # ./sgio_inquiry /dev/mapper/85ag56 &
          [1] 72830
      
          # cat /proc/72830/stack
          [<c00000171c0df700>] 0xc00000171c0df700
          [<c000000000015934>] __switch_to+0x204/0x350
          [<c000000000152d4c>] msleep+0x5c/0x80
          [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
          [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
          [<c0000000003128e4>] block_ioctl+0x64/0xd0
          [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
          [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
          [<c000000000009358>] system_call+0x38/0xd0
      
      6) This is the function call chain exercised in this analysis:
      
      SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
          -> do_vfs_ioctl()
              -> vfs_ioctl()
                  ...
                  error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
                  ...
                      -> dm_blk_ioctl() @ drivers/md/dm.c
                          -> multipath_ioctl() @ drivers/md/dm-mpath.c
                              ...
                              (bdev = NULL, due to no active paths)
                              ...
                              if (!bdev || <...>) {
                                  int err = scsi_verify_blk_ioctl(NULL, cmd);
                                  if (err)
                                      r = err;
                              }
                              ...
                                  -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                      ...
                                      if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                          return 0;
                                      ...
                                      if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                          return 0;
                                      ...
                                      printk_ratelimited(KERN_WARNING
                                                 "%s: sending ioctl %x to a partition!\n" <...>);
      
                                      return -ENOIOCTLCMD;
                                  <-
                              ...
                              return r ? : <...>
                          <-
                  ...
                  if (error == -ENOIOCTLCMD)
                      error = -ENOTTY;
                   out:
                      return error;
                  ...
      
      Links:
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
      [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
      Signed-off-by: default avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3662b6cf
    • Mikulas Patocka's avatar
      dm: initialize non-blk-mq queue data before queue is used · ac44b98e
      Mikulas Patocka authored
      commit ad5f498f upstream.
      
      Commit bfebd1cd ("dm: add full blk-mq
      support to request-based DM") moves the initialization of the fields
      backing_dev_info.congested_fn, backing_dev_info.congested_data and
      queuedata from the function dm_init_md_queue (that is called when the
      device is created) to dm_init_old_md_queue (that is called after the
      device type is determined).
      
      There is no locking when accessing these variables, thus it is possible
      for other parts of the kernel to briefly see this data in a transient
      state (e.g. queue->backing_dev_info.congested_fn initialized and
      md->queue->backing_dev_info.congested_data uninitialized, resulting in
      passing an incorrect parameter to the function dm_any_congested).
      
      This queue data is left initialized for blk-mq devices even though they
      that don't use it.
      
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac44b98e
    • Dmitry V. Levin's avatar
      sh64: fix __NR_fgetxattr · 98216c7c
      Dmitry V. Levin authored
      commit 2d33fa10 upstream.
      
      According to arch/sh/kernel/syscalls_64.S and common sense, __NR_fgetxattr
      has to be defined to 259, but it doesn't.  Instead, it's defined to 269,
      which is of course used by another syscall, __NR_sched_setaffinity in this
      case.
      
      This bug was found by strace test suite.
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98216c7c
    • xuejiufei's avatar
      ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup · c6a5eba5
      xuejiufei authored
      commit c95a5180 upstream.
      
      When recovery master down, dlm_do_local_recovery_cleanup() only remove
      the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
      Which will make umount thread falling in dead loop migrating $RECOVERY
      to the dead node.
      Signed-off-by: default avatarxuejiufei <xuejiufei@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6a5eba5
    • xuejiufei's avatar
      ocfs2/dlm: ignore cleaning the migration mle that is inuse · 62e3e821
      xuejiufei authored
      commit bef5502d upstream.
      
      We have found that migration source will trigger a BUG that the refcount
      of mle is already zero before put when the target is down during
      migration.  The situation is as follows:
      
      dlm_migrate_lockres
        dlm_add_migration_mle
        dlm_mark_lockres_migrating
        dlm_get_mle_inuse
        <<<<<< Now the refcount of the mle is 2.
        dlm_send_one_lockres and wait for the target to become the
        new master.
        <<<<<< o2hb detect the target down and clean the migration
        mle. Now the refcount is 1.
      
      dlm_migrate_lockres woken, and put the mle twice when found the target
      goes down which trigger the BUG with the following message:
      
        "ERROR: bad mle: ".
      Signed-off-by: default avatarJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62e3e821
    • Joseph Qi's avatar
      ocfs2: fix BUG when calculate new backup super · 959c944f
      Joseph Qi authored
      commit 5c9ee4cb upstream.
      
      When resizing, it firstly extends the last gd.  Once it should backup
      super in the gd, it calculates new backup super and update the
      corresponding value.
      
      But it currently doesn't consider the situation that the backup super is
      already done.  And in this case, it still sets the bit in gd bitmap and
      then decrease from bg_free_bits_count, which leads to a corrupted gd and
      trigger the BUG in ocfs2_block_group_set_bits:
      
          BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);
      
      So check whether the backup super is done and then do the updates.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      959c944f
    • Junxiao Bi's avatar
      ocfs2: fix SGID not inherited issue · c553fd80
      Junxiao Bi authored
      commit 854ee2e9 upstream.
      
      Commit 8f1eb487 ("ocfs2: fix umask ignored issue") introduced an
      issue, SGID of sub dir was not inherited from its parents dir.  It is
      because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
      is overwritten by "mode" which don't have SGID set later.
      
      Fixes: 8f1eb487 ("ocfs2: fix umask ignored issue")
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Acked-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c553fd80
    • Mike Kravetz's avatar
      mm/hugetlb.c: fix resv map memory leak for placeholder entries · 11445dc8
      Mike Kravetz authored
      commit dbe409e4 upstream.
      
      Dmitry Vyukov reported the following memory leak
      
      unreferenced object 0xffff88002eaafd88 (size 32):
        comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
        hex dump (first 32 bytes):
          28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff  (.Nc....(.Nc....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           kmalloc include/linux/slab.h:458
           region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
           __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
           vma_needs_reservation mm/hugetlb.c:1813
           alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
           hugetlb_no_page mm/hugetlb.c:3543
           hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
           follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
           __get_user_pages+0x542/0xf30 mm/gup.c:497
           populate_vma_page_range+0xde/0x110 mm/gup.c:919
           __mm_populate+0x1c7/0x310 mm/gup.c:969
           do_mlock+0x291/0x360 mm/mlock.c:637
           SYSC_mlock2 mm/mlock.c:658
           SyS_mlock2+0x4b/0x70 mm/mlock.c:648
      
      Dmitry identified a potential memory leak in the routine region_chg,
      where a region descriptor is not free'ed on an error path.
      
      However, the root cause for the above memory leak resides in region_del.
      In this specific case, a "placeholder" entry is created in region_chg.
      The associated page allocation fails, and the placeholder entry is left
      in the reserve map.  This is "by design" as the entry should be deleted
      when the map is released.  The bug is in the region_del routine which is
      used to delete entries within a specific range (and when the map is
      released).  region_del did not handle the case where a placeholder entry
      exactly matched the start of the range range to be deleted.  In this
      case, the entry would not be deleted and leaked.  The fix is to take
      these special placeholder entries into account in region_del.
      
      The region_chg error path leak is also fixed.
      
      Fixes: feba16e2 ("mm/hugetlb: add region_del() to delete a specific range of entries")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11445dc8
    • Richard Weinberger's avatar
      kernel/signal.c: unexport sigsuspend() · 0440e1ce
      Richard Weinberger authored
      commit 9d8a7652 upstream.
      
      sigsuspend() is nowhere used except in signal.c itself, so we can mark it
      static do not pollute the global namespace.
      
      But this patch is more than a boring cleanup patch, it fixes a real issue
      on UserModeLinux.  UML has a special console driver to display ttys using
      xterm, or other terminal emulators, on the host side.  Vegard reported
      that sometimes UML is unable to spawn a xterm and he's facing the
      following warning:
      
        WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()
      
      It turned out that this warning makes absolutely no sense as the UML
      xterm code calls sigsuspend() on the host side, at least it tries.  But
      as the kernel itself offers a sigsuspend() symbol the linker choose this
      one instead of the glibc wrapper.  Interestingly this code used to work
      since ever but always blocked signals on the wrong side.  Some recent
      kernel change made the WARN_ON() trigger and uncovered the bug.
      
      It is a wonderful example of how much works by chance on computers. :-)
      
      Fixes: 68f3f16d ("new helper: sigsuspend()")
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0440e1ce
    • Naoya Horiguchi's avatar
      mm: hugetlb: call huge_pte_alloc() only if ptep is null · ceb9eea3
      Naoya Horiguchi authored
      commit 0d777df5 upstream.
      
      Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
      and check whether the obtained *ptep is a migration/hwpoison entry or
      not.  And if not, then we get to call huge_pte_alloc().  This is racy
      because the *ptep could turn into migration/hwpoison entry after the
      huge_pte_offset() check.  This race results in BUG_ON in
      huge_pte_alloc().
      
      We don't have to call huge_pte_alloc() when the huge_pte_offset()
      returns non-NULL, so let's fix this bug with moving the code into else
      block.
      
      Note that the *ptep could turn into a migration/hwpoison entry after
      this block, but that's not a problem because we have another
      !pte_present check later (we never go into hugetlb_no_page() in that
      case.)
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ceb9eea3
    • OGAWA Hirofumi's avatar
      fat: fix fake_offset handling on error path · f3fc333b
      OGAWA Hirofumi authored
      commit 928a4771 upstream.
      
      For the root directory, .  and ..  are faked (using dir_emit_dots()) and
      ctx->pos is reset from 2 to 0.
      
      A corrupted root directory could cause fat_get_entry() to fail, but
      ->iterate() (fat_readdir()) reports progress to the VFS (with ctx->pos
      rewound to 0), so any following calls to ->iterate() continue to return
      the same entries again and again.
      
      The result is that userspace will never see the end of the directory,
      causing e.g.  'ls' to hang in a getdents() loop.
      
      [hirofumi@mail.parknet.co.jp: cleanup and make sure to correct fake_offset]
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarRichard Weinberger <richard.weinberger@gmail.com>
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3fc333b
    • Mike Kravetz's avatar
      mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes · 525893dd
      Mike Kravetz authored
      commit 1817889e upstream.
      
      Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
      punch code.  These problems are in the routine remove_inode_hugepages and
      mostly occur in the case where there are holes in the range of pages to be
      removed.  These holes could be the result of a previous hole punch or
      simply sparse allocation.  The current code could access pages outside the
      specified range.
      
      remove_inode_hugepages handles both hole punch and truncate operations.
      Page index handling was fixed/cleaned up so that the loop index always
      matches the page being processed.  The code now only makes a single pass
      through the range of pages as it was determined page faults could not race
      with truncate.  A cond_resched() was added after removing up to
      PAGEVEC_SIZE pages.
      
      Some totally unnecessary code in hugetlbfs_fallocate() that remained from
      early development was also removed.
      
      Tested with fallocate tests submitted here:
      http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
      And, some ftruncate tests under development
      
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      525893dd
    • Michal Hocko's avatar
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 16c61891
      Michal Hocko authored
      commit 373ccbe5 upstream.
      
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16c61891
    • Naoya Horiguchi's avatar
      mm: hugetlb: fix hugepage memory leak caused by wrong reserve count · 3f5fae4d
      Naoya Horiguchi authored
      commit a88c7695 upstream.
      
      When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
      alloc_buddy_huge_page() to directly create a hugepage from the buddy
      allocator.
      
      In that case, however, if alloc_buddy_huge_page() succeeds we don't
      decrement h->resv_huge_pages, which means that successful
      hugetlb_fault() returns without releasing the reserve count.  As a
      result, subsequent hugetlb_fault() might fail despite that there are
      still free hugepages.
      
      This patch simply adds decrementing code on that code path.
      
      I reproduced this problem when testing v4.3 kernel in the following situation:
       - the test machine/VM is a NUMA system,
       - hugepage overcommiting is enabled,
       - most of hugepages are allocated and there's only one free hugepage
         which is on node 0 (for example),
       - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
         node 1, tries to allocate a hugepage,
       - the allocation should fail but the reserve count is still hold.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f5fae4d
    • Michal Hocko's avatar
      memcg: fix thresholds for 32b architectures. · 01a79daa
      Michal Hocko authored
      commit c12176d3 upstream.
      
      Commit 424cdc14 ("memcg: convert threshold to bytes") has fixed a
      regression introduced by 3e32cb2e ("mm: memcontrol: lockless page
      counters") where thresholds were silently converted to use page units
      rather than bytes when interpreting the user input.
      
      The fix is not complete, though, as properly pointed out by Ben Hutchings
      during stable backport review.  The page count is converted to bytes but
      unsigned long is used to hold the value which would be obviously not
      sufficient for 32b systems with more than 4G thresholds.  The same applies
      to usage as taken from mem_cgroup_usage which might overflow.
      
      Let's remove this bytes vs.  pages internal tracking differences and
      handle thresholds in page units internally.  Chage mem_cgroup_usage() to
      return the value in page units and revert 424cdc14 because this should
      be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
      only users of mem_cgroup_usage outside of the threshold handling code is
      converted to give the proper in bytes result.  It is doing that already
      for page_counter output so this is more consistent as well.
      
      The value presented to the userspace is still in bytes units.
      
      Fixes: 424cdc14 ("memcg: convert threshold to bytes")
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      From: Michal Hocko <mhocko@kernel.org>
      Subject: memcg: fix thresholds for 32b architectures.
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      From: Andrew Morton <akpm@linux-foundation.org>
      Subject: memcg: fix thresholds for 32b architectures.
      
      don't attempt to inline mem_cgroup_usage()
      
      The compiler ignores the inline anwyay.  And __always_inlining it adds 600
      bytes of goop to the .o file.
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01a79daa
    • Greg Thelen's avatar
      fs, seqfile: always allow oom killer · 56c5e58f
      Greg Thelen authored
      commit 0f930902 upstream.
      
      Since 5cec38ac ("fs, seq_file: fallback to vmalloc instead of oom kill
      processes") seq_buf_alloc() avoids calling the oom killer for PAGE_SIZE or
      smaller allocations; but larger allocations can use the oom killer via
      vmalloc().  Thus reads of small files can return ENOMEM, but larger files
      use the oom killer to avoid ENOMEM.
      
      The effect of this bug is that reads from /proc and other virtual
      filesystems can return ENOMEM instead of the preferred behavior - oom
      killing something (possibly the calling process).  I don't know of anyone
      except Google who has noticed the issue.
      
      I suspect the fix is more needed in smaller systems where there isn't any
      reclaimable memory.  But these seem like the kinds of systems which
      probably don't use the oom killer for production situations.
      
      Memory overcommit requires use of the oom killer to select a victim
      regardless of file size.
      
      Enable oom killer for small seq_buf_alloc() allocations.
      
      Fixes: 5cec38ac ("fs, seq_file: fallback to vmalloc instead of oom kill processes")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56c5e58f
    • Andy Shevchenko's avatar
      lib/hexdump.c: truncate output in case of overflow · 2343eaa3
      Andy Shevchenko authored
      commit 9f029f54 upstream.
      
      There is a classical off-by-one error in case when we try to place, for
      example, 1+1 bytes as hex in the buffer of size 6.  The expected result is
      to get an output truncated, but in the reality we get 6 bytes filed
      followed by terminating NUL.
      
      Change the logic how we fill the output in case of byte dumping into
      limited space.  This will follow the snprintf() behaviour by truncating
      output even on half bytes.
      
      Fixes: 114fc1af (hexdump: make it return number of bytes placed in buffer)
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@nokia.com>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@nokia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2343eaa3
    • Tetsuo Handa's avatar
      mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL · 83d79c0d
      Tetsuo Handa authored
      commit 426fb5e7 upstream.
      
      It was confirmed that a local unprivileged user can consume all memory
      reserves and hang up that system using time lag between the OOM killer
      sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
      printk() inside for_each_process() loop at oom_kill_process() can consume
      many seconds when there are many thread groups sharing the same memory.
      
      Before starting oom-depleter process:
      
          Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
          Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
      
      As of invoking the OOM killer:
      
          Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
          Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
      
      Between the thread group leader got TIF_MEMDIE and receives SIGKILL:
      
          Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
          Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
      
      The oom-depleter's thread group leader which got TIF_MEMDIE started
      memset() in user space after the OOM killer set TIF_MEMDIE, and it was
      free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
      until SIGKILL is delivered.  If SIGKILL is delivered before TIF_MEMDIE is
      set, the oom-depleter can terminate without touching memory reserves.
      
      Although the possibility of hitting this time lag is very small for 3.19
      and earlier kernels because TIF_MEMDIE is set immediately before sending
      SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
      step between and allow memory allocations which are not needed for
      terminating the OOM victim.
      
      Fixes: 83363b91 ("oom: make sure that TIF_MEMDIE is set under task_lock")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83d79c0d
    • Catalin Marinas's avatar
      mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE · 64615901
      Catalin Marinas authored
      commit d4322d88 upstream.
      
      On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
      configurations defining ARCH_DMA_MINALIGN to 128), the first
      kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
      "kmalloc-128" with index 7.  Depending on the debug kernel configuration,
      sizeof(struct kmem_cache) can be larger than 128 resulting in an
      INDEX_NODE of 8.
      
      Commit 8fc9cf42 ("slab: make more slab management structure off the
      slab") enables off-slab management objects for sizes starting with
      PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
      of the "kmalloc-128" cache would try to place the management objects
      off-slab.  However, since KMALLOC_MIN_SIZE is already 128 and
      freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
      returns NULL (kmalloc_caches[7] not populated yet).  This triggers the
      following bug on arm64:
      
        kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
        Hardware name: Juno (DT)
        PC is at __kmem_cache_create+0x21c/0x280
        LR is at __kmem_cache_create+0x210/0x280
        [...]
        Call trace:
          __kmem_cache_create+0x21c/0x280
          create_boot_cache+0x48/0x80
          create_kmalloc_cache+0x50/0x88
          create_kmalloc_caches+0x4c/0xf4
          kmem_cache_init+0x100/0x118
          start_kernel+0x214/0x33c
      
      This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
      management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
      
      Fixes: 8fc9cf42 ("slab: make more slab management structure off the slab")
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64615901
    • Colin Ian King's avatar
      proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter · 38ff7eb0
      Colin Ian King authored
      commit 41a0c249 upstream.
      
      Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
      774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
      the setting of ret after the get_proc_task call and incorrectly left it as
      -ESRCH.  Instead, return 0 when successful.
      
      Example breakage:
      
        echo 0 > /proc/self/coredump_filter
        bash: echo: write error: No such process
      
      Fixes: 774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38ff7eb0
    • Arnd Bergmann's avatar
      remoteproc: avoid stack overflow in debugfs file · bffecd38
      Arnd Bergmann authored
      commit 92792e48 upstream.
      
      Recent gcc versions warn about reading from a negative offset of
      an on-stack array:
      
      drivers/remoteproc/remoteproc_debugfs.c: In function 'rproc_recovery_write':
      drivers/remoteproc/remoteproc_debugfs.c:167:9: warning: 'buf[4294967295u]' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      I don't see anything in sys_write() that prevents us from
      being called with a zero 'count' argument, so we should
      add an extra check in rproc_recovery_write() to prevent the
      access and avoid the warning.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 2e37abb8 ("remoteproc: create a 'recovery' debugfs entry")
      Signed-off-by: default avatarOhad Ben-Cohen <ohad@wizery.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bffecd38
    • Oleg Nesterov's avatar
      proc: actually make proc_fd_permission() thread-friendly · a793d2b5
      Oleg Nesterov authored
      commit 54708d28 upstream.
      
      The commit 96d0df79 ("proc: make proc_fd_permission() thread-friendly")
      fixed the access to /proc/self/fd from sub-threads, but introduced another
      problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd
      if generic_permission() fails.
      
      Change proc_fd_permission() to check same_thread_group(pid_task(), current).
      
      Fixes: 96d0df79 ("proc: make proc_fd_permission() thread-friendly")
      Reported-by: default avatar"Jin, Yihua" <yihua.jin@intel.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a793d2b5
    • Takashi Iwai's avatar
      ALSA: hda - Implement loopback control switch for Realtek and other codecs · 788e6ddf
      Takashi Iwai authored
      commit e7fdd527 upstream.
      
      Many codecs, typically found on Realtek codecs, have the analog
      loopback path merged to the secondary input of the middle of the
      output paths.  Currently, we don't offer the dynamic switching in such
      configuration but let each loopback path mute by itself.
      
      This should work well in theory, but in reality, we often see that
      such a dead loopback path causes some background noises even if all
      the elements get muted.  Such a problem has been fixed by adding the
      quirk accordingly to disable aamix, and it's the right fix, per se.
      The only problem is that it's not so trivial to achieve it; user needs
      to pass a hint string via patch module option or sysfs.
      
      This patch gives a bit improvement on the situation: it adds "Loopback
      Mixing" control element for such codecs like other codecs (e.g. IDT or
      VIA codecs) with the individual loopback paths.  User can turn on/off
      the loopback path simply via a mixer app.
      
      For keeping the compatibility, the loopback is still enabled on these
      codecs.  But user can try to turn it off if experiencing a suspicious
      background or click noise on the fly, then build a static fixup later
      once after the problem is addressed.
      
      Other than the addition of the loopback enable/disablement control,
      there should be no changes.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      788e6ddf
    • Ioan-Adrian Ratiu's avatar
      HID: usbhid: fix recursive deadlock · e52edae1
      Ioan-Adrian Ratiu authored
      commit e470127e upstream.
      
      The critical section protected by usbhid->lock in hid_ctrl() is too
      big and because of this it causes a recursive deadlock. "Too big" means
      the case statement and the call to hid_input_report() do not need to be
      protected by the spinlock (no URB operations are done inside them).
      
      The deadlock happens because in certain rare cases drivers try to grab
      the lock while handling the ctrl irq which grabs the lock before them
      as described above. For example newer wacom tablets like 056a:033c try
      to reschedule proximity reads from wacom_intuos_schedule_prox_event()
      calling hid_hw_request() -> usbhid_request() -> usbhid_submit_report()
      which tries to grab the usbhid lock already held by hid_ctrl().
      
      There are two ways to get out of this deadlock:
          1. Make the drivers work "around" the ctrl critical region, in the
          wacom case for ex. by delaying the scheduling of the proximity read
          request itself to a workqueue.
          2. Shrink the critical region so the usbhid lock protects only the
          instructions which modify usbhid state, calling hid_input_report()
          with the spinlock unlocked, allowing the device driver to grab the
          lock first, finish and then grab the lock afterwards in hid_ctrl().
      
      This patch implements the 2nd solution.
      Signed-off-by: default avatarIoan-Adrian Ratiu <adi@adirat.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarJason Gerecke <jason.gerecke@wacom.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e52edae1
    • Tariq Saeed's avatar
      ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock · 5b419b90
      Tariq Saeed authored
      commit b1b1e15e upstream.
      
      NFS on a 2 node ocfs2 cluster each node exporting dir.  The lock causing
      the hang is the global bit map inode lock.  Node 1 is master, has the
      lock granted in PR mode; Node 2 is in the converting list (PR -> EX).
      There are no holders of the lock on the master node so it should
      downconvert to NL and grant EX to node 2 but that does not happen.
      BLOCKED + QUEUED in lock res are set and it is on osb blocked list.
      Threads are waiting in __ocfs2_cluster_lock on BLOCKED.  One thread
      wants EX, rest want PR.  So it is as though the downconvert thread needs
      to be kicked to complete the conv.
      
      The hang is caused by an EX req coming into __ocfs2_cluster_lock on the
      heels of a PR req after it sets BUSY (drops l_lock, releasing EX
      thread), forcing the incoming EX to wait on BUSY without doing anything.
      PR has called ocfs2_dlm_lock, which sets the node 1 lock from NL -> PR,
      queues ast.
      
      At this time, upconvert (PR ->EX) arrives from node 2, finds conflict
      with node 1 lock in PR, so the lock res is put on dlm thread's dirty
      listt.
      
      After ret from ocf2_dlm_lock, PR thread now waits behind EX on BUSY till
      awoken by ast.
      
      Now it is dlm_thread that serially runs dlm_shuffle_lists, ast, bast, in
      that order.  dlm_shuffle_lists ques a bast on behalf of node 2 (which
      will be run by dlm_thread right after the ast).  ast does its part, sets
      UPCONVERT_FINISHING, clears BUSY and wakes its waiters.  Next,
      dlm_thread runs bast.  It sets BLOCKED and kicks dc thread.  dc thread
      runs ocfs2_unblock_lock, but since UPCONVERT_FINISHING set, skips doing
      anything and reques.
      
      Inside of __ocfs2_cluster_lock, since EX has been waiting on BUSY ahead
      of PR, it wakes up first, finds BLOCKED set and skips doing anything but
      clearing UPCONVERT_FINISHING (which was actually "meant" for the PR
      thread), and this time waits on BLOCKED.  Next, the PR thread comes out
      of wait but since UPCONVERT_FINISHING is not set, it skips updating the
      l_ro_holders and goes straight to wait on BLOCKED.  So there, we have a
      hang! Threads in __ocfs2_cluster_lock wait on BLOCKED, lock res in osb
      blocked list.  Only when dc thread is awoken, it will run
      ocfs2_unblock_lock and things will unhang.
      
      One way to fix this is to wake the dc thread on the flag after clearing
      UPCONVERT_FINISHING
      
      Orabug: 20933419
      Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Signed-off-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Eric Ren <zren@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b419b90
    • Trond Myklebust's avatar
      NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn · 368f974c
      Trond Myklebust authored
      commit 1a093ceb upstream.
      
      Since commit 2d8ae84f, nothing is bumping lo->plh_block_lgets in the
      layoutreturn path, so it should not be touched in nfs4_layoutreturn_release
      either.
      
      Fixes: 2d8ae84f ("NFSv4.1/pnfs: Remove redundant lo->plh_block_lgets...")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      368f974c
    • Junichi Nomura's avatar
      block: ensure to split after potentially bouncing a bio · ecf177df
      Junichi Nomura authored
      commit 23688bf4 upstream.
      
      blk_queue_bio() does split then bounce, which makes the segment
      counting based on pages before bouncing and could go wrong. Move
      the split to after bouncing, like we do for blk-mq, and the we
      fix the issue of having the bio count for segments be wrong.
      
      Fixes: 54efd50b ("block: make generic_make_request handle arbitrarily sized bios")
      Tested-by: default avatarArtem S. Tashkinov <t.artem@lycos.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecf177df
    • Seth Jennings's avatar
      drivers/base/memory.c: prohibit offlining of memory blocks with missing sections · c854fc16
      Seth Jennings authored
      commit 26bbe7ef upstream.
      
      Commit bdee237c ("x86: mm: Use 2GB memory block size on large-memory
      x86-64 systems") and 982792c7 ("x86, mm: probe memory block size for
      generic x86 64bit") introduced large block sizes for x86.  This made it
      possible to have multiple sections per memory block where previously,
      there was a only every one section per block.
      
      Since blocks consist of contiguous ranges of section, there can be holes
      in the blocks where sections are not present.  If one attempts to
      offline such a block, a crash occurs since the code is not designed to
      deal with this.
      
      This patch is a quick fix to gaurd against the crash by not allowing
      blocks with non-present sections to be offlined.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781Signed-off-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Reported-by: default avatarAndrew Banman <abanman@sgi.com>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Russ Anderson <rja@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c854fc16
    • Mike Snitzer's avatar
      dm btree: fix leak of bufio-backed block in btree_split_sibling error path · 7435c303
      Mike Snitzer authored
      commit 30ce6e1c upstream.
      
      The block allocated at the start of btree_split_sibling() is never
      released if later insert_at() fails.
      
      Fix this by releasing the previously allocated bufio block using
      unlock_block().
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7435c303
    • Hannes Reinecke's avatar
      block: Always check queue limits for cloned requests · f6d45a66
      Hannes Reinecke authored
      commit bf4e6b4e upstream.
      
      When a cloned request is retried on other queues it always needs
      to be checked against the queue limits of that queue.
      Otherwise the calculations for nr_phys_segments might be wrong,
      leading to a crash in scsi_init_sgtable().
      
      To clarify this the patch renames blk_rq_check_limits()
      to blk_cloned_rq_check_limits() and removes the symbol
      export, as the new function should only be used for
      cloned requests and never exported.
      
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Ewan Milne <emilne@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Fixes: e2a60da7 ("block: Clean up special command handling logic")
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6d45a66
    • LABBE Corentin's avatar
      crypto: sun4i-ss - add missing statesize · f7b03cd7
      LABBE Corentin authored
      commit 4f9ea866 upstream.
      
      sun4i-ss implementaton of md5/sha1 is via ahash algorithms.
      Commit 8996eafd ("crypto: ahash - ensure statesize is non-zero")
      made impossible to load them without giving statesize. This patch
      specifiy statesize for sha1 and md5.
      
      Fixes: 6298e948 ("crypto: sunxi-ss - Add Allwinner Security System crypto accelerator")
      Tested-by: default avatarChen-Yu Tsai <wens@csie.org>
      Signed-off-by: default avatarLABBE Corentin <clabbe.montjoie@gmail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7b03cd7
    • Herbert Xu's avatar
      crypto: algif_skcipher - Use new skcipher interface · e3d57c5b
      Herbert Xu authored
      commit 0d96e4ba upstream.
      
      This patch replaces uses of ablkcipher with the new skcipher
      interface.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: <smueller@chronox.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3d57c5b
    • Jason A. Donenfeld's avatar
      crypto: skcipher - Copy iv from desc even for 0-len walks · f15e5364
      Jason A. Donenfeld authored
      commit 70d906bc upstream.
      
      Some ciphers actually support encrypting zero length plaintexts. For
      example, many AEAD modes support this. The resulting ciphertext for
      those winds up being only the authentication tag, which is a result of
      the key, the iv, the additional data, and the fact that the plaintext
      had zero length. The blkcipher constructors won't copy the IV to the
      right place, however, when using a zero length input, resulting in
      some significant problems when ciphers call their initialization
      routines, only to find that the ->iv parameter is uninitialized. One
      such example of this would be using chacha20poly1305 with a zero length
      input, which then calls chacha20, which calls the key setup routine,
      which eventually OOPSes due to the uninitialized ->iv member.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f15e5364
    • David Gstir's avatar
      crypto: talitos - Fix timing leak in ESP ICV verification · 1f8b5200
      David Gstir authored
      commit 79960943 upstream.
      
      Using non-constant time memcmp() makes the verification of the authentication
      tag in the decrypt path vulnerable to timing attacks. Fix this by using
      crypto_memneq() instead.
      Signed-off-by: default avatarDavid Gstir <david@sigma-star.at>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f8b5200
    • David Gstir's avatar
      crypto: nx - Fix timing leak in GCM and CCM decryption · 9af08bfe
      David Gstir authored
      commit cb8affb5 upstream.
      
      Using non-constant time memcmp() makes the verification of the authentication
      tag in the decrypt path vulnerable to timing attacks. Fix this by using
      crypto_memneq() instead.
      Signed-off-by: default avatarDavid Gstir <david@sigma-star.at>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9af08bfe
    • Tadeusz Struk's avatar
      crypto: qat - don't use userspace pointer · 08b8fa61
      Tadeusz Struk authored
      commit 176155da upstream.
      
      Bugfix - don't dereference userspace pointer.
      Signed-off-by: default avatarTadeusz Struk <tadeusz.struk@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08b8fa61
    • Herbert Xu's avatar
      crypto: algif_hash - Only export and import on sockets with data · d4f9756c
      Herbert Xu authored
      commit 4afa5f96 upstream.
      
      The hash_accept call fails to work on sockets that have not received
      any data.  For some algorithm implementations it may cause crashes.
      
      This patch fixes this by ensuring that we only export and import on
      sockets that have received data.
      Reported-by: default avatarHarsh Jain <harshjain.prof@gmail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: default avatarStephan Mueller <smueller@chronox.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f9756c
    • Jaegeuk Kim's avatar
      f2fs crypto: allocate buffer for decrypting filename · a349da65
      Jaegeuk Kim authored
      commit 569cf187 upstream.
      
      We got dentry pages from high_mem, and its address space directly goes into the
      decryption path via f2fs_fname_disk_to_usr.
      But, sg_init_one assumes the address is not from high_mem, so we can get this
      panic since it doesn't call kmap_high but kunmap_high is triggered at the end.
      
      kernel BUG at ../../../../../../kernel/mm/highmem.c:290!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
      ...
       (kunmap_high+0xb0/0xb8) from [<c0114534>] (__kunmap_atomic+0xa0/0xa4)
       (__kunmap_atomic+0xa0/0xa4) from [<c035f028>] (blkcipher_walk_done+0x128/0x1ec)
       (blkcipher_walk_done+0x128/0x1ec) from [<c0366c24>] (crypto_cbc_decrypt+0xc0/0x170)
       (crypto_cbc_decrypt+0xc0/0x170) from [<c0367148>] (crypto_cts_decrypt+0xc0/0x114)
       (crypto_cts_decrypt+0xc0/0x114) from [<c035ea98>] (async_decrypt+0x40/0x48)
       (async_decrypt+0x40/0x48) from [<c032ca34>] (f2fs_fname_disk_to_usr+0x124/0x304)
       (f2fs_fname_disk_to_usr+0x124/0x304) from [<c03056fc>] (f2fs_fill_dentries+0xac/0x188)
       (f2fs_fill_dentries+0xac/0x188) from [<c03059c8>] (f2fs_readdir+0x1f0/0x300)
       (f2fs_readdir+0x1f0/0x300) from [<c0218054>] (vfs_readdir+0x90/0xb4)
       (vfs_readdir+0x90/0xb4) from [<c0218418>] (SyS_getdents64+0x64/0xcc)
       (SyS_getdents64+0x64/0xcc) from [<c0105ba0>] (ret_fast_syscall+0x0/0x30)
      Reviewed-by: default avatarChao Yu <chao2.yu@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a349da65
    • Russell King's avatar
      crypto: caam - fix non-block aligned hash calculation · c084696a
      Russell King authored
      commit c7556ff7 upstream.
      
      caam does not properly calculate the size of the retained state
      when non-block aligned hashes are requested - it uses the wrong
      buffer sizes, which results in errors such as:
      
      caam_jr 2102000.jr1: 40000501: DECO: desc idx 5: SGT Length Error. The descriptor is trying to read more data than is contained in the SGT table.
      
      We end up here with:
      
      in_len 0x46 blocksize 0x40 last_bufsize 0x0 next_bufsize 0x6
      to_hash 0x40 ctx_len 0x28 nbytes 0x20
      
      which results in a job descriptor of:
      
      jobdesc@889: ed03d918: b0861c08 3daa0080 f1400000 3d03d938
      jobdesc@889: ed03d928: 00000068 f8400000 3cde2a40 00000028
      
      where the word at 0xed03d928 is the expected data size (0x68), and a
      scatterlist containing:
      
      sg@892: ed03d938: 00000000 3cde2a40 00000028 00000000
      sg@892: ed03d948: 00000000 3d03d100 00000006 00000000
      sg@892: ed03d958: 00000000 7e8aa700 40000020 00000000
      
      0x68 comes from 0x28 (the context size) plus the "in_len" rounded down
      to a block size (0x40).  in_len comes from 0x26 bytes of unhashed data
      from the previous operation, plus the 0x20 bytes from the latest
      operation.
      
      The fixed version would create:
      
      sg@892: ed03d938: 00000000 3cde2a40 00000028 00000000
      sg@892: ed03d948: 00000000 3d03d100 00000026 00000000
      sg@892: ed03d958: 00000000 7e8aa700 40000020 00000000
      
      which replaces the 0x06 length with the correct 0x26 bytes of previously
      unhashed data.
      
      This fixes a previous commit which erroneously "fixed" this due to a
      DMA-API bug report; that commit indicates that the bug was caused via a
      test_ahash_pnum() function in the tcrypt module.  No such function has
      ever existed in the mainline kernel.  Given that the change in this
      commit has been tested with DMA API debug enabled and shows no issue,
      I can only conclude that test_ahash_pnum() was triggering that bad
      behaviour by CAAM.
      
      Fixes: 7d5196ab ("crypto: caam - Correct DMA unmap size in ahash_update_ctx()")
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c084696a
    • Nicolas Iooss's avatar
      crypto: crc32c-pclmul - use .rodata instead of .rotata · 7333f7d6
      Nicolas Iooss authored
      commit 97bce7e0 upstream.
      
      Module crc32c-intel uses a special read-only data section named .rotata.
      This section is defined for K_table, and its name seems to be a spelling
      mistake for .rodata.
      
      Fixes: 473946e6 ("crypto: crc32c-pclmul - Shrink K_table to 32-bit words")
      Signed-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7333f7d6