1. 19 Aug, 2016 1 commit
    • Florian Westphal's avatar
      netfilter: x_tables: speed up jump target validation · 26bb96ed
      Florian Westphal authored
      commit f4dc7771 upstream.
      
      The dummy ruleset I used to test the original validation change was broken,
      most rules were unreachable and were not tested by mark_source_chains().
      
      In some cases rulesets that used to load in a few seconds now require
      several minutes.
      
      sample ruleset that shows the behaviour:
      
      echo "*filter"
      for i in $(seq 0 100000);do
              printf ":chain_%06x - [0:0]\n" $i
      done
      for i in $(seq 0 100000);do
         printf -- "-A INPUT -j chain_%06x\n" $i
         printf -- "-A INPUT -j chain_%06x\n" $i
         printf -- "-A INPUT -j chain_%06x\n" $i
      done
      echo COMMIT
      
      [ pipe result into iptables-restore ]
      
      This ruleset will be about 74mbyte in size, with ~500k searches
      though all 500k[1] rule entries. iptables-restore will take forever
      (gave up after 10 minutes)
      
      Instead of always searching the entire blob for a match, fill an
      array with the start offsets of every single ipt_entry struct,
      then do a binary search to check if the jump target is present or not.
      
      After this change ruleset restore times get again close to what one
      gets when reverting 36472341 (~3 seconds on my workstation).
      
      [1] every user-defined rule gets an implicit RETURN, so we get
      300k jumps + 100k userchains + 100k returns -> 500k rule entries
      
      Fixes: 36472341 ("netfilter: x_tables: validate targets of jumps")
      Reported-by: default avatarJeff Wu <wujiafu@gmail.com>
      Tested-by: default avatarJeff Wu <wujiafu@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      26bb96ed
  2. 22 Jul, 2016 3 commits
  3. 21 Jul, 2016 36 commits
    • Richard Weinberger's avatar
      um: Stop abusing __KERNEL__ · ad6896a7
      Richard Weinberger authored
      commit 298e20ba upstream.
      
      Currently UML is abusing __KERNEL__ to distinguish between
      kernel and host code (os-Linux). It is better to use a custom
      define such that existing users of __KERNEL__ don't get confused.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ad6896a7
    • Tejun Heo's avatar
      printk: do cond_resched() between lines while outputting to consoles · d47e078e
      Tejun Heo authored
      commit 8d91f8b1 upstream.
      
      @console_may_schedule tracks whether console_sem was acquired through
      lock or trylock.  If the former, we're inside a sleepable context and
      console_conditional_schedule() performs cond_resched().  This allows
      console drivers which use console_lock for synchronization to yield
      while performing time-consuming operations such as scrolling.
      
      However, the actual console outputting is performed while holding
      irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
      before starting outputting lines.  Also, only a few drivers call
      console_conditional_schedule() to begin with.  This means that when a
      lot of lines need to be output by console_unlock(), for example on a
      console registration, the task doing console_unlock() may not yield for
      a long time on a non-preemptible kernel.
      
      If this happens with a slow console devices, for example a serial
      console, the outputting task may occupy the cpu for a very long time.
      Long enough to trigger softlockup and/or RCU stall warnings, which in
      turn pile more messages, sometimes enough to trigger the next cycle of
      warnings incapacitating the system.
      
      Fix it by making console_unlock() insert cond_resched() between lines if
      @console_may_schedule.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarCalvin Owens <calvinowens@fb.com>
      Acked-by: default avatarJan Kara <jack@suse.com>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Kyle McMartin <kyle@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Charles (Chas) Williams <ciwillia@brocade.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d47e078e
    • Vitaly Kuznetsov's avatar
      panic: release stale console lock to always get the logbuf printed out · 35e182c9
      Vitaly Kuznetsov authored
      commit 08d78658 upstream.
      
      In some cases we may end up killing the CPU holding the console lock
      while still having valuable data in logbuf. E.g. I'm observing the
      following:
      
      - A crash is happening on one CPU and console_unlock() is being called on
        some other.
      
      - console_unlock() tries to print out the buffer before releasing the lock
        and on slow console it takes time.
      
      - in the meanwhile crashing CPU does lots of printk()-s with valuable data
        (which go to the logbuf) and sends IPIs to all other CPUs.
      
      - console_unlock() finishes printing previous chunk and enables interrupts
        before trying to print out the rest, the CPU catches the IPI and never
        releases console lock.
      
      This is not the only possible case: in VT/fb subsystems we have many other
      console_lock()/console_unlock() users.  Non-masked interrupts (or
      receiving NMI in case of extreme slowness) will have the same result.
      Getting the whole console buffer printed out on crash should be top
      priority.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      35e182c9
    • Hugh Dickins's avatar
      mm: migrate dirty page without clear_page_dirty_for_io etc · 2c789028
      Hugh Dickins authored
      commit 42cb14b1 upstream.
      
      clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
      since v2.6.16 first introduced page migration; and the set_page_dirty()
      which completed its migration of PageDirty, later had to be moderated to
      __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
      
      No actual problems seen with this procedure recently, but if you look into
      what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
      achieving, it turns out to be nothing more than moving the PageDirty flag,
      and its NR_FILE_DIRTY stat from one zone to another.
      
      It would be good to avoid a pile of irrelevant decrementations and
      incrementations, and improper event counting, and unnecessary descent of
      the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
      radix_tree_replace_slot() left in place anyway).
      
      Do the NR_FILE_DIRTY movement, like the other stats movements, while
      interrupts still disabled in migrate_page_move_mapping(); and don't even
      bother if the zone is the same.  Do the PageDirty movement there under
      tree_lock too, where old page is frozen and newpage not yet visible:
      bearing in mind that as soon as newpage becomes visible in radix_tree, an
      un-page-locked set_page_dirty() might interfere (or perhaps that's just
      not possible: anything doing so should already hold an additional
      reference to the old page, preventing its migration; but play safe).
      
      But we do still need to transfer PageDirty in migrate_page_copy(), for
      those who don't go the mapping route through migrate_page_move_mapping().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Charles (Chas) Williams <ciwillia@brocade.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2c789028
    • Andy Lutomirski's avatar
      x86/mm: Add barriers and document switch_mm()-vs-flush synchronization · aa8f21d0
      Andy Lutomirski authored
      commit 71b3c126 upstream.
      
      When switch_mm() activates a new PGD, it also sets a bit that
      tells other CPUs that the PGD is in use so that TLB flush IPIs
      will be sent.  In order for that to work correctly, the bit
      needs to be visible prior to loading the PGD and therefore
      starting to fill the local TLB.
      
      Document all the barriers that make this work correctly and add
      a couple that were missing.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Charles (Chas) Williams <ciwillia@brocade.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      aa8f21d0
    • Jiri Slaby's avatar
      Linux 3.12.62 · a656195a
      Jiri Slaby authored
      a656195a
    • Vladimir Davydov's avatar
      signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfo · 2b012f59
      Vladimir Davydov authored
      commit 69828dce upstream.
      
      Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
      a warning on the first attempt of doing it.  We use WARN_ON_ONCE, which is
      not informative and, what is worse, taints the kernel, making the trinity
      syscall fuzzer complain false-positively from time to time.
      
      It does not look like we need this warning at all, because the behaviour
      changed quite a long time ago (2.6.39), and if an application relies on
      the old API, it gets EPERM anyway and can issue a warning by itself.
      
      So let us zap the warning in kernel.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2b012f59
    • James Hogan's avatar
      MIPS: KVM: Fix modular KVM under QEMU · 948546f8
      James Hogan authored
      commit 797179bc upstream.
      
      Copy __kvm_mips_vcpu_run() into unmapped memory, so that we can never
      get a TLB refill exception in it when KVM is built as a module.
      
      This was observed to happen with the host MIPS kernel running under
      QEMU, due to a not entirely transparent optimisation in the QEMU TLB
      handling where TLB entries replaced with TLBWR are copied to a separate
      part of the TLB array. Code in those pages continue to be executable,
      but those mappings persist only until the next ASID switch, even if they
      are marked global.
      
      An ASID switch happens in __kvm_mips_vcpu_run() at exception level after
      switching to the guest exception base. Subsequent TLB mapped kernel
      instructions just prior to switching to the guest trigger a TLB refill
      exception, which enters the guest exception handlers without updating
      EPC. This appears as a guest triggered TLB refill on a host kernel
      mapped (host KSeg2) address, which is not handled correctly as user
      (guest) mode accesses to kernel (host) segments always generate address
      error exceptions.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: kvm@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [james.hogan@imgtec.com: backported for stable 3.14]
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      948546f8
    • Bjørn Mork's avatar
      cdc_ncm: workaround for EM7455 "silent" data interface · 2cb8ebaa
      Bjørn Mork authored
      [ Upstream commit c086e709 ]
      
      Several Lenovo users have reported problems with their Sierra
      Wireless EM7455 modem. The driver has loaded successfully and
      the MBIM management channel has appeared to work, including
      establishing a connection to the mobile network. But no frames
      have been received over the data interface.
      
      The problem affects all EM7455 and MC7455, and is assumed to
      affect other modems based on the same Qualcomm chipset and
      baseband firmware.
      
      Testing narrowed the problem down to what seems to be a
      firmware timing bug during initialization. Adding a short sleep
      while probing is sufficient to make the problem disappear.
      Experiments have shown that 1-2 ms is too little to have any
      effect, while 10-20 ms is enough to reliably succeed.
      Reported-by: default avatarStefan Armbruster <ml001@armbruster-it.de>
      Reported-by: default avatarRalph Plawetzki <ralph@purejava.org>
      Reported-by: default avatarAndreas Fett <andreas.fett@secunet.com>
      Reported-by: default avatarRasmus Lerdorf <rasmus@lerdorf.com>
      Reported-by: default avatarSamo Ratnik <samo.ratnik@gmail.com>
      Reported-and-tested-by: default avatarAleksander Morgado <aleksander@aleksander.es>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2cb8ebaa
    • Oliver Neukum's avatar
      HID: elo: kill not flush the work · 5925ce0a
      Oliver Neukum authored
      commit ed596a4a upstream.
      
      Flushing a work that reschedules itself is not a sensible operation. It needs
      to be killed. Failure to do so leads to a kernel panic in the timer code.
      Signed-off-by: default avatarOliver Neukum <ONeukum@suse.com>
      Reviewed-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5925ce0a
    • Dan Carpenter's avatar
      ALSA: compress: fix an integer overflow check · 9deea4dd
      Dan Carpenter authored
      commit 6217e5ed upstream.
      
      I previously added an integer overflow check here but looking at it now,
      it's still buggy.
      
      The bug happens in snd_compr_allocate_buffer().  We multiply
      ".fragments" and ".fragment_size" and that doesn't overflow but then we
      save it in an unsigned int so it truncates the high bits away and we
      allocate a smaller than expected size.
      
      Fixes: b35cc822 ('ALSA: compress_core: integer overflow in snd_compr_allocate_buffer()')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9deea4dd
    • Scott Bauer's avatar
      HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands · 5b900329
      Scott Bauer authored
      commit 93a2001b upstream.
      
      This patch validates the num_values parameter from userland during the
      HIDIOCGUSAGES and HIDIOCSUSAGES commands. Previously, if the report id was set
      to HID_REPORT_ID_UNKNOWN, we would fail to validate the num_values parameter
      leading to a heap overflow.
      Signed-off-by: default avatarScott Bauer <sbauer@plzdonthack.me>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5b900329
    • Lukasz Odzioba's avatar
      mm/swap.c: flush lru pvecs on compound page arrival · 93257ab3
      Lukasz Odzioba authored
      commit 8f182270 upstream.
      
      Currently we can have compound pages held on per cpu pagevecs, which
      leads to a lot of memory unavailable for reclaim when needed.  In the
      systems with hundreads of processors it can be GBs of memory.
      
      On of the way of reproducing the problem is to not call munmap
      explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
      that some pages (with THP enabled also huge pages) may end up on
      lru_add_pvec, example below.
      
        void main() {
        #pragma omp parallel
        {
      	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
      	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
      		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
      	if (p != MAP_FAILED)
      		memset(p, 0, size);
      	//munmap(p, size); // uncomment to make the problem go away
        }
        }
      
      When we run it with THP enabled it will leave significant amount of
      memory on lru_add_pvec.  This memory will be not reclaimed if we hit
      OOM, so when we run above program in a loop:
      
      	for i in `seq 100`; do ./a.out; done
      
      many processes (95% in my case) will be killed by OOM.
      
      The primary point of the LRU add cache is to save the zone lru_lock
      contention with a hope that more pages will belong to the same zone and
      so their addition can be batched.  The huge page is already a form of
      batched addition (it will add 512 worth of memory in one go) so skipping
      the batching seems like a safer option when compared to a potential
      excess in the caching which can be quite large and much harder to fix
      because lru_add_drain_all is way to expensive and it is not really clear
      what would be a good moment to call it.
      
      Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
      madvise(p, size, MADV_FREE); after memset.
      
      This patch flushes lru pvecs on compound page arrival making the problem
      less severe - after applying it kill rate of above example drops to 0%,
      due to reducing maximum amount of memory held on pvec from 28MB (with
      THP) to 56kB per CPU.
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Ming Li <mingli199x@qq.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      93257ab3
    • Marcelo Tosatti's avatar
      KVM: x86: expose invariant tsc cpuid bit (v2) · a992f642
      Marcelo Tosatti authored
      commit e4c9a5a1 upstream.
      
      Invariant TSC is a property of TSC, no additional
      support code necessary.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a992f642
    • Jiri Slaby's avatar
      base: make module_create_drivers_dir race-free · 8617fe9e
      Jiri Slaby authored
      commit 7e1b1fc4 upstream.
      
      Modules which register drivers via standard path (driver_register) in
      parallel can cause a warning:
      WARNING: CPU: 2 PID: 3492 at ../fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
      sysfs: cannot create duplicate filename '/module/saa7146/drivers'
      Modules linked in: hexium_gemini(+) mxb(+) ...
      ...
      Call Trace:
      ...
       [<ffffffff812e63a2>] sysfs_warn_dup+0x62/0x80
       [<ffffffff812e6487>] sysfs_create_dir_ns+0x77/0x90
       [<ffffffff8140f2c4>] kobject_add_internal+0xb4/0x340
       [<ffffffff8140f5b8>] kobject_add+0x68/0xb0
       [<ffffffff8140f631>] kobject_create_and_add+0x31/0x70
       [<ffffffff8157a703>] module_add_driver+0xc3/0xd0
       [<ffffffff8155e5d4>] bus_add_driver+0x154/0x280
       [<ffffffff815604c0>] driver_register+0x60/0xe0
       [<ffffffff8145bed0>] __pci_register_driver+0x60/0x70
       [<ffffffffa0273e14>] saa7146_register_extension+0x64/0x90 [saa7146]
       [<ffffffffa0033011>] hexium_init_module+0x11/0x1000 [hexium_gemini]
      ...
      
      As can be (mostly) seen, driver_register causes this call sequence:
        -> bus_add_driver
          -> module_add_driver
            -> module_create_drivers_dir
      The last one creates "drivers" directory in /sys/module/<...>. When
      this is done in parallel, the directory is attempted to be created
      twice at the same time.
      
      This can be easily reproduced by loading mxb and hexium_gemini in
      parallel:
      while :; do
        modprobe mxb &
        modprobe hexium_gemini
        wait
        rmmod mxb hexium_gemini saa7146_vv saa7146
      done
      
      saa7146 calls pci_register_driver for both mxb and hexium_gemini,
      which means /sys/module/saa7146/drivers is to be created for both of
      them.
      
      Fix this by a new mutex in module_create_drivers_dir which makes the
      test-and-create "drivers" dir atomic.
      
      I inverted the condition and removed 'return' to avoid multiple
      unlocks or a goto.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Fixes: fe480a26 (Modules: only add drivers/ direcory if needed)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8617fe9e
    • Dan Carpenter's avatar
      KEYS: potential uninitialized variable · 8c903c05
      Dan Carpenter authored
      commit 38327424 upstream.
      
      If __key_link_begin() failed then "edit" would be uninitialized.  I've
      added a check to fix that.
      
      This allows a random user to crash the kernel, though it's quite
      difficult to achieve.  There are three ways it can be done as the user
      would have to cause an error to occur in __key_link():
      
       (1) Cause the kernel to run out of memory.  In practice, this is difficult
           to achieve without ENOMEM cropping up elsewhere and aborting the
           attempt.
      
       (2) Revoke the destination keyring between the keyring ID being looked up
           and it being tested for revocation.  In practice, this is difficult to
           time correctly because the KEYCTL_REJECT function can only be used
           from the request-key upcall process.  Further, users can only make use
           of what's in /sbin/request-key.conf, though this does including a
           rejection debugging test - which means that the destination keyring
           has to be the caller's session keyring in practice.
      
       (3) Have just enough key quota available to create a key, a new session
           keyring for the upcall and a link in the session keyring, but not then
           sufficient quota to create a link in the nominated destination keyring
           so that it fails with EDQUOT.
      
      The bug can be triggered using option (3) above using something like the
      following:
      
      	echo 80 >/proc/sys/kernel/keys/root_maxbytes
      	keyctl request2 user debug:fred negate @t
      
      The above sets the quota to something much lower (80) to make the bug
      easier to trigger, but this is dependent on the system.  Note also that
      the name of the keyring created contains a random number that may be
      between 1 and 10 characters in size, so may throw the test off by
      changing the amount of quota used.
      
      Assuming the failure occurs, something like the following will be seen:
      
      	kfree_debugcheck: out of range ptr 6b6b6b6b6b6b6b68h
      	------------[ cut here ]------------
      	kernel BUG at ../mm/slab.c:2821!
      	...
      	RIP: 0010:[<ffffffff811600f9>] kfree_debugcheck+0x20/0x25
      	RSP: 0018:ffff8804014a7de8  EFLAGS: 00010092
      	RAX: 0000000000000034 RBX: 6b6b6b6b6b6b6b68 RCX: 0000000000000000
      	RDX: 0000000000040001 RSI: 00000000000000f6 RDI: 0000000000000300
      	RBP: ffff8804014a7df0 R08: 0000000000000001 R09: 0000000000000000
      	R10: ffff8804014a7e68 R11: 0000000000000054 R12: 0000000000000202
      	R13: ffffffff81318a66 R14: 0000000000000000 R15: 0000000000000001
      	...
      	Call Trace:
      	  kfree+0xde/0x1bc
      	  assoc_array_cancel_edit+0x1f/0x36
      	  __key_link_end+0x55/0x63
      	  key_reject_and_link+0x124/0x155
      	  keyctl_reject_key+0xb6/0xe0
      	  keyctl_negate_key+0x10/0x12
      	  SyS_keyctl+0x9f/0xe7
      	  do_syscall_64+0x63/0x13a
      	  entry_SYSCALL64_slow_path+0x25/0x25
      
      Fixes: f70e2e06 ('KEYS: Do preallocation for __key_link()')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8c903c05
    • Brian King's avatar
      SCSI: Increase REPORT_LUNS timeout · e0856a10
      Brian King authored
      commit b39c9a66 upstream.
      
      This patch fixes an issue seen with an IBM 2145 (SVC) where, following an error
      injection test which results in paths going offline, when they came
      back online, the path would timeout the REPORT_LUNS issued during the
      scan. This timeout situation continued until retries were expired, resulting in
      falling back to a sequential LUN scan. Then, since the target responds
      with PQ=1, PDT=0 for all possible LUNs, due to the way the sequential
      LUN scan code works, we end up adding 512 LUNs for each target, when there
      is really only a small handful of LUNs that are actually present.
      
      This patch increases the timeout used on the REPORT_LUNS to 30 seconds.
      This patch solves the issue of 512 non existent LUNs showing up after
      this event.
      Signed-off-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e0856a10
    • Tony Luck's avatar
      EDAC: Remove arbitrary limit on number of channels · 97f3455a
      Tony Luck authored
      commit c44696ff upstream.
      
      Currently set to "6", but the reset of the code will dynamically
      allocate as needed.  We need to go to "8" today, but drop the check
      completely to save doing this again when we need even larger numbers.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarAristeu Rozanski <aris@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      97f3455a
    • Kangjie Lu's avatar
      rds: fix an infoleak in rds_inc_info_copy · 3360c517
      Kangjie Lu authored
      commit 4116def2 upstream.
      
      The last field "flags" of object "minfo" is not initialized.
      Copying this object out may leak kernel stack data.
      Assign 0 to it to avoid leak.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3360c517
    • Gavin Shan's avatar
      net/qlge: Avoids recursive EEH error · f3c9e9b1
      Gavin Shan authored
      commit 3275c0c6 upstream.
      
      One timer, whose handler keeps reading on MMIO register for EEH
      core to detect error in time, is started when the PCI device driver
      is loaded. MMIO register can't be accessed during PE reset in EEH
      recovery. Otherwise, the unexpected recursive error is triggered.
      The timer isn't closed that time if the interface isn't brought
      up. So the unexpected recursive error is seen during EEH recovery
      when the interface is down.
      
      This avoids the unexpected recursive EEH error by closing the timer
      in qlge_io_error_detected() before EEH PE reset unconditionally. The
      timer is started unconditionally after EEH PE reset in qlge_io_resume().
      Also, the timer should be closed unconditionally when the device is
      removed from the system permanently in qlge_io_error_detected().
      Reported-by: default avatarShriya R. Kulkarni <shriyakul@in.ibm.com>
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f3c9e9b1
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt · bdb8bc5f
      Kangjie Lu authored
      commit e4ec8cc8 upstream.
      
      The stack object “r1” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bdb8bc5f
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in events via snd_timer_user_ccallback · 640b1f79
      Kangjie Lu authored
      commit 9a47e9cf upstream.
      
      The stack object “r1” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      640b1f79
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS · 16e5f4c6
      Kangjie Lu authored
      commit cec8f96e upstream.
      
      The stack object “tread” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16e5f4c6
    • Takashi Iwai's avatar
      ALSA: hrtimer: Handle start/stop more properly · 6210a912
      Takashi Iwai authored
      commit d2c5cf88 upstream.
      
      This patch tries to address the still remaining issues in ALSA hrtimer
      driver:
      - Spurious use-after-free was detected in hrtimer callback
      - Incorrect rescheduling due to delayed start
      - WARN_ON() is triggered in hrtimer_forward() invoked in hrtimer
        callback
      
      The first issue happens only when the new timer is scheduled even
      while hrtimer is being closed.  It's related with the second and third
      items; since ALSA timer core invokes hw.start callback during hrtimer
      interrupt, this may result in the explicit call of hrtimer_start().
      
      Also, the similar problem is seen for the stop; ALSA timer core
      invokes hw.stop callback even in the hrtimer handler, too.  Since we
      must not call the synced hrtimer_cancel() in such a context, it's just
      a hrtimer_try_to_cancel() call that doesn't properly work.
      
      Another culprit of the second and third items is the call of
      hrtimer_forward_now() before snd_timer_interrupt().  The timer->stick
      value may change during snd_timer_interrupt() call, but this
      possibility is ignored completely.
      
      For covering these subtle and messy issues, the following changes have
      been done in this patch:
      - A new flag, in_callback, is introduced in the private data to
        indicate that the hrtimer handler is being processed.
      - Both start and stop callbacks skip when called from (during)
        in_callback flag.
      - The hrtimer handler returns properly HRTIMER_RESTART and NORESTART
        depending on the running state now.
      - The hrtimer handler reprograms the expiry properly after
        snd_timer_interrupt() call, instead of before.
      - The close callback clears running flag and sets in_callback flag
        to block any further start/stop calls.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6210a912
    • Jiri Slaby's avatar
      ktime: export ktime_divns · a882416d
      Jiri Slaby authored
      ktime_divns was exported in upstream as a side-effect of commit
      166afb64 (ktime: Sanitize
      ktime_to_us/ms conversion). But we do not want the commit given ktime
      is not nanoseconds in 3.12 yet.
      
      So we only export the function here as it is needed by upstream commit
      d2c5cf88 (ALSA: hrtimer: Handle
      start/stop more properly):
      ERROR: "ktime_divns" [sound/core/snd-hrtimer.ko] undefined!
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      a882416d
    • Kangjie Lu's avatar
      USB: usbfs: fix potential infoleak in devio · fd0d40b9
      Kangjie Lu authored
      commit 681fef83 upstream.
      
      The stack object “ci” has a total size of 8 bytes. Its last 3 bytes
      are padding bytes which are not initialized and leaked to userland
      via “copy_to_user”.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fd0d40b9
    • daniel's avatar
      Bridge: Fix ipv6 mc snooping if bridge has no ipv6 address · 0da1bf82
      daniel authored
      [ Upstream commit 0888d5f3 ]
      
      The bridge is falsly dropping ipv6 mulitcast packets if there is:
       1. No ipv6 address assigned on the brigde.
       2. No external mld querier present.
       3. The internal querier enabled.
      
      When the bridge fails to build mld queries, because it has no
      ipv6 address, it slilently returns, but keeps the local querier enabled.
      This specific case causes confusing packet loss.
      
      Ipv6 multicast snooping can only work if:
       a) An external querier is present
       OR
       b) The bridge has an ipv6 address an is capable of sending own queries
      
      Otherwise it has to forward/flood the ipv6 multicast traffic,
      because snooping cannot work.
      
      This patch fixes the issue by adding a flag to the bridge struct that
      indicates that there is currently no ipv6 address assinged to the bridge
      and returns a false state for the local querier in
      __br_multicast_querier_exists().
      
      Special thanks to Linus Lüssing.
      
      Fixes: d1d81d4c ("bridge: check return value of ipv6_dev_get_saddr()")
      Signed-off-by: default avatarDaniel Danzberger <daniel@dd-wrt.com>
      Acked-by: default avatarLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0da1bf82
    • James Bottomley's avatar
      scsi_lib: correctly retry failed zero length REQ_TYPE_FS commands · 1558418f
      James Bottomley authored
      commit a621bac3 upstream.
      
      When SCSI was written, all commands coming from the filesystem
      (REQ_TYPE_FS commands) had data.  This meant that our signal for needing
      to complete the command was the number of bytes completed being equal to
      the number of bytes in the request.  Unfortunately, with the advent of
      flush barriers, we can now get zero length REQ_TYPE_FS commands, which
      confuse this logic because they satisfy the condition every time.  This
      means they never get retried even for retryable conditions, like UNIT
      ATTENTION because we complete them early assuming they're done.  Fix
      this by special casing the early completion condition to recognise zero
      length commands with errors and let them drop through to the retry code.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarSebastian Parschauer <s.parschauer@gmx.de>
      Signed-off-by: default avatarJames E.J. Bottomley <jejb@linux.vnet.ibm.com>
      Tested-by: default avatarJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      [ jwang: backport from upstream 4.7 to fix scsi resize issue ]
      Signed-off-by: default avatarJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1558418f
    • Christoph Hellwig's avatar
      scsi: remove scsi_end_request · 68503738
      Christoph Hellwig authored
      commit bc85dc50 upstream.
      
      By folding scsi_end_request into its only caller we can significantly clean
      up the completion logic.  We can use simple goto labels now to only have
      a single place to finish or requeue command there instead of the previous
      convoluted logic.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Reviewed-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      [jwang: backport to 3.12]
      Signed-off-by: default avatarJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      68503738
    • Kirill A. Shutemov's avatar
      UBIFS: Implement ->migratepage() · c6666944
      Kirill A. Shutemov authored
      commit 4ac1c17b upstream.
      
      During page migrations UBIFS might get confused
      and the following assert triggers:
      [  213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
      [  213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
      [  213.490000] Hardware name: Allwinner sun4i/sun5i Families
      [  213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
      [  213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
      [  213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
      [  213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
      [  213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
      [  213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
      [  213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
      [  213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
      [  213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
      [  213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
      [  213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
      [  213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
      [  213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
      [  213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
      [  213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
      [  213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
      [  213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
      [  213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
      [  213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)
      
      UBIFS is using PagePrivate() which can have different meanings across
      filesystems. Therefore the generic page migration code cannot handle this
      case correctly.
      We have to implement our own migration function which basically does a
      plain copy but also duplicates the page private flag.
      UBIFS is not a block device filesystem and cannot use buffer_migrate_page().
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [rw: Massaged changelog, build fixes, etc...]
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c6666944
    • Richard Weinberger's avatar
      mm: Export migrate_page_move_mapping and migrate_page_copy · 0162107f
      Richard Weinberger authored
      commit 1118dce7 upstream.
      
      Export these symbols such that UBIFS can implement
      ->migratepage.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0162107f
    • Will Deacon's avatar
      ARM: 8578/1: mm: ensure pmd_present only checks the valid bit · 87119310
      Will Deacon authored
      commit 62453188 upstream.
      
      In a subsequent patch, pmd_mknotpresent will clear the valid bit of the
      pmd entry, resulting in a not-present entry from the hardware's
      perspective. Unfortunately, pmd_present simply checks for a non-zero pmd
      value and will therefore continue to return true even after a
      pmd_mknotpresent operation. Since pmd_mknotpresent is only used for
      managing huge entries, this is only an issue for the 3-level case.
      
      This patch fixes the 3-level pmd_present implementation to take into
      account the valid bit. For bisectability, the change is made before the
      fix to pmd_mknotpresent.
      
      [catalin.marinas@arm.com: comment update regarding pmd_mknotpresent patch]
      
      Fixes: 8d962507 ("ARM: mm: Transparent huge page support for LPAE systems.")
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steve Capper <Steve.Capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      87119310
    • Trond Myklebust's avatar
      NFS: Fix another OPEN_DOWNGRADE bug · 75815edd
      Trond Myklebust authored
      commit e547f262 upstream.
      
      Olga Kornievskaia reports that the following test fails to trigger
      an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
      
      	fd0 = open(foo, RDRW)   -- should be open on the wire for "both"
      	fd1 = open(foo, RDONLY)  -- should be open on the wire for "read"
      	close(fd0) -- should trigger an open_downgrade
      	read(fd1)
      	close(fd1)
      
      The issue is that we're missing a check for whether or not the current
      state transitioned from an O_RDWR state as opposed to having transitioned
      from a combination of O_RDONLY and O_WRONLY.
      Reported-by: default avatarOlga Kornievskaia <aglo@umich.edu>
      Fixes: cd9288ff ("NFSv4: Fix another bug in the close/open_downgrade code")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      75815edd
    • Al Viro's avatar
      make nfs_atomic_open() call d_drop() on all ->open_context() errors. · 11d27536
      Al Viro authored
      commit d20cb71d upstream.
      
      In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
      unconditional d_drop() after the ->open_context() had been removed.  It had
      been correct for success cases (there ->open_context() itself had been doing
      dcache manipulations), but not for error ones.  Only one of those (ENOENT)
      got a compensatory d_drop() added in that commit, but in fact it should've
      been done for all errors.  As it is, the case of O_CREAT non-exclusive open
      on a hashed negative dentry racing with e.g. symlink creation from another
      client ended up with ->open_context() getting an error and proceeding to
      call nfs_lookup().  On a hashed dentry, which would've instantly triggered
      BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
      d_splice_alias()).
      Tested-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      11d27536
    • Borislav Petkov's avatar
      x86/amd_nb: Fix boot crash on non-AMD systems · da252e41
      Borislav Petkov authored
      commit 1ead852d upstream.
      
      Fix boot crash that triggers if this driver is built into a kernel and
      run on non-AMD systems.
      
      AMD northbridges users call amd_cache_northbridges() and it returns
      a negative value to signal that we weren't able to cache/detect any
      northbridges on the system.
      
      At least, it should do so as all its callers expect it to do so. But it
      does return a negative value only when kmalloc() fails.
      
      Fix it to return -ENODEV if there are no NBs cached as otherwise, amd_nb
      users like amd64_edac, for example, which relies on it to know whether
      it should load or not, gets loaded on systems like Intel Xeons where it
      shouldn't.
      Reported-and-tested-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1466097230-5333-2-git-send-email-bp@alien8.de
      Link: https://lkml.kernel.org/r/5761BEB0.9000807@cybernetics.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      da252e41
    • Masami Hiramatsu's avatar
      kprobes/x86: Clear TF bit in fault on single-stepping · f7fd9803
      Masami Hiramatsu authored
      commit dcfc4724 upstream.
      
      Fix kprobe_fault_handler() to clear the TF (trap flag) bit of
      the flags register in the case of a fault fixup on single-stepping.
      
      If we put a kprobe on the instruction which caused a
      page fault (e.g. actual mov instructions in copy_user_*),
      that fault happens on the single-stepping buffer. In this
      case, kprobes resets running instance so that the CPU can
      retry execution on the original ip address.
      
      However, current code forgets to reset the TF bit. Since this
      fault happens with TF bit set for enabling single-stepping,
      when it retries, it causes a debug exception and kprobes
      can not handle it because it already reset itself.
      
      On the most of x86-64 platform, it can be easily reproduced
      by using kprobe tracer. E.g.
      
        # cd /sys/kernel/debug/tracing
        # echo p copy_user_enhanced_fast_string+5 > kprobe_events
        # echo 1 > events/kprobes/enable
      
      And you'll see a kernel panic on do_debug(), since the debug
      trap is not handled by kprobes.
      
      To fix this problem, we just need to clear the TF bit when
      resetting running kprobe.
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: default avatarAnanth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: systemtap@sourceware.org
      Link: http://lkml.kernel.org/r/20160611140648.25885.37482.stgit@devbox
      [ Updated the comments. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f7fd9803