1. 13 Feb, 2013 2 commits
    • Cyril Roelandt's avatar
      xen: remove redundant NULL check before unregister_and_remove_pcpu(). · 4f8c8527
      Cyril Roelandt authored
      unregister_and_remove_pcpu on a NULL pointer is a no-op, so the NULL check in
      sync_pcpu can be removed.
      Signed-off-by: default avatarCyril Roelandt <tipecaml@gmail.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      4f8c8527
    • Jan Beulich's avatar
      x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS. · 13d2b4d1
      Jan Beulich authored
      This fixes CVE-2013-0228 / XSA-42
      
      Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user
      in 32bit PV guest can use to crash the > guest with the panic like this:
      
      -------------
      general protection fault: 0000 [#1] SMP
      last sysfs file: /sys/devices/vbd-51712/block/xvda/dev
      Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
      iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
      xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4
      mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last
      unloaded: scsi_wait_scan]
      
      Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1
      EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0
      EIP is at xen_iret+0x12/0x2b
      EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010
      ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0
       DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069
      Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000)
      Stack:
       00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000
      Call Trace:
      Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00
      8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40
      10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02
      EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0
      general protection fault: 0000 [#2]
      ---[ end trace ab0d29a492dcd330 ]---
      Kernel panic - not syncing: Fatal exception
      Pid: 1250, comm: r Tainted: G      D    ---------------
      2.6.32-356.el6.i686 #1
      Call Trace:
       [<c08476df>] ? panic+0x6e/0x122
       [<c084b63c>] ? oops_end+0xbc/0xd0
       [<c084b260>] ? do_general_protection+0x0/0x210
       [<c084a9b7>] ? error_code+0x73/
      -------------
      
      Petr says: "
       I've analysed the bug and I think that xen_iret() cannot cope with
       mangled DS, in this case zeroed out (null selector/descriptor) by either
       xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT
       entry was invalidated by the reproducer. "
      
      Jan took a look at the preliminary patch and came up a fix that solves
      this problem:
      
      "This code gets called after all registers other than those handled by
      IRET got already restored, hence a null selector in %ds or a non-null
      one that got loaded from a code or read-only data descriptor would
      cause a kernel mode fault (with the potential of crashing the kernel
      as a whole, if panic_on_oops is set)."
      
      The way to fix this is to realize that the we can only relay on the
      registers that IRET restores. The two that are guaranteed are the
      %cs and %ss as they are always fixed GDT selectors. Also they are
      inaccessible from user mode - so they cannot be altered. This is
      the approach taken in this patch.
      
      Another alternative option suggested by Jan would be to relay on
      the subtle realization that using the %ebp or %esp relative references uses
      the %ss segment.  In which case we could switch from using %eax to %ebp and
      would not need the %ss over-rides. That would also require one extra
      instruction to compensate for the one place where the register is used
      as scaled index. However Andrew pointed out that is too subtle and if
      further work was to be done in this code-path it could escape folks attention
      and lead to accidents.
      Reviewed-by: default avatarPetr Matousek <pmatouse@redhat.com>
      Reported-by: default avatarPetr Matousek <pmatouse@redhat.com>
      Reviewed-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      13d2b4d1
  2. 06 Feb, 2013 2 commits
  3. 16 Jan, 2013 2 commits
    • Andrew Cooper's avatar
      xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests. · 9174adbe
      Andrew Cooper authored
      This fixes CVE-2013-0190 / XSA-40
      
      There has been an error on the xen_failsafe_callback path for failed
      iret, which causes the stack pointer to be wrong when entering the
      iret_exc error path.  This can result in the kernel crashing.
      
      In the classic kernel case, the relevant code looked a little like:
      
              popl %eax      # Error code from hypervisor
              jz 5f
              addl $16,%esp
              jmp iret_exc   # Hypervisor said iret fault
      5:      addl $16,%esp
                             # Hypervisor said segment selector fault
      
      Here, there are two identical addls on either option of a branch which
      appears to have been optimised by hoisting it above the jz, and
      converting it to an lea, which leaves the flags register unaffected.
      
      In the PVOPS case, the code looks like:
      
              popl_cfi %eax         # Error from the hypervisor
              lea 16(%esp),%esp     # Add $16 before choosing fault path
              CFI_ADJUST_CFA_OFFSET -16
              jz 5f
              addl $16,%esp         # Incorrectly adjust %esp again
              jmp iret_exc
      
      It is possible unprivileged userspace applications to cause this
      behaviour, for example by loading an LDT code selector, then changing
      the code selector to be not-present.  At this point, there is a race
      condition where it is possible for the hypervisor to return back to
      userspace from an interrupt, fault on its own iret, and inject a
      failsafe_callback into the kernel.
      
      This bug has been present since the introduction of Xen PVOPS support
      in commit 5ead97c8 (xen: Core Xen implementation), in 2.6.23.
      Signed-off-by: default avatarFrediano Ziglio <frediano.ziglio@citrix.com>
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      9174adbe
    • Konrad Rzeszutek Wilk's avatar
      Revert "xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic." · d55bf532
      Konrad Rzeszutek Wilk authored
      This reverts commit 41bd956d.
      
      The fix is incorrect and not appropiate for the latest kernels.
      In fact it _causes_ the BUG: scheduling while atomic while
      doing vCPU hotplug.
      Suggested-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      d55bf532
  4. 15 Jan, 2013 8 commits
    • Daniel De Graaf's avatar
      xen/gntdev: remove erronous use of copy_to_user · 1affa98d
      Daniel De Graaf authored
      Since there is now a mapping of granted pages in kernel address space in
      both PV and HVM, use it for UNMAP_NOTIFY_CLEAR_BYTE instead of accessing
      memory via copy_to_user and triggering sleep-in-atomic warnings.
      Signed-off-by: default avatarDaniel De Graaf <dgdegra@tycho.nsa.gov>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      1affa98d
    • Daniel De Graaf's avatar
      xen/gntdev: correctly unmap unlinked maps in mmu notifier · 16a1d022
      Daniel De Graaf authored
      If gntdev_ioctl_unmap_grant_ref is called on a range before unmapping
      it, the entry is removed from priv->maps and the later call to
      mn_invl_range_start won't find it to do the unmapping. Fix this by
      creating another list of freeable maps that the mmu notifier can search
      and use to unmap grants.
      Signed-off-by: default avatarDaniel De Graaf <dgdegra@tycho.nsa.gov>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      16a1d022
    • Daniel De Graaf's avatar
      xen/gntdev: fix unsafe vma access · 2512f298
      Daniel De Graaf authored
      In gntdev_ioctl_get_offset_for_vaddr, we need to hold mmap_sem while
      calling find_vma() to avoid potentially having the result freed out from
      under us.  Similarly, the MMU notifier functions need to synchronize with
      gntdev_vma_close to avoid map->vma being freed during their iteration.
      Signed-off-by: default avatarDaniel De Graaf <dgdegra@tycho.nsa.gov>
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2512f298
    • Andres Lagar-Cavilla's avatar
      xen/privcmd: Fix mmap batch ioctl. · 99beae6c
      Andres Lagar-Cavilla authored
      1. If any individual mapping error happens, the V1 case will mark *all*
      operations as failed. Fixed.
      
      2. The err_array was allocated with kcalloc, resulting in potentially O(n) page
      allocations. Refactor code to not use this array.
      Signed-off-by: default avatarAndres Lagar-Cavilla <andres@lagarcavilla.org>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      99beae6c
    • Konrad Rzeszutek Wilk's avatar
      Merge tag 'v3.7' into stable/for-linus-3.8 · 7bcc1ec0
      Konrad Rzeszutek Wilk authored
      Linux 3.7
      
      * tag 'v3.7': (833 commits)
        Linux 3.7
        Input: matrix-keymap - provide proper module license
        Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage
        ipv4: ip_check_defrag must not modify skb before unsharing
        Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended"
        inet_diag: validate port comparison byte code to prevent unsafe reads
        inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
        inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
        inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
        mm: vmscan: fix inappropriate zone congestion clearing
        vfs: fix O_DIRECT read past end of block device
        net: gro: fix possible panic in skb_gro_receive()
        tcp: bug fix Fast Open client retransmission
        tmpfs: fix shared mempolicy leak
        mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones
        mm: compaction: validate pfn range passed to isolate_freepages_block
        mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
        Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"
        mmc: sdhci-s3c: fix missing clock for gpio card-detect
        lib/Makefile: Fix oid_registry build dependency
        ...
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      
      Conflicts:
      	arch/arm/xen/enlighten.c
      	drivers/xen/Makefile
      
      [We need to have the v3.7 base as the 'for-3.8' was based off v3.7-rc3
      and there are some patches in v3.7-rc6 that we to have in our branch]
      7bcc1ec0
    • Jan Beulich's avatar
      Xen: properly bound buffer access when parsing cpu/*/availability · e5c702d3
      Jan Beulich authored
      At the same time reduce the local buffers to 16 bytes each.
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      e5c702d3
    • Matt Wilson's avatar
      xen/grant-table: correctly initialize grant table version 1 · d0b4d64a
      Matt Wilson authored
      Commit 85ff6acb (xen/granttable: Grant
      tables V2 implementation) changed the GREFS_PER_GRANT_FRAME macro from
      a constant to a conditional expression. The expression depends on
      grant_table_version being appropriately set. Unfortunately, at init
      time grant_table_version will be 0. The GREFS_PER_GRANT_FRAME
      conditional expression checks for "grant_table_version == 1", and
      therefore returns the number of grant references per frame for v2.
      
      This causes gnttab_init() to allocate fewer pages for gnttab_list, as
      a frame can old half the number of v2 entries than v1 entries. After
      gnttab_resume() is called, grant_table_version is appropriately
      set. nr_init_grefs will then be miscalculated and gnttab_free_count
      will hold a value larger than the actual number of free gref entries.
      
      If a guest is heavily utilizing improperly initialized v1 grant
      tables, memory corruption can occur. One common manifestation is
      corruption of the vmalloc list, resulting in a poisoned pointer
      derefrence when accessing /proc/meminfo or /proc/vmallocinfo:
      
      [   40.770064] BUG: unable to handle kernel paging request at 0000200200001407
      [   40.770083] IP: [<ffffffff811a6fb0>] get_vmalloc_info+0x70/0x110
      [   40.770102] PGD 0
      [   40.770107] Oops: 0000 [#1] SMP
      [   40.770114] CPU 10
      
      This patch introduces a static variable, grefs_per_grant_frame, to
      cache the calculated value. gnttab_init() now calls
      gnttab_request_version() early so that grant_table_version and
      grefs_per_grant_frame can be appropriately set. A few BUG_ON()s have
      been added to prevent this type of bug from reoccurring in the future.
      Signed-off-by: default avatarMatt Wilson <msw@amazon.com>
      Reviewed-and-Tested-by: default avatarSteven Noonan <snoonan@amazon.com>
      Acked-by: default avatarIan Campbell <Ian.Campbell@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Annie Li <annie.li@oracle.com>
      Cc: xen-devel@lists.xen.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org # v3.3 and newer
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      d0b4d64a
    • Yang Zhang's avatar
      x86/xen : Fix the wrong check in pciback · 6337a239
      Yang Zhang authored
      Fix the wrong check in pciback.
      Signed-off-by: default avatarYang Zhang <yang.z.zhang@Intel.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6337a239
  5. 11 Jan, 2013 1 commit
  6. 18 Dec, 2012 3 commits
  7. 11 Dec, 2012 3 commits
    • Linus Torvalds's avatar
      Linux 3.7 · 29594404
      Linus Torvalds authored
      29594404
    • Florian Fainelli's avatar
      Input: matrix-keymap - provide proper module license · 55220bb3
      Florian Fainelli authored
      The matrix-keymap module is currently lacking a proper module license,
      add one so we don't have this module tainting the entire kernel.  This
      issue has been present since commit 1932811f ("Input: matrix-keymap
      - uninline and prepare for device tree support")
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      CC: stable@vger.kernel.org # v3.5+
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55220bb3
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 2c68bc72
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Netlink socket dumping had several missing verifications and checks.
      
          In particular, address comparisons in the request byte code
          interpreter could access past the end of the address in the
          inet_request_sock.
      
          Also, address family and address prefix lengths were not validated
          properly at all.
      
          This means arbitrary applications can read past the end of certain
          kernel data structures.
      
          Fixes from Neal Cardwell.
      
       2) ip_check_defrag() operates in contexts where we're in the process
          of, or about to, input the packet into the real protocols
          (specifically macvlan and AF_PACKET snooping).
      
          Unfortunately, it does a pskb_may_pull() which can modify the
          backing packet data which is not legal if the SKB is shared.  It
          very much can be shared in this context.
      
          Deal with the possibility that the SKB is segmented by using
          skb_copy_bits().
      
          Fix from Johannes Berg based upon a report by Eric Leblond.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        ipv4: ip_check_defrag must not modify skb before unsharing
        inet_diag: validate port comparison byte code to prevent unsafe reads
        inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
        inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
        inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
      2c68bc72
  8. 10 Dec, 2012 4 commits
    • Linus Torvalds's avatar
      Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage · caf49191
      Linus Torvalds authored
      This reverts commits a5091539 and
      d7c3b937.
      
      This is a revert of a revert of a revert.  In addition, it reverts the
      even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
      original commits in linux-next.
      
      It turns out that the original patch really was bogus, and that the
      original revert was the correct thing to do after all.  We thought we
      had fixed the problem, and then reverted the revert, but the problem
      really is fundamental: waking up kswapd simply isn't the right thing to
      do, and direct reclaim sometimes simply _is_ the right thing to do.
      
      When certain allocations fail, we simply should try some direct reclaim,
      and if that fails, fail the allocation.  That's the right thing to do
      for THP allocations, which can easily fail, and the GPU allocations want
      to do that too.
      
      So starting kswapd is sometimes simply wrong, and removing the flag that
      said "don't start kswapd" was a mistake.  Let's hope we never revisit
      this mistake again - and certainly not this many times ;)
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      caf49191
    • Johannes Berg's avatar
      ipv4: ip_check_defrag must not modify skb before unsharing · 1bf3751e
      Johannes Berg authored
      ip_check_defrag() might be called from af_packet within the
      RX path where shared SKBs are used, so it must not modify
      the input SKB before it has unshared it for defragmentation.
      Use skb_copy_bits() to get the IP header and only pull in
      everything later.
      
      The same is true for the other caller in macvlan as it is
      called from dev->rx_handler which can also get a shared SKB.
      Reported-by: default avatarEric Leblond <eric@regit.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bf3751e
    • Linus Torvalds's avatar
      Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" · 31f8d42d
      Linus Torvalds authored
      This reverts commit 782fd304.
      
      We are going to reinstate the __GFP_NO_KSWAPD flag that has been
      removed, the removal reverted, and then removed again.  Making this
      commit a pointless fixup for a problem that was caused by the removal of
      __GFP_NO_KSWAPD flag.
      
      The thing is, we really don't want to wake up kswapd for THP allocations
      (because they fail quite commonly under any kind of memory pressure,
      including when there is tons of memory free), and these patches were
      just trying to fix up the underlying bug: the original removal of
      __GFP_NO_KSWAPD in commit c6543459 ("mm: remove __GFP_NO_KSWAPD")
      was simply bogus.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f8d42d
    • Neal Cardwell's avatar
      inet_diag: validate port comparison byte code to prevent unsafe reads · 5e1f5420
      Neal Cardwell authored
      Add logic to verify that a port comparison byte code operation
      actually has the second inet_diag_bc_op from which we read the port
      for such operations.
      
      Previously the code blindly referenced op[1] without first checking
      whether a second inet_diag_bc_op struct could fit there. So a
      malicious user could make the kernel read 4 bytes beyond the end of
      the bytecode array by claiming to have a whole port comparison byte
      code (2 inet_diag_bc_op structs) when in fact the bytecode was not
      long enough to hold both.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e1f5420
  9. 09 Dec, 2012 3 commits
    • Neal Cardwell's avatar
      inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() · f67caec9
      Neal Cardwell authored
      Add logic to check the address family of the user-supplied conditional
      and the address family of the connection entry. We now do not do
      prefix matching of addresses from different address families (AF_INET
      vs AF_INET6), except for the previously existing support for having an
      IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
      maintains as-is).
      
      This change is needed for two reasons:
      
      (1) The addresses are different lengths, so comparing a 128-bit IPv6
      prefix match condition to a 32-bit IPv4 connection address can cause
      us to unwittingly walk off the end of the IPv4 address and read
      garbage or oops.
      
      (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
      simple bit-wise comparison of the prefixes is not meaningful, and
      would lead to bogus results (except for the IPv4-mapped IPv6 case,
      which this commit maintains).
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f67caec9
    • Neal Cardwell's avatar
      inet_diag: validate byte code to prevent oops in inet_diag_bc_run() · 405c0059
      Neal Cardwell authored
      Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
      operations.
      
      Previously we did not validate the inet_diag_hostcond, address family,
      address length, and prefix length. So a malicious user could make the
      kernel read beyond the end of the bytecode array by claiming to have a
      whole inet_diag_hostcond when the bytecode was not long enough to
      contain a whole inet_diag_hostcond of the given address family. Or
      they could make the kernel read up to about 27 bytes beyond the end of
      a connection address by passing a prefix length that exceeded the
      length of addresses of the given family.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      405c0059
    • Neal Cardwell's avatar
      inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state · 1c95df85
      Neal Cardwell authored
      Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
      instantiated for IPv4 traffic and in the SYN-RECV state were actually
      created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
      means that for such connections inet6_rsk(req) returns a pointer to a
      random spot in memory up to roughly 64KB beyond the end of the
      request_sock.
      
      With this bug, for a server using AF_INET6 TCP sockets and serving
      IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
      inet_diag_fill_req() causing an oops or the export to user space of 16
      bytes of kernel memory as a garbage IPv6 address, depending on where
      the garbage inet6_rsk(req) pointed.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c95df85
  10. 08 Dec, 2012 3 commits
    • Johannes Weiner's avatar
      mm: vmscan: fix inappropriate zone congestion clearing · ed23ec4f
      Johannes Weiner authored
      commit c702418f ("mm: vmscan: do not keep kswapd looping forever due
      to individual uncompactable zones") removed zone watermark checks from
      the compaction code in kswapd but left in the zone congestion clearing,
      which now happens unconditionally on higher order reclaim.
      
      This messes up the reclaim throttling logic for zones with
      dirty/writeback pages, where zones should only lose their congestion
      status when their watermarks have been restored.
      
      Remove the clearing from the zone compaction section entirely.  The
      preliminary zone check and the reclaim loop in kswapd will clear it if
      the zone is considered balanced.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed23ec4f
    • Linus Torvalds's avatar
      vfs: fix O_DIRECT read past end of block device · 684c9aae
      Linus Torvalds authored
      The direct-IO write path already had the i_size checks in mm/filemap.c,
      but it turns out the read path did not, and removing the block size
      checks in fs/block_dev.c (commit bbec0270: "blkdev_max_block: make
      private to fs/buffer.c") removed the magic "shrink IO to past the end of
      the device" code there.
      
      Fix it by truncating the IO to the size of the block device, like the
      write path already does.
      
      NOTE! I suspect the write path would be *much* better off doing it this
      way in fs/block_dev.c, rather than hidden deep in mm/filemap.c.  The
      mm/filemap.c code is extremely hard to follow, and has various
      conditionals on the target being a block device (ie the flag passed in
      to 'generic_write_checks()', along with a conditional update of the
      inode timestamp etc).
      
      It is also quite possible that we should treat this whole block device
      size as a "s_maxbytes" issue, and try to make the logic even more
      generic.  However, in the meantime this is the fairly minimal targeted
      fix.
      
      Noted by Milan Broz thanks to a regression test for the cryptsetup
      reencrypt tool.
      Reported-and-tested-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      684c9aae
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1b3c393c
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "Two stragglers:
      
         1) The new code that adds new flushing semantics to GRO can cause SKB
            pointer list corruption, manage the lists differently to avoid the
            OOPS.  Fix from Eric Dumazet.
      
         2) When TCP fast open does a retransmit of data in a SYN-ACK or
            similar, we update retransmit state that we shouldn't triggering a
            WARN_ON later.  Fix from Yuchung Cheng."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: gro: fix possible panic in skb_gro_receive()
        tcp: bug fix Fast Open client retransmission
      1b3c393c
  11. 07 Dec, 2012 3 commits
    • Eric Dumazet's avatar
      net: gro: fix possible panic in skb_gro_receive() · c3c7c254
      Eric Dumazet authored
      commit 2e71a6f8 (net: gro: selective flush of packets) added
      a bug for skbs using frag_list. This part of the GRO stack is rarely
      used, as it needs skb not using a page fragment for their skb->head.
      
      Most drivers do use a page fragment, but some of them use GFP_KERNEL
      allocations for the initial fill of their RX ring buffer.
      
      napi_gro_flush() overwrite skb->prev that was used for these skb to
      point to the last skb in frag_list.
      
      Fix this using a separate field in struct napi_gro_cb to point to the
      last fragment.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3c7c254
    • Yuchung Cheng's avatar
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng authored
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93b174ad
    • Linus Torvalds's avatar
      Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc · 1afa4717
      Linus Torvalds authored
      Pull MMC fixes from Chris Ball:
       "Two small regression fixes:
      
         - sdhci-s3c: Fix runtime PM regression against 3.7-rc1
         - sh-mmcif: Fix oops against 3.6"
      
      * tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
        mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
        Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"
        mmc: sdhci-s3c: fix missing clock for gpio card-detect
      1afa4717
  12. 06 Dec, 2012 6 commits