1. 11 Dec, 2012 24 commits
    • Jaegeuk Kim's avatar
      f2fs: adjust kernel coding style · 0a8165d7
      Jaegeuk Kim authored
      As pointed out by Randy Dunlap, this patch removes all usage of "/**" for comment
      blocks. Instead, just use "/*".
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      0a8165d7
    • Jaegeuk Kim's avatar
      f2fs: fix endian conversion bugs reported by sparse · 25ca923b
      Jaegeuk Kim authored
      This patch should resolve the bugs reported by the sparse tool.
      Initial reports were written by "kbuild test robot" managed by fengguang.wu.
      
      In my local machines, I've tested also by running:
      > make C=2 CF="-D__CHECK_ENDIAN__"
      
      Accordingly, I've found lots of warnings and bugs related to the endian
      conversion. And I've fixed all at this moment.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      25ca923b
    • Sachin Kamat's avatar
      f2fs: remove unneeded version.h header file from f2fs.h · cf0e3a64
      Sachin Kamat authored
      Including <linux/version.h> is not necessary.
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      cf0e3a64
    • Jaegeuk Kim's avatar
      f2fs: update the f2fs document · 5bb446a2
      Jaegeuk Kim authored
      I moved the f2fs-tools.git into kernel.org.
      And I added a new mailing list, linux-f2fs-devel@lists.sourceforge.net.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      5bb446a2
    • Jaegeuk Kim's avatar
      f2fs: update Kconfig and Makefile · a14d5393
      Jaegeuk Kim authored
      This adds Makefile and Kconfig for f2fs, and updates Makefile and Kconfig files
      in the fs directory.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      a14d5393
    • Greg Kroah-Hartman's avatar
      f2fs: move proc files to debugfs · 902829aa
      Greg Kroah-Hartman authored
      This moves all of the f2fs debugging files into debugfs. The files are
      located in /sys/kernel/debug/f2fs/
      
      Note, I think we are generating all of the same information in each of
      the files for every unique f2fs filesystem in the machine.  This copies
      the functionality that was present in the proc files, but this should be
      fixed up in the future.
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [jaegeuk.kim@samsung.com: merged 3 debugfs entries into a *status* entry]
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      902829aa
    • Jaegeuk Kim's avatar
      f2fs: add recovery routines for roll-forward · d624c96f
      Jaegeuk Kim authored
      This adds roll-forward routines to recover fsynced data.
      
      - F2FS uses basically roll-back model with checkpointing.
      
      - In order to implement fsync(), there are two approaches as follows.
      
      1. A roll-back model with checkpointing at every fsync()
       : This is a naive method, but suffers from very low performance.
      
      2. A roll-forward model
       : F2FS adopts this model where all the fsynced data should be recovered, which
         were written after checkpointing was done. In order to figure out the data,
         F2FS keeps a "fsync" mark in direct node blocks. In addition, F2FS remains
         the location of next node block in each direct node block for reconstructing
         the chain of node blocks during the recovery.
      
      - In order to enhance the performance, F2FS keeps a "dentry" mark also in direct
        node blocks. If this is set during the recovery, F2FS replays adding a dentry.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      d624c96f
    • Jaegeuk Kim's avatar
      f2fs: add garbage collection functions · 7bc09003
      Jaegeuk Kim authored
      This adds on-demand and background cleaning functions.
      
      - The basic background cleaning policy is trying to do cleaning jobs as much as
        possible whenever the system is idle. Once the background cleaning is done,
        the cleaner sleeps an amount of time not to interfere with VFS calls. The time
        is dynamically adjusted according to the status of whole segments, which is
        decreased when the following conditions are satisfied.
      
        . GC is not conducted currently, and
        . IO subsystem is idle by checking the number of requets in bdev's request
           list, and
        . There are enough dirty segments.
      
        Otherwise, the time is increased incrementally until to the maximum time.
        Note that, min and max times are 10 secs and 30 secs by default.
      
      - F2FS adopts a default victim selection policy where background cleaning uses
        a cost-benefit algorithm, while on-demand cleaning uses a greedy algorithm.
      
      - The method of moving data during the cleaning is slightly different between
        background and on-demand cleaning schemes. In the case of background cleaning,
        F2FS loads the data, and marks them as dirty. Then, F2FS expects that the data
        will be moved by flusher or VM. In the case of on-demand cleaning, F2FS should
        move the data right away.
      
      - In order to identify valid blocks in a victim segment, F2FS scans the bitmap
        of the segment managed as an SIT entry.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      7bc09003
    • Jaegeuk Kim's avatar
      f2fs: add xattr and acl functionalities · af48b85b
      Jaegeuk Kim authored
      This implements xattr and acl functionalities.
      
      - F2FS uses a node page to contain use extended attributes.
      Signed-off-by: default avatarChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      af48b85b
    • Jaegeuk Kim's avatar
      f2fs: add core directory operations · 6b4ea016
      Jaegeuk Kim authored
      this adds core functions to find, add, delete, and link dentries.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      6b4ea016
    • Jaegeuk Kim's avatar
      f2fs: add inode operations for special inodes · 57397d86
      Jaegeuk Kim authored
      This adds inode operations for directory, symlink, and special inodes.
      Signed-off-by: default avatarChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      57397d86
    • Jaegeuk Kim's avatar
      f2fs: add core inode operations · 19f99cee
      Jaegeuk Kim authored
      This adds core functions to get, read, write, and evict an inode.
      Signed-off-by: default avatarChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      19f99cee
    • Jaegeuk Kim's avatar
      f2fs: add address space operations for data · eb47b800
      Jaegeuk Kim authored
      This adds address space operations for data.
      
      - F2FS supports readpages(), writepages(), and direct_IO().
      
      - Because of out-of-place writes, f2fs_direct_IO() does not write data in place.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      eb47b800
    • Jaegeuk Kim's avatar
      f2fs: add file operations · fbfa2cc5
      Jaegeuk Kim authored
      This adds memory operations and file/file_inode operations.
      
      - F2FS supports fallocate(), mmap(), fsync(), and basic ioctl().
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      fbfa2cc5
    • Jaegeuk Kim's avatar
      f2fs: add segment operations · 351df4b2
      Jaegeuk Kim authored
      This adds specific functions not only to manage dirty/free segments, SIT pages,
      a cache for SIT entries, and summary entries, but also to allocate free blocks
      and write three types of pages: data, node, and meta.
      
      - F2FS maintains three types of bitmaps in memory, which indicate free, prefree,
        and dirty segments respectively.
      
      - The key information of an SIT entry consists of a segment number, the number
        of valid blocks in the segment, a bitmap to identify there-in valid or invalid
        blocks.
      
      - An SIT page is composed of a certain range of SIT entries, which is maintained
        by the address space of meta_inode.
      
      - To cache SIT entries, a simple array is used. The index for the array is the
        segment number.
      
      - A summary entry for data contains the parent node information. A summary entry
        for node contains its node offset from the inode.
      
      - F2FS manages information about six active logs and those summary entries in
        memory. Whenever one of them is changed, its summary entries are flushed to
        its SIT page maintained by the address space of meta_inode.
      
      - This patch adds a default block allocation function which supports heap-based
        allocation policy.
      
      - This patch adds core functions to write data, node, and meta pages. Since LFS
        basically produces a series of sequential writes, F2FS merges sequential bios
        with a single one as much as possible to reduce the IO scheduling overhead.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      351df4b2
    • Jaegeuk Kim's avatar
      f2fs: add node operations · e05df3b1
      Jaegeuk Kim authored
      This adds specific functions to manage NAT pages, a cache for NAT entries, free
      nids, direct/indirect node blocks for indexing data, and address space for node
      pages.
      
      - The key information of an NAT entry consists of a node id and a block address.
      
      - An NAT page is composed of block addresses covered by a certain range of NAT
        entries, which is maintained by the address space of meta_inode.
      
      - A radix tree structure is used to cache NAT entries. The index for the tree
        is a node id.
      
      - When there is no free nid, F2FS should scan NAT entries to find new one. In
        order to avoid scanning frequently, F2FS manages a list containing a number of
        free nids in memory. Only when free nids in the list are exhausted, scanning
        process, build_free_nids(), is triggered.
      
      - F2FS has direct and indirect node blocks for indexing data. This patch adds
        fuctions related to the node block management such as getting, allocating, and
        truncating node blocks to index data.
      
      - In order to cache node blocks in memory, F2FS has a node_inode with an address
        space for node pages. This patch also adds the address space operations for
        node_inode.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      e05df3b1
    • Jaegeuk Kim's avatar
      f2fs: add checkpoint operations · 127e670a
      Jaegeuk Kim authored
      This adds functions required by the checkpoint operations.
      
      Basically, f2fs adopts a roll-back model with checkpoint blocks written in the
      CP area. The checkpoint procedure includes as follows.
      
      - write_checkpoint()
      1. block_operations() freezes VFS calls.
      2. submit cached bios.
      3. flush_nat_entries() writes NAT pages updated by dirty NAT entries.
      4. flush_sit_entries() writes SIT pages updated by dirty SIT entries.
      5. do_checkpoint() writes,
        - checkpoint block (#0)
        - orphan inode blocks
        - summary blocks made by active logs
        - checkpoint block (copy of #0)
      6. unblock_opeations()
      
      In order to provide an address space for meta pages, f2fs_sb_info has a special
      inode, namely meta_inode. This patch also adds the address space operations for
      meta_inode.
      Signed-off-by: default avatarChul Lee <chur.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      127e670a
    • Jaegeuk Kim's avatar
      f2fs: add super block operations · aff063e2
      Jaegeuk Kim authored
      This adds the implementation of superblock operations for f2fs, which includes
      - init_f2fs_fs/exit_f2fs_fs
      - f2fs_mount
      - super_operations of f2fs
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      aff063e2
    • Jaegeuk Kim's avatar
      f2fs: add superblock and major in-memory structure · 39a53e0c
      Jaegeuk Kim authored
      This adds the following major in-memory structures in f2fs.
      
      - f2fs_sb_info:
        contains f2fs-specific information, two special inode pointers for node and
        meta address spaces, and orphan inode management.
      
      - f2fs_inode_info:
        contains vfs_inode and other fs-specific information.
      
      - f2fs_nm_info:
        contains node manager information such as NAT entry cache, free nid list,
        and NAT page management.
      
      - f2fs_node_info:
        represents a node as node id, inode number, block address, and its version.
      
      - f2fs_sm_info:
        contains segment manager information such as SIT entry cache, free segment
        map, current active logs, dirty segment management, and segment utilization.
        The specific structures are sit_info, free_segmap_info, dirty_seglist_info,
        curseg_info.
      
      In addition, add F2FS_SUPER_MAGIC in magic.h.
      Signed-off-by: default avatarChul Lee <chur.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      39a53e0c
    • Jaegeuk Kim's avatar
      f2fs: add on-disk layout · dd31866b
      Jaegeuk Kim authored
      This adds a header file describing the on-disk layout of f2fs.
      Signed-off-by: default avatarChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: default avatarChul Lee <chur.lee@samsung.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      dd31866b
    • Jaegeuk Kim's avatar
      f2fs: add document · 98e4da8c
      Jaegeuk Kim authored
      This adds a document describing the mount options, proc entries, usage, and
      design of Flash-Friendly File System, namely F2FS.
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk.kim@samsung.com>
      98e4da8c
    • Linus Torvalds's avatar
      Linux 3.7 · 29594404
      Linus Torvalds authored
      29594404
    • Florian Fainelli's avatar
      Input: matrix-keymap - provide proper module license · 55220bb3
      Florian Fainelli authored
      The matrix-keymap module is currently lacking a proper module license,
      add one so we don't have this module tainting the entire kernel.  This
      issue has been present since commit 1932811f ("Input: matrix-keymap
      - uninline and prepare for device tree support")
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      CC: stable@vger.kernel.org # v3.5+
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55220bb3
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 2c68bc72
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Netlink socket dumping had several missing verifications and checks.
      
          In particular, address comparisons in the request byte code
          interpreter could access past the end of the address in the
          inet_request_sock.
      
          Also, address family and address prefix lengths were not validated
          properly at all.
      
          This means arbitrary applications can read past the end of certain
          kernel data structures.
      
          Fixes from Neal Cardwell.
      
       2) ip_check_defrag() operates in contexts where we're in the process
          of, or about to, input the packet into the real protocols
          (specifically macvlan and AF_PACKET snooping).
      
          Unfortunately, it does a pskb_may_pull() which can modify the
          backing packet data which is not legal if the SKB is shared.  It
          very much can be shared in this context.
      
          Deal with the possibility that the SKB is segmented by using
          skb_copy_bits().
      
          Fix from Johannes Berg based upon a report by Eric Leblond.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        ipv4: ip_check_defrag must not modify skb before unsharing
        inet_diag: validate port comparison byte code to prevent unsafe reads
        inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
        inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
        inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
      2c68bc72
  2. 10 Dec, 2012 4 commits
    • Linus Torvalds's avatar
      Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage · caf49191
      Linus Torvalds authored
      This reverts commits a5091539 and
      d7c3b937.
      
      This is a revert of a revert of a revert.  In addition, it reverts the
      even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
      original commits in linux-next.
      
      It turns out that the original patch really was bogus, and that the
      original revert was the correct thing to do after all.  We thought we
      had fixed the problem, and then reverted the revert, but the problem
      really is fundamental: waking up kswapd simply isn't the right thing to
      do, and direct reclaim sometimes simply _is_ the right thing to do.
      
      When certain allocations fail, we simply should try some direct reclaim,
      and if that fails, fail the allocation.  That's the right thing to do
      for THP allocations, which can easily fail, and the GPU allocations want
      to do that too.
      
      So starting kswapd is sometimes simply wrong, and removing the flag that
      said "don't start kswapd" was a mistake.  Let's hope we never revisit
      this mistake again - and certainly not this many times ;)
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      caf49191
    • Johannes Berg's avatar
      ipv4: ip_check_defrag must not modify skb before unsharing · 1bf3751e
      Johannes Berg authored
      ip_check_defrag() might be called from af_packet within the
      RX path where shared SKBs are used, so it must not modify
      the input SKB before it has unshared it for defragmentation.
      Use skb_copy_bits() to get the IP header and only pull in
      everything later.
      
      The same is true for the other caller in macvlan as it is
      called from dev->rx_handler which can also get a shared SKB.
      Reported-by: default avatarEric Leblond <eric@regit.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bf3751e
    • Linus Torvalds's avatar
      Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" · 31f8d42d
      Linus Torvalds authored
      This reverts commit 782fd304.
      
      We are going to reinstate the __GFP_NO_KSWAPD flag that has been
      removed, the removal reverted, and then removed again.  Making this
      commit a pointless fixup for a problem that was caused by the removal of
      __GFP_NO_KSWAPD flag.
      
      The thing is, we really don't want to wake up kswapd for THP allocations
      (because they fail quite commonly under any kind of memory pressure,
      including when there is tons of memory free), and these patches were
      just trying to fix up the underlying bug: the original removal of
      __GFP_NO_KSWAPD in commit c6543459 ("mm: remove __GFP_NO_KSWAPD")
      was simply bogus.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f8d42d
    • Neal Cardwell's avatar
      inet_diag: validate port comparison byte code to prevent unsafe reads · 5e1f5420
      Neal Cardwell authored
      Add logic to verify that a port comparison byte code operation
      actually has the second inet_diag_bc_op from which we read the port
      for such operations.
      
      Previously the code blindly referenced op[1] without first checking
      whether a second inet_diag_bc_op struct could fit there. So a
      malicious user could make the kernel read 4 bytes beyond the end of
      the bytecode array by claiming to have a whole port comparison byte
      code (2 inet_diag_bc_op structs) when in fact the bytecode was not
      long enough to hold both.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e1f5420
  3. 09 Dec, 2012 3 commits
    • Neal Cardwell's avatar
      inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() · f67caec9
      Neal Cardwell authored
      Add logic to check the address family of the user-supplied conditional
      and the address family of the connection entry. We now do not do
      prefix matching of addresses from different address families (AF_INET
      vs AF_INET6), except for the previously existing support for having an
      IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
      maintains as-is).
      
      This change is needed for two reasons:
      
      (1) The addresses are different lengths, so comparing a 128-bit IPv6
      prefix match condition to a 32-bit IPv4 connection address can cause
      us to unwittingly walk off the end of the IPv4 address and read
      garbage or oops.
      
      (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
      simple bit-wise comparison of the prefixes is not meaningful, and
      would lead to bogus results (except for the IPv4-mapped IPv6 case,
      which this commit maintains).
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f67caec9
    • Neal Cardwell's avatar
      inet_diag: validate byte code to prevent oops in inet_diag_bc_run() · 405c0059
      Neal Cardwell authored
      Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
      operations.
      
      Previously we did not validate the inet_diag_hostcond, address family,
      address length, and prefix length. So a malicious user could make the
      kernel read beyond the end of the bytecode array by claiming to have a
      whole inet_diag_hostcond when the bytecode was not long enough to
      contain a whole inet_diag_hostcond of the given address family. Or
      they could make the kernel read up to about 27 bytes beyond the end of
      a connection address by passing a prefix length that exceeded the
      length of addresses of the given family.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      405c0059
    • Neal Cardwell's avatar
      inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state · 1c95df85
      Neal Cardwell authored
      Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
      instantiated for IPv4 traffic and in the SYN-RECV state were actually
      created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
      means that for such connections inet6_rsk(req) returns a pointer to a
      random spot in memory up to roughly 64KB beyond the end of the
      request_sock.
      
      With this bug, for a server using AF_INET6 TCP sockets and serving
      IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
      inet_diag_fill_req() causing an oops or the export to user space of 16
      bytes of kernel memory as a garbage IPv6 address, depending on where
      the garbage inet6_rsk(req) pointed.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c95df85
  4. 08 Dec, 2012 3 commits
    • Johannes Weiner's avatar
      mm: vmscan: fix inappropriate zone congestion clearing · ed23ec4f
      Johannes Weiner authored
      commit c702418f ("mm: vmscan: do not keep kswapd looping forever due
      to individual uncompactable zones") removed zone watermark checks from
      the compaction code in kswapd but left in the zone congestion clearing,
      which now happens unconditionally on higher order reclaim.
      
      This messes up the reclaim throttling logic for zones with
      dirty/writeback pages, where zones should only lose their congestion
      status when their watermarks have been restored.
      
      Remove the clearing from the zone compaction section entirely.  The
      preliminary zone check and the reclaim loop in kswapd will clear it if
      the zone is considered balanced.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed23ec4f
    • Linus Torvalds's avatar
      vfs: fix O_DIRECT read past end of block device · 684c9aae
      Linus Torvalds authored
      The direct-IO write path already had the i_size checks in mm/filemap.c,
      but it turns out the read path did not, and removing the block size
      checks in fs/block_dev.c (commit bbec0270: "blkdev_max_block: make
      private to fs/buffer.c") removed the magic "shrink IO to past the end of
      the device" code there.
      
      Fix it by truncating the IO to the size of the block device, like the
      write path already does.
      
      NOTE! I suspect the write path would be *much* better off doing it this
      way in fs/block_dev.c, rather than hidden deep in mm/filemap.c.  The
      mm/filemap.c code is extremely hard to follow, and has various
      conditionals on the target being a block device (ie the flag passed in
      to 'generic_write_checks()', along with a conditional update of the
      inode timestamp etc).
      
      It is also quite possible that we should treat this whole block device
      size as a "s_maxbytes" issue, and try to make the logic even more
      generic.  However, in the meantime this is the fairly minimal targeted
      fix.
      
      Noted by Milan Broz thanks to a regression test for the cryptsetup
      reencrypt tool.
      Reported-and-tested-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      684c9aae
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1b3c393c
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "Two stragglers:
      
         1) The new code that adds new flushing semantics to GRO can cause SKB
            pointer list corruption, manage the lists differently to avoid the
            OOPS.  Fix from Eric Dumazet.
      
         2) When TCP fast open does a retransmit of data in a SYN-ACK or
            similar, we update retransmit state that we shouldn't triggering a
            WARN_ON later.  Fix from Yuchung Cheng."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: gro: fix possible panic in skb_gro_receive()
        tcp: bug fix Fast Open client retransmission
      1b3c393c
  5. 07 Dec, 2012 3 commits
    • Eric Dumazet's avatar
      net: gro: fix possible panic in skb_gro_receive() · c3c7c254
      Eric Dumazet authored
      commit 2e71a6f8 (net: gro: selective flush of packets) added
      a bug for skbs using frag_list. This part of the GRO stack is rarely
      used, as it needs skb not using a page fragment for their skb->head.
      
      Most drivers do use a page fragment, but some of them use GFP_KERNEL
      allocations for the initial fill of their RX ring buffer.
      
      napi_gro_flush() overwrite skb->prev that was used for these skb to
      point to the last skb in frag_list.
      
      Fix this using a separate field in struct napi_gro_cb to point to the
      last fragment.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3c7c254
    • Yuchung Cheng's avatar
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng authored
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93b174ad
    • Linus Torvalds's avatar
      Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc · 1afa4717
      Linus Torvalds authored
      Pull MMC fixes from Chris Ball:
       "Two small regression fixes:
      
         - sdhci-s3c: Fix runtime PM regression against 3.7-rc1
         - sh-mmcif: Fix oops against 3.6"
      
      * tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
        mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
        Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"
        mmc: sdhci-s3c: fix missing clock for gpio card-detect
      1afa4717
  6. 06 Dec, 2012 3 commits
    • Mel Gorman's avatar
      tmpfs: fix shared mempolicy leak · 18a2f371
      Mel Gorman authored
      This fixes a regression in 3.7-rc, which has since gone into stable.
      
      Commit 00442ad0 ("mempolicy: fix a memory corruption by refcount
      imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
      refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
      on expecting alloc_page_vma() to drop the refcount it had acquired.
      This deserves a rework: but for now fix the leak in shmem_alloc_page().
      
      Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
      the same refcounting there as in shmem_alloc_page(), delete its onstack
      mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
      those were invented to let swapin_readahead() make an unknown number of
      calls to alloc_pages_vma() with one mempolicy; but since 00442ad0,
      alloc_pages_vma() has kept refcount in balance, so now no problem.
      Reported-and-tested-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18a2f371
    • Johannes Weiner's avatar
      mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones · c702418f
      Johannes Weiner authored
      When a zone meets its high watermark and is compactable in case of
      higher order allocations, it contributes to the percentage of the node's
      memory that is considered balanced.
      
      This requirement, that a node be only partially balanced, came about
      when kswapd was desparately trying to balance tiny zones when all bigger
      zones in the node had plenty of free memory.  Arguably, the same should
      apply to compaction: if a significant part of the node is balanced
      enough to run compaction, do not get hung up on that tiny zone that
      might never get in shape.
      
      When the compaction logic in kswapd is reached, we know that at least
      25% of the node's memory is balanced properly for compaction (see
      zone_balanced and pgdat_balanced).  Remove the individual zone checks
      that restart the kswapd cycle.
      
      Otherwise, we may observe more endless looping in kswapd where the
      compaction code loops back to reclaim because of a single zone and
      reclaim does nothing because the node is considered balanced overall.
      
      See for example
      
        https://bugzilla.redhat.com/show_bug.cgi?id=866988Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: default avatarThorsten Leemhuis <fedora@leemhuis.info>
      Reported-by: default avatarJiri Slaby <jslaby@suse.cz>
      Tested-by: default avatarJohn Ellson <john.ellson@comcast.net>
      Tested-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Tested-by: default avatarBruno Wolff III <bruno@wolff.to>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c702418f
    • Mel Gorman's avatar
      mm: compaction: validate pfn range passed to isolate_freepages_block · 60177d31
      Mel Gorman authored
      Commit 0bf380bc ("mm: compaction: check pfn_valid when entering a
      new MAX_ORDER_NR_PAGES block during isolation for migration") added a
      check for pfn_valid() when isolating pages for migration as the scanner
      does not necessarily start pageblock-aligned.
      
      Since commit c89511ab ("mm: compaction: Restart compaction from near
      where it left off"), the free scanner has the same problem.  This patch
      makes sure that the pfn range passed to isolate_freepages_block() is
      within the same block so that pfn_valid() checks are unnecessary.
      
      In answer to Henrik's wondering why others have not reported this:
      reproducing this requires a large enough hole with the right aligment to
      have compaction walk into a PFN range with no memmap.  Size and
      alignment depends in the memory model - 4M for FLATMEM and 128M for
      SPARSEMEM on x86.  It needs a "lucky" machine.
      Reported-by: default avatarHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60177d31