1. 13 Feb, 2014 40 commits
    • Joe Thornber's avatar
      dm thin: fix discard support to a previously shared block · 0905e88e
      Joe Thornber authored
      commit 19fa1a67 upstream.
      
      If a snapshot is created and later deleted the origin dm_thin_device's
      snapshotted_time will have been updated to reflect the snapshot's
      creation time.  The 'shared' flag in the dm_thin_lookup_result struct
      returned from dm_thin_find_block() is an approximation based on
      snapshotted_time -- this is done to avoid 0(n), or worse, time
      complexity.  In this case, the shared flag would be true.
      
      But because the 'shared' flag reflects an approximation a block can be
      incorrectly assumed to be shared (e.g. false positive for 'shared'
      because the snapshot no longer exists).  This could result in discards
      issued to a thin device not being passed down to the pool's underlying
      data device.
      
      To fix this we double check that a thin block is really still in-use
      after a mapping is removed using dm_pool_block_is_used().  If the
      reference count for a block is now zero the discard is allowed to be
      passed down.
      
      Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
      structure -- reflects that the 'shared' flag in the response from
      dm_thin_find_block() can only be held as definitive if false is
      returned.
      
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0905e88e
    • Jeff Layton's avatar
      sunrpc: don't wait for write before allowing reads from use-gss-proxy file · bc2bab99
      Jeff Layton authored
      commit 1654a04c upstream.
      
      It doesn't make much sense to make reads from this procfile hang. As
      far as I can tell, only gssproxy itself will open this file and it
      never reads from it. Change it to just give the present setting of
      sn->use_gss_proxy without waiting for anything.
      
      Note that we do not want to call use_gss_proxy() in this codepath
      since an inopportune read of this file could cause it to be disabled
      prematurely.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc2bab99
    • Weston Andros Adamson's avatar
      sunrpc: Fix infinite loop in RPC state machine · 61b66025
      Weston Andros Adamson authored
      commit 6ff33b7d upstream.
      
      When a task enters call_refreshresult with status 0 from call_refresh and
      !rpcauth_uptodatecred(task) it enters call_refresh again with no rate-limiting
      or max number of retries.
      
      Instead of trying forever, make use of the retry path that other errors use.
      
      This only seems to be possible when the crrefresh callback is gss_refresh_null,
      which only happens when destroying the context.
      
      To reproduce:
      
      1) mount with sec=krb5 (or sec=sys with krb5 negotiated for non FSID specific
         operations).
      
      2) reboot - the client will be stuck and will need to be hard rebooted
      
      BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:2:46]
      Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache ppdev crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd serio_raw i2c_piix4 i2c_core e1000 parport_pc parport shpchp nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc autofs4 mptspi scsi_transport_spi mptscsih mptbase ata_generic floppy
      irq event stamp: 195724
      hardirqs last  enabled at (195723): [<ffffffff814a925c>] restore_args+0x0/0x30
      hardirqs last disabled at (195724): [<ffffffff814b0a6a>] apic_timer_interrupt+0x6a/0x80
      softirqs last  enabled at (195722): [<ffffffff8103f583>] __do_softirq+0x1df/0x276
      softirqs last disabled at (195717): [<ffffffff8103f852>] irq_exit+0x53/0x9a
      CPU: 0 PID: 46 Comm: kworker/0:2 Not tainted 3.13.0-rc3-branch-dros_testing+ #4
      Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      Workqueue: rpciod rpc_async_schedule [sunrpc]
      task: ffff8800799c4260 ti: ffff880079002000 task.ti: ffff880079002000
      RIP: 0010:[<ffffffffa0064fd4>]  [<ffffffffa0064fd4>] __rpc_execute+0x8a/0x362 [sunrpc]
      RSP: 0018:ffff880079003d18  EFLAGS: 00000246
      RAX: 0000000000000005 RBX: 0000000000000007 RCX: 0000000000000007
      RDX: 0000000000000007 RSI: ffff88007aecbae8 RDI: ffff8800783d8900
      RBP: ffff880079003d78 R08: ffff88006e30e9f8 R09: ffffffffa005a3d7
      R10: ffff88006e30e7b0 R11: ffff8800783d8900 R12: ffffffffa006675e
      R13: ffff880079003ce8 R14: ffff88006e30e7b0 R15: ffff8800783d8900
      FS:  0000000000000000(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3072333000 CR3: 0000000001a0b000 CR4: 00000000001407f0
      Stack:
       ffff880079003d98 0000000000000246 0000000000000000 ffff88007a9a4830
       ffff880000000000 ffffffff81073f47 ffff88007f212b00 ffff8800799c4260
       ffff8800783d8988 ffff88007f212b00 ffffe8ffff604800 0000000000000000
      Call Trace:
       [<ffffffff81073f47>] ? trace_hardirqs_on_caller+0x145/0x1a1
       [<ffffffffa00652d3>] rpc_async_schedule+0x27/0x32 [sunrpc]
       [<ffffffff81052974>] process_one_work+0x211/0x3a5
       [<ffffffff810528d5>] ? process_one_work+0x172/0x3a5
       [<ffffffff81052eeb>] worker_thread+0x134/0x202
       [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280
       [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280
       [<ffffffff810584a0>] kthread+0xc9/0xd1
       [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61
       [<ffffffff814afd6c>] ret_from_fork+0x7c/0xb0
       [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61
      Code: e8 87 63 fd e0 c6 05 10 dd 01 00 01 48 8b 43 70 4c 8d 6b 70 45 31 e4 a8 02 0f 85 d5 02 00 00 4c 8b 7b 48 48 c7 43 48 00 00 00 00 <4c> 8b 4b 50 4d 85 ff 75 0c 4d 85 c9 4d 89 cf 0f 84 32 01 00 00
      
      And the output of "rpcdebug -m rpc -s all":
      
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refresh (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refresh (status 0)
      RPC:    61 call_refreshresult (status 0)
      RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61b66025
    • Trond Myklebust's avatar
      NFSv4: Fix a slot leak in nfs40_sequence_done · 3973a15d
      Trond Myklebust authored
      commit cab92c19 upstream.
      
      The check for whether or not we sent an RPC call in nfs40_sequence_done
      is insufficient to decide whether or not we are holding a session slot,
      and thus should not be used to decide when to free that slot.
      
      This patch replaces the RPC_WAS_SENT() test with the correct test for
      whether or not slot == NULL.
      
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3973a15d
    • Boaz Harrosh's avatar
      pnfs: Proper delay for NFS4ERR_RECALLCONFLICT in layout_get_done · 9e246e00
      Boaz Harrosh authored
      commit ed7e5423 upstream.
      
      An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT
      only when a Server Sent a RECALL do to that GET_LAYOUT, or
      the RECALL and GET_LAYOUT crossed on the wire.
      In any way this means we want to wait at most until in-flight IO
      is finished and the RECALL can be satisfied.
      
      So a proper wait here is more like 1/10 of a second, not 15 seconds
      like we have now. In case of a server bug we delay exponentially
      longer on each retry.
      
      Current code totally craps out performance of very large files on
      most pnfs-objects layouts, because of how the map changes when the
      file has grown into the next raid group.
      
      [Stable: This will patch back to 3.9. If there are earlier still
       maintained trees, please tell me I'll send a patch]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e246e00
    • Weston Andros Adamson's avatar
      nfs4: fix discover_server_trunking use after free · c39e05ef
      Weston Andros Adamson authored
      commit abad2fa5 upstream.
      
      If clp is new (cl_count = 1) and it matches another client in
      nfs4_discover_server_trunking, the nfs_put_client will free clp before
      ->cl_preserve_clid is set.
      Signed-off-by: default avatarWeston Andros Adamson <dros@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c39e05ef
    • Trond Myklebust's avatar
      NFSv4.1: Handle errors correctly in nfs41_walk_client_list · 1ec9b465
      Trond Myklebust authored
      commit 64590daa upstream.
      
      Both nfs41_walk_client_list and nfs40_walk_client_list expect the
      'status' variable to be set to the value -NFS4ERR_STALE_CLIENTID
      if the loop fails to find a match.
      The problem is that the 'pos->cl_cons_state > NFS_CS_READY' changes
      the value of 'status', and sets it either to the value '0' (which
      indicates success), or to the value EINTR.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ec9b465
    • Scott Mayhew's avatar
      nfs: always make sure page is up-to-date before extending a write to cover the entire page · 4a3cbb28
      Scott Mayhew authored
      commit 263b4509 upstream.
      
      We should always make sure the cached page is up-to-date when we're
      determining whether we can extend a write to cover the full page -- even
      if we've received a write delegation from the server.
      
      Commit c7559663 added logic to skip this check if we have a write
      delegation, which can lead to data corruption such as the following
      scenario if client B receives a write delegation from the NFS server:
      
      Client A:
          # echo 123456789 > /mnt/file
      
      Client B:
          # echo abcdefghi >> /mnt/file
          # cat /mnt/file
          0�D0�abcdefghi
      
      Just because we hold a write delegation doesn't mean that we've read in
      the entire page contents.
      Signed-off-by: default avatarScott Mayhew <smayhew@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a3cbb28
    • Weston Andros Adamson's avatar
      nfs4.1: properly handle ENOTSUP in SECINFO_NO_NAME · 0c402bf5
      Weston Andros Adamson authored
      commit 78b19bae upstream.
      
      Don't check for -NFS4ERR_NOTSUPP, it's already been mapped to -ENOTSUPP
      by nfs4_stat_to_errno.
      
      This allows the client to mount v4.1 servers that don't support
      SECINFO_NO_NAME by falling back to the "guess and check" method of
      nfs4_find_root_sec.
      Signed-off-by: default avatarWeston Andros Adamson <dros@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c402bf5
    • Trond Myklebust's avatar
      NFSv4: OPEN must handle the NFS4ERR_IO return code correctly · 8a362bd4
      Trond Myklebust authored
      commit c7848f69 upstream.
      
      decode_op_hdr() cannot distinguish between an XDR decoding error and
      the perfectly valid errorcode NFS4ERR_IO. This is normally not a
      problem, but for the particular case of OPEN, we need to be able
      to increment the NFSv4 open sequence id when the server returns
      a valid response.
      Reported-by: default avatarJ Bruce Fields <bfields@fieldses.org>
      Link: http://lkml.kernel.org/r/20131204210356.GA19452@fieldses.orgSigned-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a362bd4
    • Mika Westerberg's avatar
      spi/pxa2xx: initialize DMA channels to -1 to prevent inadvertent match · b1d74b18
      Mika Westerberg authored
      commit 483c3191 upstream.
      
      Commit cddb339b (spi/pxa2xx: convert to dma_request_slave_channel_compat())
      converted the driver to use ACPI provided DMA helpers but it forgot to
      initialize the platform data for the channels to -1. Failing to do so will
      result inadvertent match in the filter function because 0 is a valid
      channel number.
      
      Prevent this from happening by initializing both platform data channels
      correctly to -1.
      
      Fixes: cddb339b (spi/pxa2xx: convert to dma_request_slave_channel_compat())
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1d74b18
    • Daniel Santos's avatar
      spidev: fix hang when transfer_one_message fails · 24175707
      Daniel Santos authored
      commit e120cc0d upstream.
      
      This corrects a problem in spi_pump_messages() that leads to an spi
      message hanging forever when a call to transfer_one_message() fails.
      This failure occurs in my MCP2210 driver when the cs_change bit is set
      on the last transfer in a message, an operation which the hardware does
      not support.
      
      Rationale
      Since the transfer_one_message() returns an int, we must presume that it
      may fail.  If transfer_one_message() should never fail, it should return
      void.  Thus, calls to transfer_one_message() should properly manage a
      failure.
      
      Fixes: ffbbdd21 (spi: create a message queueing infrastructure)
      Signed-off-by: default avatarDaniel Santos <daniel.santos@pobox.com>
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24175707
    • Jonas Gorski's avatar
      spi/bcm63xx: don't substract prepend length from total length · dd00c353
      Jonas Gorski authored
      commit 86b3bde0 upstream.
      
      The spi command must include the full message length including any
      prepended writes, else transfers larger than 256 bytes will be
      incomplete.
      Signed-off-by: default avatarJonas Gorski <jogo@openwrt.org>
      Acked-by: default avatarFlorian Fainelli <florian@openwrt.org>
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd00c353
    • Ira Weiny's avatar
      IB/qib: Fix QP check when looping back to/from QP1 · 6b390aa3
      Ira Weiny authored
      commit 6e0ea9e6 upstream.
      
      The GSI QP type is compatible with and should be allowed to send data
      to/from any UD QP.  This was found when testing ibacm on the same node
      as an SA.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b390aa3
    • Max Filippov's avatar
      xtensa: xtfpga: fix definitions of platform devices · 452dc5f3
      Max Filippov authored
      commit a558d992 upstream.
      
      Remove __initdata attribute, as the devices may be used after init
      sections are freed.
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      452dc5f3
    • Boaz Harrosh's avatar
      ore: Fix wrong math in allocation of per device BIO · f149f695
      Boaz Harrosh authored
      commit aad560b7 upstream.
      
      At IO preparation we calculate the max pages at each device and
      allocate a BIO per device of that size. The calculation was wrong
      on some unaligned corner cases offset/length combination and would
      make prepare return with -ENOMEM. This would be bad for pnfs-objects
      that would in that case IO through MDS. And fatal for exofs were it
      would fail writes with EIO.
      
      Fix it by doing the proper math, that will work in all cases. (I
      ran a test with all possible offset/length combinations this time
      round).
      
      Also when reading we do not need to allocate for the parity units
      since we jump over them.
      
      Also lower the max_io_length to take into account the parity pages
      so not to allocate BIOs bigger than PAGE_SIZE
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f149f695
    • Michael Grzeschik's avatar
      mtd: mxc_nand: remove duplicated ecc_stats counting · 9af6a8e6
      Michael Grzeschik authored
      commit 05664777 upstream.
      
      The ecc_stats.corrected count variable will already be incremented in
      the above framework-layer just after this callback.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9af6a8e6
    • Heiko Carstens's avatar
      tile: remove compat_sys_lookup_dcookie declaration to fix compile error · fb6071a0
      Heiko Carstens authored
      commit 5a5e75f4 upstream.
      
      With commit d8d14bd0 ("fs/compat: fix lookup_dcookie() parameter
      handling") I changed the type of the len parameter of the
      lookup_dcookie() syscall.
      
      However I missed that there was still a stale declaration in
      arch/tile/..  which now causes a compile error on tile:
      
        In file included from fs/dcookies.c:28:0:
        include/linux/compat.h:425:17: error: conflicting types for 'compat_sys_lookup_dcookie'
        fs/dcookies.c:207:1: error: conflicting types for 'compat_sys_lookup_dcookie'
      
      Simply remove the declaration in the tile architecture, which is only a
      leftover from before the different compat lookup_dcookie() versions have
      been merged.  The correct declaration is now in include/linux/compat.h
      
      The build error was reported by Fenguang's build bot.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb6071a0
    • Heiko Carstens's avatar
      fs/compat: fix lookup_dcookie() parameter handling · e5c84a57
      Heiko Carstens authored
      commit d8d14bd0 upstream.
      
      Commit d5dc77bf ("consolidate compat lookup_dcookie()") coverted all
      architectures to the new compat_sys_lookup_dcookie() syscall.
      
      The "len" paramater of the new compat syscall must have the type
      compat_size_t in order to enforce zero extension for architectures where
      the ABI requires that the caller of a function performed zero and/or
      sign extension to 64 bit of all parameters.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5c84a57
    • Heiko Carstens's avatar
      fs/compat: fix parameter handling for compat readv/writev syscalls · 1670f13d
      Heiko Carstens authored
      commit dfd948e3 upstream.
      
      We got a report that the pwritev syscall does not work correctly in
      compat mode on s390.
      
      It turned out that with commit 72ec3516 ("switch compat readv/writev
      variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
      couple of syscall parameters because the some parameter types haven't
      been converted from unsigned long to compat_ulong_t.
      
      This is needed for architectures where the ABI requires that the caller
      of a function performed zero and/or sign extension to 64 bit of all
      parameters.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1670f13d
    • Heiko Carstens's avatar
      compat: fix sys_fanotify_mark · f6150c45
      Heiko Carstens authored
      commit 592f6b84 upstream.
      
      Commit 91c2e0bc ("unify compat fanotify_mark(2), switch to
      COMPAT_SYSCALL_DEFINE") added a new unified compat fanotify_mark syscall
      to be used by all architectures.
      
      Unfortunately the unified version merges the split mask parameter in a
      wrong way: the lower and higher word got swapped.
      
      This was discovered with glibc's tst-fanotify test case.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarAndreas Krebbel <krebbel@linux.vnet.ibm.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6150c45
    • Mark Brown's avatar
      ACPI / init: Flag use of ACPI and ACPI idioms for power supplies to regulator API · 49d42873
      Mark Brown authored
      commit 49a12877 upstream.
      
      There is currently no facility in ACPI to express the hookup of voltage
      regulators, the expectation is that the regulators that exist in the
      system will be handled transparently by firmware if they need software
      control at all. This means that if for some reason the regulator API is
      enabled on such a system it should assume that any supplies that devices
      need are provided by the system at all relevant times without any software
      intervention.
      
      Tell the regulator core to make this assumption by calling
      regulator_has_full_constraints(). Do this as soon as we know we are using
      ACPI so that the information is available to the regulator core as early
      as possible. This will cause the regulator core to pretend that there is
      an always on regulator supplying any supply that is requested but that has
      not otherwise been mapped which is the behaviour expected on a system with
      ACPI.
      
      Should the ability to specify regulators be added in future revisions of
      ACPI then once we have support for ACPI mappings in the kernel the same
      assumptions will apply. It is also likely that systems will default to a
      mode of operation which does not require any interpretation of these
      mappings in order to be compatible with existing operating system releases
      so it should remain safe to make these assumptions even if the mappings
      exist but are not supported by the kernel.
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49d42873
    • Josh Triplett's avatar
      turbostat: Use GCC's CPUID functions to support PIC · ac7b8b31
      Josh Triplett authored
      commit 2b92865e upstream.
      
      turbostat uses inline assembly to call cpuid.  On 32-bit x86, on systems
      that have certain security features enabled by default that make -fPIC
      the default, this causes a build error:
      
      turbostat.c: In function ‘check_cpuid’:
      turbostat.c:1906:2: error: PIC register clobbered by ‘ebx’ in ‘asm’
        asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx");
        ^
      
      GCC provides a header cpuid.h, containing a __get_cpuid function that
      works with both PIC and non-PIC.  (On PIC, it saves and restores ebx
      around the cpuid instruction.)  Use that instead.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac7b8b31
    • Josh Triplett's avatar
      turbostat: Don't put unprocessed uapi headers in the include path · f5b6d821
      Josh Triplett authored
      commit b731f311 upstream.
      
      turbostat's Makefile puts arch/x86/include/uapi/ in the include path, so
      that it can include <asm/msr.h> from it.  It isn't in general safe to
      include even uapi headers directly from the kernel tree without
      processing them through scripts/headers_install.sh, but asm/msr.h
      happens to work.
      
      However, that include path can break with some versions of system
      headers, by overriding some system headers with the unprocessed versions
      directly from the kernel source.  For instance:
      
      In file included from /build/x86-generic/usr/include/bits/sigcontext.h:28:0,
                       from /build/x86-generic/usr/include/signal.h:339,
                       from /build/x86-generic/usr/include/sys/wait.h:31,
                       from turbostat.c:27:
      ../../../../arch/x86/include/uapi/asm/sigcontext.h:4:28: fatal error: linux/compiler.h: No such file or directory
      
      This occurs because the system bits/sigcontext.h on that build system
      includes <asm/sigcontext.h>, and asm/sigcontext.h in the kernel source
      includes <linux/compiler.h>, which scripts/headers_install.sh would have
      filtered out.
      
      Since turbostat really only wants a single header, just include that one
      header rather than putting an entire directory of kernel headers on the
      include path.
      
      In the process, switch from msr.h to msr-index.h, since turbostat just
      wants the MSR numbers.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5b6d821
    • Li Zefan's avatar
      slub: Fix calculation of cpu slabs · f350eea0
      Li Zefan authored
      commit 8afb1474 upstream.
      
        /sys/kernel/slab/:t-0000048 # cat cpu_slabs
        231 N0=16 N1=215
        /sys/kernel/slab/:t-0000048 # cat slabs
        145 N0=36 N1=109
      
      See, the number of slabs is smaller than that of cpu slabs.
      
      The bug was introduced by commit 49e22585
      ("slub: per cpu cache for partial pages").
      
      We should use page->pages instead of page->pobjects when calculating
      the number of cpu partial slabs. This also fixes the mapping of slabs
      and nodes.
      
      As there's no variable storing the number of total/active objects in
      cpu partial slabs, and we don't have user interfaces requiring those
      statistics, I just add WARN_ON for those cases.
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f350eea0
    • Gregory CLEMENT's avatar
      ARM: mvebu: Fix kernel hang in mvebu_soc_id_init() when of_iomap failed · 020043ee
      Gregory CLEMENT authored
      commit dc4910d9 upstream.
      
      When pci_base is accessed whereas it has not been properly mapped by
      of_iomap() the kernel hang. The check of this pointer made an improper
      use of IS_ERR() instead of comparing to NULL. This patch fix this
      issue.
      Signed-off-by: default avatarGregory CLEMENT <gregory.clement@free-electrons.com>
      Reported-by: default avatarEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Fixes: 930ab3d4 (i2c: mv64xxx: Add I2C Transaction Generator support)
      Signed-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      020043ee
    • Sebastian Hesselbarth's avatar
      ARM: orion: provide C-style interrupt handler for MULTI_IRQ_HANDLER · fd4042ce
      Sebastian Hesselbarth authored
      commit f28d7de6 upstream.
      
      DT-enabled Marvell Kirkwood and Dove SoCs make use of an irqchip
      driver. As expected for irqchip drivers, it uses a C-style
      interrupt handler and therefore selects MULTI_IRQ_HANDLER.
      
      Now, compiling a kernel with both non-DT and DT support enabled,
      selecting MULTI_IRQ_HANDLER will break ASM irq handler used by
      non-DT boards.
      
      Therefore, we provide a C-style irq handler even for non-DT boards,
      if MULTI_IRQ_HANDLER is set. By installing the C-style irq handler
      in orion_irq_init this is transparent to all non-DT board files.
      
      While the regression report was filed on Marvell Kirkwood, also
      Marvell Dove non-DT boards are affected and fixed by this patch.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Tested-by: default avatarIan Campbell <ijc@hellion.org.uk>
      Reported-by: default avatarIan Campbell <ijc@hellion.org.uk>
      Fixes: 2326f043 ("ARM: kirkwood: convert to DT irqchip and clocksource")
      Fixes: f07d73e3 ("ARM: dove: convert to DT irqchip and clocksource")
      Acked-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd4042ce
    • Wolfram Sang's avatar
      mmc: core: sd: implement proper support for sd3.0 au sizes · d72025f6
      Wolfram Sang authored
      commit 9288cac0 upstream.
      
      This reverts and updates commit 77776fd0 ("mmc: sd: fix the
      maximum au_size for SD3.0"). The au_size for SD3.0 cannot be achieved
      by a simple bit shift, so this needs to be implemented differently.
      Also, don't print the warning in case of 0 since 'not defined' is
      different from 'invalid'.
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Acked-by: default avatarJaehoon Chung <jh80.chung@samsung.com>
      Reviewed-by: default avatarH Hartley Sweeten <hsweeten@visionengravers.com>
      Signed-off-by: default avatarChris Ball <chris@printf.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d72025f6
    • Ludovic Desroches's avatar
      mmc: atmel-mci: fix timeout errors in SDIO mode when using DMA · dc00440a
      Ludovic Desroches authored
      commit 66b512ed upstream.
      
      With some SDIO devices, timeout errors can happen when reading data.
      To solve this issue, the DMA transfer has to be activated before sending
      the command to the device. This order is incorrect in PDC mode. So we
      have to take care if we are using DMA or PDC to know when to send the
      MMC command.
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@atmel.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc00440a
    • Ray Jui's avatar
      mmc: fix host release issue after discard operation · 0efb3651
      Ray Jui authored
      commit f662ae48 upstream.
      
      Under function mmc_blk_issue_rq, after an MMC discard operation,
      the MMC request data structure may be freed in memory. Later in
      the same function, the check of req->cmd_flags & MMC_REQ_SPECIAL_MASK
      is dangerous and invalid. It causes the MMC host not to be released
      when it should.
      
      This patch fixes the issue by marking the special request down before
      the discard/flush operation.
      
      Reported by: Harold (SoonYeal) Yang <haroldsy@broadcom.com>
      Signed-off-by: default avatarRay Jui <rjui@broadcom.com>
      Reviewed-by: default avatarSeungwon Jeon <tgih.jun@samsung.com>
      Acked-by: default avatarSeungwon Jeon <tgih.jun@samsung.com>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0efb3651
    • Andrey Vagin's avatar
      mm: don't lose the SOFT_DIRTY flag on mprotect · 5fc120a5
      Andrey Vagin authored
      commit 24f91eba upstream.
      
      The SOFT_DIRTY bit shows that the content of memory was changed after a
      defined point in the past.  mprotect() doesn't change the content of
      memory, so it must not change the SOFT_DIRTY bit.
      
      This bug causes a malfunction: on the first iteration all pages are
      dumped.  On other iterations only pages with the SOFT_DIRTY bit are
      dumped.  So if the SOFT_DIRTY bit is cleared from a page by mistake, the
      page is not dumped and its content will be restored incorrectly.
      
      This patch does nothing with _PAGE_SWP_SOFT_DIRTY, becase pte_modify()
      is called only for present pages.
      
      Fixes commit 0f8975ec ("mm: soft-dirty bits for user memory changes
      tracking").
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fc120a5
    • Cyrill Gorcunov's avatar
      mm: ignore VM_SOFTDIRTY on VMA merging · b64ba2f3
      Cyrill Gorcunov authored
      commit 34228d47 upstream.
      
      The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
      in vm_flags matched except dirty bit the kernel can't longer merge them
      and this forces the kernel to generate new VMAs instead.
      
      It finally may lead to the situation when userspace application reaches
      vm.max_map_count limit and get crashed in worse case
      
       | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
       |
       | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
       | xinit: connection to X server lost
       |
       | waiting for X server to shut down
       | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
      
        https://bugzilla.kernel.org/show_bug.cgi?id=67651
        https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0
      
      Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
      even if we would set up VM_SOFTDIRTY here, there is still a way to
      prevent VMAs from merging: one can call
      
       | echo 4 > /proc/$PID/clear_refs
      
      and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
      new do_brk() will try to extend old VMA and finds that dirty bit doesn't
      match thus new VMA will be generated.
      
      As discussed with Pavel, the right approach should be to ignore
      VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
      we mark extended VMA with dirty bit where needed.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: default avatarBastian Hougaard <gnome@rvzt.net>
      Reported-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b64ba2f3
    • Michal Hocko's avatar
      memcg: fix css reference leak and endless loop in mem_cgroup_iter · 86e770e1
      Michal Hocko authored
      commit 0eef6156 upstream.
      
      Commit 19f39402 ("memcg: simplify mem_cgroup_iter") has reorganized
      mem_cgroup_iter code in order to simplify it.  A part of that change was
      dropping an optimization which didn't call css_tryget on the root of the
      walked tree.  The patch however didn't change the css_put part in
      mem_cgroup_iter which excludes root.
      
      This wasn't an issue at the time because __mem_cgroup_iter_next bailed
      out for root early without taking a reference as cgroup iterators
      (css_next_descendant_pre) didn't visit root themselves.
      
      Nevertheless cgroup iterators have been reworked to visit root by commit
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") when the root bypass have been
      dropped in __mem_cgroup_iter_next.  This means that css_put is not
      called for root and so css along with mem_cgroup and other cgroup
      internal object tied by css lifetime are never freed.
      
      Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
      do not take css reference for it.
      
      This reference counting magic protects us also from another issue, an
      endless loop reported by Hugh Dickins when reclaim races with root
      removal and css_tryget called by iterator internally would fail.  There
      would be no other nodes to visit so __mem_cgroup_iter_next would return
      NULL and mem_cgroup_iter would interpret it as "start looping from root
      again" and so mem_cgroup_iter would loop forever internally.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86e770e1
    • Michal Hocko's avatar
      memcg: fix endless loop caused by mem_cgroup_iter · b0a1f4ec
      Michal Hocko authored
      commit ecc736fc upstream.
      
      Hugh has reported an endless loop when the hardlimit reclaim sees the
      same group all the time.  This might happen when the reclaim races with
      the memcg removal.
      
      shrink_zone
                                                      [rmdir root]
        mem_cgroup_iter(root, NULL, reclaim)
          // prev = NULL
          rcu_read_lock()
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root || NULL
            css_tryget(last_visited)            // failed
            last_visited = NULL                 [1]
          memcg = root = __mem_cgroup_iter_next(root, NULL)
          mem_cgroup_iter_update
            iter->last_visited = root;
          reclaim->generation = iter->generation
      
       mem_cgroup_iter(root, root, reclaim)
         // prev = root
         rcu_read_lock
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root
            css_tryget(last_visited)            // failed
          [1]
      
      The issue seemed to be introduced by commit 5f578161 ("memcg: relax
      memcg iter caching") which has replaced unconditional css_get/css_put by
      css_tryget/css_put for the cached iterator.
      
      This patch fixes the issue by skipping css_tryget on the root of the
      tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
      in mem_cgroup_iter_update.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0a1f4ec
    • Johannes Weiner's avatar
      mm/page-writeback.c: do not count anon pages as dirtyable memory · 414f6b9f
      Johannes Weiner authored
      commit a1c3bfb2 upstream.
      
      The VM is currently heavily tuned to avoid swapping.  Whether that is
      good or bad is a separate discussion, but as long as the VM won't swap
      to make room for dirty cache, we can not consider anonymous pages when
      calculating the amount of dirtyable memory, the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      A simple workload that occupies a significant size (40+%, depending on
      memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
      uses the remainder for a streaming writer demonstrates this problem.  In
      that case, the actual cache pages are a small fraction of what is
      considered dirtyable overall, which results in an relatively large
      portion of the cache pages to be dirtied.  As kswapd starts rotating
      these, random tasks enter direct reclaim and stall on IO.
      
      Only consider free pages and file pages dirtyable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      414f6b9f
    • Johannes Weiner's avatar
      mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory · 69b896d6
      Johannes Weiner authored
      commit a804552b upstream.
      
      Tejun reported stuttering and latency spikes on a system where random
      tasks would enter direct reclaim and get stuck on dirty pages.  Around
      50% of memory was occupied by tmpfs backed by an SSD, and another disk
      (rotating) was reading and writing at max speed to shrink a partition.
      
      : The problem was pretty ridiculous.  It's a 8gig machine w/ one ssd and 10k
      : rpm harddrive and I could reliably reproduce constant stuttering every
      : several seconds for as long as buffered IO was going on on the hard drive
      : either with tmpfs occupying somewhere above 4gig or a test program which
      : allocates about the same amount of anon memory.  Although swap usage was
      : zero, turning off swap also made the problem go away too.
      :
      : The trigger conditions seem quite plausible - high anon memory usage w/
      : heavy buffered IO and swap configured - and it's highly likely that this
      : is happening in the wild too.  (this can happen with copying large files
      : to usb sticks too, right?)
      
      This patch (of 2):
      
      The dirty_balance_reserve is an approximation of the fraction of free
      pages that the page allocator does not make available for page cache
      allocations.  As a result, it has to be taken into account when
      calculating the amount of "dirtyable memory", the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      However, currently the reserve is subtracted from the sum of free and
      reclaimable pages, which is non-sensical and leads to erroneous results
      when the system is dominated by unreclaimable pages and the
      dirty_balance_reserve is bigger than free+reclaimable.  In that case, at
      least the already allocated cache should be considered dirtyable.
      
      Fix the calculation by subtracting the reserve from the amount of free
      pages, then adding the reclaimable pages on top.
      
      [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69b896d6
    • Hugh Dickins's avatar
      mm/memcg: iteration skip memcgs not yet fully initialized · ab9682ff
      Hugh Dickins authored
      commit d8ad3055 upstream.
      
      It is surprising that the mem_cgroup iterator can return memcgs which
      have not yet been fully initialized.  By accident (or trial and error?)
      this appears not to present an actual problem; but it may be better to
      prevent such surprises, by skipping memcgs not yet online.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab9682ff
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: shift page lock from head page to tail page after thp split · 4d712df1
      Naoya Horiguchi authored
      commit 54b9dd14 upstream.
      
      After thp split in hwpoison_user_mappings(), we hold page lock on the
      raw error page only between try_to_unmap, hence we are in danger of race
      condition.
      
      I found in the RHEL7 MCE-relay testing that we have "bad page" error
      when a memory error happens on a thp tail page used by qemu-kvm:
      
        Triggering MCE exception on CPU 10
        mce: [Hardware Error]: Machine check events logged
        MCE exception done on CPU 10
        MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
        MCE 0x38c535: dirty LRU page recovery: Recovered
        qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
        BUG: Bad page state in process qemu-kvm  pfn:38c400
        page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
        page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
        Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
        CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
        Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
        Call Trace:
          dump_stack+0x19/0x1b
          bad_page.part.59+0xcf/0xe8
          free_pages_prepare+0x148/0x160
          free_hot_cold_page+0x31/0x140
          free_hot_cold_page_list+0x46/0xa0
          release_pages+0x1c1/0x200
          free_pages_and_swap_cache+0xad/0xd0
          tlb_flush_mmu.part.46+0x4c/0x90
          tlb_finish_mmu+0x55/0x60
          exit_mmap+0xcb/0x170
          mmput+0x67/0xf0
          vhost_dev_cleanup+0x231/0x260 [vhost_net]
          vhost_net_release+0x3f/0x90 [vhost_net]
          __fput+0xe9/0x270
          ____fput+0xe/0x10
          task_work_run+0xc4/0xe0
          do_exit+0x2bb/0xa40
          do_group_exit+0x3f/0xa0
          get_signal_to_deliver+0x1d0/0x6e0
          do_signal+0x48/0x5e0
          do_notify_resume+0x71/0xc0
          retint_signal+0x48/0x8c
      
      The reason of this bug is that a page fault happens before unlocking the
      head page at the end of memory_failure().  This strange page fault is
      trying to access to address 0x20 and I'm not sure why qemu-kvm does
      this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
      way we catch the bad page bug/warning because we try to free a locked
      page (which was the former head page.)
      
      To fix this, this patch suggests to shift page lock from head page to
      tail page just after thp split.  SIGSEGV still happens, but it affects
      only error affected VMs, not a whole system.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d712df1
    • Konrad Rzeszutek Wilk's avatar
      xen/pvhvm: If xen_platform_pci=0 is set don't blow up (v4). · d7c80b2d
      Konrad Rzeszutek Wilk authored
      commit 51c71a3b upstream.
      
      The user has the option of disabling the platform driver:
      00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
      
      which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
      and allow the PV drivers to take over. If the user wishes
      to disable that they can set:
      
        xen_platform_pci=0
        (in the guest config file)
      
      or
        xen_emul_unplug=never
        (on the Linux command line)
      
      except it does not work properly. The PV drivers still try to
      load and since the Xen platform driver is not run - and it
      has not initialized the grant tables, most of the PV drivers
      stumble upon:
      
      input: Xen Virtual Keyboard as /devices/virtual/input/input5
      input: Xen Virtual Pointer as /devices/virtual/input/input6M
      ------------[ cut here ]------------
      kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
      CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
      Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
      RIP: 0010:[<ffffffff813ddc40>]  [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
      Call Trace:
       [<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
       [<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
       [<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
       [<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
       [<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
       [<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
       [<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
       [<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
       [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
       [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
       [<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
       [<ffffffff8145e7d9>] driver_attach+0x19/0x20
       [<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
       [<ffffffff8145f1ff>] driver_register+0x5f/0xf0
       [<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
       [<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
       [<ffffffffa0015000>] ? 0xffffffffa0014fff
       [<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
       [<ffffffff81002049>] do_one_initcall+0x49/0x170
      
      .. snip..
      
      which is hardly nice. This patch fixes this by having each
      PV driver check for:
       - if running in PV, then it is fine to execute (as that is their
         native environment).
       - if running in HVM, check if user wanted 'xen_emul_unplug=never',
         in which case bail out and don't load any PV drivers.
       - if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
         does not exist, then bail out and not load PV drivers.
       - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
         then bail out for all PV devices _except_ the block one.
         Ditto for the network one ('nics').
       - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
         then load block PV driver, and also setup the legacy IDE paths.
         In (v3) make it actually load PV drivers.
      
      Reported-by: Sander Eikelenboom <linux@eikelenboom.it
      Reported-by: default avatarAnthony PERARD <anthony.perard@citrix.com>
      Reported-and-Tested-by: default avatarFabio Fantoni <fabio.fantoni@m2r.biz>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
      can be used per Ian and Stefano suggestion]
      [v3: Make the unnecessary case work properly]
      [v4: s/disks/ide-disks/ spotted by Fabio]
      Reviewed-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7c80b2d
    • AKASHI Takahiro's avatar
      audit: correct a type mismatch in audit_syscall_exit() · b224b01f
      AKASHI Takahiro authored
      commit 06bdadd7 upstream.
      
      audit_syscall_exit() saves a result of regs_return_value() in intermediate
      "int" variable and passes it to __audit_syscall_exit(), which expects its
      second argument as a "long" value.  This will result in truncating the
      value returned by a system call and making a wrong audit record.
      
      I don't know why gcc compiler doesn't complain about this, but anyway it
      causes a problem at runtime on arm64 (and probably most 64-bit archs).
      Signed-off-by: default avatarAKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b224b01f