1. 28 Oct, 2013 17 commits
    • Chuck Lever's avatar
      NFS: Add basic migration support to state manager thread · c9fdeb28
      Chuck Lever authored
      Migration recovery and state recovery must be serialized, so handle
      both in the state manager thread.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      c9fdeb28
    • Chuck Lever's avatar
      NFS: Add a super_block backpointer to the nfs_server struct · ce6cda18
      Chuck Lever authored
      NFS_SB() returns the pointer to an nfs_server struct, given a
      pointer to a super_block.  But we have no way to go back the other
      way.
      
      Add a super_block backpointer field so that, given an nfs_server
      struct, it is easy to get to the filesystem's root dentry.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      ce6cda18
    • Chuck Lever's avatar
      NFS: Add method to retrieve fs_locations during migration recovery · b03d735b
      Chuck Lever authored
      The nfs4_proc_fs_locations() function is invoked during referral
      processing to perform a GETATTR(fs_locations) on an object's parent
      directory in order to discover the target of the referral.  It
      performs a LOOKUP in the compound, so the client needs to know the
      parent's file handle a priori.
      
      Unfortunately this function is not adequate for handling migration
      recovery.  We need to probe fs_locations information on an FSID, but
      there's no parent directory available for many operations that
      can return NFS4ERR_MOVED.
      
      Another subtlety: recovering from NFS4ERR_LEASE_MOVED is a process
      of walking over a list of known FSIDs that reside on the server, and
      probing whether they have migrated.  Once the server has detected
      that the client has probed all migrated file systems, it stops
      returning NFS4ERR_LEASE_MOVED.
      
      A minor version zero server needs to know what client ID is
      requesting fs_locations information so it can clear the flag that
      forces it to continue returning NFS4ERR_LEASE_MOVED.  This flag is
      set per client ID and per FSID.  However, the client ID is not an
      argument of either the PUTFH or GETATTR operations.  Later minor
      versions have client ID information embedded in the compound's
      SEQUENCE operation.
      
      Therefore, by convention, minor version zero clients send a RENEW
      operation in the same compound as the GETATTR(fs_locations), since
      RENEW's one argument is a clientid4.  This allows a minor version
      zero server to identify correctly the client that is probing for a
      migration.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      b03d735b
    • Chuck Lever's avatar
      NFS: Export _nfs_display_fhandle() · 9e6ee76d
      Chuck Lever authored
      Allow code in nfsv4.ko to use _nfs_display_fhandle().
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      9e6ee76d
    • Chuck Lever's avatar
      NFS: Introduce a vector of migration recovery ops · ec011fe8
      Chuck Lever authored
      The differences between minor version 0 and minor version 1
      migration will be abstracted by the addition of a set of migration
      recovery ops.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      ec011fe8
    • Chuck Lever's avatar
      NFS: Add functions to swap transports during migration recovery · 800c06a5
      Chuck Lever authored
      Introduce functions that can walk through an array of returned
      fs_locations information and connect a transport to one of the
      destination servers listed therein.
      
      Note that NFS minor version 1 introduces "fs_locations_info" which
      extends the locations array sorting criteria available to clients.
      This is not supported yet.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      800c06a5
    • Chuck Lever's avatar
      NFS: Add nfs4_update_server · 32e62b7c
      Chuck Lever authored
      New function nfs4_update_server() moves an nfs_server to a different
      nfs_client.  This is done as part of migration recovery.
      
      Though it may be appealing to think of them as the same thing,
      migration recovery is not the same as following a referral.
      
      For a referral, the client has not descended into the file system
      yet: it has no nfs_server, no super block, no inodes or open state.
      It is enough to simply instantiate the nfs_server and super block,
      and perform a referral mount.
      
      For a migration, however, we have all of those things already, and
      they have to be moved to a different nfs_client.  No local namespace
      changes are needed here.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      32e62b7c
    • Trond Myklebust's avatar
      SUNRPC: Add a helper to switch the transport of an rpc_clnt · 40b00b6b
      Trond Myklebust authored
      Add an RPC client API to redirect an rpc_clnt's transport from a
      source server to a destination server during a migration event.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      [ cel: forward ported to 3.12 ]
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      40b00b6b
    • Chuck Lever's avatar
      SUNRPC: Modify synopsis of rpc_client_register() · d746e545
      Chuck Lever authored
      The rpc_client_register() helper was added in commit e73f4cc0,
      "SUNRPC: split client creation routine into setup and registration,"
      Mon Jun 24 11:52:52 2013.  In a subsequent patch, I'd like to invoke
      rpc_client_register() from a context where a struct rpc_create_args
      is not available.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d746e545
    • Weston Andros Adamson's avatar
      NFSv4: don't reprocess cached open CLAIM_PREVIOUS · d2bfda2e
      Weston Andros Adamson authored
      Cached opens have already been handled by _nfs4_opendata_reclaim_to_nfs4_state
      and can safely skip being reprocessed, but must still call update_open_stateid
      to make sure that all active fmodes are recovered.
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Cc: stable@vger.kernel.org # 3.7.x: f494a607: NFSv4: fix NULL dereference
      Cc: stable@vger.kernel.org # 3.7.x: a43ec98b: NFSv4: don't fail on missin
      Cc: stable@vger.kernel.org # 3.7.x
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d2bfda2e
    • Trond Myklebust's avatar
      NFSv4: Fix state reference counting in _nfs4_opendata_reclaim_to_nfs4_state · d49f042a
      Trond Myklebust authored
      Currently, if the call to nfs_refresh_inode fails, then we end up leaking
      a reference count, due to the call to nfs4_get_open_state.
      While we're at it, replace nfs4_get_open_state with a simple call to
      atomic_inc(); there is no need to do a full lookup of the struct nfs_state
      since it is passed as an argument in the struct nfs4_opendata, and
      is already assigned to the variable 'state'.
      
      Cc: stable@vger.kernel.org # 3.7.x: a43ec98b: NFSv4: don't fail on missing
      Cc: stable@vger.kernel.org # 3.7.x
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d49f042a
    • Weston Andros Adamson's avatar
      NFSv4: don't fail on missing fattr in open recover · a43ec98b
      Weston Andros Adamson authored
      This is an unneeded check that could cause the client to fail to recover
      opens.
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      a43ec98b
    • Weston Andros Adamson's avatar
      NFSv4: fix NULL dereference in open recover · f494a607
      Weston Andros Adamson authored
      _nfs4_opendata_reclaim_to_nfs4_state doesn't expect to see a cached
      open CLAIM_PREVIOUS, but this can happen. An example is when there are
      RDWR openers and RDONLY openers on a delegation stateid. The recovery
      path will first try an open CLAIM_PREVIOUS for the RDWR openers, this
      marks the delegation as not needing RECLAIM anymore, so the open
      CLAIM_PREVIOUS for the RDONLY openers will not actually send an rpc.
      
      The NULL dereference is due to _nfs4_opendata_reclaim_to_nfs4_state
      returning PTR_ERR(rpc_status) when !rpc_done. When the open is
      cached, rpc_done == 0 and rpc_status == 0, thus
      _nfs4_opendata_reclaim_to_nfs4_state returns NULL - this is unexpected
      by callers of nfs4_opendata_to_nfs4_state().
      
      This can be reproduced easily by opening the same file two times on an
      NFSv4.0 mount with delegations enabled, once as RDWR and once as RDONLY then
      sleeping for a long time.  While the files are held open, kick off state
      recovery and this NULL dereference will be hit every time.
      
      An example OOPS:
      
      [   65.003602] BUG: unable to handle kernel NULL pointer dereference at 00000000
      00000030
      [   65.005312] IP: [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.006820] PGD 7b0ea067 PUD 791ff067 PMD 0
      [   65.008075] Oops: 0000 [#1] SMP
      [   65.008802] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache
      snd_ens1371 gameport nfsd snd_rawmidi snd_ac97_codec ac97_bus btusb snd_seq snd
      _seq_device snd_pcm ppdev bluetooth auth_rpcgss coretemp snd_page_alloc crc32_pc
      lmul crc32c_intel ghash_clmulni_intel microcode rfkill nfs_acl vmw_balloon serio
      _raw snd_timer lockd parport_pc e1000 snd soundcore parport i2c_piix4 shpchp vmw
      _vmci sunrpc ata_generic mperf pata_acpi mptspi vmwgfx ttm scsi_transport_spi dr
      m mptscsih mptbase i2c_core
      [   65.018684] CPU: 0 PID: 473 Comm: 192.168.10.85-m Not tainted 3.11.2-201.fc19
      .x86_64 #1
      [   65.020113] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
      Reference Platform, BIOS 6.00 07/31/2013
      [   65.022012] task: ffff88003707e320 ti: ffff88007b906000 task.ti: ffff88007b906000
      [   65.023414] RIP: 0010:[<ffffffffa037d6ee>]  [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.025079] RSP: 0018:ffff88007b907d10  EFLAGS: 00010246
      [   65.026042] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   65.027321] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000000
      [   65.028691] RBP: ffff88007b907d38 R08: 0000000000016f60 R09: 0000000000000000
      [   65.029990] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      [   65.031295] R13: 0000000000000050 R14: 0000000000000000 R15: 0000000000000001
      [   65.032527] FS:  0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
      [   65.033981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   65.035177] CR2: 0000000000000030 CR3: 000000007b27f000 CR4: 00000000000407f0
      [   65.036568] Stack:
      [   65.037011]  0000000000000000 0000000000000001 ffff88007b907d90 ffff88007a880220
      [   65.038472]  ffff88007b768de8 ffff88007b907d48 ffffffffa037e4a5 ffff88007b907d80
      [   65.039935]  ffffffffa036a6c8 ffff880037020e40 ffff88007a880000 ffff880037020e40
      [   65.041468] Call Trace:
      [   65.042050]  [<ffffffffa037e4a5>] nfs4_close_state+0x15/0x20 [nfsv4]
      [   65.043209]  [<ffffffffa036a6c8>] nfs4_open_recover_helper+0x148/0x1f0 [nfsv4]
      [   65.044529]  [<ffffffffa036a886>] nfs4_open_recover+0x116/0x150 [nfsv4]
      [   65.045730]  [<ffffffffa036d98d>] nfs4_open_reclaim+0xad/0x150 [nfsv4]
      [   65.046905]  [<ffffffffa037d979>] nfs4_do_reclaim+0x149/0x5f0 [nfsv4]
      [   65.048071]  [<ffffffffa037e1dc>] nfs4_run_state_manager+0x3bc/0x670 [nfsv4]
      [   65.049436]  [<ffffffffa037de20>] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
      [   65.050686]  [<ffffffffa037de20>] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
      [   65.051943]  [<ffffffff81088640>] kthread+0xc0/0xd0
      [   65.052831]  [<ffffffff81088580>] ? insert_kthread_work+0x40/0x40
      [   65.054697]  [<ffffffff8165686c>] ret_from_fork+0x7c/0xb0
      [   65.056396]  [<ffffffff81088580>] ? insert_kthread_work+0x40/0x40
      [   65.058208] Code: 5c 41 5d 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b 67 30 f0 41 ff 44 24 44 49 8d 7c 24 40 e8 0e 0a 2d e1 44
      [   65.065225] RIP  [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.067175]  RSP <ffff88007b907d10>
      [   65.068570] CR2: 0000000000000030
      [   65.070098] ---[ end trace 0d1fe4f5c7dd6f8b ]---
      
      Cc: <stable@vger.kernel.org> #3.7+
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      f494a607
    • Trond Myklebust's avatar
      NFSv4.1: Don't change the security label as part of open reclaim. · 83c78eb0
      Trond Myklebust authored
      The current caching model calls for the security label to be set on
      first lookup and/or on any subsequent label changes. There is no
      need to do it as part of an open reclaim.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      83c78eb0
    • Jeff Layton's avatar
      nfs: fix handling of invalid mount options in nfs_remount · 1966903f
      Jeff Layton authored
      nfs_parse_mount_options returns 0 on error, not -errno.
      Reported-by: default avatarKarel Zak <kzak@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      1966903f
    • Jeff Layton's avatar
    • Andy Adamson's avatar
      NFSv4 Remove zeroing state kern warnings · 3660cd43
      Andy Adamson authored
      As of commit 5d422301 we no longer zero the
      state.
      Signed-off-by: default avatarAndy Adamson <andros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      3660cd43
  2. 01 Oct, 2013 12 commits
  3. 30 Sep, 2013 11 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (fixes from Andrew Morton) · 522d6d38
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
        pidns: fix free_pid() to handle the first fork failure
        ipc,msg: prevent race with rmid in msgsnd,msgrcv
        ipc/sem.c: update sem_otime for all operations
        mm/hwpoison: fix the lack of one reference count against poisoned page
        mm/hwpoison: fix false report on 2nd attempt at page recovery
        mm/hwpoison: fix test for a transparent huge page
        mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
        block: change config option name for cmdline partition parsing
        mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
        mm: avoid reinserting isolated balloon pages into LRU lists
        arch/parisc/mm/fault.c: fix uninitialized variable usage
        include/asm-generic/vtime.h: avoid zero-length file
        nilfs2: fix issue with race condition of competition between segments for dirty blocks
        Documentation/kernel-parameters.txt: replace kernelcore with Movable
        mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
        kernel/kmod.c: check for NULL in call_usermodehelper_exec()
        ipc/sem.c: synchronize the proc interface
        ipc/sem.c: optimize sem_lock()
        ipc/sem.c: fix race in sem_lock()
        mm/compaction.c: periodically schedule when freeing pages
        ...
      522d6d38
    • Oleg Nesterov's avatar
      pidns: fix free_pid() to handle the first fork failure · 314a8ad0
      Oleg Nesterov authored
      "case 0" in free_pid() assumes that disable_pid_allocation() should
      clear PIDNS_HASH_ADDING before the last pid goes away.
      
      However this doesn't happen if the first fork() fails to create the
      child reaper which should call disable_pid_allocation().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314a8ad0
    • Davidlohr Bueso's avatar
      ipc,msg: prevent race with rmid in msgsnd,msgrcv · 4271b05a
      Davidlohr Bueso authored
      This fixes a race in both msgrcv() and msgsnd() between finding the msg
      and actually dealing with the queue, as another thread can delete shmid
      underneath us if we are preempted before acquiring the
      kern_ipc_perm.lock.
      
      Manfred illustrates this nicely:
      
      Assume a preemptible kernel that is preempted just after
      
          msq = msq_obtain_object_check(ns, msqid)
      
      in do_msgrcv().  The only lock that is held is rcu_read_lock().
      
      Now the other thread processes IPC_RMID.  When the first task is
      resumed, then it will happily wait for messages on a deleted queue.
      
      Fix this by checking for if the queue has been deleted after taking the
      lock.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: <stable@vger.kernel.org> 	[3.11]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4271b05a
    • Manfred Spraul's avatar
      ipc/sem.c: update sem_otime for all operations · 0e8c6656
      Manfred Spraul authored
      In commit 0a2b9d4c ("ipc/sem.c: move wake_up_process out of the
      spinlock section"), the update of semaphore's sem_otime(last semop time)
      was moved to one central position (do_smart_update).
      
      But since do_smart_update() is only called for operations that modify
      the array, this means that wait-for-zero semops do not update sem_otime
      anymore.
      
      The fix is simple:
      Non-alter operations must update sem_otime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarJia He <jiakernel@gmail.com>
      Tested-by: default avatarJia He <jiakernel@gmail.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e8c6656
    • Wanpeng Li's avatar
      mm/hwpoison: fix the lack of one reference count against poisoned page · fb31ba30
      Wanpeng Li authored
      The lack of one reference count against poisoned page for hwpoison_inject
      w/o hwpoison_filter enabled result in hwpoison detect -1 users still
      referenced the page, however, the number should be 0 except the poison
      handler held one after successfully unmap.  This patch fix it by hold one
      referenced count against poisoned page for hwpoison_inject w/ and w/o
      hwpoison_filter enabled.
      
      Before patch:
      
      [   71.902112] Injecting memory failure at pfn 224706
      [   71.902137] MCE 0x224706: dirty LRU page recovery: Failed
      [   71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users
      
      After patch:
      
      [   94.710860] Injecting memory failure at pfn 215b68
      [   94.710885] MCE 0x215b68: dirty LRU page recovery: Recovered
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb31ba30
    • Wanpeng Li's avatar
      mm/hwpoison: fix false report on 2nd attempt at page recovery · 2d421acd
      Wanpeng Li authored
      If the page is poisoned by software injection w/ MF_COUNT_INCREASED
      flag, there is a false report during the 2nd attempt at page recovery
      which is not truthful.
      
      This patch fixes it by reporting the first attempt to try free buddy
      page recovery if MF_COUNT_INCREASED is set.
      
      Before patch:
      
      [  346.332041] Injecting memory failure at pfn 200010
      [  346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed
      
      After patch:
      
      [  297.742600] Injecting memory failure at pfn 200010
      [  297.742941] MCE 0x200010: free buddy page recovery: Delayed
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d421acd
    • Wanpeng Li's avatar
      mm/hwpoison: fix test for a transparent huge page · e76d30e2
      Wanpeng Li authored
      PageTransHuge() can't guarantee the page is a transparent huge page
      since it returns true for both transparent huge and hugetlbfs pages.
      
      This patch fixes it by checking the page is also !hugetlbfs page.
      
      Before patch:
      
      [  121.571128] Injecting memory failure at pfn 23a200
      [  121.571141] MCE 0x23a200: huge page recovery: Delayed
      [  140.355100] MCE: Memory failure is now running on 0x23a200
      
      After patch:
      
      [   94.290793] Injecting memory failure at pfn 23a000
      [   94.290800] MCE 0x23a000: huge page recovery: Delayed
      [  105.722303] MCE: Software-unpoisoned page 0x23a000
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e76d30e2
    • Wanpeng Li's avatar
      mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood · 20cb6cab
      Wanpeng Li authored
      madvise_hwpoison won't check if the page is small page or huge page and
      traverses in small page granularity against the range unconditionally,
      which result in a printk flood "MCE xxx: already hardware poisoned" if
      the page is a huge page.
      
      This patch fixes it by using compound_order(compound_head(page)) for
      huge page iterator.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 3
      #define PAGE_SIZE	4096 * 512
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20cb6cab
    • Paul Gortmaker's avatar
      block: change config option name for cmdline partition parsing · 080506ad
      Paul Gortmaker authored
      Recently commit bab55417 ("block: support embedded device command
      line partition") introduced CONFIG_CMDLINE_PARSER.  However, that name
      is too generic and sounds like it enables/disables generic kernel boot
      arg processing, when it really is block specific.
      
      Before this option becomes a part of a full/final release, add the BLK_
      prefix to it so that it is clear in absence of any other context that it
      is block specific.
      
      In addition, fix up the following less critical items:
       - help text was not really at all helpful.
       - index file for Documentation was not updated
       - add the new arg to Documentation/kernel-parameters.txt
       - clarify wording in source comments
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Cai Zhiyong <caizhiyong@huawei.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      080506ad
    • Vlastimil Babka's avatar
      mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration · eadb41ae
      Vlastimil Babka authored
      The function __munlock_pagevec_fill() introduced in commit 7a8010cd
      ("mm: munlock: manual pte walk in fast path instead of
      follow_page_mask()") uses pmd_addr_end() for restricting its operation
      within current page table.
      
      This is insufficient on architectures/configurations where pmd is folded
      and pmd_addr_end() just returns the end of the full range to be walked.
      In this case, it allows pte++ to walk off the end of a page table
      resulting in unpredictable behaviour.
      
      This patch fixes the function by using pgd_addr_end() and pud_addr_end()
      before pmd_addr_end(), which will yield correct page table boundary on
      all configurations.  This is similar to what existing page walkers do
      when walking each level of the page table.
      
      Additionaly, the patch clarifies a comment for get_locked_pte() call in the
      function.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eadb41ae
    • Rafael Aquini's avatar
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini authored
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Reported-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e