• Ross Zwisler's avatar
    dax: fix race between colliding PMD & PTE entries · e2093926
    Ross Zwisler authored
    We currently have two related PMD vs PTE races in the DAX code.  These
    can both be easily triggered by having two threads reading and writing
    simultaneously to the same private mapping, with the key being that
    private mapping reads can be handled with PMDs but private mapping
    writes are always handled with PTEs so that we can COW.
    
    Here is the first race:
    
      CPU 0					CPU 1
    
      (private mapping write)
      __handle_mm_fault()
        create_huge_pmd() - FALLBACK
        handle_pte_fault()
          passes check for pmd_devmap()
    
    					(private mapping read)
    					__handle_mm_fault()
    					  create_huge_pmd()
    					    dax_iomap_pmd_fault() inserts PMD
    
          dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
          			  installed in our page tables at this spot.
    
    Here's the second race:
    
      CPU 0					CPU 1
    
      (private mapping read)
      __handle_mm_fault()
        passes check for pmd_none()
        create_huge_pmd()
          dax_iomap_pmd_fault() inserts PMD
    
      (private mapping write)
      __handle_mm_fault()
        create_huge_pmd() - FALLBACK
    					(private mapping read)
    					__handle_mm_fault()
    					  passes check for pmd_none()
    					  create_huge_pmd()
    
        handle_pte_fault()
          dax_iomap_pte_fault() inserts PTE
    					    dax_iomap_pmd_fault() inserts PMD,
    					       but we already have a PTE at
    					       this spot.
    
    The core of the issue is that while there is isolation between faults to
    the same range in the DAX fault handlers via our DAX entry locking,
    there is no isolation between faults in the code in mm/memory.c.  This
    means for instance that this code in __handle_mm_fault() can run:
    
    	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
    		ret = create_huge_pmd(&vmf);
    
    But by the time we actually get to run the fault handler called by
    create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
    fault has installed a normal PMD here as a parent.  This is the cause of
    the 2nd race.  The first race is similar - there is the following check
    in handle_pte_fault():
    
    	} else {
    		/* See comment in pte_alloc_one_map() */
    		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
    			return 0;
    
    So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
    will bail and retry the fault.  This is correct, but there is nothing
    preventing the PMD from being installed after this check but before we
    actually get to the DAX PTE fault handlers.
    
    In my testing these races result in the following types of errors:
    
      BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
      BUG: non-zero nr_ptes on freeing mm: 15
    
    Fix this issue by having the DAX fault handlers verify that it is safe
    to continue their fault after they have taken an entry lock to block
    other racing faults.
    
    [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
      Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
    Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
    Reported-by: default avatarPawel Lebioda <pawel.lebioda@intel.com>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Pawel Lebioda <pawel.lebioda@intel.com>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Xiong Zhou <xzhou@redhat.com>
    Cc: Eryu Guan <eguan@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e2093926
dax.c 41.2 KB