1. 01 Oct, 2013 17 commits
  2. 27 Sep, 2013 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.10.13 · cff43fc8
      Greg Kroah-Hartman authored
      cff43fc8
    • Miklos Szeredi's avatar
      fuse: readdir: check for slash in names · e5362560
      Miklos Szeredi authored
      commit efeb9e60 upstream.
      
      Userspace can add names containing a slash character to the directory
      listing.  Don't allow this as it could cause all sorts of trouble.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5362560
    • Maxim Patlasov's avatar
      fuse: hotfix truncate_pagecache() issue · 4e208303
      Maxim Patlasov authored
      commit 06a7c3c2 upstream.
      
      The way how fuse calls truncate_pagecache() from fuse_change_attributes()
      is completely wrong. Because, w/o i_mutex held, we never sure whether
      'oldsize' and 'attr->size' are valid by the time of execution of
      truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
      released fc->lock in the middle of fuse_change_attributes(), we completely
      loose control of actions which may happen with given inode until we reach
      truncate_pagecache. The list of potentially dangerous actions includes
      mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
      
      The typical outcome of doing truncate_pagecache() with outdated arguments
      is data corruption from user point of view. This is (in some sense)
      acceptable in cases when the issue is triggered by a change of the file on
      the server (i.e. externally wrt fuse operation), but it is absolutely
      intolerable in scenarios when a single fuse client modifies a file without
      any external intervention. A real life case I discovered by fsx-linux
      looked like this:
      
      1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
      FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
      2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
      to the server synchronously, then calls fuse_change_attributes(). The
      latter updates i_size, releases fc->lock, but before comparing oldsize vs
      attr->size..
      3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
      updating attributes and i_size, but now oldsize is equal to
      outarg.attr.size because i_size has just been updated (step 2). Hence,
      fuse_do_setattr() returns w/o calling truncate_pagecache().
      4. As soon as ftruncate(2) completes, the user extends file size by
      write(2) making a hole in the middle of file, then reads data from the hole
      either by read(2) or mmap-ed read. The user expects to get zero data from
      the hole, but gets stale data because truncate_pagecache() is not executed
      yet.
      
      The scenario above illustrates one side of the problem: not truncating the
      page cache even though we should. Another side corresponds to truncating
      page cache too late, when the state of inode changed significantly.
      Theoretically, the following is possible:
      
      1. As in the previous scenario fuse_dentry_revalidate() discovered that
      i_size changed (due to our own fuse_do_setattr()) and is going to call
      truncate_pagecache() for some 'new_size' it believes valid right now. But
      by the time that particular truncate_pagecache() is called ...
      2. fuse_do_setattr() returns (either having called truncate_pagecache() or
      not -- it doesn't matter).
      3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
      4. mmap-ed write makes a page in the extended region dirty.
      
      The result will be the lost of data user wrote on the fourth step.
      
      The patch is a hotfix resolving the issue in a simplistic way: let's skip
      dangerous i_size update and truncate_pagecache if an operation changing
      file size is in progress. This simplistic approach looks correct for the
      cases w/o external changes. And to handle them properly, more sophisticated
      and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
      postpone it until the issue is well discussed on the mailing list(s).
      
      Changed in v2:
       - improved patch description to cover both sides of the issue.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e208303
    • Anand Avati's avatar
      fuse: invalidate inode attributes on xattr modification · eb97a45d
      Anand Avati authored
      commit d331a415 upstream.
      
      Calls like setxattr and removexattr result in updation of ctime.
      Therefore invalidate inode attributes to force a refresh.
      Signed-off-by: default avatarAnand Avati <avati@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb97a45d
    • Maxim Patlasov's avatar
      fuse: postpone end_page_writeback() in fuse_writepage_locked() · 3002e63b
      Maxim Patlasov authored
      commit 4a4ac4eb upstream.
      
      The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
      
      1) An user makes a page dirty via mmap-ed write.
      2) The user performs shrinking truncate(2) intended to purge the page.
      3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
         writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
         the original page by end_page_writeback.
      4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
         is free.
      5) Ordinary write(2) extends i_size back to cover the page. Note that
         fuse_send_write_pages do wait for fuse writeback, but for another
         page->index.
      6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
         fuse_send_writepage is supposed to crop inarg->size of the request,
         but it doesn't because i_size has already been extended back.
      
      Moving end_page_writeback to the end of fuse_writepage_locked fixes the
      race because now the fact that truncate_pagecache is successfully returned
      infers that fuse_writepage_locked has already called end_page_writeback.
      And this, in turn, infers that fuse_flush_writepages has already called
      fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
      could not extend it because of i_mutex held by ftruncate(2).
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3002e63b
    • Mark Brown's avatar
      clk: wm831x: Initialise wm831x pointer on init · d662986b
      Mark Brown authored
      commit 08442ce9 upstream.
      
      Otherwise any attempt to interact with the hardware will crash. This is
      what happens when drivers get written blind.
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Signed-off-by: default avatarMike Turquette <mturquette@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d662986b
    • Brian Norris's avatar
      mtd: nand: fix NAND_BUSWIDTH_AUTO for x16 devices · 497587f8
      Brian Norris authored
      commit 68e80780 upstream.
      
      The code for NAND_BUSWIDTH_AUTO is broken. According to Alexander:
      
        "I have a problem with attach NAND UBI in 16 bit mode.
         NAND works fine if I specify NAND_BUSWIDTH_16 option, but not
         working with NAND_BUSWIDTH_AUTO option. In second case NAND
         chip is identifyed with ONFI."
      
      See his report for the rest of the details:
      
        http://lists.infradead.org/pipermail/linux-mtd/2013-July/047515.html
      
      Anyway, the problem is that nand_set_defaults() is called twice, we
      intend it to reset the chip functions to their x16 buswidth verions
      if the buswidth changed from x8 to x16; however, nand_set_defaults()
      does exactly nothing if called a second time.
      
      Fix this by hacking nand_set_defaults() to reset the buswidth-dependent
      functions if they were set to the x8 version the first time. Note that
      this does not do anything to reset from x16 to x8, but that's not the
      supported use case for NAND_BUSWIDTH_AUTO anyway.
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Reported-by: default avatarAlexander Shiyan <shc_work@mail.ru>
      Tested-by: default avatarAlexander Shiyan <shc_work@mail.ru>
      Cc: Matthieu Castet <matthieu.castet@parrot.com>
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      497587f8
    • Grant Likely's avatar
      of: Fix missing memory initialization on FDT unflattening · 6830e9ab
      Grant Likely authored
      commit 0640332e upstream.
      
      Any calls to dt_alloc() need to be zeroed. This is a temporary fix, but
      the allocation function itself needs to zero memory before returning
      it. This is a follow up to patch 9e401275, "of: fdt: fix memory
      initialization for expanded DT" which fixed one call site but missed
      another.
      Signed-off-by: default avatarGrant Likely <grant.likely@linaro.org>
      Acked-by: default avatarWladislav Wiebe <wladislav.kw@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6830e9ab
    • Sergei Shtylyov's avatar
      mmc: tmio_mmc_dma: fix PIO fallback on SDHI · c0da0888
      Sergei Shtylyov authored
      commit f936f9b6 upstream.
      
      I'm testing SH-Mobile SDHI driver in DMA mode with  a new DMA controller  using
      'bonnie++' and getting DMA error after which the tmio_mmc_dma.c code falls back
      to PIO but all commands time out after that.  It turned out that the fallback
      code calls tmio_mmc_enable_dma() with RX/TX channels already freed and pointers
      to them cleared, so that the function bails out early instead  of clearing the
      DMA bit in the CTL_DMA_ENABLE register. The regression was introduced by commit
      162f43e3 (mmc: tmio: fix a deadlock).
      Moving tmio_mmc_enable_dma() calls to the top of the PIO fallback code in
      tmio_mmc_start_dma_{rx|tx}() helps.
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Acked-by: default avatarGuennadi Liakhovetski <g.liakhovetski@gmx.de>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0da0888
    • Josh Durgin's avatar
      rbd: fix I/O error propagation for reads · be4c4b85
      Josh Durgin authored
      commit 17c1cc1d upstream.
      
      When a request returns an error, the driver needs to report the entire
      extent of the request as completed.  Writes already did this, since
      they always set xferred = length, but reads were skipping that step if
      an error other than -ENOENT occurred.  Instead, rbd would end up
      passing 0 xferred to blk_end_request(), which would always report
      needing more data.  This resulted in an assert failing when more data
      was required by the block layer, but all the object requests were
      done:
      
      [ 1868.719077] rbd: obj_request read result -108 xferred 0
      [ 1868.719077]
      [ 1868.719518] end_request: I/O error, dev rbd1, sector 0
      [ 1868.719739]
      [ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736:
      [ 1868.719739]
      [ 1868.719739]   rbd_assert(more ^ (which == img_request->obj_request_count));
      
      Without this assert, reads that hit errors would hang forever, since
      the block layer considered them incomplete.
      
      Fixes: http://tracker.ceph.com/issues/5647Signed-off-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      Reviewed-by: default avatarAlex Elder <alex.elder@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be4c4b85
    • majianpeng's avatar
    • Sage Weil's avatar
      libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc · fd5e2dea
      Sage Weil authored
      commit 9542cf0b upstream.
      
      Fix a typo that used the wrong bitmask for the pg.seed calculation.  This
      is normally unnoticed because in most cases pg_num == pgp_num.  It is, however,
      a bug that is easily corrected.
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      Reviewed-by: default avatarAlex Elder <alex.elder@linary.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd5e2dea
    • majianpeng's avatar
      libceph: unregister request in __map_request failed and nofail == false · 2ab0ad6a
      majianpeng authored
      commit 73d9f7ee upstream.
      
      For nofail == false request, if __map_request failed, the caller does
      cleanup work, like releasing the relative pages.  It doesn't make any sense
      to retry this request.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ab0ad6a
    • Richard Weinberger's avatar
      um: Implement probe_kernel_read() · 4fdaa3d4
      Richard Weinberger authored
      commit f75b1b1b upstream.
      
      UML needs it's own probe_kernel_read() to handle kernel
      mode faults correctly.
      The implementation uses mincore() on the host side to detect
      whether a page is owned by the UML kernel process.
      
      This fixes also a possible crash when sysrq-t is used.
      Starting with 3.10 sysrq-t calls probe_kernel_read() to
      read details from the kernel workers. As kernel worker are
      completely async pointers may turn NULL while reading them.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: <stian@nixia.no>
      Cc: <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fdaa3d4
    • Alex Deucher's avatar
      drm/edid: add quirk for Medion MD30217PG · 579db19e
      Alex Deucher authored
      commit 118bdbd8 upstream.
      
      This LCD monitor (1280x1024 native) has a completely
      bogus detailed timing (640x350@70hz).  User reports that
      1280x1024@60 has waves so prefer 1280x1024@75.
      
      Manufacturer: MED  Model: 7b8  Serial#: 99188
      Year: 2005  Week: 5
      EDID Version: 1.3
      Analog Display Input,  Input Voltage Level: 0.700/0.700 V
      Sync:  Separate
      Max Image Size [cm]: horiz.: 34  vert.: 27
      Gamma: 2.50
      DPMS capabilities: Off; RGB/Color Display
      First detailed timing is preferred mode
      redX: 0.645 redY: 0.348   greenX: 0.280 greenY: 0.605
      blueX: 0.142 blueY: 0.071   whiteX: 0.313 whiteY: 0.329
      Supported established timings:
      720x400@70Hz
      640x480@60Hz
      640x480@72Hz
      640x480@75Hz
      800x600@56Hz
      800x600@60Hz
      800x600@72Hz
      800x600@75Hz
      1024x768@60Hz
      1024x768@70Hz
      1024x768@75Hz
      1280x1024@75Hz
      Manufacturer's mask: 0
      Supported standard timings:
      Supported detailed timing:
      clock: 25.2 MHz   Image Size:  337 x 270 mm
      h_active: 640  h_sync: 688  h_sync_end 784 h_blank_end 800 h_border: 0
      v_active: 350  v_sync: 350  v_sync_end 352 v_blanking: 449 v_border: 0
      Monitor name: MD30217PG
      Ranges: V min: 56 V max: 76 Hz, H min: 30 H max: 83 kHz, PixClock max 145 MHz
      Serial No: 501099188
      EDID (in hex):
                00ffffffffffff0034a4b80774830100
                050f010368221b962a0c55a559479b24
                125054afcf00310a0101010101018180
                000000000000d60980a0205e63103060
                0200510e1100001e000000fc004d4433
                3032313750470a202020000000fd0038
                4c1e530e000a202020202020000000ff
                003530313039393138380a2020200078
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reported-by: friedrich@mailstation.de
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      579db19e
    • Borislav Petkov's avatar
      amd64_edac: Fix single-channel setups · 34db3c07
      Borislav Petkov authored
      commit f0a56c48 upstream.
      
      It can happen that configurations are running in a single-channel mode
      even with a dual-channel memory controller, by, say, putting the DIMMs
      only on the one channel and leaving the other empty. This causes a
      problem in init_csrows which implicitly assumes that when the second
      channel is enabled, i.e. channel 1, the struct dimm hierarchy will be
      present. Which is not.
      
      So always allocate two channels unconditionally.
      
      This provides for the nice side effect that the data structures are
      initialized so some day, when memory hotplug is supported, it should
      just work out of the box when all of a sudden a second channel appears.
      Reported-and-tested-by: default avatarRoger Leigh <rleigh@debian.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34db3c07
    • Jan Kara's avatar
      isofs: Refuse RW mount of the filesystem instead of making it RO · 4b5fe51a
      Jan Kara authored
      commit 17b7f7cf upstream.
      
      Refuse RW mount of isofs filesystem. So far we just silently changed it
      to RO mount but when the media is writeable, block layer won't notice
      this change and thus will think device is used RW and will block eject
      button of the drive. That is unexpected by users because for
      non-writeable media eject button works just fine.
      
      Userspace mount(8) command handles this just fine and retries mounting
      with MS_RDONLY set so userspace shouldn't see any regression.  Plus any
      tool mounting isofs is likely confronted with the case of read-only
      media where block layer already refuses to mount the filesystem without
      MS_RDONLY set so our behavior shouldn't be anything new for it.
      Reported-by: default avatarHui Wang <hui.wang@canonical.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b5fe51a
    • Eric W. Biederman's avatar
      proc: Restrict mounting the proc filesystem · 1ca91545
      Eric W. Biederman authored
      commit aee1c13d upstream.
      
      Don't allow mounting the proc filesystem unless the caller has
      CAP_SYS_ADMIN rights over the pid namespace.  The principle here is if
      you create or have capabilities over it you can mount it, otherwise
      you get to live with what other people have mounted.
      
      Andy pointed out that this is needed to prevent users in a user
      namespace from remounting proc and specifying different hidepid and gid
      options on already existing proc mounts.
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ca91545
    • Libin's avatar
      mm/huge_memory.c: fix potential NULL pointer dereference · 8b89ae8a
      Libin authored
      commit a8f531eb upstream.
      
      In collapse_huge_page() there is a race window between releasing the
      mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may
      return NULL.  So check the return value to avoid NULL pointer dereference.
      
      collapse_huge_page
      	khugepaged_alloc_page
      		up_read(&mm->mmap_sem)
      	down_write(&mm->mmap_sem)
      	vma = find_vma(mm, address)
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b89ae8a
    • Greg Thelen's avatar
      memcg: fix multiple large threshold notifications · d96fa179
      Greg Thelen authored
      commit 2bff24a3 upstream.
      
      A memory cgroup with (1) multiple threshold notifications and (2) at least
      one threshold >=2G was not reliable.  Specifically the notifications would
      either not fire or would not fire in the proper order.
      
      The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
      thresholds in sorted order.  mem_cgroup_usage_register_event() sorts them
      with compare_thresholds(), which returns the difference of two 64 bit
      thresholds as an int.  If the difference is positive but has bit[31] set,
      then sort() treats the difference as negative and breaks sort order.
      
      This fix compares the two arbitrary 64 bit thresholds returning the
      classic -1, 0, 1 result.
      
      The test below sets two notifications (at 0x1000 and 0x81001000):
        cd /sys/fs/cgroup/memory
        mkdir x
        for x in 4096 2164264960; do
          cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
        done
        echo $$ > x/cgroup.procs
        anon_leaker 500M
      
      v3.11-rc7 fails to signal the 4096 event listener:
        Leaking...
        Done leaking pages.
      
      Patched v3.11-rc7 properly notifies:
        Leaking...
        4096 listener:2013:8:31:14:13:36
        Done leaking pages.
      
      The fixed bug is old.  It appears to date back to the introduction of
      memcg threshold notifications in v2.6.34-rc1-116-g2e72b634 "memcg:
      implement memory thresholds"
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d96fa179
    • Jie Liu's avatar
      ocfs2: fix the end cluster offset of FIEMAP · 3c46f726
      Jie Liu authored
      commit 28e8be31 upstream.
      
      Call fiemap ioctl(2) with given start offset as well as an desired mapping
      range should show extents if possible.  However, we somehow figure out the
      end offset of mapping via 'mapping_end -= cpos' before iterating the
      extent records which would cause problems if the given fiemap length is
      too small to a cluster size, e.g,
      
      Cluster size 4096:
      debugfs.ocfs2 1.6.3
              Block Size Bits: 12   Cluster Size Bits: 12
      
      The extended fiemap test utility From David:
      https://gist.github.com/anonymous/6172331
      
      # dd if=/dev/urandom of=/ocfs2/test_file bs=1M count=1000
      # ./fiemap /ocfs2/test_file 4096 10
      start: 4096, length: 10
      File /ocfs2/test_file has 0 extents:
      #	Logical          Physical         Length           Flags
      	^^^^^ <-- No extent is shown
      
      In this case, at ocfs2_fiemap(): cpos == mapping_end == 1. Hence the
      loop of searching extent records was not executed at all.
      
      This patch remove the in question 'mapping_end -= cpos', and loops
      until the cpos is larger than the mapping_end as usual.
      
      # ./fiemap /ocfs2/test_file 4096 10
      start: 4096, length: 10
      File /ocfs2/test_file has 1 extents:
      #	Logical          Physical         Length           Flags
      0:	0000000000000000 0000000056a01000 0000000006a00000 0000
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Reported-by: default avatarDavid Weber <wb@munzinger.de>
      Tested-by: default avatarDavid Weber <wb@munzinger.de>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Mark Fashen <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c46f726
    • Oleg Nesterov's avatar
      pidns: fix vfork() after unshare(CLONE_NEWPID) · f608ebd7
      Oleg Nesterov authored
      commit e79f525e upstream.
      
      Commit 8382fcac ("pidns: Outlaw thread creation after
      unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
      pid_ns, this obviously breaks vfork:
      
      	int main(void)
      	{
      		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
      		assert(vfork() >= 0);
      		_exit(0);
      		return 0;
      	}
      
      fails without this patch.
      
      Change this check to use CLONE_SIGHAND instead.  This also forbids
      CLONE_THREAD automatically, and this is what the comment implies.
      
      We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
      would be safer to not do this.  The current check denies CLONE_SIGHAND
      implicitely and there is no reason to change this.
      
      Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
      Having shared signal handling between two different pid namespaces is
      the case that we are fundamentally guarding against."
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarColin Walters <walters@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f608ebd7
    • Eric W. Biederman's avatar
      pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup · 5a48788c
      Eric W. Biederman authored
      commit a6064885 upstream.
      
      Serge Hallyn <serge.hallyn@ubuntu.com> writes:
      
      > Since commit af4b8a83 it's been
      > possible to get into a situation where a pidns reaper is
      > <defunct>, reparented to host pid 1, but never reaped.  How to
      > reproduce this is documented at
      >
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
      > (and see
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
      > In short, run repeated starts of a container whose init is
      >
      > Process.exit(0);
      >
      > sysrq-t when such a task is playing zombie shows:
      >
      > [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
      > [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
      > [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
      > [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
      > [  131.132978] Call Trace:
      > [  131.132978]  [<ffffffff816f6159>] schedule+0x29/0x70
      > [  131.132978]  [<ffffffff81064591>] do_exit+0x6e1/0xa40
      > [  131.132978]  [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30
      > [  131.132978]  [<ffffffff8106496f>] do_group_exit+0x3f/0xa0
      > [  131.132978]  [<ffffffff810649e4>] SyS_exit_group+0x14/0x20
      > [  131.132978]  [<ffffffff8170102f>] tracesys+0xe1/0xe6
      >
      > Further debugging showed that every time this happened, zap_pid_ns_processes()
      > started with nr_hashed being 3, while we were expecting it to drop to 2.
      > Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
      > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
      > if nr_hashed hits 1.
      
      The issue is that when the task group leader of an init process exits
      before other tasks of the init process when the init process finally
      exits it will be a secondary task sleeping in zap_pid_ns_processes and
      waiting to wake up when the number of hashed pids drops to two.  This
      case waits forever as free_pid only sends a wake up when the number of
      hashed pids drops to 1.
      
      To correct this the simple strategy of sending a possibly unncessary
      wake up when the number of hashed pids drops to 2 is adopted.
      
      Sending one extraneous wake up is relatively harmless, at worst we
      waste a little cpu time in the rare case when a pid namespace
      appropaches exiting.
      
      We can detect the case when the pid namespace drops to just two pids
      hashed race free in free_pid.
      
      Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
      without out the tasklist_lock because it is guaranteed that the
      detach_pid will be called on the child_reaper before it is freed and
      detach_pid calls __change_pid which calls free_pid which takes the
      pidmap_lock.  __change_pid only calls free_pid if this is the
      last use of the pid.  For a thread that is not the thread group leader
      the threads pid will only ever have one user because a threads pid
      is not allowed to be the pid of a process, of a process group or
      a session.  For a thread that is a thread group leader all of
      the other threads of that process will be reaped before it is allowed
      for the thread group leader to be reaped ensuring there will only
      be one user of the threads pid as a process pid.  Furthermore
      because the thread is the init process of a pid namespace all of the
      other processes in the pid namespace will have also been already freed
      leading to the fact that the pid will not be used as a session pid or
      a process group pid for any other running process.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a48788c