1. 12 Jul, 2016 40 commits
    • Bill Sommerfeld's avatar
      udp6: fix UDP/IPv6 encap resubmit path · 355b7d7b
      Bill Sommerfeld authored
      [ Upstream commit 59dca1d8 ]
      
      IPv4 interprets a negative return value from a protocol handler as a
      request to redispatch to a new protocol.  In contrast, IPv6 interprets a
      negative value as an error, and interprets a positive value as a request
      for redispatch.
      
      UDP for IPv6 was unaware of this difference.  Change __udp6_lib_rcv() to
      return a positive value for redispatch.  Note that the socket's
      encap_rcv hook still needs to return a negative value to request
      dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
      identify the byte containing the next protocol.
      Signed-off-by: default avatarBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      355b7d7b
    • Oliver Neukum's avatar
      usbnet: cleanup after bind() in probe() · 17c09e33
      Oliver Neukum authored
      [ Upstream commit 1666984c ]
      
      In case bind() works, but a later error forces bailing
      in probe() in error cases work and a timer may be scheduled.
      They must be killed. This fixes an error case related to
      the double free reported in
      http://www.spinics.net/lists/netdev/msg367669.html
      and needs to go on top of Linus' fix to cdc-ncm.
      Signed-off-by: default avatarOliver Neukum <ONeukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      17c09e33
    • Bjørn Mork's avatar
      cdc_ncm: toggle altsetting to force reset before setup · cb7669cf
      Bjørn Mork authored
      [ Upstream commit 48906f62 ]
      
      Some devices will silently fail setup unless they are reset first.
      This is necessary even if the data interface is already in
      altsetting 0, which it will be when the device is probed for the
      first time.  Briefly toggling the altsetting forces a function
      reset regardless of the initial state.
      
      This fixes a setup problem observed on a number of Huawei devices,
      appearing to operate in NTB-32 mode even if we explicitly set them
      to NTB-16 mode.
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cb7669cf
    • Florian Westphal's avatar
      ipv6: re-enable fragment header matching in ipv6_find_hdr · 51fc48e9
      Florian Westphal authored
      [ Upstream commit 5d150a98 ]
      
      When ipv6_find_hdr is used to find a fragment header
      (caller specifies target NEXTHDR_FRAGMENT) we erronously return
      -ENOENT for all fragments with nonzero offset.
      
      Before commit 9195bb8e, when target was specified, we did not
      enter the exthdr walk loop as nexthdr == target so this used to work.
      
      Now we do (so we can skip empty route headers). When we then stumble upon
      a frag with nonzero frag_off we must return -ENOENT ("header not found")
      only if the caller did not specifically request NEXTHDR_FRAGMENT.
      
      This allows nfables exthdr expression to match ipv6 fragments, e.g. via
      
      nft add rule ip6 filter input frag frag-off gt 0
      
      Fixes: 9195bb8e ("ipv6: improve ipv6_find_hdr() to skip empty routing headers")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      51fc48e9
    • Bjørn Mork's avatar
      qmi_wwan: add Sierra Wireless EM74xx device ID · 3a49f491
      Bjørn Mork authored
      [ Upstream commit bf13c94c ]
      
      The MC74xx and EM74xx modules use different IDs by default, according
      to the Lenovo EM7455 driver for Windows.
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3a49f491
    • Xin Long's avatar
      sctp: lack the check for ports in sctp_v6_cmp_addr · d8f67670
      Xin Long authored
      [ Upstream commit 40b4f0fd ]
      
      As the member .cmp_addr of sctp_af_inet6, sctp_v6_cmp_addr should also check
      the port of addresses, just like sctp_v4_cmp_addr, cause it's invoked by
      sctp_cmp_addr_exact().
      
      Now sctp_v6_cmp_addr just check the port when two addresses have different
      family, and lack the port check for two ipv6 addresses. that will make
      sctp_hash_cmp() cannot work well.
      
      so fix it by adding ports comparison in sctp_v6_cmp_addr().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d8f67670
    • Stefan Wahren's avatar
      net: qca_spi: clear IFF_TX_SKB_SHARING · 3d30cee8
      Stefan Wahren authored
      [ Upstream commit a4690afe ]
      
      ether_setup sets IFF_TX_SKB_SHARING but this is not supported by
      qca_spi as it modifies the skb on xmit.
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Fixes: 291ab06e (net: qualcomm: new Ethernet over SPI driver for QCA7000)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3d30cee8
    • Stefan Wahren's avatar
      net: qca_spi: Don't clear IFF_BROADCAST · c65c8e2a
      Stefan Wahren authored
      [ Upstream commit 2b70bad2 ]
      
      Currently qcaspi_netdev_setup accidentally clears IFF_BROADCAST.
      So fix this by keeping the flags from ether_setup.
      Reported-by: default avatarMichael Heimpold <michael.heimpold@i2se.com>
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Fixes: 291ab06e (net: qualcomm: new Ethernet over SPI driver for QCA7000)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c65c8e2a
    • Diego Viola's avatar
      net: jme: fix suspend/resume on JMC260 · b9628dd1
      Diego Viola authored
      [ Upstream commit ee50c130 ]
      
      The JMC260 network card fails to suspend/resume because the call to
      jme_start_irq() was too early, moving the call to jme_start_irq() after
      the call to jme_reset_link() makes it work.
      
      Prior this change suspend/resume would fail unless /sys/power/pm_async=0
      was explicitly specified.
      
      Relevant bug report: https://bugzilla.kernel.org/show_bug.cgi?id=112351Signed-off-by: default avatarDiego Viola <diego.viola@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b9628dd1
    • Konstantin Khlebnikov's avatar
      tcp: convert cached rtt from usec to jiffies when feeding initial rto · 5c16a050
      Konstantin Khlebnikov authored
      [ Upstream commit 9bdfb3b7 ]
      
      Currently it's converted into msecs, thus HZ=1000 intact.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5c16a050
    • Alex Deucher's avatar
      drm/radeon: add a dpm quirk for all R7 370 parts · 742a7a14
      Alex Deucher authored
      [ Upstream commit 0e5585dc ]
      
      Higher mclk values are not stable due to a bug somewhere.
      Limit them for now.
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      742a7a14
    • Alex Deucher's avatar
      848ff9da
    • Daniel Vetter's avatar
      drm/udl: Use unlocked gem unreferencing · 71c879eb
      Daniel Vetter authored
      [ Upstream commit 72b9ff06 ]
      
      For drm_gem_object_unreference callers are required to hold
      dev->struct_mutex, which these paths don't. Enforcing this requirement
      has become a bit more strict with
      
      commit ef4c6270
      Author: Daniel Vetter <daniel.vetter@ffwll.ch>
      Date:   Thu Oct 15 09:36:25 2015 +0200
      
          drm/gem: Check locking in drm_gem_object_unreference
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      71c879eb
    • Xishi Qiu's avatar
      mm: fix invalid node in alloc_migrate_target() · 2539420b
      Xishi Qiu authored
      [ Upstream commit 6f25a14a ]
      
      It is incorrect to use next_node to find a target node, it will return
      MAX_NUMNODES or invalid node.  This will lead to crash in buddy system
      allocation.
      
      Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Laura Abbott" <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2539420b
    • Takashi Iwai's avatar
      ALSA: timer: Use mod_timer() for rearming the system timer · d4f95a01
      Takashi Iwai authored
      [ Upstream commit 4a07083e ]
      
      ALSA system timer backend stops the timer via del_timer() without sync
      and leaves del_timer_sync() at the close instead.  This is because of
      the restriction by the design of ALSA timer: namely, the stop callback
      may be called from the timer handler, and calling the sync shall lead
      to a hangup.  However, this also triggers a kernel BUG() when the
      timer is rearmed immediately after stopping without sync:
       kernel BUG at kernel/time/timer.c:966!
       Call Trace:
        <IRQ>
        [<ffffffff8239c94e>] snd_timer_s_start+0x13e/0x1a0
        [<ffffffff8239e1f4>] snd_timer_interrupt+0x504/0xec0
        [<ffffffff8122fca0>] ? debug_check_no_locks_freed+0x290/0x290
        [<ffffffff8239ec64>] snd_timer_s_function+0xb4/0x120
        [<ffffffff81296b72>] call_timer_fn+0x162/0x520
        [<ffffffff81296add>] ? call_timer_fn+0xcd/0x520
        [<ffffffff8239ebb0>] ? snd_timer_interrupt+0xec0/0xec0
        ....
      
      It's the place where add_timer() checks the pending timer.  It's clear
      that this may happen after the immediate restart without sync in our
      cases.
      
      So, the workaround here is just to use mod_timer() instead of
      add_timer().  This looks like a band-aid fix, but it's a right move,
      as snd_timer_interrupt() takes care of the continuous rearm of timer.
      Reported-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d4f95a01
    • Nicolai Stange's avatar
      PKCS#7: pkcs7_validate_trust(): initialize the _trusted output argument · 772935da
      Nicolai Stange authored
      [ Upstream commit e5435891 ]
      
      Despite what the DocBook comment to pkcs7_validate_trust() says, the
      *_trusted argument is never set to false.
      
      pkcs7_validate_trust() only positively sets *_trusted upon encountering
      a trusted PKCS#7 SignedInfo block.
      
      This is quite unfortunate since its callers, system_verify_data() for
      example, depend on pkcs7_validate_trust() clearing *_trusted on non-trust.
      
      Indeed, UBSAN splats when attempting to load the uninitialized local
      variable 'trusted' from system_verify_data() in pkcs7_validate_trust():
      
        UBSAN: Undefined behaviour in crypto/asymmetric_keys/pkcs7_trust.c:194:14
        load of value 82 is not a valid value for type '_Bool'
        [...]
        Call Trace:
          [<ffffffff818c4d35>] dump_stack+0xbc/0x117
          [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
          [<ffffffff8194113b>] ubsan_epilogue+0xd/0x4e
          [<ffffffff819419fa>] __ubsan_handle_load_invalid_value+0x111/0x158
          [<ffffffff819418e9>] ? val_to_string.constprop.12+0xcf/0xcf
          [<ffffffff818334a4>] ? x509_request_asymmetric_key+0x114/0x370
          [<ffffffff814b83f0>] ? kfree+0x220/0x370
          [<ffffffff818312c2>] ? public_key_verify_signature_2+0x32/0x50
          [<ffffffff81835e04>] pkcs7_validate_trust+0x524/0x5f0
          [<ffffffff813c391a>] system_verify_data+0xca/0x170
          [<ffffffff813c3850>] ? top_trace_array+0x9b/0x9b
          [<ffffffff81510b29>] ? __vfs_read+0x279/0x3d0
          [<ffffffff8129372f>] mod_verify_sig+0x1ff/0x290
          [...]
      
      The implication is that pkcs7_validate_trust() effectively grants trust
      when it really shouldn't have.
      
      Fix this by explicitly setting *_trusted to false at the very beginning
      of pkcs7_validate_trust().
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      772935da
    • Guenter Roeck's avatar
      hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated · 83fb7b87
      Guenter Roeck authored
      [ Upstream commit 3c2e2266 ]
      
      arm:pxa_defconfig can result in the following crash if the max1111 driver
      is not instantiated.
      
      Unhandled fault: page domain fault (0x01b) at 0x00000000
      pgd = c0004000
      [00000000] *pgd=00000000
      Internal error: : 1b [#1] PREEMPT ARM
      Modules linked in:
      CPU: 0 PID: 300 Comm: kworker/0:1 Not tainted 4.5.0-01301-g1701f680 #10
      Hardware name: SHARP Akita
      Workqueue: events sharpsl_charge_toggle
      task: c390a000 ti: c391e000 task.ti: c391e000
      PC is at max1111_read_channel+0x20/0x30
      LR is at sharpsl_pm_pxa_read_max1111+0x2c/0x3c
      pc : [<c03aaab0>]    lr : [<c0024b50>]    psr: 20000013
      ...
      [<c03aaab0>] (max1111_read_channel) from [<c0024b50>]
      					(sharpsl_pm_pxa_read_max1111+0x2c/0x3c)
      [<c0024b50>] (sharpsl_pm_pxa_read_max1111) from [<c00262e0>]
      					(spitzpm_read_devdata+0x5c/0xc4)
      [<c00262e0>] (spitzpm_read_devdata) from [<c0024094>]
      					(sharpsl_check_battery_temp+0x78/0x110)
      [<c0024094>] (sharpsl_check_battery_temp) from [<c0024f9c>]
      					(sharpsl_charge_toggle+0x48/0x110)
      [<c0024f9c>] (sharpsl_charge_toggle) from [<c004429c>]
      					(process_one_work+0x14c/0x48c)
      [<c004429c>] (process_one_work) from [<c0044618>] (worker_thread+0x3c/0x5d4)
      [<c0044618>] (worker_thread) from [<c004a238>] (kthread+0xd0/0xec)
      [<c004a238>] (kthread) from [<c000a670>] (ret_from_fork+0x14/0x24)
      
      This can occur because the SPI controller driver (SPI_PXA2XX) is built as
      module and thus not necessarily loaded. While building SPI_PXA2XX into the
      kernel would make the problem disappear, it appears prudent to ensure that
      the driver is instantiated before accessing its data structures.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      83fb7b87
    • Asai Thambi SP's avatar
      mtip32xx: Fix broken service thread handling · 32827995
      Asai Thambi SP authored
      [ Upstream commit cfc05bd3 ]
      
      Service thread does not detect the need for taskfile error hanlding. Fixed the
      flag condition to process taskfile error.
      Signed-off-by: default avatarSelvan Mani <smani@micron.com>
      Signed-off-by: default avatarAsai Thambi S P <asamymuthupa@micron.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      32827995
    • Asai Thambi SP's avatar
      mtip32xx: Fix for rmmod crash when drive is in FTL rebuild · 59872e56
      Asai Thambi SP authored
      [ Upstream commit 59cf70e2 ]
      
      When FTL rebuild is in progress, alloc_disk() initializes the disk
      but device node will be created by add_disk() only after successful
      completion of FTL rebuild. So, skip deletion of device node in
      removal path when FTL rebuild is in progress.
      Signed-off-by: default avatarSelvan Mani <smani@micron.com>
      Signed-off-by: default avatarAsai Thambi S P <asamymuthupa@micron.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      59872e56
    • Sebastian Frias's avatar
      8250: use callbacks to access UART_DLL/UART_DLM · dcfd994d
      Sebastian Frias authored
      [ Upstream commit 0b41ce99 ]
      
      Some UART HW has a single register combining UART_DLL/UART_DLM
      (this was probably forgotten in the change that introduced the
      callbacks, commit b32b19b8)
      
      Fixes: b32b19b8 ("[SERIAL] 8250: set divisor register correctly ...")
      Signed-off-by: default avatarSebastian Frias <sf84@laposte.net>
      Reviewed-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      dcfd994d
    • Grazvydas Ignotas's avatar
      HID: logitech: fix Dual Action gamepad support · 2b14a87b
      Grazvydas Ignotas authored
      [ Upstream commit 5d74325a ]
      
      The patch that added Logitech Dual Action gamepad support forgot to
      update the special driver list for the device. This caused the logitech
      driver not to probe unless kernel module load order was favorable.
      Update the special driver list to fix it. Thanks to Simon Wood for the
      idea.
      
      Cc: Vitaly Katraew <zawullon@gmail.com>
      Fixes: 56d0c8b7 ("HID: add support for Logitech Dual Action gamepads")
      Signed-off-by: default avatarGrazvydas Ignotas <notasas@gmail.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2b14a87b
    • Vladis Dronov's avatar
      ALSA: usb-audio: Fix double-free in error paths after snd_usb_add_audio_stream() call · 9279da1e
      Vladis Dronov authored
      [ Upstream commit 836b34a9 ]
      
      create_fixed_stream_quirk(), snd_usb_parse_audio_interface() and
      create_uaxx_quirk() functions allocate the audioformat object by themselves
      and free it upon error before returning. However, once the object is linked
      to a stream, it's freed again in snd_usb_audio_pcm_free(), thus it'll be
      double-freed, eventually resulting in a memory corruption.
      
      This patch fixes these failures in the error paths by unlinking the audioformat
      object before freeing it.
      
      Based on a patch by Takashi Iwai <tiwai@suse.de>
      
      [Note for stable backports:
       this patch requires the commit 902eb7fd ('ALSA: usb-audio: Minor
       code cleanup in create_fixed_stream_quirk()')]
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1283358Reported-by: default avatarRalf Spenneberg <ralf@spenneberg.net>
      Cc: <stable@vger.kernel.org> # see the note above
      Signed-off-by: default avatarVladis Dronov <vdronov@redhat.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9279da1e
    • Takashi Iwai's avatar
      ALSA: usb-audio: Minor code cleanup in create_fixed_stream_quirk() · 4aaf3322
      Takashi Iwai authored
      [ Upstream commit 902eb7fd ]
      
      Just a minor code cleanup: unify the error paths.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      4aaf3322
    • Arnd Bergmann's avatar
      ASoC: samsung: pass DMA channels as pointers · 00f9443a
      Arnd Bergmann authored
      [ Upstream commit b9a1a743 ]
      
      ARM64 allmodconfig produces a bunch of warnings when building the
      samsung ASoC code:
      
      sound/soc/samsung/dmaengine.c: In function 'samsung_asoc_init_dma_data':
      sound/soc/samsung/dmaengine.c:53:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
         playback_data->filter_data = (void *)playback->channel;
      sound/soc/samsung/dmaengine.c:60:31: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
         capture_data->filter_data = (void *)capture->channel;
      
      We could easily shut up the warning by adding an intermediate cast,
      but there is a bigger underlying problem: The use of IORESOURCE_DMA
      to pass data from platform code to device drivers is dubious to start
      with, as what we really want is a pointer that can be passed into
      a filter function.
      
      Note that on s3c64xx, the pl08x DMA data is already a pointer, but
      gets cast to resource_size_t so we can pass it as a resource, and it
      then gets converted back to a pointer. In contrast, the data we pass
      for s3c24xx is an index into a device specific table, and we artificially
      convert that into a pointer for the filter function.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      00f9443a
    • Krzysztof Hałasa's avatar
      PCI: Allow a NULL "parent" pointer in pci_bus_assign_domain_nr() · 68b8c387
      Krzysztof Hałasa authored
      [ Upstream commit 54c6e2dd ]
      
      pci_create_root_bus() passes a "parent" pointer to
      pci_bus_assign_domain_nr().  When CONFIG_PCI_DOMAINS_GENERIC is defined,
      pci_bus_assign_domain_nr() dereferences that pointer.  Many callers of
      pci_create_root_bus() supply a NULL "parent" pointer, which leads to a NULL
      pointer dereference error.
      
      7c674700 ("PCI: Move domain assignment from arm64 to generic code")
      moved the "parent" dereference from arm64 to generic code.  Only arm64 used
      that code (because only arm64 defined CONFIG_PCI_DOMAINS_GENERIC), and it
      always supplied a valid "parent" pointer.  Other arches supplied NULL
      "parent" pointers but didn't defined CONFIG_PCI_DOMAINS_GENERIC, so they
      used a no-op version of pci_bus_assign_domain_nr().
      
      8c7d1474 ("ARM/PCI: Move to generic PCI domains") defined
      CONFIG_PCI_DOMAINS_GENERIC on ARM, and many ARM platforms use
      pci_common_init(), which supplies a NULL "parent" pointer.
      These platforms (cns3xxx, dove, footbridge, iop13xx, etc.) crash
      with a NULL pointer dereference like this while probing PCI:
      
        Unable to handle kernel NULL pointer dereference at virtual address 000000a4
        PC is at pci_bus_assign_domain_nr+0x10/0x84
        LR is at pci_create_root_bus+0x48/0x2e4
        Kernel panic - not syncing: Attempted to kill init!
      
      [bhelgaas: changelog, add "Reported:" and "Fixes:" tags]
      Reported: http://forum.doozan.com/read.php?2,17868,22070,quote=1
      Fixes: 8c7d1474 ("ARM/PCI: Move to generic PCI domains")
      Fixes: 7c674700 ("PCI: Move domain assignment from arm64 to generic code")
      Signed-off-by: default avatarKrzysztof Hałasa <khalasa@piap.pl>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      CC: stable@vger.kernel.org	# v4.0+
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      68b8c387
    • Lorenzo Pieralisi's avatar
      PCI: Move domain assignment from arm64 to generic code · 1e7429d4
      Lorenzo Pieralisi authored
      [ Upstream commit 7c674700 ]
      
      The current logic in arm64 pci_bus_assign_domain_nr() is flawed in that
      depending on the host controller configuration for a platform and the
      initialization sequence, core code may end up allocating PCI domain numbers
      from both DT and the generic domain counter, which would result in PCI
      domain allocation aliases/errors.
      
      Fix the logic behind the PCI domain number assignment and move the
      resulting code to the PCI core so the same domain allocation logic is used
      on all platforms that select CONFIG_PCI_DOMAINS_GENERIC.
      
      [bhelgaas: tidy changelog]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarLiviu Dudau <Liviu.Dudau@arm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      CC: Rob Herring <robh+dt@kernel.org>
      CC: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      1e7429d4
    • Miklos Szeredi's avatar
      locks: use file_inode() · 870957bf
      Miklos Szeredi authored
      [ Upstream commit 6343a212 ]
      
      (Another one for the f_path debacle.)
      
      ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.
      
      The reason is that generic_add_lease() used filp->f_path.dentry->inode
      while all the others use file_inode().  This makes a difference for files
      opened on overlayfs since the former will point to the overlay inode the
      latter to the underlying inode.
      
      So generic_add_lease() added the lease to the overlay inode and
      generic_delete_lease() removed it from the underlying inode.  When the file
      was released the lease remained on the overlay inode's lock list, resulting
      in use after free.
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      870957bf
    • Andrey Ulanov's avatar
      namespace: update event counter when umounting a deleted dentry · 81c379f5
      Andrey Ulanov authored
      [ Upstream commit e06b933e ]
      
      - m_start() in fs/namespace.c expects that ns->event is incremented each
        time a mount added or removed from ns->list.
      - umount_tree() removes items from the list but does not increment event
        counter, expecting that it's done before the function is called.
      - There are some codepaths that call umount_tree() without updating
        "event" counter. e.g. from __detach_mounts().
      - When this happens m_start may reuse a cached mount structure that no
        longer belongs to ns->list (i.e. use after free which usually leads
        to infinite loop).
      
      This change fixes the above problem by incrementing global event counter
      before invoking umount_tree().
      
      Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrey Ulanov <andreyu@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      81c379f5
    • Trond Myklebust's avatar
      NFS: Fix another OPEN_DOWNGRADE bug · 35bec119
      Trond Myklebust authored
      [ Upstream commit e547f262 ]
      
      Olga Kornievskaia reports that the following test fails to trigger
      an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
      
      	fd0 = open(foo, RDRW)   -- should be open on the wire for "both"
      	fd1 = open(foo, RDONLY)  -- should be open on the wire for "read"
      	close(fd0) -- should trigger an open_downgrade
      	read(fd1)
      	close(fd1)
      
      The issue is that we're missing a check for whether or not the current
      state transitioned from an O_RDWR state as opposed to having transitioned
      from a combination of O_RDONLY and O_WRONLY.
      Reported-by: default avatarOlga Kornievskaia <aglo@umich.edu>
      Fixes: cd9288ff ("NFSv4: Fix another bug in the close/open_downgrade code")
      Cc: stable@vger.kernel.org # 2.6.33+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      35bec119
    • Michael Holzheu's avatar
      Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL" · 12978664
      Michael Holzheu authored
      [ Upstream commit 5419447e ]
      
      This reverts commit 852ffd0f.
      
      There are use cases where an intermediate boot kernel (1) uses kexec
      to boot the final production kernel (2). For this scenario we should
      provide the original boot information to the production kernel (2).
      Therefore clearing the boot information during kexec() should not
      be done.
      
      Cc: stable@vger.kernel.org # v3.17+
      Reported-by: default avatarSteffen Maier <maier@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      12978664
    • Alexey Brodkin's avatar
      arc: unwind: warn only once if DW2_UNWIND is disabled · c59ed1ff
      Alexey Brodkin authored
      [ Upstream commit 9bd54517 ]
      
      If CONFIG_ARC_DW2_UNWIND is disabled every time arc_unwind_core()
      gets called following message gets printed in debug console:
      ----------------->8---------------
      CONFIG_ARC_DW2_UNWIND needs to be enabled
      ----------------->8---------------
      
      That message makes sense if user indeed wants to see a backtrace or
      get nice function call-graphs in perf but what if user disabled
      unwinder for the purpose? Why pollute his debug console?
      
      So instead we'll warn user about possibly missing feature once and
      let him decide if that was what he or she really wanted.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c59ed1ff
    • Vineet Gupta's avatar
      ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame) · e4c542a8
      Vineet Gupta authored
      [ Upstream commit f52e126c ]
      
      With recent binutils update to support dwarf CFI pseudo-ops in gas, we
      now get .eh_frame vs. .debug_frame. Although the call frame info is
      exactly the same in both, the CIE differs, which the current kernel
      unwinder can't cope with.
      
      This broke both the kernel unwinder as well as loadable modules (latter
      because of a new unhandled relo R_ARC_32_PCREL from .rela.eh_frame in
      the module loader)
      
      The ideal solution would be to switch unwinder to .eh_frame.
      For now however we can make do by just ensureing .debug_frame is
      generated by removing -fasynchronous-unwind-tables
      
       .eh_frame    generated with -gdwarf-2 -fasynchronous-unwind-tables
       .debug_frame generated with -gdwarf-2
      
      Fixes STAR 9001058196
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e4c542a8
    • Alan Stern's avatar
      USB: don't free bandwidth_mutex too early · 0c0ad079
      Alan Stern authored
      [ Upstream commit ab2a4bf8 ]
      
      The USB core contains a bug that can show up when a USB-3 host
      controller is removed.  If the primary (USB-2) hcd structure is
      released before the shared (USB-3) hcd, the core will try to do a
      double-free of the common bandwidth_mutex.
      
      The problem was described in graphical form by Chung-Geol Kim, who
      first reported it:
      
      =================================================
           At *remove USB(3.0) Storage
           sequence <1> --> <5> ((Problem Case))
      =================================================
                                        VOLD
      ------------------------------------|------------
                                       (uevent)
                                  ________|_________
                                 |<1>               |
                                 |dwc3_otg_sm_work  |
                                 |usb_put_hcd       |
                                 |peer_hcd(kref=2)|
                                 |__________________|
                                  ________|_________
                                 |<2>               |
                                 |New USB BUS #2    |
                                 |                  |
                                 |peer_hcd(kref=1)  |
                                 |                  |
                               --(Link)-bandXX_mutex|
                               | |__________________|
                               |
          ___________________  |
         |<3>                | |
         |dwc3_otg_sm_work   | |
         |usb_put_hcd        | |
         |primary_hcd(kref=1)| |
         |___________________| |
          _________|_________  |
         |<4>                | |
         |New USB BUS #1     | |
         |hcd_release        | |
         |primary_hcd(kref=0)| |
         |                   | |
         |bandXX_mutex(free) |<-
         |___________________|
                                     (( VOLD ))
                                  ______|___________
                                 |<5>               |
                                 |      SCSI        |
                                 |usb_put_hcd       |
                                 |peer_hcd(kref=0)  |
                                 |*hcd_release      |
                                 |bandXX_mutex(free*)|<- double free
                                 |__________________|
      
      =================================================
      
      This happens because hcd_release() frees the bandwidth_mutex whenever
      it sees a primary hcd being released (which is not a very good idea
      in any case), but in the course of releasing the primary hcd, it
      changes the pointers in the shared hcd in such a way that the shared
      hcd will appear to be primary when it gets released.
      
      This patch fixes the problem by changing hcd_release() so that it
      deallocates the bandwidth_mutex only when the _last_ hcd structure
      referencing it is released.  The patch also removes an unnecessary
      test, so that when an hcd is released, both the shared_hcd and
      primary_hcd pointers in the hcd's peer will be cleared.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Reported-by: default avatarChung-Geol Kim <chunggeol.kim@samsung.com>
      Tested-by: default avatarChung-Geol Kim <chunggeol.kim@samsung.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0c0ad079
    • Al Viro's avatar
      make nfs_atomic_open() call d_drop() on all ->open_context() errors. · e9393f71
      Al Viro authored
      [ Upstream commit d20cb71d ]
      
      In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
      unconditional d_drop() after the ->open_context() had been removed.  It had
      been correct for success cases (there ->open_context() itself had been doing
      dcache manipulations), but not for error ones.  Only one of those (ENOENT)
      got a compensatory d_drop() added in that commit, but in fact it should've
      been done for all errors.  As it is, the case of O_CREAT non-exclusive open
      on a hashed negative dentry racing with e.g. symlink creation from another
      client ended up with ->open_context() getting an error and proceeding to
      call nfs_lookup().  On a hashed dentry, which would've instantly triggered
      BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
      d_splice_alias()).
      
      Cc: stable@vger.kernel.org # v3.10+
      Tested-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e9393f71
    • James Morse's avatar
      KVM: arm/arm64: Stop leaking vcpu pid references · 6ded7184
      James Morse authored
      [ Upstream commit 591d215a ]
      
      kvm provides kvm_vcpu_uninit(), which amongst other things, releases the
      last reference to the struct pid of the task that was last running the vcpu.
      
      On arm64 built with CONFIG_DEBUG_KMEMLEAK, starting a guest with kvmtool,
      then killing it with SIGKILL results (after some considerable time) in:
      > cat /sys/kernel/debug/kmemleak
      > unreferenced object 0xffff80007d5ea080 (size 128):
      >  comm "lkvm", pid 2025, jiffies 4294942645 (age 1107.776s)
      >  hex dump (first 32 bytes):
      >    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      >    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      >  backtrace:
      >    [<ffff8000001b30ec>] create_object+0xfc/0x278
      >    [<ffff80000071da34>] kmemleak_alloc+0x34/0x70
      >    [<ffff80000019fa2c>] kmem_cache_alloc+0x16c/0x1d8
      >    [<ffff8000000d0474>] alloc_pid+0x34/0x4d0
      >    [<ffff8000000b5674>] copy_process.isra.6+0x79c/0x1338
      >    [<ffff8000000b633c>] _do_fork+0x74/0x320
      >    [<ffff8000000b66b0>] SyS_clone+0x18/0x20
      >    [<ffff800000085cb0>] el0_svc_naked+0x24/0x28
      >    [<ffffffffffffffff>] 0xffffffffffffffff
      
      On x86 kvm_vcpu_uninit() is called on the path from kvm_arch_destroy_vm(),
      on arm no equivalent call is made. Add the call to kvm_arch_vcpu_free().
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Fixes: 749cf76c ("KVM: ARM: Initial skeleton to compile KVM support")
      Cc: <stable@vger.kernel.org> # 3.10+
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6ded7184
    • Cyril Bur's avatar
      powerpc/tm: Always reclaim in start_thread() for exec() class syscalls · 8d596e6a
      Cyril Bur authored
      [ Upstream commit 8e96a87c ]
      
      Userspace can quite legitimately perform an exec() syscall with a
      suspended transaction. exec() does not return to the old process, rather
      it load a new one and starts that, the expectation therefore is that the
      new process starts not in a transaction. Currently exec() is not treated
      any differently to any other syscall which creates problems.
      
      Firstly it could allow a new process to start with a suspended
      transaction for a binary that no longer exists. This means that the
      checkpointed state won't be valid and if the suspended transaction were
      ever to be resumed and subsequently aborted (a possibility which is
      exceedingly likely as exec()ing will likely doom the transaction) the
      new process will jump to invalid state.
      
      Secondly the incorrect attempt to keep the transactional state while
      still zeroing state for the new process creates at least two TM Bad
      Things. The first triggers on the rfid to return to userspace as
      start_thread() has given the new process a 'clean' MSR but the suspend
      will still be set in the hardware MSR. The second TM Bad Thing triggers
      in __switch_to() as the processor is still transactionally suspended but
      __switch_to() wants to zero the TM sprs for the new process.
      
      This is an example of the outcome of calling exec() with a suspended
      transaction. Note the first 700 is likely the first TM bad thing
      decsribed earlier only the kernel can't report it as we've loaded
      userspace registers. c000000000009980 is the rfid in
      fast_exception_return()
      
        Bad kernel stack pointer 3fffcfa1a370 at c000000000009980
        Oops: Bad kernel stack pointer, sig: 6 [#1]
        CPU: 0 PID: 2006 Comm: tm-execed Not tainted
        NIP: c000000000009980 LR: 0000000000000000 CTR: 0000000000000000
        REGS: c00000003ffefd40 TRAP: 0700   Not tainted
        MSR: 8000000300201031 <SF,ME,IR,DR,LE,TM[SE]>  CR: 00000000  XER: 00000000
        CFAR: c0000000000098b4 SOFTE: 0
        PACATMSCRATCH: b00000010000d033
        GPR00: 0000000000000000 00003fffcfa1a370 0000000000000000 0000000000000000
        GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR12: 00003fff966611c0 0000000000000000 0000000000000000 0000000000000000
        NIP [c000000000009980] fast_exception_return+0xb0/0xb8
        LR [0000000000000000]           (null)
        Call Trace:
        Instruction dump:
        f84d0278 e9a100d8 7c7b03a6 e84101a0 7c4ff120 e8410170 7c5a03a6 e8010070
        e8410080 e8610088 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed023b
      
        Kernel BUG at c000000000043e80 [verbose debug info unavailable]
        Unexpected TM Bad Thing exception at c000000000043e80 (msr 0x201033)
        Oops: Unrecoverable exception, sig: 6 [#2]
        CPU: 0 PID: 2006 Comm: tm-execed Tainted: G      D
        task: c0000000fbea6d80 ti: c00000003ffec000 task.ti: c0000000fb7ec000
        NIP: c000000000043e80 LR: c000000000015a24 CTR: 0000000000000000
        REGS: c00000003ffef7e0 TRAP: 0700   Tainted: G      D
        MSR: 8000000300201033 <SF,ME,IR,DR,RI,LE,TM[SE]>  CR: 28002828  XER: 00000000
        CFAR: c000000000015a20 SOFTE: 0
        PACATMSCRATCH: b00000010000d033
        GPR00: 0000000000000000 c00000003ffefa60 c000000000db5500 c0000000fbead000
        GPR04: 8000000300001033 2222222222222222 2222222222222222 00000000ff160000
        GPR08: 0000000000000000 800000010000d033 c0000000fb7e3ea0 c00000000fe00004
        GPR12: 0000000000002200 c00000000fe00000 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000000000000 0000000000000000 c0000000fbea7410 00000000ff160000
        GPR24: c0000000ffe1f600 c0000000fbea8700 c0000000fbea8700 c0000000fbead000
        GPR28: c000000000e20198 c0000000fbea6d80 c0000000fbeab680 c0000000fbea6d80
        NIP [c000000000043e80] tm_restore_sprs+0xc/0x1c
        LR [c000000000015a24] __switch_to+0x1f4/0x420
        Call Trace:
        Instruction dump:
        7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
        4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020
      
      This fixes CVE-2016-5828.
      
      Fixes: bc2a9408 ("powerpc: Hook in new transactional memory code")
      Cc: stable@vger.kernel.org # v3.9+
      Signed-off-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8d596e6a
    • Torsten Hilbrich's avatar
      fs/nilfs2: fix potential underflow in call to crc32_le · e23042d0
      Torsten Hilbrich authored
      [ Upstream commit 63d2f95d ]
      
      The value `bytes' comes from the filesystem which is about to be
      mounted.  We cannot trust that the value is always in the range we
      expect it to be.
      
      Check its value before using it to calculate the length for the crc32_le
      call.  It value must be larger (or equal) sumoff + 4.
      
      This fixes a kernel bug when accidentially mounting an image file which
      had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
      The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
      s_bytes value of 1.  This caused an underflow when substracting sumoff +
      4 (20) in the call to crc32_le.
      
        BUG: unable to handle kernel paging request at ffff88021e600000
        IP:  crc32_le+0x36/0x100
        ...
        Call Trace:
          nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
          nilfs_load_super_block+0x142/0x300 [nilfs2]
          init_nilfs+0x60/0x390 [nilfs2]
          nilfs_mount+0x302/0x520 [nilfs2]
          mount_fs+0x38/0x160
          vfs_kern_mount+0x67/0x110
          do_mount+0x269/0xe00
          SyS_mount+0x9f/0x100
          entry_SYSCALL_64_fastpath+0x16/0x71
      
      Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jpSigned-off-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e23042d0
    • David Rientjes's avatar
      mm, compaction: abort free scanner if split fails · 3c76752f
      David Rientjes authored
      [ Upstream commit 284f69fb ]
      
      [ Upstream commit a4f04f2c ]
      
      If the memory compaction free scanner cannot successfully split a free
      page (only possible due to per-zone low watermark), terminate the free
      scanner rather than continuing to scan memory needlessly.  If the
      watermark is insufficient for a free page of order <= cc->order, then
      terminate the scanner since all future splits will also likely fail.
      
      This prevents the compaction freeing scanner from scanning all memory on
      very large zones (very noticeable for zones > 128GB, for instance) when
      all splits will likely fail while holding zone->lock.
      
      compaction_alloc() iterating a 128GB zone has been benchmarked to take
      over 400ms on some systems whereas any free page isolated and ready to
      be split ends up failing in split_free_page() because of the low
      watermark check and thus the iteration continues.
      
      The next time compaction occurs, the freeing scanner will likely start
      at the end of the zone again since no success was made previously and we
      get the same lengthy iteration until the zone is brought above the low
      watermark.  All thp page faults can take >400ms in such a state without
      this fix.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3c76752f
    • Vlastimil Babka's avatar
      mm, compaction: skip compound pages by order in free scanner · f1f702e8
      Vlastimil Babka authored
      [ Upstream commit 68385427 ]
      
      [ Upstream commit 9fcd6d2e ]
      
      The compaction free scanner is looking for PageBuddy() pages and
      skipping all others.  For large compound pages such as THP or hugetlbfs,
      we can save a lot of iterations if we skip them at once using their
      compound_order().  This is generally unsafe and we can read a bogus
      value of order due to a race, but if we are careful, the only danger is
      skipping too much.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
      least 15%.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f1f702e8
    • Lukasz Odzioba's avatar
      mm/swap.c: flush lru pvecs on compound page arrival · a2d8c514
      Lukasz Odzioba authored
      [ Upstream commit 8f182270 ]
      
      Currently we can have compound pages held on per cpu pagevecs, which
      leads to a lot of memory unavailable for reclaim when needed.  In the
      systems with hundreads of processors it can be GBs of memory.
      
      On of the way of reproducing the problem is to not call munmap
      explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
      that some pages (with THP enabled also huge pages) may end up on
      lru_add_pvec, example below.
      
        void main() {
        #pragma omp parallel
        {
      	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
      	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
      		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
      	if (p != MAP_FAILED)
      		memset(p, 0, size);
      	//munmap(p, size); // uncomment to make the problem go away
        }
        }
      
      When we run it with THP enabled it will leave significant amount of
      memory on lru_add_pvec.  This memory will be not reclaimed if we hit
      OOM, so when we run above program in a loop:
      
      	for i in `seq 100`; do ./a.out; done
      
      many processes (95% in my case) will be killed by OOM.
      
      The primary point of the LRU add cache is to save the zone lru_lock
      contention with a hope that more pages will belong to the same zone and
      so their addition can be batched.  The huge page is already a form of
      batched addition (it will add 512 worth of memory in one go) so skipping
      the batching seems like a safer option when compared to a potential
      excess in the caching which can be quite large and much harder to fix
      because lru_add_drain_all is way to expensive and it is not really clear
      what would be a good moment to call it.
      
      Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
      madvise(p, size, MADV_FREE); after memset.
      
      This patch flushes lru pvecs on compound page arrival making the problem
      less severe - after applying it kill rate of above example drops to 0%,
      due to reducing maximum amount of memory held on pvec from 28MB (with
      THP) to 56kB per CPU.
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Ming Li <mingli199x@qq.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a2d8c514