1. 17 Nov, 2019 7 commits
  2. 16 Nov, 2019 26 commits
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ffaf037
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Misc fixes: a handful of AUX event handling related fixes, a Sparse
        fix and two ABI fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/core: Fix missing static inline on perf_cgroup_switch()
        perf/core: Consistently fail fork on allocation failures
        perf/aux: Disallow aux_output for kernel events
        perf/core: Reattach a misplaced comment
        perf/aux: Fix the aux_output group inheritance fix
        perf/core: Disallow uncore-cgroup events
      5ffaf037
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 8be636dd
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix memory leak in xfrm_state code, from Steffen Klassert.
      
       2) Fix races between devlink reload operations and device
          setup/cleanup, from Jiri Pirko.
      
       3) Null deref in NFC code, from Stephan Gerhold.
      
       4) Refcount fixes in SMC, from Ursula Braun.
      
       5) Memory leak in slcan open error paths, from Jouni Hogander.
      
       6) Fix ETS bandwidth validation in hns3, from Yonglong Liu.
      
       7) Info leak on short USB request answers in ax88172a driver, from
          Oliver Neukum.
      
       8) Release mem region properly in ep93xx_eth, from Chuhong Yuan.
      
       9) PTP config timestamp flags validation, from Richard Cochran.
      
      10) Dangling pointers after SKB data realloc in seg6, from Andrea Mayer.
      
      11) Missing free_netdev() in gemini driver, from Chuhong Yuan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
        ipmr: Fix skb headroom in ipmr_get_route().
        net: hns3: cleanup of stray struct hns3_link_mode_mapping
        net/smc: fix fastopen for non-blocking connect()
        rds: ib: update WR sizes when bringing up connection
        net: gemini: add missed free_netdev
        net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid
        seg6: fix skb transport_header after decap_and_validate()
        seg6: fix srh pointer in get_srh()
        net: stmmac: Use the correct style for SPDX License Identifier
        octeontx2-af: Use the correct style for SPDX License Identifier
        ptp: Extend the test program to check the external time stamp flags.
        mlx5: Reject requests to enable time stamping on both edges.
        igb: Reject requests that fail to enable time stamping on both edges.
        dp83640: Reject requests to enable time stamping on both edges.
        mv88e6xxx: Reject requests to enable time stamping on both edges.
        ptp: Introduce strict checking of external time stamp options.
        renesas: reject unsupported external timestamp flags
        mlx5: reject unsupported external timestamp flags
        igb: reject unsupported external timestamp flags
        dp83640: reject unsupported external timestamp flags
        ...
      8be636dd
    • Guillaume Nault's avatar
      ipmr: Fix skb headroom in ipmr_get_route(). · 7901cd97
      Guillaume Nault authored
      In route.c, inet_rtm_getroute_build_skb() creates an skb with no
      headroom. This skb is then used by inet_rtm_getroute() which may pass
      it to rt_fill_info() and, from there, to ipmr_get_route(). The later
      might try to reuse this skb by cloning it and prepending an IPv4
      header. But since the original skb has no headroom, skb_push() triggers
      skb_under_panic():
      
      skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
      ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:108!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      RIP: 0010:skb_panic+0xbf/0xd0
      Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
      RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
      RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
      RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
      RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
      R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
      R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
      FS:  00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_push+0x7e/0x80
       ipmr_get_route+0x459/0x6fa
       rt_fill_info+0x692/0x9f0
       inet_rtm_getroute+0xd26/0xf20
       rtnetlink_rcv_msg+0x45d/0x630
       netlink_rcv_skb+0x1a5/0x220
       rtnetlink_rcv+0x15/0x20
       netlink_unicast+0x305/0x3a0
       netlink_sendmsg+0x575/0x730
       sock_sendmsg+0xb5/0xc0
       ___sys_sendmsg+0x497/0x4f0
       __sys_sendmsg+0xcb/0x150
       __x64_sys_sendmsg+0x48/0x50
       do_syscall_64+0xd2/0xac0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Actually the original skb used to have enough headroom, but the
      reserve_skb() call was lost with the introduction of
      inet_rtm_getroute_build_skb() by commit 404eb77e ("ipv4: support
      sport, dport and ip_proto in RTM_GETROUTE").
      
      We could reserve some headroom again in inet_rtm_getroute_build_skb(),
      but this function shouldn't be responsible for handling the special
      case of ipmr_get_route(). Let's handle that directly in
      ipmr_get_route() by calling skb_realloc_headroom() instead of
      skb_clone().
      
      Fixes: 404eb77e ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7901cd97
    • Salil Mehta's avatar
      net: hns3: cleanup of stray struct hns3_link_mode_mapping · b696083d
      Salil Mehta authored
      This patch cleans-up the stray left over code. It has no
      functionality impact.
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b696083d
    • Ursula Braun's avatar
      net/smc: fix fastopen for non-blocking connect() · 8204df72
      Ursula Braun authored
      FASTOPEN does not work with SMC-sockets. Since SMC allows fallback to
      TCP native during connection start, the FASTOPEN setsockopts trigger
      this fallback, if the SMC-socket is still in state SMC_INIT.
      But if a FASTOPEN setsockopt is called after a non-blocking connect(),
      this is broken, and fallback does not make sense.
      This change complements
      commit cd206360 ("net/smc: avoid fallback in case of non-blocking connect")
      and fixes the syzbot reported problem "WARNING in smc_unhash_sk".
      
      Reported-by: syzbot+8488cc4cf1c9e09b8b86@syzkaller.appspotmail.com
      Fixes: e1bbdd57 ("net/smc: reduce sock_put() for fallback sockets")
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8204df72
    • Dag Moxnes's avatar
      rds: ib: update WR sizes when bringing up connection · a36e629e
      Dag Moxnes authored
      Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and
      rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result,
      a connection being down while rds_ib_sysctl_max_send_wr or
      rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when
      it comes back up.
      
      Move resizing of WRs to rds_ib_setup_qp so that connections will be setup
      with the most current WR sizes.
      Signed-off-by: default avatarDag Moxnes <dag.moxnes@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a36e629e
    • Chuhong Yuan's avatar
      net: gemini: add missed free_netdev · 18d647ae
      Chuhong Yuan authored
      This driver forgets to free allocated netdev in remove like
      what is done in probe failure.
      Add the free to fix it.
      Signed-off-by: default avatarChuhong Yuan <hslester96@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18d647ae
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid · c80ed84e
      Vladimir Oltean authored
      This sequence of operations:
      ip link set dev br0 type bridge vlan_filtering 1
      bridge vlan del dev swp2 vid 1
      ip link set dev br0 type bridge vlan_filtering 1
      ip link set dev br0 type bridge vlan_filtering 0
      
      apparently fails with the message:
      
      [   31.305716] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
      [   31.322161] sja1105 spi0.1: Couldn't determine PVID attributes (pvid 0)
      [   31.328939] sja1105 spi0.1: Failed to setup VLAN tagging for port 1: -2
      [   31.335599] ------------[ cut here ]------------
      [   31.340215] WARNING: CPU: 1 PID: 194 at net/switchdev/switchdev.c:157 switchdev_port_attr_set_now+0x9c/0xa4
      [   31.349981] br0: Commit of attribute (id=6) failed.
      [   31.354890] Modules linked in:
      [   31.357942] CPU: 1 PID: 194 Comm: ip Not tainted 5.4.0-rc6-01792-gf4f632e07665-dirty #2062
      [   31.366167] Hardware name: Freescale LS1021A
      [   31.370437] [<c03144dc>] (unwind_backtrace) from [<c030e184>] (show_stack+0x10/0x14)
      [   31.378153] [<c030e184>] (show_stack) from [<c11d1c1c>] (dump_stack+0xe0/0x10c)
      [   31.385437] [<c11d1c1c>] (dump_stack) from [<c034c730>] (__warn+0xf4/0x10c)
      [   31.392373] [<c034c730>] (__warn) from [<c034c7bc>] (warn_slowpath_fmt+0x74/0xb8)
      [   31.399827] [<c034c7bc>] (warn_slowpath_fmt) from [<c11ca204>] (switchdev_port_attr_set_now+0x9c/0xa4)
      [   31.409097] [<c11ca204>] (switchdev_port_attr_set_now) from [<c117036c>] (__br_vlan_filter_toggle+0x6c/0x118)
      [   31.418971] [<c117036c>] (__br_vlan_filter_toggle) from [<c115d010>] (br_changelink+0xf8/0x518)
      [   31.427637] [<c115d010>] (br_changelink) from [<c0f8e9ec>] (__rtnl_newlink+0x3f4/0x76c)
      [   31.435613] [<c0f8e9ec>] (__rtnl_newlink) from [<c0f8eda8>] (rtnl_newlink+0x44/0x60)
      [   31.443329] [<c0f8eda8>] (rtnl_newlink) from [<c0f89f20>] (rtnetlink_rcv_msg+0x2cc/0x51c)
      [   31.451477] [<c0f89f20>] (rtnetlink_rcv_msg) from [<c1008df8>] (netlink_rcv_skb+0xb8/0x110)
      [   31.459796] [<c1008df8>] (netlink_rcv_skb) from [<c1008648>] (netlink_unicast+0x17c/0x1f8)
      [   31.468026] [<c1008648>] (netlink_unicast) from [<c1008980>] (netlink_sendmsg+0x2bc/0x3b4)
      [   31.476261] [<c1008980>] (netlink_sendmsg) from [<c0f43858>] (___sys_sendmsg+0x230/0x250)
      [   31.484408] [<c0f43858>] (___sys_sendmsg) from [<c0f44c84>] (__sys_sendmsg+0x50/0x8c)
      [   31.492209] [<c0f44c84>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x28)
      [   31.500090] Exception stack(0xedf47fa8 to 0xedf47ff0)
      [   31.505122] 7fa0:                   00000002 b6f2e060 00000003 beabd6a4 00000000 00000000
      [   31.513265] 7fc0: 00000002 b6f2e060 5d6e3213 00000128 00000000 00000001 00000006 000619c4
      [   31.521405] 7fe0: 00086078 beabd658 0005edbc b6e7ce68
      
      The reason is the implementation of br_get_pvid:
      
      static inline u16 br_get_pvid(const struct net_bridge_vlan_group *vg)
      {
      	if (!vg)
      		return 0;
      
      	smp_rmb();
      	return vg->pvid;
      }
      
      Since VID 0 is an invalid pvid from the bridge's point of view, let's
      add this check in dsa_8021q_restore_pvid to avoid restoring a pvid that
      doesn't really exist.
      
      Fixes: 5f33183b ("net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c80ed84e
    • David S. Miller's avatar
      Merge branch 'seg6-fixes-to-Segment-Routing-in-IPv6' · e84fa0ae
      David S. Miller authored
      Andrea Mayer says:
      
      ====================
      seg6: fixes to Segment Routing in IPv6
      
      This patchset is divided in 2 patches and it introduces some fixes
      to Segment Routing in IPv6, which are:
      
      - in function get_srh() fix the srh pointer after calling
        pskb_may_pull();
      
      - fix the skb->transport_header after calling decap_and_validate()
        function;
      
      Any comments on the patchset are welcome.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e84fa0ae
    • Andrea Mayer's avatar
      seg6: fix skb transport_header after decap_and_validate() · c71644d0
      Andrea Mayer authored
      in the receive path (more precisely in ip6_rcv_core()) the
      skb->transport_header is set to skb->network_header + sizeof(*hdr). As a
      consequence, after routing operations, destination input expects to find
      skb->transport_header correctly set to the next protocol (or extension
      header) that follows the network protocol. However, decap behaviors (DX*,
      DT*) remove the outer IPv6 and SRH extension and do not set again the
      skb->transport_header pointer correctly. For this reason, the patch sets
      the skb->transport_header to the skb->network_header + sizeof(hdr) in each
      DX* and DT* behavior.
      Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c71644d0
    • Andrea Mayer's avatar
      seg6: fix srh pointer in get_srh() · 7f91ed8c
      Andrea Mayer authored
      pskb_may_pull may change pointers in header. For this reason, it is
      mandatory to reload any pointer that points into skb header.
      Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f91ed8c
    • Nishad Kamdar's avatar
      net: stmmac: Use the correct style for SPDX License Identifier · acb9bdc1
      Nishad Kamdar authored
      This patch corrects the SPDX License Identifier style in
      header files related to STMicroelectronics based Multi-Gigabit
      Ethernet driver. For C header files Documentation/process/license-rules.rst
      mandates C-like comments (opposed to C source files where
      C++ style should be used).
      
      Changes made by using a script provided by Joe Perches here:
      https://lkml.org/lkml/2019/2/7/46.
      Suggested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarNishad Kamdar <nishadkamdar@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acb9bdc1
    • Nishad Kamdar's avatar
      octeontx2-af: Use the correct style for SPDX License Identifier · 26b3f3cc
      Nishad Kamdar authored
      This patch corrects the SPDX License Identifier style in
      header files related to Marvell OcteonTX2 network devices.
      It uses an expilict block comment for the SPDX License
      Identifier.
      
      Changes made by using a script provided by Joe Perches here:
      https://lkml.org/lkml/2019/2/7/46.
      Suggested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarNishad Kamdar <nishadkamdar@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26b3f3cc
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · bec8b6e9
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "11 fixes"
      
      MM fixes and one xz decompressor fix.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/debug.c: PageAnon() is true for PageKsm() pages
        mm/debug.c: __dump_page() prints an extra line
        mm/page_io.c: do not free shared swap slots
        mm/memory_hotplug: fix try_offline_node()
        mm,thp: recheck each page before collapsing file THP
        mm: slub: really fix slab walking for init_on_free
        mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
        mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
        lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocations
        mm: fix trying to reclaim unevictable lru page when calling madvise_pageout
        mm: mempolicy: fix the wrong return value and potential pages leak of mbind
      bec8b6e9
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 6c9594bd
      Linus Torvalds authored
      Pull more input fixes from Dmitry Torokhov:
       "A couple of fixes in driver teardown paths and another ID for
        Synaptics RMI mode"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: synaptics - enable RMI mode for X1 Extreme 2nd Generation
        Input: synaptics-rmi4 - destroy F54 poller workqueue when removing
        Input: ff-memless - kill timer in destroy()
      6c9594bd
    • Ralph Campbell's avatar
      mm/debug.c: PageAnon() is true for PageKsm() pages · 6855ac4a
      Ralph Campbell authored
      PageAnon() and PageKsm() use the low two bits of the page->mapping
      pointer to indicate the page type.  PageAnon() only checks the LSB while
      PageKsm() checks the least significant 2 bits are equal to 3.
      
      Therefore, PageAnon() is true for KSM pages.  __dump_page() incorrectly
      will never print "ksm" because it checks PageAnon() first.  Fix this by
      checking PageKsm() first.
      
      Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
      Fixes: 1c6fb1d8 ("mm: print more information about mapping in __dump_page")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6855ac4a
    • Ralph Campbell's avatar
      mm/debug.c: __dump_page() prints an extra line · 76a1850e
      Ralph Campbell authored
      When dumping struct page information, __dump_page() prints the page type
      with a trailing blank followed by the page flags on a separate line:
      
        anon
        flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      It looks like the intent was to use pr_cont() for printing "flags:" but
      pr_cont() usage is discouraged so fix this by extending the format to
      include the flags into a single line:
      
        anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      If the page is file backed, the name might be long so use two lines:
      
        shmem_aops name:"dev/zero"
        flags: 0x10000000008000c(uptodate|dirty|swapbacked)
      
      Eliminate pr_conf() usage as well for appending compound_mapcount.
      
      Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76a1850e
    • Vinayak Menon's avatar
      mm/page_io.c: do not free shared swap slots · 5df373e9
      Vinayak Menon authored
      The following race is observed due to which a processes faulting on a
      swap entry, finds the page neither in swapcache nor swap.  This causes
      zram to give a zero filled page that gets mapped to the process,
      resulting in a user space crash later.
      
      Consider parent and child processes Pa and Pb sharing the same swap slot
      with swap_count 2.  Swap is on zram with SWP_SYNCHRONOUS_IO set.
      Virtual address 'VA' of Pa and Pb points to the shared swap entry.
      
      Pa                                       Pb
      
      fault on VA                              fault on VA
      do_swap_page                             do_swap_page
      lookup_swap_cache fails                  lookup_swap_cache fails
                                               Pb scheduled out
      swapin_readahead (deletes zram entry)
      swap_free (makes swap_count 1)
                                               Pb scheduled in
                                               swap_readpage (swap_count == 1)
                                               Takes SWP_SYNCHRONOUS_IO path
                                               zram enrty absent
                                               zram gives a zero filled page
      
      Fix this by making sure that swap slot is freed only when swap count
      drops down to one.
      
      Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
      Fixes: aa8d22a1 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
      Signed-off-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      Suggested-by: default avatarMinchan Kim <minchan@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5df373e9
    • David Hildenbrand's avatar
      mm/memory_hotplug: fix try_offline_node() · 2c91f8fc
      David Hildenbrand authored
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c91f8fc
    • Song Liu's avatar
      mm,thp: recheck each page before collapsing file THP · 4655e5e5
      Song Liu authored
      In collapse_file(), for !is_shmem case, current check cannot guarantee
      the locked page is up-to-date.  Specifically, xas_unlock_irq() should
      not be called before lock_page() and get_page(); and it is necessary to
      recheck PageUptodate() after locking the page.
      
      With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
      may contain corrupted data.  This is because khugepaged mistakenly
      collapses some not up-to-date sub pages into a huge page, and assumes
      the huge page is up-to-date.  This will NOT corrupt data in the disk,
      because the page is read-only and never written back.  Fix this by
      properly checking PageUptodate() after locking the page.  This check
      replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
      
      Also, move PageDirty() check after locking the page.  Current khugepaged
      should not try to collapse dirty file THP, because it is limited to
      read-only .text.  The only case we hit a dirty page here is when the
      page hasn't been written since write.  Bail out and retry when this
      happens.
      
      syzbot reported bug on previous version of this patch.
      
      Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4655e5e5
    • Laura Abbott's avatar
      mm: slub: really fix slab walking for init_on_free · aea4df4c
      Laura Abbott authored
      Commit 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      fixed one problem with the slab walking but missed a key detail: When
      walking the list, the head and tail pointers need to be updated since we
      end up reversing the list as a result.  Without doing this, bulk free is
      broken.
      
      One way this is exposed is a NULL pointer with slub_debug=F:
      
        =============================================================================
        BUG skbuff_head_cache (Tainted: G                T): Object already free
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        Oops: 0000 [#1] PREEMPT SMP PTI
        RIP: 0010:print_trailer+0x70/0x1d5
        Call Trace:
         <IRQ>
         free_debug_processing.cold.37+0xc9/0x149
         __slab_free+0x22a/0x3d0
         kmem_cache_free_bulk+0x415/0x420
         __kfree_skb_flush+0x30/0x40
         net_rx_action+0x2dd/0x480
         __do_softirq+0xf0/0x246
         irq_exit+0x93/0xb0
         do_IRQ+0xa0/0x110
         common_interrupt+0xf/0xf
         </IRQ>
      
      Given we're now almost identical to the existing debugging code which
      correctly walks the list, combine with that.
      
      Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
      Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
      Fixes: 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      Signed-off-by: default avatarLaura Abbott <labbott@redhat.com>
      Reported-by: default avatarThibaut Sautereau <thibaut.sautereau@clip-os.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <clipos@ssi.gouv.fr>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aea4df4c
    • Roman Gushchin's avatar
      mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() · 0362f326
      Roman Gushchin authored
      An exiting task might belong to an offline cgroup.  In this case an
      attempt to grab a cgroup reference from the task can end up with an
      infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
      cgroup will become online, neither the task will be migrated to a live
      cgroup.
      
      Fix this by switching over to css_tryget().  As css_tryget_online()
      can't guarantee that the cgroup won't go offline, in most cases the
      check doesn't make sense.  In this particular case users of
      hugetlb_cgroup_charge_cgroup() are not affected by this change.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0362f326
    • Roman Gushchin's avatar
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · 00d484f3
      Roman Gushchin authored
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00d484f3
    • Lasse Collin's avatar
      lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocations · 8e20ba2e
      Lasse Collin authored
      s->dict.allocated was initialized to 0 but never set after a successful
      allocation, thus the code always thought that the dictionary buffer has
      to be reallocated.
      
      Link: http://lkml.kernel.org/r/20191104185107.3b6330df@tukaani.orgSigned-off-by: default avatarLasse Collin <lasse.collin@tukaani.org>
      Reported-by: default avatarYu Sun <yusun2@cisco.com>
      Acked-by: default avatarDaniel Walker <danielwa@cisco.com>
      Cc: "Yixia Si (yisi)" <yisi@cisco.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e20ba2e
    • zhong jiang's avatar
      mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 82072962
      zhong jiang authored
      Recently, I hit the following issue when running upstream.
      
        kernel BUG at mm/vmscan.c:1521!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
        RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
        Call Trace:
         reclaim_pages+0x499/0x800 mm/vmscan.c:2188
         madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
         walk_pmd_range mm/pagewalk.c:53 [inline]
         walk_pud_range mm/pagewalk.c:112 [inline]
         walk_p4d_range mm/pagewalk.c:139 [inline]
         walk_pgd_range mm/pagewalk.c:166 [inline]
         __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
         walk_page_range+0x179/0x310 mm/pagewalk.c:349
         madvise_pageout_page_range mm/madvise.c:506 [inline]
         madvise_pageout+0x1f0/0x330 mm/madvise.c:542
         madvise_vma mm/madvise.c:931 [inline]
         __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
         do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      madvise_pageout() accesses the specified range of the vma and isolates
      them, then runs shrink_page_list() to reclaim its memory.  But it also
      isolates the unevictable pages to reclaim.  Hence, we can catch the
      cases in shrink_page_list().
      
      The root cause is that we scan the page tables instead of specific LRU
      list.  and so we need to filter out the unevictable lru pages from our
      end.
      
      Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82072962
    • Yang Shi's avatar
      mm: mempolicy: fix the wrong return value and potential pages leak of mbind · a85dfc30
      Yang Shi authored
      Commit d8835445 ("mm: mempolicy: make the behavior consistent when
      MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
      of mbind() for a couple of corner cases.  But, it altered the errno for
      some other cases, for example, mbind() should return -EFAULT when part
      or all of the memory range specified by nodemask and maxnode points
      outside your accessible address space, or there was an unmapped hole in
      the specified memory range specified by addr and len.
      
      Fix this by preserving the errno returned by queue_pages_range().  And,
      the pagelist may be not empty even though queue_pages_range() returns
      error, put the pages back to LRU since mbind_range() is not called to
      really apply the policy so those pages should not be migrated, this is
      also the old behavior before the problematic commit.
      
      Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: d8835445 ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Reviewed-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[4.19 and 5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85dfc30
  3. 15 Nov, 2019 7 commits