1. 26 Feb, 2020 4 commits
    • Cong Wang's avatar
      netfilter: xt_hashlimit: unregister proc file before releasing mutex · 99b79c39
      Cong Wang authored
      Before releasing the global mutex, we only unlink the hashtable
      from the hash list, its proc file is still not unregistered at
      this point. So syzbot could trigger a race condition where a
      parallel htable_create() could register the same file immediately
      after the mutex is released.
      
      Move htable_remove_proc_entry() back to mutex protection to
      fix this. And, fold htable_destroy() into htable_put() to make
      the code slightly easier to understand.
      
      Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com
      Fixes: c4a3922d ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()")
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      99b79c39
    • Stefano Brivio's avatar
      selftests: nft_concat_range: Add test for reported add/flush/add issue · 0954df70
      Stefano Brivio authored
      Add a specific test for the crash reported by Phil Sutter and addressed
      in the previous patch. The test cases that, in my intention, should
      have covered these cases, that is, the ones from the 'concurrency'
      section, don't run these sequences tightly enough and spectacularly
      failed to catch this.
      
      While at it, define a convenient way to add these kind of tests, by
      adding a "reported issues" test section.
      
      It's more convenient, for this particular test, to execute the set
      setup in its own function. However, future test cases like this one
      might need to call setup functions, and will typically need no tools
      other than nft, so allow for this in check_tools().
      
      The original form of the reproducer used here was provided by Phil.
      Reported-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0954df70
    • Stefano Brivio's avatar
      nft_set_pipapo: Actually fetch key data in nft_pipapo_remove() · 212d58c1
      Stefano Brivio authored
      Phil reports that adding elements, flushing and re-adding them
      right away:
      
        nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }'
        nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
        nft flush set t s
        nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'
      
      triggers, almost reliably, a crash like this one:
      
        [   71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
        [   71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e #192
        [   71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
        [   71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
        [   71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
        [   71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
        [   71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
        [   71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
        [   71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
        [   71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
        [   71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
        [   71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
        [   71.334596] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
        [   71.335780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [   71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
        [   71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [   71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [   71.339718] Call Trace:
        [   71.340093]  nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
        [   71.340973]  nft_set_destroy+0x20/0x50 [nf_tables]
        [   71.341879]  nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
        [   71.342916]  process_one_work+0x1d5/0x3c0
        [   71.343601]  worker_thread+0x4a/0x3c0
        [   71.344229]  kthread+0xfb/0x130
        [   71.344780]  ? process_one_work+0x3c0/0x3c0
        [   71.345477]  ? kthread_park+0x90/0x90
        [   71.346129]  ret_from_fork+0x35/0x40
        [   71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
        [   71.348153] ---[ end trace 2eaa8149ca759bcc ]---
        [   71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
        [   71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
        [   71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
        [   71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
        [   71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
        [   71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
        [   71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
        [   71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
        [   71.350025] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
        [   71.350026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [   71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
        [   71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [   71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [   71.350030] Kernel panic - not syncing: Fatal exception
        [   71.350412] Kernel Offset: disabled
        [   71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      which is caused by dangling elements that have been deactivated, but
      never removed.
      
      On a flush operation, nft_pipapo_walk() walks through all the elements
      in the mapping table, which are then deactivated by nft_flush_set(),
      one by one, and added to the commit list for removal. Element data is
      then freed.
      
      On transaction commit, nft_pipapo_remove() is called, and failed to
      remove these elements, leading to the stale references in the mapping.
      The first symptom of this, revealed by KASan, is a one-byte
      use-after-free in subsequent calls to nft_pipapo_walk(), which is
      usually not enough to trigger a panic. When stale elements are used
      more heavily, though, such as double-free via nft_pipapo_destroy()
      as in Phil's case, the problem becomes more noticeable.
      
      The issue comes from that fact that, on a flush operation,
      nft_pipapo_remove() won't get the actual key data via elem->key,
      elements to be deleted upon commit won't be found by the lookup via
      pipapo_get(), and removal will be skipped. Key data should be fetched
      via nft_set_ext_key(), instead.
      Reported-by: default avatarPhil Sutter <phil@nwl.cc>
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      212d58c1
    • Pablo Neira Ayuso's avatar
      Merge branch 'master' of git://blackhole.kfki.hu/nf · 9ea4894b
      Pablo Neira Ayuso authored
      Jozsef Kadlecsik says:
      
      ====================
      ipset patches for nf
      
      The first one is larger than usual, but the issue could not be solved simpler.
      Also, it's a resend of the patch I submitted a few days ago, with a one line
      fix on top of that: the size of the comment extensions was not taken into
      account at reporting the full size of the set.
      
      - Fix "INFO: rcu detected stall in hash_xxx" reports of syzbot
        by introducing region locking and using workqueue instead of timer based
        gc of timed out entries in hash types of sets in ipset.
      - Fix the forceadd evaluation path - the bug was also uncovered by the syzbot.
      ====================
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9ea4894b
  2. 25 Feb, 2020 1 commit
  3. 24 Feb, 2020 16 commits
    • David S. Miller's avatar
      Merge tag 'mac80211-for-net-2020-02-24' of... · 3614d05b
      David S. Miller authored
      Merge tag 'mac80211-for-net-2020-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg
      
      ====================
      A few fixes:
       * remove a double mutex-unlock
       * fix a leak in an error path
       * NULL pointer check
       * include if_vlan.h where needed
       * avoid RCU list traversal when not under RCU
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3614d05b
    • Nikolay Aleksandrov's avatar
      net: bridge: fix stale eth hdr pointer in br_dev_xmit · 823d81b0
      Nikolay Aleksandrov authored
      In br_dev_xmit() we perform vlan filtering in br_allowed_ingress() but
      if the packet has the vlan header inside (e.g. bridge with disabled
      tx-vlan-offload) then the vlan filtering code will use skb_vlan_untag()
      to extract the vid before filtering which in turn calls pskb_may_pull()
      and we may end up with a stale eth pointer. Moreover the cached eth header
      pointer will generally be wrong after that operation. Remove the eth header
      caching and just use eth_hdr() directly, the compiler does the right thing
      and calculates it only once so we don't lose anything.
      
      Fixes: 057658cb ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      823d81b0
    • David S. Miller's avatar
      Merge branch 'net-ll_temac-Bugfixes' · e4686c2d
      David S. Miller authored
      Esben Haabendal says:
      
      ====================
      net: ll_temac: Bugfixes
      
      Fix a number of bugs which have been present since the first commit.
      
      The bugs fixed in patch 1,2 and 4 have all been observed in real systems, and
      was relatively easy to reproduce given an appropriate stress setup.
      
      Changes since v1:
      
      - Changed error handling of of dma_map_single() in temac_start_xmit() to drop
        packet instead of returning NETDEV_TX_BUSY.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4686c2d
    • Esben Haabendal's avatar
      net: ll_temac: Handle DMA halt condition caused by buffer underrun · 1d63b8d6
      Esben Haabendal authored
      The SDMA engine used by TEMAC halts operation when it has finished
      processing of the last buffer descriptor in the buffer ring.
      Unfortunately, no interrupt event is generated when this happens,
      so we need to setup another mechanism to make sure DMA operation is
      restarted when enough buffers have been added to the ring.
      
      Fixes: 92744989 ("net: add Xilinx ll_temac device driver")
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d63b8d6
    • Esben Haabendal's avatar
      net: ll_temac: Fix RX buffer descriptor handling on GFP_ATOMIC pressure · 770d9c67
      Esben Haabendal authored
      Failures caused by GFP_ATOMIC memory pressure have been observed, and
      due to the missing error handling, results in kernel crash such as
      
      [1876998.350133] kernel BUG at mm/slub.c:3952!
      [1876998.350141] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [1876998.350147] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.3.0-scnxt #1
      [1876998.350150] Hardware name: N/A N/A/COMe-bIP2, BIOS CCR2R920 03/01/2017
      [1876998.350160] RIP: 0010:kfree+0x1ca/0x220
      [1876998.350164] Code: 85 db 74 49 48 8b 95 68 01 00 00 48 31 c2 48 89 10 e9 d7 fe ff ff 49 8b 04 24 a9 00 00 01 00 75 0b 49 8b 44 24 08 a8 01 75 02 <0f> 0b 49 8b 04 24 31 f6 a9 00 00 01 00 74 06 41 0f b6 74 24
       5b
      [1876998.350172] RSP: 0018:ffffc900000f0df0 EFLAGS: 00010246
      [1876998.350177] RAX: ffffea00027f0708 RBX: ffff888008d78000 RCX: 0000000000391372
      [1876998.350181] RDX: 0000000000000000 RSI: ffffe8ffffd01400 RDI: ffff888008d78000
      [1876998.350185] RBP: ffff8881185a5d00 R08: ffffc90000087dd8 R09: 000000000000280a
      [1876998.350189] R10: 0000000000000002 R11: 0000000000000000 R12: ffffea0000235e00
      [1876998.350193] R13: ffff8881185438a0 R14: 0000000000000000 R15: ffff888118543870
      [1876998.350198] FS:  0000000000000000(0000) GS:ffff88811f300000(0000) knlGS:0000000000000000
      [1876998.350203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      s#1 Part1
      [1876998.350206] CR2: 00007f8dac7b09f0 CR3: 000000011e20a006 CR4: 00000000001606e0
      [1876998.350210] Call Trace:
      [1876998.350215]  <IRQ>
      [1876998.350224]  ? __netif_receive_skb_core+0x70a/0x920
      [1876998.350229]  kfree_skb+0x32/0xb0
      [1876998.350234]  __netif_receive_skb_core+0x70a/0x920
      [1876998.350240]  __netif_receive_skb_one_core+0x36/0x80
      [1876998.350245]  process_backlog+0x8b/0x150
      [1876998.350250]  net_rx_action+0xf7/0x340
      [1876998.350255]  __do_softirq+0x10f/0x353
      [1876998.350262]  irq_exit+0xb2/0xc0
      [1876998.350265]  do_IRQ+0x77/0xd0
      [1876998.350271]  common_interrupt+0xf/0xf
      [1876998.350274]  </IRQ>
      
      In order to handle such failures more graceful, this change splits the
      receive loop into one for consuming the received buffers, and one for
      allocating new buffers.
      
      When GFP_ATOMIC allocations fail, the receive will continue with the
      buffers that is still there, and with the expectation that the allocations
      will succeed in a later call to receive.
      
      Fixes: 92744989 ("net: add Xilinx ll_temac device driver")
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      770d9c67
    • Esben Haabendal's avatar
      net: ll_temac: Add more error handling of dma_map_single() calls · d07c849c
      Esben Haabendal authored
      This adds error handling to the remaining dma_map_single() calls, so that
      behavior is well defined if/when we run out of DMA memory.
      
      Fixes: 92744989 ("net: add Xilinx ll_temac device driver")
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d07c849c
    • Esben Haabendal's avatar
      net: ll_temac: Fix race condition causing TX hang · 84823ff8
      Esben Haabendal authored
      It is possible that the interrupt handler fires and frees up space in
      the TX ring in between checking for sufficient TX ring space and
      stopping the TX queue in temac_start_xmit. If this happens, the
      queue wake from the interrupt handler will occur before the queue is
      stopped, causing a lost wakeup and the adapter's transmit hanging.
      
      To avoid this, after stopping the queue, check again whether there is
      sufficient space in the TX ring. If so, wake up the queue again.
      
      This is a port of the similar fix in axienet driver,
      commit 7de44285 ("net: axienet: Fix race condition causing TX hang").
      
      Fixes: 23ecc4bd ("net: ll_temac: fix checksum offload logic")
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84823ff8
    • Madhuparna Bhowmik's avatar
      mac80211: rx: avoid RCU list traversal under mutex · 253216ff
      Madhuparna Bhowmik authored
      local->sta_mtx is held in __ieee80211_check_fast_rx_iface().
      No need to use list_for_each_entry_rcu() as it also requires
      a cond argument to avoid false lockdep warnings when not used in
      RCU read-side section (with CONFIG_PROVE_RCU_LIST).
      Therefore use list_for_each_entry();
      Signed-off-by: default avatarMadhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
      Link: https://lore.kernel.org/r/20200223143302.15390-1-madhuparnabhowmik10@gmail.comSigned-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      253216ff
    • Johannes Berg's avatar
      nl80211: explicitly include if_vlan.h · e3ae39ed
      Johannes Berg authored
      We use that here, and do seem to get it through some recursive
      include, but better include it explicitly.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Link: https://lore.kernel.org/r/20200224093814.1b9c258fec67.I45ac150d4e11c72eb263abec9f1f0c7add9bef2b@changeidSigned-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      e3ae39ed
    • Madhuparna Bhowmik's avatar
      net: core: devlink.c: Hold devlink->lock from the beginning of devlink_dpipe_table_register() · 6132c1d9
      Madhuparna Bhowmik authored
      devlink_dpipe_table_find() should be called under either
      rcu_read_lock() or devlink->lock. devlink_dpipe_table_register()
      calls devlink_dpipe_table_find() without holding the lock
      and acquires it later. Therefore hold the devlink->lock
      from the beginning of devlink_dpipe_table_register().
      Suggested-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarMadhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6132c1d9
    • Florian Fainelli's avatar
      net: phy: Avoid multiple suspends · 503ba7c6
      Florian Fainelli authored
      It is currently possible for a PHY device to be suspended as part of a
      network device driver's suspend call while it is still being attached to
      that net_device, either via phy_suspend() or implicitly via phy_stop().
      
      Later on, when the MDIO bus controller get suspended, we would attempt
      to suspend again the PHY because it is still attached to a network
      device.
      
      This is both a waste of time and creates an opportunity for improper
      clock/power management bugs to creep in.
      
      Fixes: 803dd9c7 ("net: phy: avoid suspending twice a PHY")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      503ba7c6
    • Marek Vasut's avatar
      net: ks8851-ml: Fix IRQ handling and locking · 44343418
      Marek Vasut authored
      The KS8851 requires that packet RX and TX are mutually exclusive.
      Currently, the driver hopes to achieve this by disabling interrupt
      from the card by writing the card registers and by disabling the
      interrupt on the interrupt controller. This however is racy on SMP.
      
      Replace this approach by expanding the spinlock used around the
      ks_start_xmit() TX path to ks_irq() RX path to assure true mutual
      exclusion and remove the interrupt enabling/disabling, which is
      now not needed anymore. Furthermore, disable interrupts also in
      ks_net_stop(), which was missing before.
      
      Note that a massive improvement here would be to re-use the KS8851
      driver approach, which is to move the TX path into a worker thread,
      interrupt handling to threaded interrupt, and synchronize everything
      with mutexes, but that would be a much bigger rework, for a separate
      patch.
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Lukas Wunner <lukas@wunner.de>
      Cc: Petr Stetiar <ynezz@true.cz>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44343418
    • Jonathan Neuschäfer's avatar
      docs: networking: phy: Rephrase paragraph for clarity · 52df1e56
      Jonathan Neuschäfer authored
      Let's make it a little easier to read.
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52df1e56
    • Neal Cardwell's avatar
      tcp: fix TFO SYNACK undo to avoid double-timestamp-undo · dad8cea7
      Neal Cardwell authored
      In a rare corner case the new logic for undo of SYNACK RTO could
      result in triggering the warning in tcp_fastretrans_alert() that says:
              WARN_ON(tp->retrans_out != 0);
      
      The warning looked like:
      
      WARNING: CPU: 1 PID: 1 at net/ipv4/tcp_input.c:2818 tcp_ack+0x13e0/0x3270
      
      The sequence that tickles this bug is:
       - Fast Open server receives TFO SYN with data, sends SYNACK
       - (client receives SYNACK and sends ACK, but ACK is lost)
       - server app sends some data packets
       - (N of the first data packets are lost)
       - server receives client ACK that has a TS ECR matching first SYNACK,
         and also SACKs suggesting the first N data packets were lost
          - server performs TS undo of SYNACK RTO, then immediately
            enters recovery
          - buggy behavior then performed a *second* undo that caused
            the connection to be in CA_Open with retrans_out != 0
      
      Basically, the incoming ACK packet with SACK blocks causes us to first
      undo the cwnd reduction from the SYNACK RTO, but then immediately
      enters fast recovery, which then makes us eligible for undo again. And
      then tcp_rcv_synrecv_state_fastopen() accidentally performs an undo
      using a "mash-up" of state from two different loss recovery phases: it
      uses the timestamp info from the ACK of the original SYNACK, and the
      undo_marker from the fast recovery.
      
      This fix refines the logic to only invoke the tcp_try_undo_loss()
      inside tcp_rcv_synrecv_state_fastopen() if the connection is still in
      CA_Loss.  If peer SACKs triggered fast recovery, then
      tcp_rcv_synrecv_state_fastopen() can't safely undo.
      
      Fixes: 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dad8cea7
    • Haiyang Zhang's avatar
      hv_netvsc: Fix unwanted wakeup in netvsc_attach() · f6f13c12
      Haiyang Zhang authored
      When netvsc_attach() is called by operations like changing MTU, etc.,
      an extra wakeup may happen while netvsc_attach() calling
      rndis_filter_device_add() which sends rndis messages when queue is
      stopped in netvsc_detach(). The completion message will wake up queue 0.
      
      We can reproduce the issue by changing MTU etc., then the wake_queue
      counter from "ethtool -S" will increase beyond stop_queue counter:
           stop_queue: 0
           wake_queue: 1
      The issue causes queue wake up, and counter increment, no other ill
      effects in current code. So we didn't see any network problem for now.
      
      To fix this, initialize tx_disable to true, and set it to false when
      the NIC is ready to be attached or registered.
      
      Fixes: 7b2ee50c ("hv_netvsc: common detach logic")
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6f13c12
    • Daniele Palmas's avatar
      net: usb: qmi_wwan: restore mtu min/max values after raw_ip switch · eae7172f
      Daniele Palmas authored
      usbnet creates network interfaces with min_mtu = 0 and
      max_mtu = ETH_MAX_MTU.
      
      These values are not modified by qmi_wwan when the network interface
      is created initially, allowing, for example, to set mtu greater than 1500.
      
      When a raw_ip switch is done (raw_ip set to 'Y', then set to 'N') the mtu
      values for the network interface are set through ether_setup, with
      min_mtu = ETH_MIN_MTU and max_mtu = ETH_DATA_LEN, not allowing anymore to
      set mtu greater than 1500 (error: mtu greater than device maximum).
      
      The patch restores the original min/max mtu values set by usbnet after a
      raw_ip switch.
      Signed-off-by: default avatarDaniele Palmas <dnlplm@gmail.com>
      Acked-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eae7172f
  4. 23 Feb, 2020 3 commits
  5. 22 Feb, 2020 2 commits
    • Jozsef Kadlecsik's avatar
      netfilter: ipset: Fix forceadd evaluation path · 8af1c6fb
      Jozsef Kadlecsik authored
      When the forceadd option is enabled, the hash:* types should find and replace
      the first entry in the bucket with the new one if there are no reuseable
      (deleted or timed out) entries. However, the position index was just not set
      to zero and remained the invalid -1 if there were no reuseable entries.
      
      Reported-by: syzbot+6a86565c74ebe30aea18@syzkaller.appspotmail.com
      Fixes: 23c42a40 ("netfilter: ipset: Introduction of new commands and protocol version 7")
      Signed-off-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      8af1c6fb
    • Jozsef Kadlecsik's avatar
      netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports · f66ee041
      Jozsef Kadlecsik authored
      In the case of huge hash:* types of sets, due to the single spinlock of
      a set the processing of the whole set under spinlock protection could take
      too long.
      
      There were four places where the whole hash table of the set was processed
      from bucket to bucket under holding the spinlock:
      
      - During resizing a set, the original set was locked to exclude kernel side
        add/del element operations (userspace add/del is excluded by the
        nfnetlink mutex). The original set is actually just read during the
        resize, so the spinlocking is replaced with rcu locking of regions.
        However, thus there can be parallel kernel side add/del of entries.
        In order not to loose those operations a backlog is added and replayed
        after the successful resize.
      - Garbage collection of timed out entries was also protected by the spinlock.
        In order not to lock too long, region locking is introduced and a single
        region is processed in one gc go. Also, the simple timer based gc running
        is replaced with a workqueue based solution. The internal book-keeping
        (number of elements, size of extensions) is moved to region level due to
        the region locking.
      - Adding elements: when the max number of the elements is reached, the gc
        was called to evict the timed out entries. The new approach is that the gc
        is called just for the matching region, assuming that if the region
        (proportionally) seems to be full, then the whole set does. We could scan
        the other regions to check every entry under rcu locking, but for huge
        sets it'd mean a slowdown at adding elements.
      - Listing the set header data: when the set was defined with timeout
        support, the garbage collector was called to clean up timed out entries
        to get the correct element numbers and set size values. Now the set is
        scanned to check non-timed out entries, without actually calling the gc
        for the whole set.
      
      Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
      SOFTIRQ-unsafe lock order issues during working on the patch.
      
      Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com
      Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com
      Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com
      Fixes: 23c42a40 ("netfilter: ipset: Introduction of new commands and protocol version 7")
      Signed-off-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      f66ee041
  6. 21 Feb, 2020 14 commits
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog · 0c0ddd6a
      Linus Torvalds authored
      Pull watchdog fixes from Wim Van Sebroeck:
      
       - mtk_wdt needs RESET_CONTROLLER to build
      
       - da9062 driver fixes:
           - fix power management ops
           - do not ping the hw during stop()
           - add dependency on I2C
      
      * tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: da9062: Add dependency on I2C
        watchdog: da9062: fix power management ops
        watchdog: da9062: do not ping the hw during stop()
        watchdog: fix mtk_wdt.c RESET_CONTROLLER build error
      0c0ddd6a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · bb65619e
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small char/misc driver fixes for 5.6-rc3.
      
        Also included in here are some updates for some documentation files
        that I seem to be maintaining these days.
      
        The driver fixes are:
         - small fixes for the habanalabs driver
         - fsi driver bugfix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        Documentation/process: Swap out the ambassador for Canonical
        habanalabs: patched cb equals user cb in device memset
        habanalabs: do not halt CoreSight during hard reset
        habanalabs: halt the engines before hard-reset
        MAINTAINERS: remove unnecessary ':' characters
        fsi: aspeed: add unspecified HAS_IOMEM dependency
        COPYING: state that all contributions really are covered by this file
        Documentation/process: Change Microsoft contact for embargoed hardware issues
        embargoed-hardware-issues: drop Amazon contact as the email address now bounces
        Documentation/process: Add Arm contact for embargoed HW issues
      bb65619e
    • Linus Torvalds's avatar
      Merge tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · e5553ac7
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are some small staging driver fixes for 5.6-rc3, along with the
        removal of an unused/unneeded driver as well.
      
        The android vsoc driver is not needed anymore by anyone, so it was
        removed.
      
        The other driver fixes are:
         - ashmem bugfixes
         - greybus audio driver bugfix
         - wireless driver bugfixes and tiny cleanups to error paths
      
        All of these have been in linux-next for a while now with no reported
        issues"
      
      * tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: rtl8723bs: Remove unneeded goto statements
        staging: rtl8188eu: Remove some unneeded goto statements
        staging: rtl8723bs: Fix potential overuse of kernel memory
        staging: rtl8188eu: Fix potential overuse of kernel memory
        staging: rtl8723bs: Fix potential security hole
        staging: rtl8188eu: Fix potential security hole
        staging: greybus: use after free in gb_audio_manager_remove_all()
        staging: android: Delete the 'vsoc' driver
        staging: rtl8723bs: fix copy of overlapping memory
        staging: android: ashmem: Disallow ashmem memory from being remapped
        staging: vt6656: fix sign of rx_dbm to bb_pre_ed_rssi.
      e5553ac7
    • Linus Torvalds's avatar
      Merge tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · ef11f1b7
      Linus Torvalds authored
      Pull tty/serial driver fixes from Greg KH:
       "Here are a number of small tty and serial driver fixes for 5.6-rc3
        that resolve a bunch of reported issues.
      
        They are:
         - vt selection and ioctl fixes
         - serdev bugfix
         - atmel serial driver fixes
         - qcom serial driver fixes
         - other minor serial driver fixes
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        vt: selection, close sel_buffer race
        vt: selection, handle pending signals in paste_selection
        serial: cpm_uart: call cpm_muram_init before registering console
        tty: serial: qcom_geni_serial: Fix RX cancel command failure
        serial: 8250: Check UPF_IRQ_SHARED in advance
        tty: serial: imx: setup the correct sg entry for tx dma
        vt: vt_ioctl: fix race in VT_RESIZEX
        vt: fix scrollback flushing on background consoles
        tty: serial: tegra: Handle RX transfer in PIO mode if DMA wasn't started
        tty/serial: atmel: manage shutdown in case of RS485 or ISO7816 mode
        serdev: ttyport: restore client ops on deregistration
        serial: ar933x_uart: set UART_CS_{RX,TX}_READY_ORIDE
      ef11f1b7
    • Linus Torvalds's avatar
      Merge tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · cee853e8
      Linus Torvalds authored
      Pull USB/Thunderbolt fixes from Greg KH:
       "Here are a number of small USB driver fixes for 5.6-rc3.
      
        Included in here are:
        - MAINTAINER file updates
        - USB gadget driver fixes
        - usb core quirk additions and fixes for regressions
        - xhci driver fixes
        - usb serial driver id additions and fixes
        - thunderbolt bugfix
      
        Thunderbolt patches come in through here now that USB4 is really
        thunderbolt.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (34 commits)
        USB: misc: iowarrior: add support for the 100 device
        thunderbolt: Prevent crash if non-active NVMem file is read
        usb: gadget: udc-xilinx: Fix xudc_stop() kernel-doc format
        USB: misc: iowarrior: add support for the 28 and 28L devices
        USB: misc: iowarrior: add support for 2 OEMed devices
        USB: Fix novation SourceControl XL after suspend
        xhci: Fix memory leak when caching protocol extended capability PSI tables - take 2
        Revert "xhci: Fix memory leak when caching protocol extended capability PSI tables"
        MAINTAINERS: Sort entries in database for THUNDERBOLT
        usb: dwc3: debug: fix string position formatting mixup with ret and len
        usb: gadget: serial: fix Tx stall after buffer overflow
        usb: gadget: ffs: ffs_aio_cancel(): Save/restore IRQ flags
        usb: dwc2: Fix SET/CLEAR_FEATURE and GET_STATUS flows
        usb: dwc2: Fix in ISOC request length checking
        usb: gadget: composite: Support more than 500mA MaxPower
        usb: gadget: composite: Fix bMaxPower for SuperSpeedPlus
        usb: gadget: u_audio: Fix high-speed max packet size
        usb: dwc3: gadget: Check for IOC/LST bit in TRB->ctrl fields
        USB: core: clean up endpoint-descriptor parsing
        USB: quirks: blacklist duplicate ep on Sound Devices USBPre2
        ...
      cee853e8
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm · 88f8bbfa
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Varied fixes for rc3.
      
        i915 is the largest, they are seeing some ACPI problems with their CI
        which hopefully get solved soon [1].
      
        msm has a bunch of fixes for new hw added in the merge, a bunch of
        amdgpu fixes, and nouveau adds support for some new firmwares for
        turing tu11x GPUs that were just released into linux-firmware by
        nvidia, they operate the same as the ones we already have for tu10x so
        should be fine to hook up.
      
        Otherwise it's just misc fixes for panfrost and sun4i.
      
        core:
         - Allow only one rotation argument, and allow zero rotation in video
           cmdline.
      
        i915:
         - Workaround missing Display Stream Compression (DSC) state readout
           by forcing modeset when its enabled at probe
         - Fix EHL port clock voltage level requirements
         - Fix queuing retire workers on the virtual engine
         - Fix use of partially initialized waiters
         - Stop using drm_pci_alloc/drm_pci/free
         - Fix rewind of RING_TAIL by forcing a context reload
         - Fix locking on resetting ring->head
         - Propagate our bug filing URL change to stable kernels
      
        panfrost:
         - Small compiler warning fix for panfrost.
         - Fix when using performance counters in panfrost when using per fd
           address space.
      
        sun4xi:
         - Fix dt binding
      
        nouveau:
         - tu11x modesetting fix
         - ACR/GR firmware support for tu11x (fw is public now)
      
        msm:
         - fix UBWC on GPU and display side for sc7180
         - fix DSI suspend/resume issue encountered on sc7180
         - fix some breakage on so called "linux-android" devices
            (fallout from sc7180/a618 support, not seen earlier due to
             bootloader/firmware differences)
         - couple other misc fixes
      
        amdgpu:
         - HDCP fixes
         - xclk fix for raven
         - GFXOFF fixes"
      
      [1] The Intel suspend testing should now be fixed by commit 63fb9623
          ("ACPI: PM: s2idle: Check fixed wakeup events in acpi_s2idle_wake()")
      
      * tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm: (39 commits)
        drm/amdgpu/display: clean up hdcp workqueue handling
        drm/amdgpu: add is_raven_kicker judgement for raven1
        drm/i915/gt: Avoid resetting ring->head outside of its timeline mutex
        drm/i915/execlists: Always force a context reload when rewinding RING_TAIL
        drm/i915: Wean off drm_pci_alloc/drm_pci_free
        drm/i915/gt: Protect defer_request() from new waiters
        drm/i915/gt: Prevent queuing retire workers on the virtual engine
        drm/i915/dsc: force full modeset whenever DSC is enabled at probe
        drm/i915/ehl: Update port clock voltage level requirements
        drm/i915: Update drm/i915 bug filing URL
        MAINTAINERS: Update drm/i915 bug filing URL
        drm/i915: Initialise basic fence before acquiring seqno
        drm/i915/gem: Require per-engine reset support for non-persistent contexts
        drm/nouveau/kms/gv100-: Re-set LUT after clearing for modesets
        drm/nouveau/gr/tu11x: initial support
        drm/nouveau/acr/tu11x: initial support
        drm/amdgpu/gfx10: disable gfxoff when reading rlc clock
        drm/amdgpu/gfx9: disable gfxoff when reading rlc clock
        drm/amdgpu/soc15: fix xclk for raven
        drm/amd/powerplay: always refetch the enabled features status on dpm enablement
        ...
      88f8bbfa
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 3dc55dba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
          Cong Wang.
      
       2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
          is finished, from Magnus Karlsson.
      
       3) Set table field properly to RT_TABLE_COMPAT when necessary, from
          Jethro Beekman.
      
       4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
          that in IFLA_ALT_IFNAME. From Eric Dumazet.
      
       5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.
      
       6) Handle mtu==0 devices properly in wireguard, from Jason A.
          Donenfeld.
      
       7) Fix several lockdep warnings in bonding, from Taehee Yoo.
      
       8) Fix cls_flower port blocking, from Jason Baron.
      
       9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.
      
      10) Fix RDMA race in qede driver, from Michal Kalderon.
      
      11) Fix several false lockdep warnings by adding conditions to
          list_for_each_entry_rcu(), from Madhuparna Bhowmik.
      
      12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.
      
      13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.
      
      14) Hey, variables declared in switch statement before any case
          statements are not initialized. I learn something every day. Get
          rids of this stuff in several parts of the networking, from Kees
          Cook.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
        bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
        bnxt_en: Improve device shutdown method.
        net: netlink: cap max groups which will be considered in netlink_bind()
        net: thunderx: workaround BGX TX Underflow issue
        ionic: fix fw_status read
        net: disable BRIDGE_NETFILTER by default
        net: macb: Properly handle phylink on at91rm9200
        s390/qeth: fix off-by-one in RX copybreak check
        s390/qeth: don't warn for napi with 0 budget
        s390/qeth: vnicc Fix EOPNOTSUPP precedence
        openvswitch: Distribute switch variables for initialization
        net: ip6_gre: Distribute switch variables for initialization
        net: core: Distribute switch variables for initialization
        udp: rehash on disconnect
        net/tls: Fix to avoid gettig invalid tls record
        bpf: Fix a potential deadlock with bpf_map_do_batch
        bpf: Do not grab the bucket spinlock by default on htab batch ops
        ice: Wait for VF to be reset/ready before configuration
        ice: Don't tell the OS that link is going down
        ice: Don't reject odd values of usecs set by user
        ...
      3dc55dba
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b0dd1eb2
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
      
       - A few y2038 fixes which missed the merge window while dependencies
         in NFS were being sorted out.
      
       - A bunch of fixes. Some minor, some not.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        MAINTAINERS: use tabs for SAFESETID
        lib/stackdepot.c: fix global out-of-bounds in stack_slabs
        mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
        mm/vmscan.c: don't round up scan size for online memory cgroup
        lib/string.c: update match_string() doc-strings with correct behavior
        mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()
        mm/swapfile.c: fix a comment in sys_swapon()
        scripts/get_maintainer.pl: deprioritize old Fixes: addresses
        get_maintainer: remove uses of P: for maintainer name
        selftests/vm: add missed tests in run_vmtests
        include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
        Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"
        y2038: hide timeval/timespec/itimerval/itimerspec types
        y2038: remove unused time32 interfaces
        y2038: remove ktime to/from timespec/timeval conversion
      b0dd1eb2
    • Randy Dunlap's avatar
      MAINTAINERS: use tabs for SAFESETID · bb8d00ff
      Randy Dunlap authored
      Use tabs for indentation instead of spaces for SAFESETID.  All (!) other
      entries in MAINTAINERS use tabs (according to my simple grepping).
      
      Link: http://lkml.kernel.org/r/2bb2e52a-2694-816d-57b4-6cabfadd6c1a@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Micah Morton <mortonm@chromium.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8d00ff
    • Alexander Potapenko's avatar
      lib/stackdepot.c: fix global out-of-bounds in stack_slabs · 305e519c
      Alexander Potapenko authored
      Walter Wu has reported a potential case in which init_stack_slab() is
      called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
      initialized.  In that case init_stack_slab() will overwrite
      stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
      corruption.
      
      Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      305e519c
    • Wei Yang's avatar
      mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM · 18e19f19
      Wei Yang authored
      When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
      doesn't work before sparse_init_one_section() is called.
      
      This leads to a crash when hotplug memory:
      
          BUG: unable to handle page fault for address: 0000000006400000
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G        W         5.5.0-next-20200205+ #343
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:__memset+0x24/0x30
          Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
          RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
          RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
          RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
          RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
          R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
          R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
          FS:  0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           sparse_add_section+0x1c9/0x26a
           __add_pages+0xbf/0x150
           add_pages+0x12/0x60
           add_memory_resource+0xc8/0x210
           __add_memory+0x62/0xb0
           acpi_memory_device_add+0x13f/0x300
           acpi_bus_attach+0xf6/0x200
           acpi_bus_scan+0x43/0x90
           acpi_device_hotplug+0x275/0x3d0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x1a7/0x370
           worker_thread+0x30/0x380
           kthread+0x112/0x130
           ret_from_fork+0x35/0x40
      
      We should use memmap as it did.
      
      On x86 the impact is limited to x86_32 builds, or x86_64 configurations
      that override the default setting for SPARSEMEM_VMEMMAP.
      
      Other memory hotplug archs (arm64, ia64, and ppc) also default to
      SPARSEMEM_VMEMMAP=y.
      
      [dan.j.williams@intel.com: changelog update]
      {rppt@linux.ibm.com: changelog update]
      Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18e19f19
    • Gavin Shan's avatar
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan authored
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
    • Alexandru Ardelean's avatar
      lib/string.c: update match_string() doc-strings with correct behavior · c11d3fa0
      Alexandru Ardelean authored
      There were a few attempts at changing behavior of the match_string()
      helpers (i.e.  'match_string()' & 'sysfs_match_string()'), to change &
      extend the behavior according to the doc-string.
      
      But the simplest approach is to just fix the doc-strings.  The current
      behavior is fine as-is, and some bugs were introduced trying to fix it.
      
      As for extending the behavior, new helpers can always be introduced if
      needed.
      
      The match_string() helpers behave more like 'strncmp()' in the sense
      that they go up to n elements or until the first NULL element in the
      array of strings.
      
      This change updates the doc-strings with this info.
      
      Link: http://lkml.kernel.org/r/20200213072722.8249-1-alexandru.ardelean@analog.comSigned-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Acked-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c11d3fa0
    • Vasily Averin's avatar
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · 75866af6
      Vasily Averin authored
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75866af6