1. 27 Aug, 2019 5 commits
    • Linus Torvalds's avatar
      Merge tag 'mfd-fixes-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · 8d645408
      Linus Torvalds authored
      Pull MFD fix from Lee Jones:
       "Identify potentially unused functions in rk808 driver when !PM"
      
      * tag 'mfd-fixes-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
        mfd: rk808: Make PM function declaration static
        mfd: rk808: Mark pm functions __maybe_unused
      8d645408
    • Linus Torvalds's avatar
      Merge tag 'sound-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 0004654f
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A collection of small fixes as usual:
      
         - More coverage of USB-audio descriptor sanity checks
      
         - A fix for mute LED regression on Conexant HD-audio codecs
      
         - A few device-specific fixes and quirks for USB-audio and HD-audio
      
         - A fix for (die-hard remaining) possible race in sequencer core
      
         - FireWire oxfw regression fix that was introduced in 5.3-rc1"
      
      * tag 'sound-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: oxfw: fix to handle correct stream for PCM playback
        ALSA: seq: Fix potential concurrent access to the deleted pool
        ALSA: usb-audio: Check mixer unit bitmap yet more strictly
        ALSA: line6: Fix memory leak at line6_init_pcm() error path
        ALSA: usb-audio: Fix invalid NULL check in snd_emuusb_set_samplerate()
        ALSA: hda/ca0132 - Add new SBZ quirk
        ALSA: usb-audio: Add implicit fb quirk for Behringer UFX1604
        ALSA: hda - Fixes inverted Conexant GPIO mic mute led
      0004654f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 452a0444
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
          Leoshkevich.
      
       2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
          SMC from Jason Baron.
      
       3) ipv6_mc_may_pull() should return 0 for malformed packets, not
          -EINVAL. From Stefano Brivio.
      
       4) Don't forget to unpin umem xdp pages in error path of
          xdp_umem_reg(). From Ivan Khoronzhuk.
      
       5) Fix sta object leak in mac80211, from Johannes Berg.
      
       6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
          switches. From Florian Fainelli.
      
       7) Revert DMA sync removal from r8169 which was causing regressions on
          some MIPS Loongson platforms. From Heiner Kallweit.
      
       8) Use after free in flow dissector, from Jakub Sitnicki.
      
       9) Fix NULL derefs of net devices during ICMP processing across
          collect_md tunnels, from Hangbin Liu.
      
      10) proto_register() memory leaks, from Zhang Lin.
      
      11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
          from John Fastabend.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
        r8152: Set memory to all 0xFFs on failed reg reads
        openvswitch: Fix conntrack cache with timeout
        ipv4: mpls: fix mpls_xmit for iptunnel
        nexthop: Fix nexthop_num_path for blackhole nexthops
        net: rds: add service level support in rds-info
        net: route dump netlink NLM_F_MULTI flag missing
        s390/qeth: reject oversized SNMP requests
        sock: fix potential memory leak in proto_register()
        MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
        xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
        ipv4/icmp: fix rt dst dev null pointer dereference
        openvswitch: Fix log message in ovs conntrack
        bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
        bpf: fix use after free in prog symbol exposure
        bpf: fix precision tracking in presence of bpf2bpf calls
        flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
        Revert "r8169: remove not needed call to dma_sync_single_for_device"
        ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
        net/ncsi: Fix the payload copying for the request coming from Netlink
        qed: Add cleanup in qed_slowpath_start()
        ...
      452a0444
    • Lee Jones's avatar
      mfd: rk808: Make PM function declaration static · 4d82fa67
      Lee Jones authored
      Avoids:
        ../drivers/mfd/rk808.c:771:1: warning: symbol 'rk8xx_pm_ops' \
          was not declared. Should it be static?
      
      Fixes: 5752bc43 ("mfd: rk808: Mark pm functions __maybe_unused")
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      4d82fa67
    • Arnd Bergmann's avatar
      mfd: rk808: Mark pm functions __maybe_unused · 5752bc43
      Arnd Bergmann authored
      The newly added suspend/resume functions are only used if CONFIG_PM
      is enabled:
      
      drivers/mfd/rk808.c:752:12: error: 'rk8xx_resume' defined but not used [-Werror=unused-function]
      drivers/mfd/rk808.c:732:12: error: 'rk8xx_suspend' defined but not used [-Werror=unused-function]
      
      Mark them as __maybe_unused so the compiler can silently drop them
      when they are not needed.
      
      Fixes: 586c1b41 ("mfd: rk808: Add RK817 and RK809 support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      5752bc43
  2. 26 Aug, 2019 2 commits
  3. 25 Aug, 2019 26 commits
    • Yi-Hung Wei's avatar
      openvswitch: Fix conntrack cache with timeout · 71778951
      Yi-Hung Wei authored
      This patch addresses a conntrack cache issue with timeout policy.
      Currently, we do not check if the timeout extension is set properly in the
      cached conntrack entry.  Thus, after packet recirculate from conntrack
      action, the timeout policy is not applied properly.  This patch fixes the
      aforementioned issue.
      
      Fixes: 06bd2bdf ("openvswitch: Add timeout support to ct action")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarYi-Hung Wei <yihung.wei@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71778951
    • Alexey Kodanev's avatar
      ipv4: mpls: fix mpls_xmit for iptunnel · 803f3e22
      Alexey Kodanev authored
      When using mpls over gre/gre6 setup, rt->rt_gw4 address is not set, the
      same for rt->rt_gw_family.  Therefore, when rt->rt_gw_family is checked
      in mpls_xmit(), neigh_xmit() call is skipped. As a result, such setup
      doesn't work anymore.
      
      This issue was found with LTP mpls03 tests.
      
      Fixes: 1550c171 ("ipv4: Prepare rtable for IPv6 gateway")
      Signed-off-by: default avatarAlexey Kodanev <alexey.kodanev@oracle.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      803f3e22
    • David Ahern's avatar
      nexthop: Fix nexthop_num_path for blackhole nexthops · 9b5f6841
      David Ahern authored
      Donald reported this sequence:
        ip next add id 1 blackhole
        ip next add id 2 blackhole
        ip ro add 1.1.1.1/32 nhid 1
        ip ro add 1.1.1.2/32 nhid 2
      
      would cause a crash. Backtrace is:
      
      [  151.302790] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  151.304043] CPU: 1 PID: 277 Comm: ip Not tainted 5.3.0-rc5+ #37
      [  151.305078] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
      [  151.306526] RIP: 0010:fib_add_nexthop+0x8b/0x2aa
      [  151.307343] Code: 35 f7 81 48 8d 14 01 c7 02 f1 f1 f1 f1 c7 42 04 01 f4 f4 f4 48 89 f2 48 c1 ea 03 65 48 8b 0c 25 28 00 00 00 48 89 4d d0 31 c9 <80> 3c 02 00 74 08 48 89 f7 e8 1a e8 53 ff be 08 00 00 00 4c 89 e7
      [  151.310549] RSP: 0018:ffff888116c27340 EFLAGS: 00010246
      [  151.311469] RAX: dffffc0000000000 RBX: ffff8881154ece00 RCX: 0000000000000000
      [  151.312713] RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffff888115649b40
      [  151.313968] RBP: ffff888116c273d8 R08: ffffed10221e3757 R09: ffff888110f1bab8
      [  151.315212] R10: 0000000000000001 R11: ffff888110f1bab3 R12: ffff888115649b40
      [  151.316456] R13: 0000000000000020 R14: ffff888116c273b0 R15: ffff888115649b40
      [  151.317707] FS:  00007f60b4d8d800(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
      [  151.319113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  151.320119] CR2: 0000555671ffdc00 CR3: 00000001136ba005 CR4: 0000000000020ee0
      [  151.321367] Call Trace:
      [  151.321820]  ? fib_nexthop_info+0x635/0x635
      [  151.322572]  fib_dump_info+0xaa4/0xde0
      [  151.323247]  ? fib_create_info+0x2431/0x2431
      [  151.324008]  ? napi_alloc_frag+0x2a/0x2a
      [  151.324711]  rtmsg_fib+0x2c4/0x3be
      [  151.325339]  fib_table_insert+0xe2f/0xeee
      ...
      
      fib_dump_info incorrectly has nhs = 0 for blackhole nexthops, so it
      believes the nexthop object is a multipath group (nhs != 1) and ends
      up down the nexthop_mpath_fill_node() path which is wrong for a
      blackhole.
      
      The blackhole check in nexthop_num_path is leftover from early days
      of the blackhole implementation which did not initialize the device.
      In the end the design was simpler (fewer special case checks) to set
      the device to loopback in nh_info, so the check in nexthop_num_path
      should have been removed.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Reported-by: default avatarDonald Sharp <sharpd@cumulusnetworks.com>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b5f6841
    • Linus Torvalds's avatar
      Linux 5.3-rc6 · a55aa89a
      Linus Torvalds authored
      a55aa89a
    • Linus Torvalds's avatar
      Merge tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux · c749088f
      Linus Torvalds authored
      Pull auxdisplay cleanup from Miguel Ojeda:
       "Make ht16k33_fb_fix and ht16k33_fb_var constant (Nishka Dasgupta)"
      
      * tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux:
        auxdisplay: ht16k33: Make ht16k33_fb_fix and ht16k33_fb_var constant
      c749088f
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · 32ae83ff
      Linus Torvalds authored
      Pull UML fix from Richard Weinberger:
       "Fix time travel mode"
      
      * tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: fix time travel mode
      32ae83ff
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs · 94a76d9b
      Linus Torvalds authored
      Pull UBIFS and JFFS2 fixes from Richard Weinberger:
       "UBIFS:
         - Don't block too long in writeback_inodes_sb()
         - Fix for a possible overrun of the log head
         - Fix double unlock in orphan_delete()
      
        JFFS2:
         - Remove C++ style from UAPI header and unbreak picky toolchains"
      
      * tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
        ubifs: Limit the number of pages in shrink_liability
        ubifs: Correctly initialize c->min_log_bytes
        ubifs: Fix double unlock around orphan_delete()
        jffs2: Remove C++ style comments from uapi header
      94a76d9b
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 146c3d32
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A few fixes for x86:
      
         - Fix a boot regression caused by the recent bootparam sanitizing
           change, which escaped the attention of all people who reviewed that
           code.
      
         - Address a boot problem on machines with broken E820 tables caused
           by an underflow which ended up placing the trampoline start at
           physical address 0.
      
         - Handle machines which do not advertise a legacy timer of any form,
           but need calibration of the local APIC timer gracefully by making
           the calibration routine independent from the tick interrupt. Marked
           for stable as well as there seems to be quite some new laptops
           rolled out which expose this.
      
         - Clear the RDRAND CPUID bit on AMD family 15h and 16h CPUs which are
           affected by broken firmware which does not initialize RDRAND
           correctly after resume. Add a command line parameter to override
           this for machine which either do not use suspend/resume or have a
           fixed BIOS. Unfortunately there is no way to detect this on boot,
           so the only safe decision is to turn it off by default.
      
         - Prevent RFLAGS from being clobbers in CALL_NOSPEC on 32bit which
           caused fast KVM instruction emulation to break.
      
         - Explain the Intel CPU model naming convention so that the repeating
           discussions come to an end"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
        x86/boot: Fix boot regression caused by bootparam sanitizing
        x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
        x86/boot/compressed/64: Fix boot on machines with broken E820 table
        x86/apic: Handle missing global clockevent gracefully
        x86/cpu: Explain Intel model naming convention
      146c3d32
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5a13fc3d
      Linus Torvalds authored
      Pull timekeeping fix from Thomas Gleixner:
       "A single fix for a regression caused by the generic VDSO
        implementation where a math overflow causes CLOCK_BOOTTIME to become a
        random number generator"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
      5a13fc3d
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8a04c2ee
      Linus Torvalds authored
      Pull scheduler fix from Thomas Gleixner:
       "Handle the worker management in situations where a task is scheduled
        out on a PI lock contention correctly and schedule a new worker if
        possible"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/core: Schedule new worker even if PI-blocked
      8a04c2ee
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 05bbb936
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "Two small fixes for kprobes and perf:
      
         - Prevent a deadlock in kprobe_optimizer() causes by reverse lock
           ordering
      
         - Fix a comment typo"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kprobes: Fix potential deadlock in kprobe_optimizer()
        perf/x86: Fix typo in comment
      05bbb936
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 44c471e4
      Linus Torvalds authored
      Pull irq fix from Thomas Gleixner:
       "A single fix for a imbalanced kobject operation in the irq decriptor
        code which was unearthed by the new warnings in the kobject code"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Properly pair kobject_del() with kobject_add()
      44c471e4
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f47edb59
      Linus Torvalds authored
      Mergr misc fixes from Andrew Morton:
       "11 fixes"
      
      Mostly VM fixes, one psi polling fix, and one parisc build fix.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
        mm/zsmalloc.c: fix race condition in zs_destroy_pool
        mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
        mm, page_owner: handle THP splits correctly
        userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
        psi: get poll_work to run when calling poll syscall next time
        mm: memcontrol: flush percpu vmevents before releasing memcg
        mm: memcontrol: flush percpu vmstats before releasing memcg
        parisc: fix compilation errrors
        mm, page_alloc: move_freepages should not examine struct page of reserved memory
        mm/z3fold.c: fix race between migration and destruction
      f47edb59
    • Takashi Iwai's avatar
      ALSA: seq: Fix potential concurrent access to the deleted pool · 75545304
      Takashi Iwai authored
      The input pool of a client might be deleted via the resize ioctl, the
      the access to it should be covered by the proper locks.  Currently the
      only missing place is the call in snd_seq_ioctl_get_client_pool(), and
      this patch papers over it.
      
      Reported-by: syzbot+4a75454b9ca2777f35c7@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      75545304
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping · e67095fd
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
       "Two fixes for regressions in this merge window:
      
         - select the Kconfig symbols for the noncoherent dma arch helpers on
           arm if swiotlb is selected, not just for LPAE to not break then Xen
           build, that uses swiotlb indirectly through swiotlb-xen
      
         - fix the page allocator fallback in dma_alloc_contiguous if the CMA
           allocation fails"
      
      * tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
        dma-direct: fix zone selection after an unaddressable CMA allocation
        arm: select the dma-noncoherent symbols for all swiotlb builds
      e67095fd
    • Andrey Ryabinin's avatar
      mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y · 00fb24a4
      Andrey Ryabinin authored
      The code like this:
      
      	ptr = kmalloc(size, GFP_KERNEL);
      	page = virt_to_page(ptr);
      	offset = offset_in_page(ptr);
      	kfree(page_address(page) + offset);
      
      may produce false-positive invalid-free reports on the kernel with
      CONFIG_KASAN_SW_TAGS=y.
      
      In the example above we lose the original tag assigned to 'ptr', so
      kfree() gets the pointer with 0xFF tag.  In kfree() we check that 0xFF
      tag is different from the tag in shadow hence print false report.
      
      Instead of just comparing tags, do the following:
      
      1) Check that shadow doesn't contain KASAN_TAG_INVALID.  Otherwise it's
         double-free and it doesn't matter what tag the pointer have.
      
      2) If pointer tag is different from 0xFF, make sure that tag in the
         shadow is the same as in the pointer.
      
      Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
      Fixes: 7f94ffbc ("kasan: add hooks implementation for tag-based mode")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00fb24a4
    • Henry Burns's avatar
      mm/zsmalloc.c: fix race condition in zs_destroy_pool · 701d6785
      Henry Burns authored
      In zs_destroy_pool() we call flush_work(&pool->free_work).  However, we
      have no guarantee that migration isn't happening in the background at
      that time.
      
      Since migration can't directly free pages, it relies on free_work being
      scheduled to free the pages.  But there's nothing preventing an
      in-progress migrate from queuing the work *after*
      zs_unregister_migration() has called flush_work().  Which would mean
      pages still pointing at the inode when we free it.
      
      Since we know at destroy time all objects should be free, no new
      migrations can come in (since zs_page_isolate() fails for fully-free
      zspages).  This means it is sufficient to track a "# isolated zspages"
      count by class, and have the destroy logic ensure all such pages have
      drained before proceeding.  Keeping that state under the class spinlock
      keeps the logic straightforward.
      
      In this case a memory leak could lead to an eventual crash if compaction
      hits the leaked page.  This crash would only occur if people are
      changing their zswap backend at runtime (which eventually starts
      destruction).
      
      Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
      Fixes: 48b4800a ("zsmalloc: page migration support")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      701d6785
    • Henry Burns's avatar
      mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely · 1a87aa03
      Henry Burns authored
      In zs_page_migrate() we call putback_zspage() after we have finished
      migrating all pages in this zspage.  However, the return value is
      ignored.  If a zs_free() races in between zs_page_isolate() and
      zs_page_migrate(), freeing the last object in the zspage,
      putback_zspage() will leave the page in ZS_EMPTY for potentially an
      unbounded amount of time.
      
      To fix this, we need to do the same thing as zs_page_putback() does:
      schedule free_work to occur.
      
      To avoid duplicated code, move the sequence to a new
      putback_zspage_deferred() function which both zs_page_migrate() and
      zs_page_putback() call.
      
      Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
      Fixes: 48b4800a ("zsmalloc: page migration support")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a87aa03
    • Vlastimil Babka's avatar
      mm, page_owner: handle THP splits correctly · f7da677b
      Vlastimil Babka authored
      THP splitting path is missing the split_page_owner() call that
      split_page() has.
      
      As a result, split THP pages are wrongly reported in the page_owner file
      as order-9 pages.  Furthermore when the former head page is freed, the
      remaining former tail pages are not listed in the page_owner file at
      all.  This patch fixes that by adding the split_page_owner() call into
      __split_huge_page().
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
      Fixes: a9627bc5 ("mm/page_owner: introduce split_page_owner and replace manual handling")
      Reported-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7da677b
    • Oleg Nesterov's avatar
      userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx · 46d0b24c
      Oleg Nesterov authored
      userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
      mm->core_state != NULL.
      
      Otherwise a page fault can see userfaultfd_missing() == T and use an
      already freed userfaultfd_ctx.
      
      Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
      Fixes: 04f5866e ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46d0b24c
    • Jason Xing's avatar
      psi: get poll_work to run when calling poll syscall next time · 7b2b55da
      Jason Xing authored
      Only when calling the poll syscall the first time can user receive
      POLLPRI correctly.  After that, user always fails to acquire the event
      signal.
      
      Reproduce case:
       1. Get the monitor code in Documentation/accounting/psi.txt
       2. Run it, and wait for the event triggered.
       3. Kill and restart the process.
      
      The question is why we can end up with poll_scheduled = 1 but the work
      not running (which would reset it to 0).  And the answer is because the
      scheduling side sees group->poll_kworker under RCU protection and then
      schedules it, but here we cancel the work and destroy the worker.  The
      cancel needs to pair with resetting the poll_scheduled flag.
      
      Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.comSigned-off-by: default avatarJason Xing <kerneljasonxing@linux.alibaba.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b2b55da
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmevents before releasing memcg · bb65f89b
      Roman Gushchin authored
      Similar to vmstats, percpu caching of local vmevents leads to an
      accumulation of errors on non-leaf levels.  This happens because some
      leftovers may remain in percpu caches, so that they are never propagated
      up by the cgroup tree and just disappear into nonexistence with on
      releasing of the memory cgroup.
      
      To fix this issue let's accumulate and propagate percpu vmevents values
      before releasing the memory cgroup similar to what we're doing with
      vmstats.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb65f89b
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmstats before releasing memcg · c350a99e
      Roman Gushchin authored
      Percpu caching of local vmstats with the conditional propagation by the
      cgroup tree leads to an accumulation of errors on non-leaf levels.
      
      Let's imagine two nested memory cgroups A and A/B.  Say, a process
      belonging to A/B allocates 100 pagecache pages on the CPU 0.  The percpu
      cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
      and A atomic vmstat counters, 4 pages will remain in the percpu cache.
      
      Imagine A/B is nearby memory.max, so that every following allocation
      triggers a direct reclaim on the local CPU.  Say, each such attempt will
      free 16 pages on a new cpu.  That means every percpu cache will have -16
      pages, except the first one, which will have 4 - 16 = -12.  A/B and A
      atomic counters will not be touched at all.
      
      Now a user removes A/B.  All percpu caches are freed and corresponding
      vmstat numbers are forgotten.  A has 96 pages more than expected.
      
      As memory cgroups are created and destroyed, errors do accumulate.  Even
      1-2 pages differences can accumulate into large numbers.
      
      To fix this issue let's accumulate and propagate percpu vmstat values
      before releasing the memory cgroup.  At this point these numbers are
      stable and cannot be changed.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c350a99e
    • Qian Cai's avatar
      parisc: fix compilation errrors · bbcb03a9
      Qian Cai authored
      Commit 0cfaee2a ("include/asm-generic/5level-fixup.h: fix variable
      'p4d' set but not used") converted a few functions from macros to static
      inline, which causes parisc to complain,
      
        In file included from include/asm-generic/4level-fixup.h:38:0,
                         from arch/parisc/include/asm/pgtable.h:5,
                         from arch/parisc/include/asm/io.h:6,
                         from include/linux/io.h:13,
                         from sound/core/memory.c:9:
        include/asm-generic/5level-fixup.h:14:18: error: unknown type name 'pgd_t'; did you mean 'pid_t'?
         #define p4d_t    pgd_t
                          ^
        include/asm-generic/5level-fixup.h:24:28: note: in expansion of macro 'p4d_t'
         static inline int p4d_none(p4d_t p4d)
                                    ^~~~~
      
      It is because "4level-fixup.h" is included before "asm/page.h" where
      "pgd_t" is defined.
      
      Link: http://lkml.kernel.org/r/20190815205305.1382-1-cai@lca.pw
      Fixes: 0cfaee2a ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbcb03a9
    • David Rientjes's avatar
      mm, page_alloc: move_freepages should not examine struct page of reserved memory · cd961038
      David Rientjes authored
      After commit 907ec5fc ("mm: zero remaining unavailable struct
      pages"), struct page of reserved memory is zeroed.  This causes
      page->flags to be 0 and fixes issues related to reading
      /proc/kpageflags, for example, of reserved memory.
      
      The VM_BUG_ON() in move_freepages_block(), however, assumes that
      page_zone() is meaningful even for reserved memory.  That assumption is
      no longer true after the aforementioned commit.
      
      There's no reason why move_freepages_block() should be testing the
      legitimacy of page_zone() for reserved memory; its scope is limited only
      to pages on the zone's freelist.
      
      Note that pfn_valid() can be true for reserved memory: there is a
      backing struct page.  The check for page_to_nid(page) is also buggy but
      reserved memory normally only appears on node 0 so the zeroing doesn't
      affect this.
      
      Move the debug checks to after verifying PageBuddy is true.  This
      isolates the scope of the checks to only be for buddy pages which are on
      the zone's freelist which move_freepages_block() is operating on.  In
      this case, an incorrect node or zone is a bug worthy of being warned
      about (and the examination of struct page is acceptable bcause this
      memory is not reserved).
      
      Why does move_freepages_block() gets called on reserved memory? It's
      simply math after finding a valid free page from the per-zone free area
      to use as fallback.  We find the beginning and end of the pageblock of
      the valid page and that can bring us into memory that was reserved per
      the e820.  pfn_valid() is still true (it's backed by a struct page), but
      since it's zero'd we shouldn't make any inferences here about comparing
      its node or zone.  The current node check just happens to succeed most
      of the time by luck because reserved memory typically appears on node 0.
      
      The fix here is to validate that we actually have buddy pages before
      testing if there's any type of zone or node strangeness going on.
      
      We noticed it almost immediately after bringing 907ec5fc in on
      CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
      the per-zone free area where the math in move_freepages() will bring the
      start or end pfn into reserved memory and wanting to claim that entire
      pageblock as a new migratetype.  So the path will be rare, require
      CONFIG_DEBUG_VM, and require fallback to a different migratetype.
      
      Some struct pages were already zeroed from reserve pages before
      907ec5fca3c so it theoretically could trigger before this commit.  I
      think it's rare enough under a config option that most people don't run
      that others may not have noticed.  I wouldn't argue against a stable tag
      and the backport should be easy enough, but probably wouldn't single out
      a commit that this is fixing.
      
      Mel said:
      
      : The overhead of the debugging check is higher with this patch although
      : it'll only affect debug builds and the path is not particularly hot.
      : If this was a concern, I think it would be reasonable to simply remove
      : the debugging check as the zone boundaries are checked in
      : move_freepages_block and we never expect a zone/node to be smaller than
      : a pageblock and stuck in the middle of another zone.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd961038
    • Henry Burns's avatar
      mm/z3fold.c: fix race between migration and destruction · d776aaa9
      Henry Burns authored
      In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
      However, we have no guarantee that migration isn't happening in the
      background at that time.
      
      Migration directly calls queue_work_on(pool->compact_wq), if destruction
      wins that race we are using a destroyed workqueue.
      
      Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.comSigned-off-by: default avatarHenry Burns <henryburns@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d776aaa9
  4. 24 Aug, 2019 7 commits
    • Zhu Yanjun's avatar
      net: rds: add service level support in rds-info · e0e6d062
      Zhu Yanjun authored
      >From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
      is used to identify different flows within an IBA subnet.
      It is carried in the local route header of the packet.
      
      Before this commit, run "rds-info -I". The outputs are as
      below:
      "
      RDS IB Connections:
       LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
      192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
      192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
      192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
      "
      After this commit, the output is as below:
      "
      RDS IB Connections:
       LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
      192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
      192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
      192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
      "
      
      The commit fe3475af ("net: rds: add per rds connection cache
      statistics") adds cache_allocs in struct rds_info_rdma_connection
      as below:
      struct rds_info_rdma_connection {
      ...
              __u32           rdma_mr_max;
              __u32           rdma_mr_size;
              __u8            tos;
              __u32           cache_allocs;
       };
      The peer struct in rds-tools of struct rds_info_rdma_connection is as
      below:
      struct rds_info_rdma_connection {
      ...
              uint32_t        rdma_mr_max;
              uint32_t        rdma_mr_size;
              uint8_t         tos;
              uint8_t         sl;
              uint32_t        cache_allocs;
      };
      The difference between userspace and kernel is the member variable sl.
      In the kernel struct, the member variable sl is missing. This will
      introduce risks. So it is necessary to use this commit to avoid this risk.
      
      Fixes: fe3475af ("net: rds: add per rds connection cache statistics")
      CC: Joe Jin <joe.jin@oracle.com>
      CC: JUNXIAO_BI <junxiao.bi@oracle.com>
      Suggested-by: default avatarGerd Rausch <gerd.rausch@oracle.com>
      Signed-off-by: default avatarZhu Yanjun <yanjun.zhu@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0e6d062
    • John Fastabend's avatar
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend authored
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e93fb3e9
    • Julian Wiedmann's avatar
      s390/qeth: reject oversized SNMP requests · 292a50e3
      Julian Wiedmann authored
      Commit d4c08afa ("s390/qeth: streamline SNMP cmd code") removed
      the bounds checking for req_len, under the assumption that the check in
      qeth_alloc_cmd() would suffice.
      
      But that code path isn't sufficiently robust to handle a user-provided
      data_length, which could overflow (when adding the cmd header overhead)
      before being checked against QETH_BUFSIZE. We end up allocating just a
      tiny iob, and the subsequent copy_from_user() writes past the end of
      that iob.
      
      Special-case this path and add a coarse bounds check, to protect against
      maliciuous requests. This let's the subsequent code flow do its normal
      job and precise checking, without risk of overflow.
      
      Fixes: d4c08afa ("s390/qeth: streamline SNMP cmd code")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      292a50e3
    • zhanglin's avatar
      sock: fix potential memory leak in proto_register() · b45ce321
      zhanglin authored
      If protocols registered exceeded PROTO_INUSE_NR, prot will be
      added to proto_list, but no available bit left for prot in
      proto_inuse_idx.
      
      Changes since v2:
      * Propagate the error code properly
      Signed-off-by: default avatarzhanglin <zhang.lin16@zte.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b45ce321
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2019-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · d37fb975
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 fixes 2019-08-22
      
      This series introduces some fixes to mlx5 driver.
      
      1) Form Moshe, two fixes for firmware health reporter
      2) From Eran, two ktls fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d37fb975
    • Andrew Lunn's avatar
      MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT · 0c69b19f
      Andrew Lunn authored
      Russell king maintains phylink, as part of the SFP module support.
      However, much of the review work is about drivers swapping from phylib
      to phylink. Such changes don't make changes to the phylink core, and
      so the F: rules in MAINTAINERS don't match. Add a K:, keywork rule,
      which hopefully get_maintainers will match against for patches to MAC
      drivers swapping to phylink.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c69b19f
    • David S. Miller's avatar
      Merge branch 'collect_md-mode-dev-null' · 9b45ff91
      David S. Miller authored
      Hangbin Liu says:
      
      ====================
      fix dev null pointer dereference when send packets larger than mtu in collect_md mode
      
      When we send a packet larger than PMTU, we need to reply with
      icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).
      
      But with collect_md mode, kernel will crash while accessing the dst dev
      as __metadata_dst_init() init dst->dev to NULL by default. Here is what
      the code path looks like, for GRE:
      
      - ip6gre_tunnel_xmit
        - ip6gre_xmit_ipv4
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmp_send
            - net = dev_net(rt->dst.dev); <-- here
        - ip6gre_xmit_ipv6
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmpv6_send
            ...
            - decode_session4
              - oif = skb_dst(skb)->dev->ifindex; <-- here
            - decode_session6
              - oif = skb_dst(skb)->dev->ifindex; <-- here
      
      We could not fix it in __metadata_dst_init() as there is no dev supplied.
      Look in to the __icmp_send()/decode_session{4,6} code we could find the dst
      dev is actually not needed. In __icmp_send(), we could get the net by skb->dev.
      For decode_session{4,6}, as it was called by xfrm_decode_session_reverse()
      in this scenario, the oif is not used by
      fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
      
      The reproducer is easy:
      
      ovs-vsctl add-br br0
      ip link set br0 up
      ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$dst_addr
      ip link set gre0 up
      ip addr add ${local_gre6}/64 dev br0
      ping6 $remote_gre6 -s 1500
      
      The kernel will crash like
      [40595.821651] BUG: kernel NULL pointer dereference, address: 0000000000000108
      [40595.822411] #PF: supervisor read access in kernel mode
      [40595.822949] #PF: error_code(0x0000) - not-present page
      [40595.823492] PGD 0 P4D 0
      [40595.823767] Oops: 0000 [#1] SMP PTI
      [40595.824139] CPU: 0 PID: 2831 Comm: handler12 Not tainted 5.2.0 #57
      [40595.824788] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8.1.0+2983+b2ae9c0a 04/01/2014
      [40595.825680] RIP: 0010:__xfrm_decode_session+0x6b/0x930
      [40595.826219] Code: b7 c0 00 00 00 b8 06 00 00 00 66 85 d2 0f b7 ca 48 0f 45 c1 44 0f b6 2c 06 48 8b 47 58 48 83 e0 fe 0f 84 f4 04 00 00 48 8b 00 <44> 8b 80 08 01 00 00 41 f6 c4 01 4c 89 e7
      ba 58 00 00 00 0f 85 47
      [40595.828155] RSP: 0018:ffffc90000a73438 EFLAGS: 00010286
      [40595.828705] RAX: 0000000000000000 RBX: ffff8881329d7100 RCX: 0000000000000000
      [40595.829450] RDX: 0000000000000000 RSI: ffff8881339e70ce RDI: ffff8881329d7100
      [40595.830191] RBP: ffffc90000a73470 R08: 0000000000000000 R09: 000000000000000a
      [40595.830936] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a73490
      [40595.831682] R13: 000000000000002c R14: ffff888132ff1301 R15: ffff8881329d7100
      [40595.832427] FS:  00007f5bfcfd6700(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
      [40595.833266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [40595.833883] CR2: 0000000000000108 CR3: 000000013a368000 CR4: 00000000000006f0
      [40595.834633] Call Trace:
      [40595.835392]  ? rt6_multipath_hash+0x4c/0x390
      [40595.835853]  icmpv6_route_lookup+0xcb/0x1d0
      [40595.836296]  ? icmpv6_xrlim_allow+0x3e/0x140
      [40595.836751]  icmp6_send+0x537/0x840
      [40595.837125]  icmpv6_send+0x20/0x30
      [40595.837494]  tnl_update_pmtu.isra.27+0x19d/0x2a0 [ip_tunnel]
      [40595.838088]  ip_md_tunnel_xmit+0x1b6/0x510 [ip_tunnel]
      [40595.838633]  gre_tap_xmit+0x10c/0x160 [ip_gre]
      [40595.839103]  dev_hard_start_xmit+0x93/0x200
      [40595.839551]  sch_direct_xmit+0x101/0x2d0
      [40595.839967]  __dev_queue_xmit+0x69f/0x9c0
      [40595.840399]  do_execute_actions+0x1717/0x1910 [openvswitch]
      [40595.840987]  ? validate_set.isra.12+0x2f5/0x3d0 [openvswitch]
      [40595.841596]  ? reserve_sfa_size+0x31/0x130 [openvswitch]
      [40595.842154]  ? __ovs_nla_copy_actions+0x1b4/0xad0 [openvswitch]
      [40595.842778]  ? __kmalloc_reserve.isra.50+0x2e/0x80
      [40595.843285]  ? should_failslab+0xa/0x20
      [40595.843696]  ? __kmalloc+0x188/0x220
      [40595.844078]  ? __alloc_skb+0x97/0x270
      [40595.844472]  ovs_execute_actions+0x47/0x120 [openvswitch]
      [40595.845041]  ovs_packet_cmd_execute+0x27d/0x2b0 [openvswitch]
      [40595.845648]  genl_family_rcv_msg+0x3a8/0x430
      [40595.846101]  genl_rcv_msg+0x47/0x90
      [40595.846476]  ? __alloc_skb+0x83/0x270
      [40595.846866]  ? genl_family_rcv_msg+0x430/0x430
      [40595.847335]  netlink_rcv_skb+0xcb/0x100
      [40595.847777]  genl_rcv+0x24/0x40
      [40595.848113]  netlink_unicast+0x17f/0x230
      [40595.848535]  netlink_sendmsg+0x2ed/0x3e0
      [40595.848951]  sock_sendmsg+0x4f/0x60
      [40595.849323]  ___sys_sendmsg+0x2bd/0x2e0
      [40595.849733]  ? sock_poll+0x6f/0xb0
      [40595.850098]  ? ep_scan_ready_list.isra.14+0x20b/0x240
      [40595.850634]  ? _cond_resched+0x15/0x30
      [40595.851032]  ? ep_poll+0x11b/0x440
      [40595.851401]  ? _copy_to_user+0x22/0x30
      [40595.851799]  __sys_sendmsg+0x58/0xa0
      [40595.852180]  do_syscall_64+0x5b/0x190
      [40595.852574]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [40595.853105] RIP: 0033:0x7f5c00038c7d
      [40595.853489] Code: c7 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e f7 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d7 f7 ff ff 48 89
      d0 48 83 c4 08 48 3d 01
      [40595.855443] RSP: 002b:00007f5bfcf73c00 EFLAGS: 00003293 ORIG_RAX: 000000000000002e
      [40595.856244] RAX: ffffffffffffffda RBX: 00007f5bfcf74a60 RCX: 00007f5c00038c7d
      [40595.856990] RDX: 0000000000000000 RSI: 00007f5bfcf73c60 RDI: 0000000000000015
      [40595.857736] RBP: 0000000000000004 R08: 0000000000000b7c R09: 0000000000000110
      [40595.858613] R10: 0001000800050004 R11: 0000000000003293 R12: 000055c2d8329da0
      [40595.859401] R13: 00007f5bfcf74120 R14: 0000000000000347 R15: 00007f5bfcf73c60
      [40595.860185] Modules linked in: ip_gre ip_tunnel gre openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc bochs_drm ttm drm_kms_helper drm pcspkr joydev i2c_piix4 qemu_fw_cfg xfs libcrc32c virtio_net net_failover serio_raw failover ata_generic virtio_blk pata_acpi floppy
      [40595.863155] CR2: 0000000000000108
      [40595.863551] ---[ end trace 22209bbcacb4addd ]---
      
      v4: Julian Anastasov remind skb->dev also could be NULL in icmp_send. We'd
      better still use dst.dev and do a check to avoid crash.
      
      v3: only replace pkg to packets in cover letter. So I didn't update the version
      info in the follow up patches.
      
      v2: fix it in __icmp_send() and decode_session{4,6} separately instead of
      updating shared dst dev in {ip_md, ip6}_tunnel_xmit.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b45ff91