1. 27 Jan, 2015 15 commits
    • Jay Vosburgh's avatar
      net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding · a81b0a4f
      Jay Vosburgh authored
      [ Upstream commit 2c26d34b ]
      
      When using VXLAN tunnels and a sky2 device, I have experienced
      checksum failures of the following type:
      
      [ 4297.761899] eth0: hw csum failure
      [...]
      [ 4297.765223] Call Trace:
      [ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
      [ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
      [ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
      [ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
      [ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
      [ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360
      
      	These are reliably reproduced in a network topology of:
      
      container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch
      
      	When VXLAN encapsulated traffic is received from a similarly
      configured peer, the above warning is generated in the receive
      processing of the encapsulated packet.  Note that the warning is
      associated with the container eth0.
      
              The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
      because the packet is an encapsulated Ethernet frame, the checksum
      generated by the hardware includes the inner protocol and Ethernet
      headers.
      
      	The receive code is careful to update the skb->csum, except in
      __dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
      calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
      to skip over the Ethernet header, but does not update skb->csum when
      doing so.
      
      	This patch resolves the problem by adding a call to
      skb_postpull_rcsum to update the skb->csum after the call to
      eth_type_trans.
      Signed-off-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a81b0a4f
    • Govindarajulu Varadarajan's avatar
      enic: fix rx skb checksum · 0042721e
      Govindarajulu Varadarajan authored
      [ Upstream commit 17e96834 ]
      
      Hardware always provides compliment of IP pseudo checksum. Stack expects
      whole packet checksum without pseudo checksum if CHECKSUM_COMPLETE is set.
      
      This causes checksum error in nf & ovs.
      
      kernel: qg-19546f09-f2: hw csum failure
      kernel: CPU: 9 PID: 0 Comm: swapper/9 Tainted: GF          O--------------   3.10.0-123.8.1.el7.x86_64 #1
      kernel: Hardware name: Cisco Systems Inc UCSB-B200-M3/UCSB-B200-M3, BIOS B200M3.2.2.3.0.080820141339 08/08/2014
      kernel: ffff881218f40000 df68243feb35e3a8 ffff881237a43ab8 ffffffff815e237b
      kernel: ffff881237a43ad0 ffffffff814cd4ca ffff8829ec71eb00 ffff881237a43af0
      kernel: ffffffff814c6232 0000000000000286 ffff8829ec71eb00 ffff881237a43b00
      kernel: Call Trace:
      kernel: <IRQ>  [<ffffffff815e237b>] dump_stack+0x19/0x1b
      kernel: [<ffffffff814cd4ca>] netdev_rx_csum_fault+0x3a/0x40
      kernel: [<ffffffff814c6232>] __skb_checksum_complete_head+0x62/0x70
      kernel: [<ffffffff814c6251>] __skb_checksum_complete+0x11/0x20
      kernel: [<ffffffff8155a20c>] nf_ip_checksum+0xcc/0x100
      kernel: [<ffffffffa049edc7>] icmp_error+0x1f7/0x35c [nf_conntrack_ipv4]
      kernel: [<ffffffff814cf419>] ? netif_rx+0xb9/0x1d0
      kernel: [<ffffffffa040eb7b>] ? internal_dev_recv+0xdb/0x130 [openvswitch]
      kernel: [<ffffffffa04c8330>] nf_conntrack_in+0xf0/0xa80 [nf_conntrack]
      kernel: [<ffffffff81509380>] ? inet_del_offload+0x40/0x40
      kernel: [<ffffffffa049e302>] ipv4_conntrack_in+0x22/0x30 [nf_conntrack_ipv4]
      kernel: [<ffffffff815005ca>] nf_iterate+0xaa/0xc0
      kernel: [<ffffffff81509380>] ? inet_del_offload+0x40/0x40
      kernel: [<ffffffff81500664>] nf_hook_slow+0x84/0x140
      kernel: [<ffffffff81509380>] ? inet_del_offload+0x40/0x40
      kernel: [<ffffffff81509dd4>] ip_rcv+0x344/0x380
      
      Hardware verifies IP & tcp/udp header checksum but does not provide payload
      checksum, use CHECKSUM_UNNECESSARY. Set it only if its valid IP tcp/udp packet.
      
      Cc: Jiri Benc <jbenc@redhat.com>
      Cc: Stefan Assmann <sassmann@redhat.com>
      Reported-by: default avatarSunil Choudhary <schoudha@redhat.com>
      Signed-off-by: default avatarGovindarajulu Varadarajan <_govind@gmx.com>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0042721e
    • Jiri Pirko's avatar
      team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin · 52bf2a12
      Jiri Pirko authored
      [ Upstream commit b0d11b42 ]
      
      This patch is fixing a race condition that may cause setting
      count_pending to -1, which results in unwanted big bulk of arp messages
      (in case of "notify peers").
      
      Consider following scenario:
      
      count_pending == 2
         CPU0                                           CPU1
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
       team_notify_peers
         atomic_add (adding 1 to count_pending)
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 0)
         schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to -1)
      
      Fix this race by using atomic_dec_if_positive - that will prevent
      count_pending running under 0.
      
      Fixes: fc423ff0 ("team: add peer notification")
      Fixes: 492b200e  ("team: add support for sending multicast rejoins")
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52bf2a12
    • Eric Dumazet's avatar
      alx: fix alx_poll() · 07d187d6
      Eric Dumazet authored
      [ Upstream commit 7a05dc64 ]
      
      Commit d75b1ade ("net: less interrupt masking in NAPI") uncovered
      wrong alx_poll() behavior.
      
      A NAPI poll() handler is supposed to return exactly the budget when/if
      napi_complete() has not been called.
      
      It is also supposed to return number of frames that were received, so
      that netdev_budget can have a meaning.
      
      Also, in case of TX pressure, we still have to dequeue received
      packets : alx_clean_rx_irq() has to be called even if
      alx_clean_tx_irq(alx) returns false, otherwise device is half duplex.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: d75b1ade ("net: less interrupt masking in NAPI")
      Reported-by: default avatarOded Gabbay <oded.gabbay@amd.com>
      Bisected-by: default avatarOded Gabbay <oded.gabbay@amd.com>
      Tested-by: default avatarOded Gabbay <oded.gabbay@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07d187d6
    • Herbert Xu's avatar
      tcp: Do not apply TSO segment limit to non-TSO packets · 3a136049
      Herbert Xu authored
      [ Upstream commit 843925f3 ]
      
      Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
      
      In fact the problem was completely unrelated to IPsec.  The bug is
      also reproducible if you just disable TSO/GSO.
      
      The problem is that when the MSS goes down, existing queued packet
      on the TX queue that have not been transmitted yet all look like
      TSO packets and get treated as such.
      
      This then triggers a bug where tcp_mss_split_point tells us to
      generate a zero-sized packet on the TX queue.  Once that happens
      we're screwed because the zero-sized packet can never be removed
      by ACKs.
      
      Fixes: 1485348d ("tcp: Apply device TSO segment limit earlier")
      Reported-by: default avatarThomas Jarosch <thomas.jarosch@intra2net.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      
      Cheers,
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a136049
    • Thomas Graf's avatar
      net: Reset secmark when scrubbing packet · 7d19bd80
      Thomas Graf authored
      [ Upstream commit b8fb4e06 ]
      
      skb_scrub_packet() is called when a packet switches between a context
      such as between underlay and overlay, between namespaces, or between
      L3 subnets.
      
      While we already scrub the packet mark, connection tracking entry,
      and cached destination, the security mark/context is left intact.
      
      It seems wrong to inherit the security context of a packet when going
      from overlay to underlay or across forwarding paths.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarFlavio Leitner <fbl@sysclose.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d19bd80
    • Toshiaki Makita's avatar
      net: Fix stacked vlan offload features computation · 72111ffa
      Toshiaki Makita authored
      [ Upstream commit 796f2da8 ]
      
      When vlan tags are stacked, it is very likely that the outer tag is stored
      in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
      Currently netif_skb_features() first looks at skb->protocol even if there
      is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
      encapsulated by the inner vlan instead of the inner vlan protocol.
      This allows GSO packets to be passed to HW and they end up being
      corrupted.
      
      Fixes: 58e998c6 ("offloading: Force software GSO for multiple vlan tags.")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72111ffa
    • Antonio Quartulli's avatar
      batman-adv: avoid NULL dereferences and fix if check · b27b36d5
      Antonio Quartulli authored
      [ Upstream commit 0d164491 ]
      
      Gateway having bandwidth_down equal to zero are not accepted
      at all and so never added to the Gateway list.
      For this reason checking the bandwidth_down member in
      batadv_gw_out_of_range() is useless.
      
      This is probably a copy/paste error and this check was supposed
      to be "!gw_node" only. Moreover, the way the check is written
      now may also lead to a NULL dereference.
      
      Fix this by rewriting the if-condition properly.
      
      Introduced by 414254e3
      ("batman-adv: tvlv - gateway download/upload bandwidth container")
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b27b36d5
    • Sven Eckelmann's avatar
      batman-adv: Unify fragment size calculation · 22afb682
      Sven Eckelmann authored
      [ Upstream commit 0402e444 ]
      
      The fragmentation code was replaced in 610bfc6b
      ("batman-adv: Receive fragmented packets and merge") by an implementation which
      can handle up to 16 fragments of a packet. The packet is prepared for the split
      in fragments by the function batadv_frag_send_packet and the actual split is
      done by batadv_frag_create.
      
      Both functions calculate the size of a fragment themself. But their calculation
      differs because batadv_frag_send_packet also subtracts ETH_HLEN. Therefore,
      the check in batadv_frag_send_packet "can a full fragment can be created?" may
      return true even when batadv_frag_create cannot create a full fragment.
      
      The function batadv_frag_create doesn't check the size of the skb before
      splitting it and therefore might try to create a larger fragment than the
      remaining buffer. This creates an integer underflow and an invalid len is given
      to skb_split.
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22afb682
    • Sven Eckelmann's avatar
      batman-adv: Calculate extra tail size based on queued fragments · 73ef2a1c
      Sven Eckelmann authored
      [ Upstream commit 5b6698b0 ]
      
      The fragmentation code was replaced in 610bfc6b
      ("batman-adv: Receive fragmented packets and merge"). The new code provided a
      mostly unused parameter skb for the merging function. It is used inside the
      function to calculate the additionally needed skb tailroom. But instead of
      increasing its own tailroom, it is only increasing the tailroom of the first
      queued skb. This is not correct in some situations because the first queued
      entry can be a different one than the parameter.
      
      An observed problem was:
      
      1. packet with size 104, total_size 1464, fragno 1 was received
         - packet is queued
      2. packet with size 1400, total_size 1464, fragno 0 was received
         - packet is queued at the end of the list
      3. enough data was received and can be given to the merge function
         (1464 == (1400 - 20) + (104 - 20))
         - merge functions gets 1400 byte large packet as skb argument
      4. merge function gets first entry in queue (104 byte)
         - stored as skb_out
      5. merge function calculates the required extra tail as total_size - skb->len
         - pskb_expand_head tail of skb_out with 64 bytes
      6. merge function tries to squeeze the extra 1380 bytes from the second queued
         skb (1400 byte aka skb parameter) in the 64 extra tail bytes of skb_out
      
      Instead calculate the extra required tail bytes for skb_out also using skb_out
      instead of using the parameter skb. The skb parameter is only used to get the
      total_size from the last received packet. This is also the total_size used to
      decide that all fragments were received.
      Reported-by: default avatarPhilipp Psurek <philipp.psurek@gmail.com>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Acked-by: default avatarMartin Hundebøll <martin@hundeboll.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73ef2a1c
    • Prashant Sreedharan's avatar
      tg3: tg3_disable_ints using uninitialized mailbox value to disable interrupts · f26c07f9
      Prashant Sreedharan authored
      [ Upstream commit 05b0aa57 ]
      
      During driver load in tg3_init_one, if the driver detects DMA activity before
      intializing the chip tg3_halt is called. As part of tg3_halt interrupts are
      disabled using routine tg3_disable_ints. This routine was using mailbox value
      which was not initialized (default value is 0). As a result driver was writing
      0x00000001 to pci config space register 0, which is the vendor id / device id.
      
      This driver bug was exposed because of the commit a7877b17a667 (PCI: Check only
      the Vendor ID to identify Configuration Request Retry). Also this issue is only
      seen in older generation chipsets like 5722 because config space write to offset
      0 from driver is possible. The newer generation chips ignore writes to offset 0.
      Also without commit a7877b17a667, for these older chips when a GRC reset is
      issued the Bootcode would reprogram the vendor id/device id, which is the reason
      this bug was masked earlier.
      
      Fixed by initializing the interrupt mailbox registers before calling tg3_halt.
      
      Please queue for -stable.
      Reported-by: default avatarNils Holland <nholland@tisys.org>
      Reported-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f26c07f9
    • stephen hemminger's avatar
      in6: fix conflict with glibc · 1bd686b1
      stephen hemminger authored
      [ Upstream commit 6d08acd2 ]
      
      Resolve conflicts between glibc definition of IPV6 socket options
      and those defined in Linux headers. Looks like earlier efforts to
      solve this did not cover all the definitions.
      
      It resolves warnings during iproute2 build.
      Please consider for stable as well.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bd686b1
    • Thomas Graf's avatar
      netlink: Don't reorder loads/stores before marking mmap netlink frame as available · ef82260c
      Thomas Graf authored
      [ Upstream commit a18e6a18 ]
      
      Each mmap Netlink frame contains a status field which indicates
      whether the frame is unused, reserved, contains data or needs to
      be skipped. Both loads and stores may not be reordeded and must
      complete before the status field is changed and another CPU might
      pick up the frame for use. Use an smp_mb() to cover needs of both
      types of callers to netlink_set_status(), callers which have been
      reading data frame from the frame, and callers which have been
      filling or releasing and thus writing to the frame.
      
      - Example code path requiring a smp_rmb():
        memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
        netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
      
      - Example code path requiring a smp_wmb():
        hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
        hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
        netlink_frame_flush_dcache(hdr);
        netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
      
      Fixes: f9c228 ("netlink: implement memory mapped recvmsg()")
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef82260c
    • David Miller's avatar
      netlink: Always copy on mmap TX. · 0c6de524
      David Miller authored
      [ Upstream commit 4682a035 ]
      
      Checking the file f_count and the nlk->mapped count is not completely
      sufficient to prevent the mmap'd area contents from changing from
      under us during netlink mmap sendmsg() operations.
      
      Be careful to sample the header's length field only once, because this
      could change from under us as well.
      
      Fixes: 5fd96123 ("netlink: implement memory mapped sendmsg()")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c6de524
    • Timo Teräs's avatar
      gre: fix the inner mac header in nbma tunnel xmit path · eaf90cde
      Timo Teräs authored
      [ Upstream commit 8a0033a9 ]
      
      The NBMA GRE tunnels temporarily push GRE header that contain the
      per-packet NBMA destination on the skb via header ops early in xmit
      path. It is the later pulled before the real GRE header is constructed.
      
      The inner mac was thus set differently in nbma case: the GRE header
      has been pushed by neighbor layer, and mac header points to beginning
      of the temporary gre header (set by dev_queue_xmit).
      
      Now that the offloads expect mac header to point to the gre payload,
      fix the xmit patch to:
       - pull first the temporary gre header away
       - and reset mac header to point to gre payload
      
      This fixes tso to work again with nbma tunnels.
      
      Fixes: 14051f04 ("gre: Use inner mac length when computing tunnel length")
      Signed-off-by: default avatarTimo Teräs <timo.teras@iki.fi>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eaf90cde
  2. 16 Jan, 2015 25 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.14.29 · a2ab9187
      Greg Kroah-Hartman authored
      a2ab9187
    • Linus Torvalds's avatar
      mm: Don't count the stack guard page towards RLIMIT_STACK · 1bec714a
      Linus Torvalds authored
      commit 690eac53 upstream.
      
      Commit fee7e49d ("mm: propagate error from stack expansion even for
      guard page") made sure that we return the error properly for stack
      growth conditions.  It also theorized that counting the guard page
      towards the stack limit might break something, but also said "Let's see
      if anybody notices".
      
      Somebody did notice.  Apparently android-x86 sets the stack limit very
      close to the limit indeed, and including the guard page in the rlimit
      check causes the android 'zygote' process problems.
      
      So this adds the (fairly trivial) code to make the stack rlimit check be
      against the actual real stack size, rather than the size of the vma that
      includes the guard page.
      Reported-and-tested-by: default avatarChih-Wei Huang <cwhuang@android-x86.org>
      Cc: Jay Foad <jay.foad@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bec714a
    • Linus Torvalds's avatar
      mm: propagate error from stack expansion even for guard page · 11e4f3bf
      Linus Torvalds authored
      commit fee7e49d upstream.
      
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: default avatarJay Foad <jay.foad@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11e4f3bf
    • Vlastimil Babka's avatar
      mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed · 18d9304b
      Vlastimil Babka authored
      commit 9e5e3661 upstream.
      
      Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
      stuck in a busy loop with nothing left to balance, but
      kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
      to be a combination of several factors:
      
      1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
      
      2. The process has been killed (by OOM in this case), but has not yet been
         scheduled to remove itself from the waitqueue and die.
      
      3. kswapd checks for throttled processes in prepare_kswapd_sleep():
      
              if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                      wake_up(&pgdat->pfmemalloc_wait);
      		return false; // kswapd will not go to sleep
      	}
      
         However, for a process that was already killed, wake_up() does not remove
         the process from the waitqueue, since try_to_wake_up() checks its state
         first and returns false when the process is no longer waiting.
      
      4. kswapd is running on the same CPU as the only CPU that the process is
         allowed to run on (through cpus_allowed, or possibly single-cpu system).
      
      5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
         encounters no voluntary preemption points and repeatedly fails
         prepare_kswapd_sleep(), blocking the process from running and removing
         itself from the waitqueue, which would let kswapd sleep.
      
      So, the source of the problem is that we prevent kswapd from going to
      sleep until there are processes waiting on the pfmemalloc_wait queue,
      and a process waiting on a queue is guaranteed to be removed from the
      queue only when it gets scheduled.  This was done to make sure that no
      process is left sleeping on pfmemalloc_wait when kswapd itself goes to
      sleep.
      
      However, it isn't necessary to postpone kswapd sleep until the
      pfmemalloc_wait queue actually empties.  To prevent processes from being
      left sleeping, it's actually enough to guarantee that all processes
      waiting on pfmemalloc_wait queue have been woken up by the time we put
      kswapd to sleep.
      
      This patch therefore fixes this issue by substituting 'wake_up' with
      'wake_up_all' and removing 'return false' in the code snippet from
      prepare_kswapd_sleep() above.  Note that if any process puts itself in
      the queue after this waitqueue_active() check, or after the wake up
      itself, it means that the process will also wake up kswapd - and since
      we are under prepare_to_wait(), the wake up won't be missed.  Also we
      update the comment prepare_kswapd_sleep() to hopefully more clearly
      describe the races it is preventing.
      
      Fixes: 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18d9304b
    • Krzysztof Kozlowski's avatar
      mmc: sdhci: Fix sleep in atomic after inserting SD card · b36cd20d
      Krzysztof Kozlowski authored
      commit 2836766a upstream.
      
      Sleep in atomic context happened on Trats2 board after inserting or
      removing SD card because mmc_gpio_get_cd() was called under spin lock.
      
      Fix this by moving card detection earlier, before acquiring spin lock.
      The mmc_gpio_get_cd() call does not have to be protected by spin lock
      because it does not access any sdhci internal data.
      The sdhci_do_get_cd() call access host flags (SDHCI_DEVICE_DEAD). After
      moving it out side of spin lock it could theoretically race with driver
      removal but still there is no actual protection against manual card
      eject.
      
      Dmesg after inserting SD card:
      [   41.663414] BUG: sleeping function called from invalid context at drivers/gpio/gpiolib.c:1511
      [   41.670469] in_atomic(): 1, irqs_disabled(): 128, pid: 30, name: kworker/u8:1
      [   41.677580] INFO: lockdep is turned off.
      [   41.681486] irq event stamp: 61972
      [   41.684872] hardirqs last  enabled at (61971): [<c0490ee0>] _raw_spin_unlock_irq+0x24/0x5c
      [   41.693118] hardirqs last disabled at (61972): [<c04907ac>] _raw_spin_lock_irq+0x18/0x54
      [   41.701190] softirqs last  enabled at (61648): [<c0026fd4>] __do_softirq+0x234/0x2c8
      [   41.708914] softirqs last disabled at (61631): [<c00273a0>] irq_exit+0xd0/0x114
      [   41.716206] Preemption disabled at:[<  (null)>]   (null)
      [   41.721500]
      [   41.722985] CPU: 3 PID: 30 Comm: kworker/u8:1 Tainted: G        W      3.18.0-rc5-next-20141121 #883
      [   41.732111] Workqueue: kmmcd mmc_rescan
      [   41.735945] [<c0014d2c>] (unwind_backtrace) from [<c0011c80>] (show_stack+0x10/0x14)
      [   41.743661] [<c0011c80>] (show_stack) from [<c0489d14>] (dump_stack+0x70/0xbc)
      [   41.750867] [<c0489d14>] (dump_stack) from [<c0228b74>] (gpiod_get_raw_value_cansleep+0x18/0x30)
      [   41.759628] [<c0228b74>] (gpiod_get_raw_value_cansleep) from [<c03646e8>] (mmc_gpio_get_cd+0x38/0x58)
      [   41.768821] [<c03646e8>] (mmc_gpio_get_cd) from [<c036d378>] (sdhci_request+0x50/0x1a4)
      [   41.776808] [<c036d378>] (sdhci_request) from [<c0357934>] (mmc_start_request+0x138/0x268)
      [   41.785051] [<c0357934>] (mmc_start_request) from [<c0357cc8>] (mmc_wait_for_req+0x58/0x1a0)
      [   41.793469] [<c0357cc8>] (mmc_wait_for_req) from [<c0357e68>] (mmc_wait_for_cmd+0x58/0x78)
      [   41.801714] [<c0357e68>] (mmc_wait_for_cmd) from [<c0361c00>] (mmc_io_rw_direct_host+0x98/0x124)
      [   41.810480] [<c0361c00>] (mmc_io_rw_direct_host) from [<c03620f8>] (sdio_reset+0x2c/0x64)
      [   41.818641] [<c03620f8>] (sdio_reset) from [<c035a3d8>] (mmc_rescan+0x254/0x2e4)
      [   41.826028] [<c035a3d8>] (mmc_rescan) from [<c003a0e0>] (process_one_work+0x180/0x3f4)
      [   41.833920] [<c003a0e0>] (process_one_work) from [<c003a3bc>] (worker_thread+0x34/0x4b0)
      [   41.841991] [<c003a3bc>] (worker_thread) from [<c003fed8>] (kthread+0xe4/0x104)
      [   41.849285] [<c003fed8>] (kthread) from [<c000f268>] (ret_from_fork+0x14/0x2c)
      [   42.038276] mmc0: new high speed SDHC card at address 1234
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Fixes: 94144a46 ("mmc: sdhci: add get_cd() implementation")
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b36cd20d
    • Stefan Roese's avatar
      spi: fsl: Fix problem with multi message transfers · 2937d5ac
      Stefan Roese authored
      commit 4302a596 upstream.
      
      When used via spidev with more than one messages to tranfer via
      SPI_IOC_MESSAGE the current implementation would return with
      -EINVAL, since bits_per_word and speed_hz are set in all
      transfer structs. And in the 2nd loop status will stay at
      -EINVAL as its not overwritten again via fsl_spi_setup_transfer().
      
      This patch changes this behavious by first checking if one of
      the messages uses different settings. If this is the case
      the function will return with -EINVAL. If not, the messages
      are transferred correctly.
      Signed-off-by: default avatarStefan Roese <sr@denx.de>
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Cc: Esben Haabendal <esbenhaabendal@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2937d5ac
    • Jiri Olsa's avatar
      perf session: Do not fail on processing out of order event · ceaefcdf
      Jiri Olsa authored
      commit f61ff6c0 upstream.
      
      Linus reported perf report command being interrupted due to processing
      of 'out of order' event, with following error:
      
        Timestamp below last timeslice flush
        0x5733a8 [0x28]: failed to process type: 3
      
      I could reproduce the issue and in my case it was caused by one CPU
      (mmap) being behind during record and userspace mmap reader seeing the
      data after other CPUs data were already stored.
      
      This is expected under some circumstances because we need to limit the
      number of events that we queue for reordering when we receive a
      PERF_RECORD_FINISHED_ROUND or when we force flush due to memory
      pressure.
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lkml.kernel.org/r/1417016371-30249-1-git-send-email-jolsa@kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      [zhangzhiqiang: backport to 3.10:
       - adjust context
       - commit f61ff6c0 struct events_stats was defined in tools/perf/util/event.h
         while 3.10 stable defined in tools/perf/util/hist.h.
       - 3.10 stable there is no pr_oe_time() which used for debug.
       - After the above adjustments, becomes same to the original patch:
         https://github.com/torvalds/linux/commit/f61ff6c06dc8f32c7036013ad802c899ec590607
      ]
      Signed-off-by: default avatarZhiqiang Zhang <zhangzhiqiang.zhang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ceaefcdf
    • Jiri Olsa's avatar
      perf: Fix events installation during moving group · c8af8989
      Jiri Olsa authored
      commit 9fc81d87 upstream.
      
      We allow PMU driver to change the cpu on which the event
      should be installed to. This happened in patch:
      
        e2d37cd2 ("perf: Allow the PMU driver to choose the CPU on which to install events")
      
      This patch also forces all the group members to follow
      the currently opened events cpu if the group happened
      to be moved.
      
      This and the change of event->cpu in perf_install_in_context()
      function introduced in:
      
        0cda4c02 ("perf: Introduce perf_pmu_migrate_context()")
      
      forces group members to change their event->cpu,
      if the currently-opened-event's PMU changed the cpu
      and there is a group move.
      
      Above behaviour causes problem for breakpoint events,
      which uses event->cpu to touch cpu specific data for
      breakpoints accounting. By changing event->cpu, some
      breakpoints slots were wrongly accounted for given
      cpu.
      
      Vinces's perf fuzzer hit this issue and caused following
      WARN on my setup:
      
         WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
         Can't find any breakpoint slot
         [...]
      
      This patch changes the group moving code to keep the event's
      original cpu.
      Reported-by: default avatarVince Weaver <vince@deater.net>
      Signed-off-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8af8989
    • Jiri Olsa's avatar
      perf/x86/intel/uncore: Make sure only uncore events are collected · ec9c772a
      Jiri Olsa authored
      commit af91568e upstream.
      
      The uncore_collect_events functions assumes that event group
      might contain only uncore events which is wrong, because it
      might contain any type of events.
      
      This bug leads to uncore framework touching 'not' uncore events,
      which could end up all sorts of bugs.
      
      One was triggered by Vince's perf fuzzer, when the uncore code
      touched breakpoint event private event space as if it was uncore
      event and caused BUG:
      
         BUG: unable to handle kernel paging request at ffffffff82822068
         IP: [<ffffffff81020338>] uncore_assign_events+0x188/0x250
         ...
      
      The code in uncore_assign_events() function was looking for
      event->hw.idx data while the event was initialized as a
      breakpoint with different members in event->hw union.
      
      This patch forces uncore_collect_events() to collect only uncore
      events.
      Reported-by: default avatarVince Weaver <vince@deater.net>
      Signed-off-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Link: http://lkml.kernel.org/r/1418243031-20367-2-git-send-email-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec9c772a
    • Chris Mason's avatar
      Btrfs: don't delay inode ref updates during log replay · 33417386
      Chris Mason authored
      commit 6f896054 upstream.
      
      Commit 1d52c78a (Btrfs: try not to ENOSPC on log replay) added a
      check to skip delayed inode updates during log replay because it
      confuses the enospc code.  But the delayed processing will end up
      ignoring delayed refs from log replay because the inode itself wasn't
      put through the delayed code.
      
      This can end up triggering a warning at commit time:
      
      WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()
      
      Which is repeated for each commit because we never process the delayed
      inode ref update.
      
      The fix used here is to change btrfs_delayed_delete_inode_ref to return
      an error if we're currently in log replay.  The caller will do the ref
      deletion immediately and everything will work properly.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33417386
    • Lorenzo Pieralisi's avatar
      arm64: kernel: fix __cpu_suspend mm switch on warm-boot · 852cacf6
      Lorenzo Pieralisi authored
      commit f43c2718 upstream.
      
      On arm64 the TTBR0_EL1 register is set to either the reserved TTBR0
      page tables on boot or to the active_mm mappings belonging to user space
      processes, it must never be set to swapper_pg_dir page tables mappings.
      
      When a CPU is booted its active_mm is set to init_mm even though its
      TTBR0_EL1 points at the reserved TTBR0 page mappings. This implies
      that when __cpu_suspend is triggered the active_mm can point at
      init_mm even if the current TTBR0_EL1 register contains the reserved
      TTBR0_EL1 mappings.
      
      Therefore, the mm save and restore executed in __cpu_suspend might
      turn out to be erroneous in that, if the current->active_mm corresponds
      to init_mm, on resume from low power it ends up restoring in the
      TTBR0_EL1 the init_mm mappings that are global and can cause speculation
      of TLB entries which end up being propagated to user space.
      
      This patch fixes the issue by checking the active_mm pointer before
      restoring the TTBR0 mappings. If the current active_mm == &init_mm,
      the code sets the TTBR0_EL1 to the reserved TTBR0 mapping instead of
      switching back to the active_mm, which is the expected behaviour
      corresponding to the TTBR0_EL1 settings when __cpu_suspend was entered.
      
      Fixes: 95322526 ("arm64: kernel: cpu_{suspend/resume} implementation")
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      852cacf6
    • Laura Abbott's avatar
      arm64: Move cpu_resume into the text section · 219591c5
      Laura Abbott authored
      commit c3684fbb upstream.
      
      The function cpu_resume currently lives in the .data section.
      There's no reason for it to be there since we can use relative
      instructions without a problem. Move a few cpu_resume data
      structures out of the assembly file so the .data annotation
      can be dropped completely and cpu_resume ends up in the read
      only text section.
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      219591c5
    • Lorenzo Pieralisi's avatar
      arm64: kernel: refactor the CPU suspend API for retention states · 0e42d84b
      Lorenzo Pieralisi authored
      commit 714f5992 upstream.
      
      CPU suspend is the standard kernel interface to be used to enter
      low-power states on ARM64 systems. Current cpu_suspend implementation
      by default assumes that all low power states are losing the CPU context,
      so the CPU registers must be saved and cleaned to DRAM upon state
      entry. Furthermore, the current cpu_suspend() implementation assumes
      that if the CPU suspend back-end method returns when called, this has
      to be considered an error regardless of the return code (which can be
      successful) since the CPU was not expected to return from a code path that
      is different from cpu_resume code path - eg returning from the reset vector.
      
      All in all this means that the current API does not cope well with low-power
      states that preserve the CPU context when entered (ie retention states),
      since first of all the context is saved for nothing on state entry for
      those states and a successful state entry can return as a normal function
      return, which is considered an error by the current CPU suspend
      implementation.
      
      This patch refactors the cpu_suspend() API so that it can be split in
      two separate functionalities. The arm64 cpu_suspend API just provides
      a wrapper around CPU suspend operation hook. A new function is
      introduced (for architecture code use only) for states that require
      context saving upon entry:
      
      __cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
      
      __cpu_suspend() saves the context on function entry and calls the
      so called suspend finisher (ie fn) to complete the suspend operation.
      The finisher is not expected to return, unless it fails in which case
      the error is propagated back to the __cpu_suspend caller.
      
      The API refactoring results in the following pseudo code call sequence for a
      suspending CPU, when triggered from a kernel subsystem:
      
      /*
       * int cpu_suspend(unsigned long idx)
       * @idx: idle state index
       */
      {
      -> cpu_suspend(idx)
      	|---> CPU operations suspend hook called, if present
      		|--> if (retention_state)
      			|--> direct suspend back-end call (eg PSCI suspend)
      		     else
      			|--> __cpu_suspend(idx, &back_end_finisher);
      }
      
      By refactoring the cpu_suspend API this way, the CPU operations back-end
      has a chance to detect whether idle states require state saving or not
      and can call the required suspend operations accordingly either through
      simple function call or indirectly through __cpu_suspend() which carries out
      state saving and suspend finisher dispatching to complete idle state entry.
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e42d84b
    • Lorenzo Pieralisi's avatar
      arm64: kernel: add missing __init section marker to cpu_suspend_init · 5ef30fef
      Lorenzo Pieralisi authored
      commit 18ab7db6 upstream.
      
      Suspend init function must be marked as __init, since it is not needed
      after the kernel has booted. This patch moves the cpu_suspend_init()
      function to the __init section.
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ef30fef
    • Rafael J. Wysocki's avatar
      ACPI / PM: Fix PM initialization for devices that are not present · 47abb28e
      Rafael J. Wysocki authored
      commit 1b1f3e16 upstream.
      
      If an ACPI device object whose _STA returns 0 (not present and not
      functional) has _PR0 or _PS0, its power_manageable flag will be set
      and acpi_bus_init_power() will return 0 for it.  Consequently, if
      such a device object is passed to the ACPI device PM functions, they
      will attempt to carry out the requested operation on the device,
      although they should not do that for devices that are not present.
      
      To fix that problem make acpi_bus_init_power() return an error code
      for devices that are not present which will cause power_manageable to
      be cleared for them as appropriate in acpi_bus_get_power_flags().
      However, the lists of power resources should not be freed for the
      device in that case, so modify acpi_bus_get_power_flags() to keep
      those lists even if acpi_bus_init_power() returns an error.
      Accordingly, when deciding whether or not the lists of power
      resources need to be freed, acpi_free_power_resources_lists()
      should check the power.flags.power_resources flag instead of
      flags.power_manageable, so make that change too.
      
      Furthermore, if acpi_bus_attach() sees that flags.initialized is
      unset for the given device, it should reset the power management
      settings of the device and re-initialize them from scratch instead
      of relying on the previous settings (the device may have appeared
      after being not present previously, for example), so make it use
      the 'valid' flag of the D0 power state as the initial value of
      flags.power_manageable for it and call acpi_bus_init_power() to
      discover its current power state.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47abb28e
    • Thomas Petazzoni's avatar
      ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP · 8d33f514
      Thomas Petazzoni authored
      commit e5535545 upstream.
      
      Enabling the hardware I/O coherency on Armada 370, Armada 375, Armada
      38x and Armada XP requires a certain number of conditions:
      
       - On Armada 370, the cache policy must be set to write-allocate.
      
       - On Armada 375, 38x and XP, the cache policy must be set to
         write-allocate, the pages must be mapped with the shareable
         attribute, and the SMP bit must be set
      
      Currently, on Armada XP, when CONFIG_SMP is enabled, those conditions
      are met. However, when Armada XP is used in a !CONFIG_SMP kernel, none
      of these conditions are met. With Armada 370, the situation is worse:
      since the processor is single core, regardless of whether CONFIG_SMP
      or !CONFIG_SMP is used, the cache policy will be set to write-back by
      the kernel and not write-allocate.
      
      Since solving this problem turns out to be quite complicated, and we
      don't want to let users with a mainline kernel known to have
      infrequent but existing data corruptions, this commit proposes to
      simply disable hardware I/O coherency in situations where it is known
      not to work.
      
      And basically, the is_smp() function of the kernel tells us whether it
      is OK to enable hardware I/O coherency or not, so this commit slightly
      refactors the coherency_type() function to return
      COHERENCY_FABRIC_TYPE_NONE when is_smp() is false, or the appropriate
      type of the coherency fabric in the other case.
      
      Thanks to this, the I/O coherency fabric will no longer be used at all
      in !CONFIG_SMP configurations. It will continue to be used in
      CONFIG_SMP configurations on Armada XP, Armada 375 and Armada 38x
      (which are multiple cores processors), but will no longer be used on
      Armada 370 (which is a single core processor).
      
      In the process, it simplifies the implementation of the
      coherency_type() function, and adds a missing call to of_node_put().
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Fixes: e60304f8 ("arm: mvebu: Add hardware I/O Coherency support")
      Acked-by: default avatarGregory CLEMENT <gregory.clement@free-electrons.com>
      Link: https://lkml.kernel.org/r/1415871540-20302-3-git-send-email-thomas.petazzoni@free-electrons.comSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      8d33f514
    • Pavel Machek's avatar
      Revert "ARM: 7830/1: delay: don't bother reporting bogomips in /proc/cpuinfo" · 3f4ddf1a
      Pavel Machek authored
      commit 4bf9636c upstream.
      
      Commit 9fc2105a ("ARM: 7830/1: delay: don't bother reporting
      bogomips in /proc/cpuinfo") breaks audio in python, and probably
      elsewhere, with message
      
        FATAL: cannot locate cpu MHz in /proc/cpuinfo
      
      I'm not the first one to hit it, see for example
      
        https://theredblacktree.wordpress.com/2014/08/10/fatal-cannot-locate-cpu-mhz-in-proccpuinfo/
        https://devtalk.nvidia.com/default/topic/765800/workaround-for-fatal-cannot-locate-cpu-mhz-in-proc-cpuinf/?offset=1
      
      Reading original changelog, I have to say "Stop breaking working setups.
      You know who you are!".
      Signed-off-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f4ddf1a
    • Nishanth Menon's avatar
      ARM: OMAP4: PM: Only do static dependency configuration in omap4_init_static_deps · dab35042
      Nishanth Menon authored
      commit 9008d83f upstream.
      
      Commit 705814b5 ("ARM: OMAP4+: PM: Consolidate OMAP4 PM code to
      re-use it for OMAP5")
      
      Moved logic generic for OMAP5+ as part of the init routine by
      introducing omap4_pm_init. However, the patch left the powerdomain
      initial setup, an unused omap4430 es1.0 check and a spurious log
      "Power Management for TI OMAP4." in the original code.
      
      Remove the duplicate code which is already present in omap4_pm_init from
      omap4_init_static_deps.
      
      As part of this change, also move the u-boot version print out of the
      static dependency function to the omap4_pm_init function.
      
      Fixes: 705814b5 ("ARM: OMAP4+: PM: Consolidate OMAP4 PM code to re-use it for OMAP5")
      Signed-off-by: default avatarNishanth Menon <nm@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dab35042
    • Tomasz Figa's avatar
      ARM: dts: Enable PWM node by default for s3c64xx · 8be47453
      Tomasz Figa authored
      commit 5e794de5 upstream.
      
      The PWM block is required for system clock source so it must be always
      enabled. This patch fixes boot issues on SMDK6410 which did not have
      the node enabled explicitly for other purposes.
      
      Fixes: eeb93d02 ("clocksource: of: Respect device tree node status")
      Signed-off-by: default avatarTomasz Figa <tomasz.figa@gmail.com>
      Signed-off-by: default avatarKukjin Kim <kgene.kim@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8be47453
    • Lokesh Vutla's avatar
      ARM: dts: DRA7: wdt: Fix compatible property for watchdog node · 9feeb8f3
      Lokesh Vutla authored
      commit be668835 upstream.
      
      OMAP wdt driver supports only ti,omap3-wdt compatible. In DRA7 dt
      wdt compatible property is defined as ti,omap4-wdt by mistake instead of
      ti,omap3-wdt. Correcting the typo.
      
      Fixes: 6e58b8f1 ("ARM: dts: DRA7: Add the dts files for dra7 SoC and dra7-evm board")
      Signed-off-by: default avatarLokesh Vutla <lokeshvutla@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9feeb8f3
    • Luca Abeni's avatar
      sched/deadline: Avoid double-accounting in case of missed deadlines · cae817ad
      Luca Abeni authored
      commit 269ad801 upstream.
      
      The dl_runtime_exceeded() function is supposed to ckeck if
      a SCHED_DEADLINE task must be throttled, by checking if its
      current runtime is <= 0. However, it also checks if the
      scheduling deadline has been missed (the current time is
      larger than the current scheduling deadline), further
      decreasing the runtime if this happens.
      This "double accounting" is wrong:
      
      - In case of partitioned scheduling (or single CPU), this
        happens if task_tick_dl() has been called later than expected
        (due to small HZ values). In this case, the current runtime is
        also negative, and replenish_dl_entity() can take care of the
        deadline miss by recharging the current runtime to a value smaller
        than dl_runtime
      
      - In case of global scheduling on multiple CPUs, scheduling
        deadlines can be missed even if the task did not consume more
        runtime than expected, hence penalizing the task is wrong
      
      This patch fix this problem by throttling a SCHED_DEADLINE task
      only when its runtime becomes negative, and not modifying the runtime
      Signed-off-by: default avatarLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cae817ad
    • Luca Abeni's avatar
      sched/deadline: Fix migration of SCHED_DEADLINE tasks · 678c8bb7
      Luca Abeni authored
      commit 6a503c3b upstream.
      
      According to global EDF, tasks should be migrated between runqueues
      without checking if their scheduling deadlines and runtimes are valid.
      However, SCHED_DEADLINE currently performs such a check:
      a migration happens doing:
      
      	deactivate_task(rq, next_task, 0);
      	set_task_cpu(next_task, later_rq->cpu);
      	activate_task(later_rq, next_task, 0);
      
      which ends up calling dequeue_task_dl(), setting the new CPU, and then
      calling enqueue_task_dl().
      
      enqueue_task_dl() then calls enqueue_dl_entity(), which calls
      update_dl_entity(), which can modify scheduling deadline and runtime,
      breaking global EDF scheduling.
      
      As a result, some of the properties of global EDF are not respected:
      for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
      two cores can have unbounded response times for the third task even
      if 30/80+40/80+120/170 = 1.5809 < 2
      
      This can be fixed by invoking update_dl_entity() only in case of
      wakeup, or if this is a new SCHED_DEADLINE task.
      Signed-off-by: default avatarLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      678c8bb7
    • Johannes Berg's avatar
      scripts/kernel-doc: don't eat struct members with __aligned · 11d1b5db
      Johannes Berg authored
      commit 7b990789 upstream.
      
      The change from \d+ to .+ inside __aligned() means that the following
      structure:
      
        struct test {
              u8 a __aligned(2);
              u8 b __aligned(2);
        };
      
      essentially gets modified to
      
        struct test {
              u8 a;
        };
      
      for purposes of kernel-doc, thus dropping a struct member, which in
      turns causes warnings and invalid kernel-doc generation.
      
      Fix this by replacing the catch-all (".") with anything that's not a
      semicolon ("[^;]").
      
      Fixes: 9dc30918 ("scripts/kernel-doc: handle struct member __aligned without numbers")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Cc: Nishanth Menon <nm@ti.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11d1b5db
    • Ryusuke Konishi's avatar
      nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races · 1cde1254
      Ryusuke Konishi authored
      commit 705304a8 upstream.
      
      Same story as in commit 41080b5a ("nfsd race fixes: ext2") (similar
      ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
      of insert_inode_locked() and a bug of a check for dead inodes needs to
      be fixed.
      
      If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
      insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
      the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.
      
      If nilfs_iget() is called before nilfs_new_inode() calls
      insert_inode_locked4(), it will create an in-core inode and read its
      data from the on-disk inode.  But, nilfs_iget() will find i_nlink equals
      zero and fail at nilfs_read_inode_common(), which will lead it to call
      iget_failed() and cleanly fail.
      
      However, this sanity check doesn't work as expected for reused on-disk
      inodes because they leave a non-zero value in i_mode field and it
      hinders the test of i_nlink.  This patch also fixes the issue by
      removing the test on i_mode that nilfs2 doesn't need.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cde1254
    • Brian Norris's avatar
      mtd: tests: abort torturetest on erase errors · b9c2571e
      Brian Norris authored
      commit 68f29815 upstream.
      
      The torture test should quit once it actually induces an error in the
      flash. This step was accidentally removed during refactoring.
      
      Without this fix, the torturetest just continues infinitely, or until
      the maximum cycle count is reached. e.g.:
      
         ...
         [ 7619.218171] mtd_test: error -5 while erasing EB 100
         [ 7619.297981] mtd_test: error -5 while erasing EB 100
         [ 7619.377953] mtd_test: error -5 while erasing EB 100
         [ 7619.457998] mtd_test: error -5 while erasing EB 100
         [ 7619.537990] mtd_test: error -5 while erasing EB 100
         ...
      
      Fixes: 6cf78358 ("mtd: mtd_torturetest: use mtd_test helpers")
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9c2571e