1. 08 Aug, 2016 40 commits
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug · 678a6241
      Michal Soltys authored
      This patch simplifies how we update fsc and calculate vt from it - while
      keeping the expected functionality identical with how hfsc behaves
      curently. It also fixes a certain issue introduced with
      a very old patch.
      
      The idea is, that instead of correcting cl_vt before fsc curve update
      (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
      cl_vt local to the current period - we can simply rely on virtual times
      and curve values always being in sync - analogously to how rsc and usc
      function, except that we use virtual time here.
      
      Why hasn't it been done since the beginning this way ? The likely scenario
      (basing on the code trying to correct curves whenever possible) was to
      keep the virtual times as small as possible - as they have tendency to
      "gallop" forward whenever their siblings and other fair sharing
      subtrees are idling. On top of that, current code is subtly bugged, so
      cumulative time (without any corrections) is always kept and used in
      init_vf() when a new backlog period begins (using cl_cvtoff).
      
      Is cumulative value safe ? Generally yes, though corner cases are easy
      to create. For example consider:
      
      1gbit interface
      some 100kbit leaf, everything else idle
      
      With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
      it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
      needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
      logarithms rounded up). It's getting somewhat dangerous, but also
      requires setup excusing this kind of values not mentioning permanently
      backlogged class for a year. In near most extreme case (10gbit, 10kbit
      leaf), we have "enough" to hold ~13.6 days in 64 bits.
      
      Well, the issue remains mostly theoretical and cl_cvtoff has been
      working fine for all those years. Sensible configuration are de-facto
      immune to this issue, and not so sensible can solve it with a cronjob
      and its period inversely proportional to the insanity of such setup =)
      
      Now let's explain the subtle bug mentioned earlier.
      
      The issue is related to how offsets are kept and how we calculate
      virtual times and update fair service curve(s). The issue itself is
      subtle, but easy to observe with long m1 segments. It was introduced in
      rather old patch:
      
      Commit 99296150: "[NET_SCHED]: O(1) children vtoff adjustment
      in HFSC scheduler"
      
      (available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)
      
      Originally when a new backlog period was started, cl_vtoff of each
      sibling was updated with cl_cvtmax from past period - naturally moving
      all cl_vt to proper starting point. That patch adjusted it so cumulative
      offset is kept in the parent, and there is no need for traversing the
      list (as any subsequent child activation derives new vt from already
      active sibling(s)).
      
      But with this change, cl_vtoff (of each sibling) is no longer persistent
      across the inactivity periods, as it's calculated from parent's
      cl_cvtoff on a new backlog period, conflicting with the following curve
      correction from the previous period:
      
      if (cl->cl_virtual.x == vt) {
              cl->cl_virtual.x -= cl->cl_vtoff;
      	cl->cl_vtoff = 0;
      }
      
      This essentially tries to keep curve as if it was local to the period
      and resets cl_vtoff (cumulative vt offset of the class) to 0 when
      possible (read: when we have an intersection or if a new curve is below
      the old one). But then it's recalculated from cl_cvtoff on next active
      period.  Then rtsc_min() call preceding the above if() doesn't really
      do what we expect it to do in such scenario - as it calculates the
      minimum of corrected curve (from the previous backlog period) and the
      new uncorrected curve (with offset derived from cl_cvtoff).
      
      Example:
      
      tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
      tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
      tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit
      
      start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
      pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
      start A, let it run 10s
      pause A briefly to force rtsc_min()
      
      At this point we would expect A to continue at 20mbit after a brief
      moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
      the effect of first correcting A (during 'start A'), and then - after
      unpausing - calculating rtsc_min() from old corrected and new uncorrected
      curve.
      
      The patch fixes this bug and keepis vt and fsc in sync (virtual times
      are cumulative, not local to the backlog period).
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      678a6241
    • Wei Yongjun's avatar
      qed: Use DEFINE_SPINLOCK() for spinlock · 0caf5b26
      Wei Yongjun authored
      spinlock can be initialized automatically with DEFINE_SPINLOCK()
      rather than explicitly calling spin_lock_init().
      Signed-off-by: default avatarWei Yongjun <weiyj.lk@gmail.com>
      Acked-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0caf5b26
    • Hangbin Liu's avatar
      net/multicast: should not send source list records when have filter mode change · a052517a
      Hangbin Liu authored
      Based on RFC3376 5.1 and RFC3810 6.1
      
         If the per-interface listening change that triggers the new report is
         a filter mode change, then the next [Robustness Variable] State
         Change Reports will include a Filter Mode Change Record.  This
         applies even if any number of source list changes occur in that
         period.
      
         Old State         New State         State Change Record Sent
         ---------         ---------         ------------------------
         INCLUDE (A)       EXCLUDE (B)       TO_EX (B)
         EXCLUDE (A)       INCLUDE (B)       TO_IN (B)
      
      So we should not send source-list change if there is a filter-mode change.
      
      Here are two scenarios:
      1. Group deleted and filter mode is EXCLUDE, which means we need send a
         TO_IN { }.
      2. Not group deleted, but has pcm->crcount, which means we need send a
         normal filter-mode-change.
      
      At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
      we stop add records and decrease sf_crcount directly
      
      Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.htmlSigned-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a052517a
    • Philippe Reynes's avatar
      net: ethernet: marvell: mvneta: use new api ethtool_{get|set}_link_ksettings · 013ad40d
      Philippe Reynes authored
      The ethtool api {get|set}_settings is deprecated.
      We move the mvneta driver to new api {get|set}_link_ksettings.
      
      We use the generic function phy_ethtool_get_link_ksettings,
      and update old mvneta_ethtool_set_settings to the new api.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      013ad40d
    • Philippe Reynes's avatar
      net: ethernet: marvell: mvneta: use phydev from struct net_device · c6c022e3
      Philippe Reynes authored
      The private structure contain a pointer to phydev, but the structure
      net_device already contain such pointer. So we can remove the pointer
      phy_dev in the private structure, and update the driver to use the
      one contained in struct net_device.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6c022e3
    • Philippe Reynes's avatar
      net: ethernet: greth: use phy_ethtool_{get|set}_link_ksettings · 72582fdb
      Philippe Reynes authored
      There are two generics functions phy_ethtool_{get|set}_link_ksettings,
      so we can use them instead of defining the same code in the driver.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72582fdb
    • Philippe Reynes's avatar
      net: ethernet: greth: use phydev from struct net_device · 65752dda
      Philippe Reynes authored
      The private structure contain a pointer to phydev, but the structure
      net_device already contain such pointer. So we can remove the pointer
      phy in the private structure, and update the driver to use the
      one contained in struct net_device.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65752dda
    • Philippe Reynes's avatar
      net: ethernet: octeon: use phy_ethtool_{get|set}_link_ksettings · 0d5704bf
      Philippe Reynes authored
      There are two generics functions phy_ethtool_{get|set}_link_ksettings,
      so we can use them instead of defining the same code in the driver.
      
      There was a check on CAP_NET_ADMIN in cvm_oct_set_settings, but this
      check is already done in dev_ethtool, so no need to repeat it before
      calling the generic function.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d5704bf
    • Philippe Reynes's avatar
      net: ethernet: octeon: use phydev from struct net_device · 86bc5ed6
      Philippe Reynes authored
      The private structure contain a pointer to phydev, but the structure
      net_device already contain such pointer. So we can remove the pointer
      phydev in the private structure, and update the driver to use the
      one contained in struct net_device.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86bc5ed6
    • David S. Miller's avatar
      Merge branch 'bna-next' · 11de8e62
      David S. Miller authored
      Ivan Vecera says:
      
      ====================
      bna: remove useless global variables
      
      The set removes useless global bnad_list as well as bnad->entry that track
      a list of driver instances but it is not used anywhere. The associated
      bnad_list_mutex is removed as well but as it is also used to protect
      bna_id increment it is necessary to convert bna_id to atomic_t.
      ====================
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      11de8e62
    • Ivan Vecera's avatar
      bna: remove global bnad_list_mutex · 09e36360
      Ivan Vecera authored
      Remove global bnad_list_mutex as it is not used anymore. This makes
      bnad_add_to_list() and bnad_remove_from_list() empty so remove them too.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09e36360
    • Ivan Vecera's avatar
      bna: change type of bna_id to atomic_t · 285eb9c3
      Ivan Vecera authored
      Change type of bna_id to atomic_t. The bnad_list_mutex is used to prevent
      a race when bna_id is incremented. After the change the mutex can be
      removed in the next step.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      285eb9c3
    • Ivan Vecera's avatar
      bna: remove useless linked list · a1f4064b
      Ivan Vecera authored
      Remove global variable bnad_list and bnad->list_entry that are used
      as list of bna driver instances. It is not necessary and useless.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1f4064b
    • David S. Miller's avatar
      Merge branch 'ipconfig-improve-dhcp-timeouts' · 9f3377ef
      David S. Miller authored
      Uwe Kleine-König says:
      
      ====================
      net: ipconfig: improve DHCP timeout handling
      
      this series teaches the ipconfig code to handle a DHCP reply on eth0 even if a
      request on eth1 was already sent out.
      This is a follow fix to 2513dfb8 ("ipconfig: handle case of delayed DHCP
      server") that dropped a late reply.
      
      This makes it possible at all to work with slow DHCP servers at all in some
      configurations and improves boot speed in general.
      
      The first patch is not really necessary, it only helps decoding debug messages
      when there is more than one device.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f3377ef
    • Uwe Kleine-König's avatar
      net: ipconfig: drop inter-device timeout · e0688534
      Uwe Kleine-König authored
      Now that ipconfig learned to handle "delayed replies" in the previous
      commit, there is no reason any more to delay sending a first request per
      device.
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0688534
    • Uwe Kleine-König's avatar
      net: ipconfig: Support using "delayed" DHCP replies · 2647cffb
      Uwe Kleine-König authored
      The dhcp code only waits 1s between sending DHCP requests on different
      devices and only accepts an answer for the device that sent out the last
      request. Only the timeout at the end of a loop is increased iteratively
      which favours only the last device. This makes it impossible to work
      with a dhcp server that takes little more than 1s connected to a device
      that is not the last one.
      
      Instead of also increasing the inter-device timeout, teach the code to
      handle delayed replies.
      
      To accomplish that, make *ic_dev track the current ic_device instead of
      the current net_device and adapt all users accordingly. The relevant
      change then is to reset d to ic_dev on a reply to assert that the
      followup request goes through the right device.
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2647cffb
    • Uwe Kleine-König's avatar
      net: ipconfig: Add device name to debug messages · 22fc5388
      Uwe Kleine-König authored
      This simplifies understanding what happens when there is more than one
      device.
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22fc5388
    • David S. Miller's avatar
      Merge branch 'be2net-next' · 874e1b75
      David S. Miller authored
      Sathya Perla says:
      
      ====================
      be2net: patch set
      
      Patch 1 fixes the driver to workaournd a bug in the Lancer FW in the
      vlan-config cmd processing. The FW in some cases clears the vlan-promisc
      setting even if it cannot apply the vlan filter. The driver has no means
      of knowing if the vlan-promisc setting has been cleared or not. This
      patch now explicitly clears the vlan-promisc setting via the RX-Filter cmd
      and then tries to program the vlan-list.
      
      Patch 2 fixes the failure path in the vlan vid add code.
      The driver currently removes a new vid from the adapter->vids[] array if
      be_vid_config() returns an error, which occurs when there is an error in
      HW/FW. This is wrong. After the HW/FW error is recovered from, we need the
      complete vids[] array to re-program the vlan list.
      
      Patch 3 fixes the ndo_set_rx_mode() path to avoid unnecessary multicast
      list updates to the FW. Each time the ndo_set_rx_mode() routine is called,
      the driver programs the multicast list in the adapter without checking
      if there are any changes to the list. This leads to a flood of RX_FILTER
      cmds when a number of vlan interfaces are configured over the device,
      as the ndo_ gets called for each vlan interface. To avoid this, we now
      use __dev_mc_sync() and __dev_uc_sync() API, but only to detect if there
      is a change in the mc/uc lists. Now that we use this API, the code has to
      be-designed to issue these API calls for each invocation of the ndo_ call.
      
      Patch 4 replaces polling with sleeping in the FW completion path.
      The ndo_set_rx_mode() and ndo_add/del_vxlan_port() calls may be called with
      BHs disabled. The driver currently issues the required cmds to the FW in
      these contexts and polls on completions from the FW, while BHs remain
      disabled.  This can cause either packet loss or packet reception to be
      delayed on that CPU.  This patch defers processing of the above cmds to a
      separate workqueue. With this change, FW cmds are now issued only in process
      context. Now that the FW cmds are issued only in process context, they can
      sleep waiting for a completion instead of polling.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      874e1b75
    • Sathya Perla's avatar
      be2net: replace polling with sleeping in the FW completion path · b7172414
      Sathya Perla authored
      The ndo_set_rx_mode() and ndo_add/del_vxlan_port() calls may be called with
      BHs disabled. The driver currently issues the required cmds to the FW in
      these contexts and polls on completions from the FW, while BHs remain
      disabled.  This can cause either packet loss or packet reception to be
      delayed on that CPU.
      
      This patch defers processing of the above cmds to a separate workqueue.
      With this change, FW cmds are now issued only in process context.
      Now that the FW cmds are issued only in process context, they can sleep
      waiting for a completion instead of polling. All the spin_lock_bh(mcc_lock)
      calls are now replaced with mutex calls.
      
      Also a new rx_filter_lock is now needed to protect the RX filtering fields
      like vids[] between be_vlan_add/rem_vid() and __be_set_rx_mode() contexts.
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7172414
    • Sriharsha Basavapatna's avatar
      be2net: Avoid unnecessary firmware updates of multicast list · 92fbb1df
      Sriharsha Basavapatna authored
      Eachtime the ndo_set_rx_mode() routine is called, the driver programs the
      multicast list in the adapter without checking if there are any changes to
      the list. This leads to a flood of RX_FILTER cmds when a number of vlan
      interfaces are configured over the device, as the ndo_ gets
      called for each vlan interface. To avoid this, we now use __dev_mc_sync()
      and __dev_uc_sync() API, but only to detect if there is a change in the
      mc/uc lists. Now that we use this API, the code has to be-designed to
      issue these API calls for each invocation of the be_set_rx_mode() call.
      Signed-off-by: default avatarSriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92fbb1df
    • Sathya Perla's avatar
      be2net: do not remove vids from driver table if be_vid_config() fails. · 0aff1fbf
      Sathya Perla authored
      The driver currently removes a new vid from the adapter->vids[] array if
      be_vid_config() returns an error, which occurs when there is an error in
      HW/FW. This is wrong. After the HW/FW error is recovered from, we need the
      complete vids[] array to re-program the vlan list.
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0aff1fbf
    • Somnath Kotur's avatar
      be2net: clear vlan-promisc setting before programming the vlan list · 841f60fc
      Somnath Kotur authored
      The Lancer FW has a bug due to which in some cases vlan-promisc setting
      is cleared eventhough the vlan-list programming did not succeed (via
      VLAN_CONFIG) cmd. The driver has no way of knowing if the vlan-promisc
      mode was cleared or not when this cmd fails. To work around this issue,
      this patch first explicitly clears the vlan-promisc mode via RX_FILTER
      cmd and then tries to program the vlan list.
      Signed-off-by: default avatarSomnath Kotur <somnath.kotur@emulex.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      841f60fc
    • Julian Anastasov's avatar
      neigh: allow admin to set NUD_STALE · 0e7bbcc1
      Julian Anastasov authored
      Admin should be able to set any state. Currently, this fails
      when lladdr is not changed and state is changed from
      NUD_CONNECTED to NUD_STALE:
      
      ip neigh add 192.168.8.1 lladdr 00:11:22:33:44:55 nud perm dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      ip neigh change 192.168.8.1 lladdr 00:11:22:33:44:55 nud stale dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      
      Problem may be from 2.1.X days.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Reviewed-by: default avatarChunhui He <hchunhui@mail.ustc.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e7bbcc1
    • Xin Long's avatar
      sctp: use event->chunk when it's valid · 1fe323aa
      Xin Long authored
      Commit 52253db9 ("sctp: also point GSO head_skb to the sk when
      it's available") used event->chunk->head_skb to get the head_skb in
      sctp_ulpevent_set_owner().
      
      But at that moment, the event->chunk was NULL, as it cloned the skb
      in sctp_ulpevent_make_rcvmsg(). Therefore, that patch didn't really
      work.
      
      This patch is to move the event->chunk initialization before calling
      sctp_ulpevent_receive_data() so that it uses event->chunk when it's
      valid.
      
      Fixes: 52253db9 ("sctp: also point GSO head_skb to the sk when it's available")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1fe323aa
    • pravin shelar's avatar
      net: vxlan: lwt: Fix vxlan local traffic. · bbec7802
      pravin shelar authored
      vxlan driver has bypass for local vxlan traffic, but that
      depends on information about all VNIs on local system in
      vxlan driver. This is not available in case of LWT.
      Therefore following patch disable encap bypass for LWT
      vxlan traffic.
      
      Fixes: ee122c79 ("vxlan: Flow based tunneling").
      Reported-by: default avatarJakub Libosvar <jlibosva@redhat.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbec7802
    • pravin shelar's avatar
      net: vxlan: lwt: Use source ip address during route lookup. · 272d96a5
      pravin shelar authored
      LWT user can specify destination as well as source ip address
      for given tunnel endpoint. But vxlan is ignoring given source
      ip address. Following patch uses both ip address to route the
      tunnel packet. This consistent with other LWT implementations,
      like GENEVE and GRE.
      
      Fixes: ee122c79 ("vxlan: Flow based tunneling").
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      272d96a5
    • David S. Miller's avatar
      Merge branch 'bpf-csum-complete' · da1b4195
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      Few BPF helper related checksum fixes
      
      The set contains three fixes with regards to CHECKSUM_COMPLETE
      and BPF helper functions. For details please see individual
      patches.
      
      Thanks!
      
      v1 -> v2:
        - Fixed make htmldocs issue reported by kbuild bot.
        - Rest as is.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da1b4195
    • Daniel Borkmann's avatar
      bpf: fix checksum for vlan push/pop helper · 8065694e
      Daniel Borkmann authored
      When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
      push rcsum of mac header back in and after BPF run back pull out again as
      opposed to some other subsystems (ovs, for example).
      
      For cases like q-in-q, meaning when a vlan tag for offloading is already
      present and we're about to push another one, then skb_vlan_push() pushes the
      inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
      the 4 bytes vlan header diff. Likewise, for the reverse operation in
      skb_vlan_pop() for the case where vlan header needs to be pulled out of the
      skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
      rcsum of the vlan header that was removed.
      
      However mangling the rcsum here will lead to hw csum failure for BPF case,
      since we're pulling or pushing data that was not part of the current rcsum.
      Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
      is also not really an option since current behaviour is ABI by now, but apart
      from that would also mean to do quite a bit of useless work in the sense that
      usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
      touch this vlan related corner case. One way to fix it would be to push the
      necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
      anyway.
      
      Fixes: 4e10df9a ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8065694e
    • Daniel Borkmann's avatar
      bpf: fix checksum fixups on bpf_skb_store_bytes · 479ffccc
      Daniel Borkmann authored
      bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
      flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
      Where we ran into an issue with bpf_skb_store_bytes() is when we did a
      single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
      flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
      underlying issue has been tracked down to a buffer alignment issue.
      
      Meaning, that csum_partial() computations via skb_postpull_rcsum() and
      skb_postpush_rcsum() pair invoked had a wrong result since they operated on
      an odd address for the hoplimit, while other computations were done on an
      even address. This mix doesn't work as-is with skb_postpull_rcsum(),
      skb_postpush_rcsum() pair as it always expects at least half-word alignment
      of input buffers, which is normally the case. Thus, instead of these helpers
      using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
      csum_block_add(), respectively. For unaligned offsets, they rotate the sum
      to align it to a half-word boundary again, otherwise they work the same as
      csum_sub() and csum_add().
      
      Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
      offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
      csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
      use a 0 constant for offset so that the compiler optimizes the offset & 1
      test away and generates the same code as with csum_sub()/_add().
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      479ffccc
    • Daniel Borkmann's avatar
      bpf: also call skb_postpush_rcsum on xmit occasions · a2bfe6bf
      Daniel Borkmann authored
      Follow-up to commit f8ffad69 ("bpf: add skb_postpush_rcsum and fix
      dev_forward_skb occasions") to fix an issue for dev_queue_xmit() redirect
      locations which need CHECKSUM_COMPLETE fixups on ingress.
      
      For the same reasons as described in f8ffad69 already, we of course
      also need this here, since dev_queue_xmit() on a veth device will let us
      end up in the dev_forward_skb() helper again to cross namespaces.
      
      Latter then calls into skb_postpull_rcsum() to pull out L2 header, so
      that netif_rx_internal() sees CHECKSUM_COMPLETE as it is expected. That
      is, CHECKSUM_COMPLETE on ingress covering L2 _payload_, not L2 headers.
      
      Also here we have to address bpf_redirect() and bpf_clone_redirect().
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2bfe6bf
    • Paul Gortmaker's avatar
      net/ethernet: tundra: fix dump_eth_one warning in tsi108_eth · 66cf3504
      Paul Gortmaker authored
      The call site for this function appears as:
      
        #ifdef DEBUG
              data->msg_enable = DEBUG;
              dump_eth_one(dev);
        #endif
      
      ...leading to the following warning for !DEBUG builds:
      
      drivers/net/ethernet/tundra/tsi108_eth.c:169:13: warning: 'dump_eth_one' defined but not used [-Wunused-function]
       static void dump_eth_one(struct net_device *dev)
                   ^
      
      ...when using the arch/powerpc/configs/mpc7448_hpc2_defconfig
      
      Put the function definition under the same #ifdef as the call site
      to avoid the warning.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66cf3504
    • David S. Miller's avatar
      Merge branch 'mlxsw-dcb-fixes' · 61ec4bce
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: DCB fixes
      
      Patches 1 and 2 fix a problem in which PAUSE frames settings are wrongly
      overridden when ieee_setpfc() gets called.
      
      Patch 3 adds a missing rollback in port's creation error path.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61ec4bce
    • Ido Schimmel's avatar
      mlxsw: spectrum: Add missing DCB rollback in error path · 4de34eb5
      Ido Schimmel authored
      We correctly execute mlxsw_sp_port_dcb_fini() when port is removed, but
      I missed its rollback in the error path of port creation, so add it.
      
      Fixes: f00817df ("mlxsw: spectrum: Introduce support for Data Center Bridging (DCB)")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4de34eb5
    • Ido Schimmel's avatar
      mlxsw: spectrum: Do not override PAUSE settings · 07d50cae
      Ido Schimmel authored
      The PFCC register is used to configure both PAUSE and PFC frames.
      Therefore, when PFC frames are disabled we must make sure we don't
      mistakenly also disable PAUSE frames (which might be enabled).
      
      Fix this by packing the PFCC register with the current PAUSE settings.
      
      Note that this register is also accessed via ethtool ops, but there we
      are guaranteed to have PFC disabled.
      
      Fixes: d81a6bdb ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07d50cae
    • Ido Schimmel's avatar
      mlxsw: spectrum: Do not assume PAUSE frames are disabled · b489a200
      Ido Schimmel authored
      When ieee_setpfc() gets called, PAUSE frames are not necessarily
      disabled on the port.
      
      Check if PAUSE frames are disabled or enabled and configure the port's
      headroom buffer accordingly.
      
      Fixes: d81a6bdb ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b489a200
    • Phil Sutter's avatar
      rhashtable-test: Fix max_size parameter description · 3b3bf80b
      Phil Sutter authored
      Looks like a simple copy'n'paste error.
      
      Fixes: 1aa661f5 ("rhashtable-test: Measure time to insert, remove & traverse entries")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b3bf80b
    • David S. Miller's avatar
      Merge branch 'sctp_diag-fixes' · 05ec40f0
      David S. Miller authored
      Phil Sutter says:
      
      ====================
      sctp_diag: A bunch of fixes for upcoming 'ss' support
      
      The following series contains a number of fixes necessary to make my yet
      unpublished 'ss' support patch functional.
      
      Changes since v1:
      - Fixed patch 2/3
      - Rebased whole series onto current net-next/master
      
      Changes since v2:
      - Improved description of patch 2/3
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05ec40f0
    • Phil Sutter's avatar
      sctp_diag: Respect ss adding TCPF_CLOSE to idiag_states · 1ba8d77f
      Phil Sutter authored
      Since 'ss' always adds TCPF_CLOSE to idiag_states flags, sctp_diag can't
      rely upon TCPF_LISTEN flag solely being present when listening sockets
      are requested.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ba8d77f
    • Phil Sutter's avatar
      sctp_diag: Fix T3_rtx timer export · 12474e8e
      Phil Sutter authored
      The asoc's timer value is not kept in asoc->timeouts array but in it's
      primary transport instead.
      
      Furthermore, we must export the timer only if it is pending, otherwise
      the value will underrun when stored in an unsigned variable and
      user space will only see a very large timeout value.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12474e8e
    • Phil Sutter's avatar
      sctp: Export struct sctp_info to userspace · dca3f53c
      Phil Sutter authored
      This is required to correctly interpret INET_DIAG_INFO messages exported
      by sctp_diag module.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dca3f53c