1. 21 Nov, 2022 18 commits
    • Alexander Aring's avatar
      fs: dlm: parallelize lowcomms socket handling · dbb751ff
      Alexander Aring authored
      This patch is rework of lowcomms handling, the main goal was here to
      handle recvmsg() and sendpage() to run parallel. Parallel in two senses:
      1. per connection and 2. that recvmsg()/sendpage() doesn't block each
      other.
      
      Currently recvmsg()/sendpage() cannot run parallel because two
      workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means
      only one work item can be executed. The amount of queue items will be
      increased about the amount of nodes being inside the cluster. The current
      two workqueues for sending and receiving can also block each other if the
      same connection is executed at the same time in dlm_recv and dlm_send
      workqueue because a per connection mutex for the socket handling.
      
      To make it more parallel we introduce one "dlm_io" workqueue which is
      not an ordered workqueue, the amount of workers are not limited. Due
      per connection flags SEND/RECV pending we schedule workers ordered per
      connection and per send and receive task. To get rid of the mutex
      blocking same workers to do socket handling we switched to a semaphore
      which handles socket operations as read lock and sock releases as write
      operations, to prevent sock_release() being called while the socket is
      being used.
      
      There might be more optimization removing the semaphore and replacing it
      with other synchronization mechanism, however due other circumstances
      e.g. othercon behaviour it seems complicated to doing this change. I
      added comments to remove the othercon handling and moving to a different
      synchronization mechanism as this is done. We need to do that to the next
      dlm major version upgrade because it is not backwards compatible with the
      current connect mechanism.
      
      The processing of dlm messages need to be still handled by a ordered
      workqueue. An dlm_process ordered workqueue was introduced which gets
      filled by the receive worker. This is probably the next bottleneck of
      DLM but the application can't currently parse dlm messages parallel. A
      comment was introduced to lift the workqueue context of dlm processing
      in a non-sleepable softirq to get messages processing done fast.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      dbb751ff
    • Alexander Aring's avatar
      fs: dlm: don't init error value · 1351975a
      Alexander Aring authored
      This patch removes a init of an error value to -EINVAL which is not
      necessary.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      1351975a
    • Alexander Aring's avatar
      fs: dlm: use saved sk_error_report() · c852a6d7
      Alexander Aring authored
      This patch changes the handling of calling the original
      sk_error_report() by not putting it on the stack and calling it later.
      If the listen_sock.sk_error_report() is NULL in this moment it indicates
      a bug in our implementation.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      c852a6d7
    • Alexander Aring's avatar
      fs: dlm: use sock2con without checking null · e9dd5fd8
      Alexander Aring authored
      This patch removes null checks on private data for sockets. If we have a
      null dereference there we having a bug in our implementation that such
      callback occurs in this state.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      e9dd5fd8
    • Alexander Aring's avatar
      fs: dlm: remove dlm_node_addrs lookup list · 6f0b0b5d
      Alexander Aring authored
      This patch merges the dlm_node_addrs lookup list to the connection
      structure. It is a per node mapping to some configuration setup by
      configfs. We don't need two lookup structures. The connection hash has
      now a lifetime like the dlm_node_addrs entries. Means we add only new
      entries when configure cluster and not while new connections are coming
      in, remove connection when a node got fenced and cleanup all connection
      when the dlm exits. It should work the same and even will show more
      issues because we don't try to somehow keep those two data structures in
      sync with the current cluster configuration.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      6f0b0b5d
    • Alexander Aring's avatar
      fs: dlm: don't put dlm_local_addrs on heap · c51c9cd8
      Alexander Aring authored
      This patch removes to allocate the dlm_local_addr[] pointers on the
      heap. Instead we directly store the type of "struct sockaddr_storage".
      This removes function deinit_local() because it was freeing memory only.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      c51c9cd8
    • Alexander Aring's avatar
      fs: dlm: cleanup listen sock handling · c3d88dfd
      Alexander Aring authored
      This patch removes save_listen_callbacks() and add_listen_sock() as they
      are only used once in lowcomms functionality. For shutdown lowcomms it's
      not necessary to whole flush the workqueues to synchronize with
      restoring the old sk_data_ready() callback. Only the listen con receive
      work need to be cancelled. For each individual node shutdown we should be
      sure that last ack was been transmitted which is done by flushing per
      connection swork worker.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      c3d88dfd
    • Alexander Aring's avatar
      fs: dlm: remove socket shutdown handling · 4f567acb
      Alexander Aring authored
      Since commit 489d8e55 ("fs: dlm: add reliable connection if
      reconnect") we have functionality like TCP offers for half-closed
      sockets on dlm application protocol layer. This feature is required
      because the cluster manager events about leaving resource memberships
      can be locally already occurred but other cluster nodes having a pending
      leaving membership over the cluster manager protocol happening. In this
      time the local dlm node already shutdown it's connection and don't
      transmit anymore any new dlm messages, but however it still needs to be
      able to accept dlm messages because the pending leave membership request
      of the cluster manager protocol which the dlm kernel implementation has
      no control about it.
      
      We have this functionality on the application for two reasons, the main
      reason is that SCTP does not support such functionality on socket
      layer. But we can do it inside application layer.
      
      Another small issue is that this feature is broken in the TCP world
      because some NAT devices does not implement such functionality
      correctly. This is the same reason why the reliable connection session
      layer in DLM exists. We give up on middle devices in the networking
      which sends e.g. TCP resets out. In DLM we cannot have any message
      dropping and we ensure it over a session layer that it can't happen.
      
      Back to the half-closed grace shutdown handling. It's not necessary
      anymore to do it on socket layer (which is only support for TCP sockets)
      because we do it on application layer. This patch removes this handling,
      if there are still issues then we have a problem on the application
      layer for such handling.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      4f567acb
    • Alexander Aring's avatar
      fs: dlm: use listen sock as dlm running indicator · 1037c2a9
      Alexander Aring authored
      This patch will switch from dlm_allow_conn to check if dlm lowcomms is
      running or not to if we actually have a listen socket set or not. The
      list socket will be set and unset in lowcomms start and shutdown
      functionality. To synchronize with data_ready() callback we will set the
      socket callback to NULL while socket lock is held.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      1037c2a9
    • Alexander Aring's avatar
      fs: dlm: use list_first_entry_or_null · dd070a56
      Alexander Aring authored
      Instead of check on list_empty() we can do the same with
      list_first_entry_or_null() and return NULL if the returned value is NULL.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      dd070a56
    • Alexander Aring's avatar
      fs: dlm: remove twice INIT_WORK · 01ea3d77
      Alexander Aring authored
      This patch removed a twice INIT_WORK() functionality. We already doing
      this inside of dlm_lowcomms_init() functionality which is called only
      once dlm is loaded.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      01ea3d77
    • Alexander Aring's avatar
      fs: dlm: add midcomms init/start functions · 8b0188b0
      Alexander Aring authored
      This patch introduces leftovers of init, start, stop and exit
      functionality. The dlm application layer should always call the midcomms
      layer which getting aware of such event and redirect it to the lowcomms
      layer. Some functionality which is currently handled inside the start
      functionality of midcomms and lowcomms should be handled in the init
      functionality as it only need to be initialized once when dlm is loaded.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      8b0188b0
    • Alexander Aring's avatar
      fs: dlm: add dst nodeid for msg tracing · 17827754
      Alexander Aring authored
      In DLM when we send a dlm message it is easy to add the lock resource
      name, but additional lookup is required when to trace the receive
      message side. The idea here is to move the lookup work to the user by
      using a lookup to find the right send message with recv message. As note
      DLM can't drop any message which is guaranteed by a special session
      layer.
      
      For doing the lookup a 3 tupel is required as an unique identification
      which is dst nodeid, src nodeid and sequence number. This patch adds the
      destination nodeid to the dlm message trace points. The source nodeid is
      given by the h_nodeid field inside the header.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      17827754
    • Alexander Aring's avatar
      fs: dlm: rename seq to h_seq for msg tracing · 81889255
      Alexander Aring authored
      This patch renames seq to h_seq as it is named in the dlm header
      structure.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      81889255
    • Alexander Aring's avatar
      fs: dlm: rename DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING · 554d8496
      Alexander Aring authored
      This patch renames DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING because
      CB_PENDING is a proper name to describe this flag. This flag is set when
      callback enqueue will return DLM_ENQUEUE_CALLBACK_NEED_SCHED because the
      callback worker need to be queued. The flag tells that callbacks are
      currently pending to be called and will be unset if the callback work
      for the specific lkb is done. The term need schedule is part of this
      time but a proper name is to say that there are some callbacks pending
      to being called.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      554d8496
    • Alexander Aring's avatar
      fs: dlm: ast do WARN_ON_ONCE() on hotpath · 740bb8fc
      Alexander Aring authored
      This patch changes the ast hotpath functionality in very unlikely cases
      that we do WARN_ON_ONCE() instead of WARN_ON() to not spamming the
      console output if we run into states that it would occur over and over
      again.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      740bb8fc
    • Alexander Aring's avatar
      fs: dlm: drop lkb ref in bug case · 9267c857
      Alexander Aring authored
      This patch will drop the lkb reference in an very unlikely case which
      should in practice not happened. However if it happens we cleanup the
      reference just in case.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      9267c857
    • Alexander Aring's avatar
      fs: dlm: avoid false-positive checker warning · f217d7cc
      Alexander Aring authored
      This patch avoid the false-positive checker warning about writing 112
      bytes into a 88 bytes field "e->request", see:
      
      [   54.891560] dlm: csmb1: dlm_recover_directory 23 out 2 messages
      [   54.990542] ------------[ cut here ]------------
      [   54.991012] memcpy: detected field-spanning write (size 112) of single field "&e->request" at fs/dlm/requestqueue.c:47 (size 88)
      [   54.992150] WARNING: CPU: 0 PID: 297 at fs/dlm/requestqueue.c:47 dlm_add_requestqueue+0x177/0x180
      [   54.993002] CPU: 0 PID: 297 Comm: kworker/u4:3 Not tainted 6.1.0-rc5-00008-ge01d50cb #248
      [   54.993878] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
      [   54.994718] Workqueue: dlm_recv process_recv_sockets
      [   54.995230] RIP: 0010:dlm_add_requestqueue+0x177/0x180
      [   54.995731] Code: e7 01 0f 85 3b ff ff ff b9 58 00 00 00 48 c7 c2 c0 41 74 82 4c 89 ee 48 c7 c7 20 42 74 82 c6 05 8b 8d 30 02 01 e8 51 07 be 00 <0f> 0b e9 12 ff ff ff 66 90 0f 1f 44 00 00 41 57 48 8d 87 10 08 00
      [   54.997483] RSP: 0018:ffffc90000b1fbe8 EFLAGS: 00010282
      [   54.997990] RAX: 0000000000000000 RBX: ffff888024fc3d00 RCX: 0000000000000000
      [   54.998667] RDX: 0000000000000001 RSI: ffffffff81155014 RDI: fffff52000163f73
      [   54.999342] RBP: ffff88800dbac000 R08: 0000000000000001 R09: ffffc90000b1fa5f
      [   54.999997] R10: fffff52000163f4b R11: 203a7970636d656d R12: ffff88800cfb0018
      [   55.000673] R13: 0000000000000070 R14: ffff888024fc3d18 R15: 0000000000000000
      [   55.001344] FS:  0000000000000000(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000
      [   55.002078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   55.002603] CR2: 00007f35d4f0b9a0 CR3: 0000000025495002 CR4: 0000000000770ef0
      [   55.003258] PKRU: 55555554
      [   55.003514] Call Trace:
      [   55.003756]  <TASK>
      [   55.003953]  dlm_receive_buffer+0x1c0/0x200
      [   55.004348]  dlm_process_incoming_buffer+0x46d/0x780
      [   55.004786]  ? kernel_recvmsg+0x8b/0xc0
      [   55.005150]  receive_from_sock.isra.0+0x168/0x420
      [   55.005582]  ? process_listen_recv_socket+0x10/0x10
      [   55.006018]  ? finish_task_switch.isra.0+0xe0/0x400
      [   55.006469]  ? __switch_to+0x2fe/0x6a0
      [   55.006808]  ? read_word_at_a_time+0xe/0x20
      [   55.007197]  ? strscpy+0x146/0x190
      [   55.007505]  process_one_work+0x3d0/0x6b0
      [   55.007863]  worker_thread+0x8d/0x620
      [   55.008209]  ? __kthread_parkme+0xd8/0xf0
      [   55.008565]  ? process_one_work+0x6b0/0x6b0
      [   55.008937]  kthread+0x171/0x1a0
      [   55.009251]  ? kthread_exit+0x60/0x60
      [   55.009582]  ret_from_fork+0x1f/0x30
      [   55.009903]  </TASK>
      [   55.010120] ---[ end trace 0000000000000000 ]---
      [   55.025783] dlm: csmb1: dlm_recover 5 generation 3 done: 201 ms
      [   55.026466] gfs2: fsid=smbcluster:csmb1.0: recover generation 3 done
      
      It seems the checker is unable to detect the additional length bytes
      which was allocated additionally for the flexible array in struct
      dlm_message. To solve it we split the memcpy() into copy for the 88 bytes
      struct and another memcpy() for the flexible array m_extra field.
      Signed-off-by: default avatarAlexander Aring <aahringo@redhat.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      f217d7cc
  2. 08 Nov, 2022 18 commits
  3. 06 Nov, 2022 4 commits
    • Linus Torvalds's avatar
      Linux 6.1-rc4 · f0c4d9fc
      Linus Torvalds authored
      f0c4d9fc
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 16c7a368
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "Several fixes for CXL region creation crashes, leaks and failures.
      
        This is mainly fallout from the original implementation of dynamic CXL
        region creation (instantiate new physical memory pools) that arrived
        in v6.0-rc1.
      
        Given the theme of "failures in the presence of pass-through decoders"
        this also includes new regression test infrastructure for that case.
      
        Summary:
      
         - Fix region creation crash with pass-through decoders
      
         - Fix region creation crash when no decoder allocation fails
      
         - Fix region creation crash when scanning regions to enforce the
           increasing physical address order constraint that CXL mandates
      
         - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
           1:1 memory-device-to-region associations.
      
         - Fix a memory leak for cxl_region objects when regions with active
           targets are deleted
      
         - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
           emulated proximity domains.
      
         - Fix region creation failure for switch attached devices downstream
           of a single-port host-bridge
      
         - Fix false positive memory leak of cxl_region objects by recycling
           recently used region ids rather than freeing them
      
         - Add regression test infrastructure for a pass-through decoder
           configuration
      
         - Fix some mailbox payload handling corner cases"
      
      * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/region: Recycle region ids
        cxl/region: Fix 'distance' calculation with passthrough ports
        tools/testing/cxl: Add a single-port host-bridge regression config
        tools/testing/cxl: Fix some error exits
        cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
        cxl/region: Fix cxl_region leak, cleanup targets at region delete
        cxl/region: Fix region HPA ordering validation
        cxl/pmem: Use size_add() against integer overflow
        cxl/region: Fix decoder allocation crash
        ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
        cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
        cxl/region: Fix null pointer dereference due to pass through decoder commit
        cxl/mbox: Add a check on input payload size
      16c7a368
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v6.1-rc4' of... · aa529949
      Linus Torvalds authored
      Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix two regressions:
      
         - Commit 54cc3dbf ("hwmon: (pmbus) Add regulator supply into
           macro") resulted in regulator undercount when disabling regulators.
           Revert it.
      
         - The thermal subsystem rework caused the scmi driver to no longer
           register with the thermal subsystem because index values no longer
           match. To fix the problem, the scmi driver now directly registers
           with the thermal subsystem, no longer through the hwmon core"
      
      * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        Revert "hwmon: (pmbus) Add regulator supply into macro"
        hwmon: (scmi) Register explicitly with Thermal Framework
      aa529949
    • Linus Torvalds's avatar
      Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 727ea09e
      Linus Torvalds authored
      Pull perf fixes from Borislav Petkov:
      
       - Add Cooper Lake's stepping to the PEBS guest/host events isolation
         fixed microcode revisions checking quirk
      
       - Update Icelake and Sapphire Rapids events constraints
      
       - Use the standard energy unit for Sapphire Rapids in RAPL
      
       - Fix the hw_breakpoint test to fail more graciously on !SMP configs
      
      * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
        perf/x86/intel: Fix pebs event constraints for SPR
        perf/x86/intel: Fix pebs event constraints for ICL
        perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
        perf/hw_breakpoint: test: Skip the test if dependencies unmet
      727ea09e