• Namhyung Kim's avatar
    perf lock contention: Do not try to update if hash map is full · 222de5e5
    Namhyung Kim authored
    It doesn't delete data in the task_data and lock_stat maps.  The data
    is kept there until it's consumed by userspace at the end.  But it calls
    bpf_map_update_elem() again and again, and the data will be discarded if
    the map is full.  This is not good.
    
    Worse, in the bpf_map_update_elem(), it keeps trying to get a new node
    even if the map was full.  I guess it makes sense if it deletes some node
    like in the tstamp map (that's why I didn't make the change there).
    
    In a pre-allocated hash map, that means it'd iterate all CPU to check the
    freelist.  And it has a bad performance impact on large machines.
    
    I've checked it on my 64 CPU machine with this.
    
      $ perf bench sched messaging -g 1000
      # Running 'sched/messaging' benchmark:
      # 20 sender and receiver processes per group
      # 1000 groups == 40000 processes run
    
           Total time: 2.825 [sec]
    
    And I used the task mode, so that it can guarantee the map is full.
    The default map entry size is 16K and this workload has 40K tasks.
    
    Before:
      $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
      # Running 'sched/messaging' benchmark:
      # 20 sender and receiver processes per group
      # 1000 groups == 40000 processes run
    
           Total time: 11.299 [sec]
       contended   total wait     max wait     avg wait          pid   comm
    
           19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
             243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
             177     66.35 ms     12.08 ms    374.88 us      1220416   node
    
    For some reason, it didn't report the data failures.  But you can see the
    total time in the workload is increased a lot (2.8 -> 11.3).  If it fails
    early when the map is full, it goes back to normal.
    
    After:
      $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
      # Running 'sched/messaging' benchmark:
      # 20 sender and receiver processes per group
      # 1000 groups == 40000 processes run
    
           Total time: 3.044 [sec]
       contended   total wait     max wait     avg wait          pid   comm
    
           18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
              51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
              81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging
    
      === output for debug ===
    
      bad: 1164137, total: 2253341
      bad rate: 51.66 %
      histogram of failure reasons
             task: 0
            stack: 0
             time: 0
             data: 1164137
    Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
    Acked-by: default avatarIan Rogers <irogers@google.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Hao Luo <haoluo@google.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Song Liu <song@kernel.org>
    Cc: bpf@vger.kernel.org
    Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
    222de5e5
lock_contention.bpf.c 9.58 KB