1. 20 Aug, 2021 1 commit
  2. 18 Aug, 2021 1 commit
    • Arne Welzel's avatar
      dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 528b16bf
      Arne Welzel authored
      On systems with many cores using dm-crypt, heavy spinlock contention in
      percpu_counter_compare() can be observed when the page allocation limit
      for a given device is reached or close to be reached. This is due
      to percpu_counter_compare() taking a spinlock to compute an exact
      result on potentially many CPUs at the same time.
      
      Switch to non-exact comparison of allocated and allowed pages by using
      the value returned by percpu_counter_read_positive() to avoid taking
      the percpu_counter spinlock.
      
      This may over/under estimate the actual number of allocated pages by at
      most (batch-1) * num_online_cpus().
      
      Currently, batch is bounded by 32. The system on which this issue was
      first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
      change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
      allocations, this seems an acceptable error. Certainly preferred over
      running into the spinlock contention.
      
      This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
      and 192GB RAM as follows, but can be provoked on systems with less CPUs
      as well.
      
       * Disable swap
       * Tune vm settings to promote regular writeback
           $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
           $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
           $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
      
       * Create 8 dmcrypt devices based on files on a tmpfs
       * Create and mount an ext4 filesystem on each crypt devices
       * Run stress-ng --hdd 8 within one of above filesystems
      
      Total %system usage collected from sysstat goes to ~35%. Write throughput
      on the underlying loop device is ~2GB/s. perf profiling an individual
      kworker kcryptd thread shows the following profile, indicating spinlock
      contention in percpu_counter_compare():
      
          99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
            |
            --ret_from_fork
              kthread
              worker_thread
              |
               --99.92%--process_one_work
                  |
                  |--80.52%--kcryptd_crypt
                  |    |
                  |    |--62.58%--mempool_alloc
                  |    |  |
                  |    |   --62.24%--crypt_page_alloc
                  |    |     |
                  |    |      --61.51%--__percpu_counter_compare
                  |    |        |
                  |    |         --61.34%--__percpu_counter_sum
                  |    |           |
                  |    |           |--58.68%--_raw_spin_lock_irqsave
                  |    |           |  |
                  |    |           |   --58.30%--native_queued_spin_lock_slowpath
                  |    |           |
                  |    |            --0.69%--cpumask_next
                  |    |                |
                  |    |                 --0.51%--_find_next_bit
                  |    |
                  |    |--10.61%--crypt_convert
                  |    |          |
                  |    |          |--6.05%--xts_crypt
                  ...
      
      After applying this patch and running the same test, %system usage is
      lowered to ~7% and write throughput on the loop device increases
      to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
      in the profile and not hitting the percpu_counter() spinlock anymore.
      
          |--8.15%--mempool_alloc
          |    |
          |    |--3.93%--crypt_page_alloc
          |    |    |
          |    |     --3.75%--__alloc_pages
          |    |         |
          |    |          --3.62%--get_page_from_freelist
          |    |              |
          |    |               --3.22%--rmqueue_bulk
          |    |                   |
          |    |                    --2.59%--_raw_spin_lock
          |    |                      |
          |    |                       --2.57%--native_queued_spin_lock_slowpath
          |    |
          |     --3.05%--_raw_spin_lock_irqsave
          |               |
          |                --2.49%--native_queued_spin_lock_slowpath
      Suggested-by: default avatarDJ Gregor <dj@corelight.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarArne Welzel <arne.welzel@corelight.com>
      Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      528b16bf
  3. 10 Aug, 2021 13 commits
    • Tushar Sugandhi's avatar
      dm: add documentation for IMA measurement support · 00d43995
      Tushar Sugandhi authored
      To interpret various DM target measurement data in IMA logs,
      a separate documentation page is needed under
      Documentation/admin-guide/device-mapper.
      
      Add documentation to help system administrators and attestation
      client/server component owners to interpret the measurement
      data generated by various DM targets, on various device/table state
      changes.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      00d43995
    • Tushar Sugandhi's avatar
      dm: update target status functions to support IMA measurement · 8ec45662
      Tushar Sugandhi authored
      For device mapper targets to take advantage of IMA's measurement
      capabilities, the status functions for the individual targets need to be
      updated to handle the status_type_t case for value STATUSTYPE_IMA.
      
      Update status functions for the following target types, to log their
      respective attributes to be measured using IMA.
       01. cache
       02. crypt
       03. integrity
       04. linear
       05. mirror
       06. multipath
       07. raid
       08. snapshot
       09. striped
       10. verity
      
      For rest of the targets, handle the STATUSTYPE_IMA case by setting the
      measurement buffer to NULL.
      
      For IMA to measure the data on a given system, the IMA policy on the
      system needs to be updated to have the following line, and the system
      needs to be restarted for the measurements to take effect.
      
      /etc/ima/ima-policy
       measure func=CRITICAL_DATA label=device-mapper template=ima-buf
      
      The measurements will be reflected in the IMA logs, which are located at:
      
      /sys/kernel/security/integrity/ima/ascii_runtime_measurements
      /sys/kernel/security/integrity/ima/binary_runtime_measurements
      
      These IMA logs can later be consumed by various attestation clients
      running on the system, and send them to external services for attesting
      the system.
      
      The DM target data measured by IMA subsystem can alternatively
      be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
      DM_TABLE_STATUS_CMD.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8ec45662
    • Tushar Sugandhi's avatar
      dm ima: measure data on device rename · 7d1d1df8
      Tushar Sugandhi authored
      A given block device is identified by it's name and UUID.  However, both
      these parameters can be renamed.  For an external attestation service to
      correctly attest a given device, it needs to keep track of these rename
      events.
      
      Update the device data with the new values for IMA measurements.  Measure
      both old and new device name/UUID parameters in the same IMA measurement
      event, so that the old and the new values can be connected later.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7d1d1df8
    • Tushar Sugandhi's avatar
      dm ima: measure data on table clear · 99169b93
      Tushar Sugandhi authored
      For a given block device, an inactive table slot contains the parameters
      to configure the device with.  The inactive table can be cleared
      multiple times, accidentally or maliciously, which may impact the
      functionality of the device, and compromise the system.  Therefore it is
      important to measure and log the event when a table is cleared.
      
      Measure device parameters, and table hashes when the inactive table slot
      is cleared.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      99169b93
    • Tushar Sugandhi's avatar
      dm ima: measure data on device remove · 84010e51
      Tushar Sugandhi authored
      Presence of an active block-device, configured with expected parameters,
      is important for an external attestation service to determine if a system
      meets the attestation requirements.  Therefore it is important for DM to
      measure the device remove events.
      
      Measure device parameters and table hashes when the device is removed,
      using either remove or remove_all.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      84010e51
    • Tushar Sugandhi's avatar
      dm ima: measure data on device resume · 8eb6fab4
      Tushar Sugandhi authored
      A given block device can load a table multiple times, with different
      input parameters, before eventually resuming it.  Further, a device may
      be suspended and then resumed.  The device may never resume after a
      table-load.  Because of the above valid scenarios for a given device,
      it is important to measure and log the device resume event using IMA.
      
      Also, if the table is large, measuring it in clear-text each time the
      device changes state, will unnecessarily increase the size of IMA log.
      Since the table clear-text is already measured during table-load event,
      measuring the hash during resume should be sufficient to validate the
      table contents.
      
      Measure the device parameters, and hash of the active table, when the
      device is resumed.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8eb6fab4
    • Tushar Sugandhi's avatar
      dm ima: measure data on table load · 91ccbbac
      Tushar Sugandhi authored
      DM configures a block device with various target specific attributes
      passed to it as a table.  DM loads the table, and calls each target’s
      respective constructors with the attributes as input parameters.
      Some of these attributes are critical to ensure the device meets
      certain security bar.  Thus, IMA should measure these attributes, to
      ensure they are not tampered with, during the lifetime of the device.
      So that the external services can have high confidence in the
      configuration of the block-devices on a given system.
      
      Some devices may have large tables.  And a given device may change its
      state (table-load, suspend, resume, rename, remove, table-clear etc.)
      many times.  Measuring these attributes each time when the device
      changes its state will significantly increase the size of the IMA logs.
      Further, once configured, these attributes are not expected to change
      unless a new table is loaded, or a device is removed and recreated.
      Therefore the clear-text of the attributes should only be measured
      during table load, and the hash of the active/inactive table should be
      measured for the remaining device state changes.
      
      Export IMA function ima_measure_critical_data() to allow measurement
      of DM device parameters, as well as target specific attributes, during
      table load.  Compute the hash of the inactive table and store it for
      measurements during future state change.  If a load is called multiple
      times, update the inactive table hash with the hash of the latest
      populated table.  So that the correct inactive table hash is measured
      when the device transitions to different states like resume, remove,
      rename, etc.
      Signed-off-by: default avatarTushar Sugandhi <tusharsu@linux.microsoft.com>
      Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      91ccbbac
    • Mikulas Patocka's avatar
      dm writecache: add event counters · e3a35d03
      Mikulas Patocka authored
      Add 10 counters for various events (hit, miss, etc) and export them in
      the status line (accessed from userspace with "dmsetup status"). Also
      add a message "clear_stats" that resets these counters.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e3a35d03
    • Mikulas Patocka's avatar
      dm writecache: report invalid return from writecache_map helpers · df699cc1
      Mikulas Patocka authored
      If some "writecache_map_*" function returns invalid state, it is a bug.
      So, we should report it and not fail silently.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      df699cc1
    • Mike Snitzer's avatar
      dm writecache: further writecache_map() cleanup · 15cb6f39
      Mike Snitzer authored
      Factor out writecache_map_flush() and writecache_map_discard() from
      writecache_map(). Also eliminate the various goto labels in
      writecache_map().
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      15cb6f39
    • Mike Snitzer's avatar
      4d020b3a
    • Mike Snitzer's avatar
      dm writecache: split up writecache_map() to improve code readability · cdd4d783
      Mike Snitzer authored
      writecache_map() has grown too large and can be confusing to read given
      all the goto statements.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      cdd4d783
    • Christoph Hellwig's avatar
      writeback: make the laptop_mode prototypes available unconditionally · 99d26de2
      Christoph Hellwig authored
      Fix the !CONFIG_BLOCK build after the recent cleanup.
      
      Fixes: 5ed964f8 ("mm: hide laptop_mode_wb_timer entirely behind the BDI API")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99d26de2
  4. 09 Aug, 2021 14 commits
  5. 05 Aug, 2021 2 commits
    • Bart Van Assche's avatar
      loop: Select I/O scheduler 'none' from inside add_disk() · 2112f5c1
      Bart Van Assche authored
      We noticed that the user interface of Android devices becomes very slow
      under memory pressure. This is because Android uses the zram driver on top
      of the loop driver for swapping, because under memory pressure the swap
      code alternates reads and writes quickly, because mq-deadline is the
      default scheduler for loop devices and because mq-deadline delays writes by
      five seconds for such a workload with default settings. Fix this by making
      the kernel select I/O scheduler 'none' from inside add_disk() for loop
      devices. This default can be overridden at any time from user space,
      e.g. via a udev rule. This approach has an advantage compared to changing
      the I/O scheduler from userspace from 'mq-deadline' into 'none', namely
      that synchronize_rcu() does not get called.
      
      This patch changes the default I/O scheduler for loop devices from
      'mq-deadline' into 'none'.
      
      Additionally, this patch reduces the Android boot time on my test setup
      with 0.5 seconds compared to configuring the loop I/O scheduler from user
      space.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Martijn Coenen <maco@android.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210805174200.3250718-3-bvanassche@acm.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2112f5c1
    • Bart Van Assche's avatar
      blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag · 90b71980
      Bart Van Assche authored
      elevator_get_default() uses the following algorithm to select an I/O
      scheduler from inside add_disk():
      - In case of a single hardware queue or if sharing hardware queues across
        multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
      - Otherwise, use 'none'.
      
      This is a good choice for most but not for all block drivers. Make it
      possible to override the selection of mq-deadline with a new flag,
      namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Martijn Coenen <maco@android.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90b71980
  6. 02 Aug, 2021 9 commits