push_rules: Implement bulk-checking of file sizes

The file size check checks each newly pushed blob's size to see whether it's bigger than a configured threshold and, if so, rejects the ref update. This is an expensive check though: we need to go both through all preexisting as well as all new refs in order to find out new blobs via a graph walk. As such, this check doesn't only scale with the number of changes, but also with the repository size itself. Now that `#new_blobs` knows to handle multiple new revisions at once in a single RPC call to Gitaly, we can convert this check to use a single bulk-load of new blobs. While this doesn't help much with walking the positive side of the graph walk, it does amortize the negative walk of all preexisting refs and will thus in most cases result in a significant speedup if multiple changes are to be checked. Ideally, we'd go even further and enumerate new blobs directly via the quarantine directory: we wouldn't have to do a graph walk at all in this case, but can just directly look up all new blobs. While this would be as fast as we can get, the downside is that we wouldn't have blob paths available anymore given that these blobs wouldn't have been walked via a tree object. We would still be able to at least present the blob ID to the user, but the user experience is definitely worse in this case. We may still at a later point decide to go this step given that it si a huge performance win (e.g. on gitlab-org/gitlab, we're talking about 10ms vs 30s). But for now, this commit only does the uncontroversial part of batch-computing new blobs. Changelog: performance

push_rules: Implement bulk-checking of file sizes
The file size check checks each newly pushed blob's size to see whether it's bigger than a configured threshold and, if so, rejects the ref update. This is an expensive check though: we need to go both through all preexisting as well as all new refs in order to find out new blobs via a graph walk. As such, this check doesn't only scale with the number of changes, but also with the repository size itself. Now that `#new_blobs` knows to handle multiple new revisions at once in a single RPC call to Gitaly, we can convert this check to use a single bulk-load of new blobs. While this doesn't help much with walking the positive side of the graph walk, it does amortize the negative walk of all preexisting refs and will thus in most cases result in a significant speedup if multiple changes are to be checked. Ideally, we'd go even further and enumerate new blobs directly via the quarantine directory: we wouldn't have to do a graph walk at all in this case, but can just directly look up all new blobs. While this would be as fast as we can get, the downside is that we wouldn't have blob paths available anymore given that these blobs wouldn't have been walked via a tree object. We would still be able to at least present the blob ID to the user, but the user experience is definitely worse in this case. We may still at a later point decide to go this step given that it si a huge performance win (e.g. on gitlab-org/gitlab, we're talking about 10ms vs 30s). But for now, this commit only does the uncontroversial part of batch-computing new blobs. Changelog: performance
8a5681f2 · Patrick Steinhardt · 081d1780 · 8a5681f2
Commit 8a5681f2 authored Sep 02, 2021 by Patrick Steinhardt
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 11 deletions

ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb +7 -11

No files found.
--- a/ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb
+++ b/ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb
@@ -13,20 +13,16 @@ module EE
            logger.log_timed(LOG_MESSAGE) do
              max_file_size = push_rule.max_file_size

-              changes_access.changes.each do |change|
-                newrev = change[:newrev]
+              newrevs = changes_access.changes.map { |change| change[:newrev] }

-                next if newrev.blank? || ::Gitlab::Git.blank_ref?(newrev)
+              blobs = project.repository.new_blobs(newrevs, dynamic_timeout: logger.time_left)

-                blobs = project.repository.new_blobs(newrev, dynamic_timeout: logger.time_left)
-
-                large_blob = blobs.find do |blob|
-                  ::Gitlab::Utils.bytes_to_megabytes(blob.size) > max_file_size
-                end
+              large_blob = blobs.find do |blob|
+                ::Gitlab::Utils.bytes_to_megabytes(blob.size) > max_file_size
+              end

-                if large_blob
-                  raise ::Gitlab::GitAccess::ForbiddenError, %Q{File "#{large_blob.path}" is larger than the allowed size of #{max_file_size} MB}
-                end
+              if large_blob
+                raise ::Gitlab::GitAccess::ForbiddenError, %Q{File "#{large_blob.path}" is larger than the allowed size of #{max_file_size} MB}
              end
            end
          end