• Kirill Smelkov's avatar
    pull: Speedup fetching by prebuilding index of objects we already have at start · 3efed898
    Kirill Smelkov authored
    Like it was already said in 899103bf (pull: Switch from porcelain `git
    fetch` to plumbing `git fetch-pack` + friends) currently on
    lab.nexedi.com `git-backup pull` became slow and most of the slowness
    was tracked down to the fact that `git fetch` for every pulled repository does
    linear scan of whole backup repository history just to find out there is
    usually nothing to fetch. Quoting 899103bf:
    
    """
        `git fetch`, before fetching data from remote repository, first checks
        whether it already locally has all the objects remote advertises. This
        boils down to running
    
    	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all
    
        and checking whether it succeeds or not:
    
    	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
    	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
    	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8
    
        The "--not --all" in the query means that objects should be not
        reachable from all locally existing refs and is implemented by linearly
        scanning from tip of those existing refs and marking objects reachable
        from there as "do not print".
    
        In case of git-backup, where we have mostly master which is super commit
        merging from whole histories of all projects and from backup history,
        linearly scanning from such a tip goes through lots of commits. Up to
        the point where fetching a small, outdated repository, which was already
        pulled into backup and did not changed since long, takes more than 30
        seconds with almost 100% of that time being spent in quickfetch() only.
    """
    
    The solution is that we can build index of objects we already have ourselves
    only once at startup, and then in fetch, after checking lsremote output, consult
    that index, and if we see we already have everything for an advertised
    reference - just avoid giving it to fetch-pack to process. It turns out for
    many pulled repositories there is no references changed at all and this way
    fetch-pack can be skipped completely. This leads to dramatical speedup: before
    `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes.
    
    The index building itself takes ~ 30 seconds - the time which we were
    previously spending to fetch just from 1 unchanged repository. The index size
    is small and so it all can be kept in RAM - please see details in the code
    comments on this.
    
    I initially wanted to speedup fetching by teaching `git fetch-objects` to
    consult backup repo bitmap reachability index (if, for a commit, we can see
    that there is an entry in this index -> we know we already have all reachable
    objects for this commit and can skip fetching). This won't however work
    fully for all our refs - 40% of them are mostly tags, and since in the backup
    repository we don't keep tag objects - we keep tags/tree/blobs encoded as
    commits - sha1 of those 40% references to tags won't be in bitmap index.
    
    So just do the indexing ourselves.
    3efed898
git-backup_test.go 12.9 KB