pull: Speedup fetching by prebuilding index of objects we already have at start
Like it was already said in 899103bf (pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends) currently on lab.nexedi.com `git-backup pull` became slow and most of the slowness was tracked down to the fact that `git fetch` for every pulled repository does linear scan of whole backup repository history just to find out there is usually nothing to fetch. Quoting 899103bf: """ `git fetch`, before fetching data from remote repository, first checks whether it already locally has all the objects remote advertises. This boils down to running echo $remote_tips | git rev-list --quiet --objects --stdin --not --all and checking whether it succeeds or not: https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671 https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925 https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8 The "--not --all" in the query means that objects should be not reachable from all locally existing refs and is implemented by linearly scanning from tip of those existing refs and marking objects reachable from there as "do not print". In case of git-backup, where we have mostly master which is super commit merging from whole histories of all projects and from backup history, linearly scanning from such a tip goes through lots of commits. Up to the point where fetching a small, outdated repository, which was already pulled into backup and did not changed since long, takes more than 30 seconds with almost 100% of that time being spent in quickfetch() only. """ The solution is that we can build index of objects we already have ourselves only once at startup, and then in fetch, after checking lsremote output, consult that index, and if we see we already have everything for an advertised reference - just avoid giving it to fetch-pack to process. It turns out for many pulled repositories there is no references changed at all and this way fetch-pack can be skipped completely. This leads to dramatical speedup: before `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes. The index building itself takes ~ 30 seconds - the time which we were previously spending to fetch just from 1 unchanged repository. The index size is small and so it all can be kept in RAM - please see details in the code comments on this. I initially wanted to speedup fetching by teaching `git fetch-objects` to consult backup repo bitmap reachability index (if, for a commit, we can see that there is an entry in this index -> we know we already have all reachable objects for this commit and can skip fetching). This won't however work fully for all our refs - 40% of them are mostly tags, and since in the backup repository we don't keep tag objects - we keep tags/tree/blobs encoded as commits - sha1 of those 40% references to tags won't be in bitmap index. So just do the indexing ourselves.
Showing
Please register or sign in to comment