Commit 3efed898 authored by Kirill Smelkov's avatar Kirill Smelkov

pull: Speedup fetching by prebuilding index of objects we already have at start

Like it was already said in 899103bf (pull: Switch from porcelain `git
fetch` to plumbing `git fetch-pack` + friends) currently on
lab.nexedi.com `git-backup pull` became slow and most of the slowness
was tracked down to the fact that `git fetch` for every pulled repository does
linear scan of whole backup repository history just to find out there is
usually nothing to fetch. Quoting 899103bf:

"""
    `git fetch`, before fetching data from remote repository, first checks
    whether it already locally has all the objects remote advertises. This
    boils down to running

	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all

    and checking whether it succeeds or not:

	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8

    The "--not --all" in the query means that objects should be not
    reachable from all locally existing refs and is implemented by linearly
    scanning from tip of those existing refs and marking objects reachable
    from there as "do not print".

    In case of git-backup, where we have mostly master which is super commit
    merging from whole histories of all projects and from backup history,
    linearly scanning from such a tip goes through lots of commits. Up to
    the point where fetching a small, outdated repository, which was already
    pulled into backup and did not changed since long, takes more than 30
    seconds with almost 100% of that time being spent in quickfetch() only.
"""

The solution is that we can build index of objects we already have ourselves
only once at startup, and then in fetch, after checking lsremote output, consult
that index, and if we see we already have everything for an advertised
reference - just avoid giving it to fetch-pack to process. It turns out for
many pulled repositories there is no references changed at all and this way
fetch-pack can be skipped completely. This leads to dramatical speedup: before
`gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes.

The index building itself takes ~ 30 seconds - the time which we were
previously spending to fetch just from 1 unchanged repository. The index size
is small and so it all can be kept in RAM - please see details in the code
comments on this.

I initially wanted to speedup fetching by teaching `git fetch-objects` to
consult backup repo bitmap reachability index (if, for a commit, we can see
that there is an entry in this index -> we know we already have all reachable
objects for this commit and can skip fetching). This won't however work
fully for all our refs - 40% of them are mostly tags, and since in the backup
repository we don't keep tag objects - we keep tags/tree/blobs encoded as
commits - sha1 of those 40% references to tags won't be in bitmap index.

So just do the indexing ourselves.
parent 1be6aaaa
......@@ -377,12 +377,67 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
xgit("update-ref", backup_lock, mktree_empty(), Sha1{})
// make sure there is root commit
gerr, _, _ := ggit("rev-parse", "--verify", "HEAD")
var HEAD Sha1
var err error
gerr, __, _ := ggit("rev-parse", "--verify", "HEAD")
if gerr != nil {
infof("# creating root commit")
// NOTE `git commit` does not work in bare repo - do commit by hand
commit := xcommit_tree(gb, mktree_empty(), []Sha1{}, "Initialize git-backup repository")
xgit("update-ref", "-m", "git-backup pull init", "HEAD", commit)
HEAD = xcommit_tree(gb, mktree_empty(), []Sha1{}, "Initialize git-backup repository")
xgit("update-ref", "-m", "git-backup pull init", "HEAD", HEAD)
} else {
HEAD, err = Sha1Parse(__)
exc.Raiseif(err)
}
// build index of "already-have" objects: all commits + tag/tree/blob that
// were at heads of already pulled repositories.
//
// Build it once and use below to check ourselves whether a head from a pulled
// repository needs to be actually fetched. If we don't, `git fetch-pack`
// will do similar to "all commits" linear scan for every pulled repository,
// which are many out there.
alreadyHave := Sha1Set{}
infof("# building \"already-have\" index")
// already have: all commits
//
// As of lab.nexedi.com/20180612 there are ~ 1.7·10⁷ objects total in backup.
// Of those there are ~ 1.9·10⁶ commit objects, i.e. ~10% of total.
// Since 1 sha1 is 2·10¹ bytes, the space needed for keeping sha1 of all
// commits is ~ 4·10⁷B = ~40MB. It is thus ok to keep this index in RAM for now.
for _, __ := range xstrings.SplitLines(xgit("rev-list", HEAD), "\n") {
sha1, err := Sha1Parse(__)
exc.Raiseif(err)
alreadyHave.Add(sha1)
}
// already have: tag/tree/blob that were at heads of already pulled repositories
//
// As of lab.nexedi.com/20180612 there are ~ 8.4·10⁴ refs in total.
// Of those encoded tag/tree/blob are ~ 3.2·10⁴, i.e. ~40% of total.
// The number of tag/tree/blob objects in alreadyHave is thus negligible
// compared to the number of "all commits".
hcommit, err := gb.LookupCommit(HEAD.AsOid())
exc.Raiseif(err)
htree, err := hcommit.Tree()
exc.Raiseif(err)
if htree.EntryByName("backup.refs") != nil {
repotab, err := loadBackupRefs(fmt.Sprintf("%s:backup.refs", HEAD))
exc.Raiseif(err)
for _, repo := range repotab {
for _, xref := range repo.refs {
if xref.sha1 != xref.sha1_ && !alreadyHave.Contains(xref.sha1) {
// make sure encoded tag/tree/blob objects represented as
// commits are present. We do so, because we promise to
// fetch that all objects in alreadyHave are present.
obj_recreate_from_commit(gb, xref.sha1_)
alreadyHave.Add(xref.sha1)
}
}
}
}
// walk over specified dirs, pulling objects from git and blobbing non-git-object files
......@@ -435,7 +490,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
// git repo - let's pull all refs from it to our backup refs namespace
infof("# git %s\t<- %s", prefix, path)
refv, err := fetch(path)
refv, _, err := fetch(path, alreadyHave)
exc.Raiseif(err)
reporefprefix := backup_refs_work +
......@@ -531,8 +586,6 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
// index is ready - prepare tree and commit
backup_tree_sha1 := xgitSha1("write-tree")
HEAD := xgitSha1("rev-parse", "HEAD")
commit_sha1 := xcommit_tree(gb, backup_tree_sha1, append([]Sha1{HEAD}, backup_refs_parentv...),
"Git-backup " + backup_time)
......@@ -545,7 +598,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
}
xgit("update-ref", "--stdin", RunWith{stdin: backup_refs_delete})
__ := xgit("for-each-ref", backup_refs_work)
__ = xgit("for-each-ref", backup_refs_work)
if __ != "" {
exc.Raisef("Backup refs under %s not deleted properly", backup_refs_work)
}
......@@ -566,7 +619,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
// We can avoid quadratic behaviour via removing refs from just
// pulled repo right after the pull.
gitdir := xgit("rev-parse", "--git-dir")
err := os.RemoveAll(gitdir+"/"+backup_refs_work)
err = os.RemoveAll(gitdir+"/"+backup_refs_work)
exc.Raiseif(err) // NOTE err is nil if path does not exist
// if we have working copy - update it
......@@ -595,25 +648,45 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
// repository in question. The objects considered to fetch are those, that are
// reachable from all repository references.
//
// Returned is list of all references in source repository.
// AlreadyHave can be given to indicate knowledge on what objects our repository
// already has. If remote advertises tip with sha1 in alreadyHave, that tip won't be
// fetched. Notice: alreadyHave is consulted directly - no reachability scan is
// performed on it.
//
// All objects reachable from alreadyHave must be in our repository.
// AlreadyHave does not need to be complete - if we have something that is not
// in alreadyHave - it can affect only speed, not correctness.
//
// Returned are 2 lists of references from the source repository:
//
// - list of all references, and
// - list of references we actually had to fetch.
//
// Note: fetch does not create any local references - the references returned
// only describe state of references in fetched source repository.
func fetch(repo string) (refv []Ref, err error) {
func fetch(repo string, alreadyHave Sha1Set) (refv, fetchedv []Ref, err error) {
defer xerr.Contextf(&err, "fetch %s", repo)
// first check which references are advertised
refv, err = lsremote(repo)
if err != nil {
return nil, err
return nil, nil, err
}
// check if we already have something
var fetchv []Ref // references we need to actually fetch.
for _, ref := range refv {
if !alreadyHave.Contains(ref.sha1) {
fetchv = append(fetchv, ref)
}
}
// if there is nothing to fetch - we are done
if len(refv) == 0 {
return refv, nil
if len(fetchv) == 0 {
return refv, fetchv, nil
}
// fetch all those advertised objects by sha1.
// fetch by sha1 what we don't already have from advertised.
//
// even if refs would change after ls-remote but before here, we should be
// getting exactly what was advertised.
......@@ -636,14 +709,14 @@ func fetch(repo string) (refv []Ref, err error) {
" upload-pack",
repo)
for _, ref := range refv {
for _, ref := range fetchv {
arg(ref.sha1)
}
arg(RunWith{stderr: gitprogress()})
gerr, _, _ := ggit(argv...)
if gerr != nil {
return nil, gerr
return nil, nil, gerr
}
// fetch-pack ran ok - now check that all fetched tips are indeed fully
......@@ -660,18 +733,18 @@ func fetch(repo string) (refv []Ref, err error) {
// https://git.kernel.org/pub/scm/git/git.git/commit/?h=6d4bb3833c
argv = nil
arg("rev-list", "--quiet", "--objects", "--not", "--all", "--not")
for _, ref := range refv {
for _, ref := range fetchv {
arg(ref.sha1)
}
arg(RunWith{stderr: gitprogress()})
gerr, _, _ = ggit(argv...)
if gerr != nil {
return nil, fmt.Errorf("remote did not send all neccessary objects")
return nil, nil, fmt.Errorf("remote did not send all neccessary objects")
}
// fetched ok
return refv, nil
return refv, fetchv, nil
}
// lsremote lists all references advertised by repo.
......
......@@ -160,44 +160,63 @@ func TestPullRestore(t *testing.T) {
}
}
// verify no garbage is left under refs/backup/
dentryv, err := ioutil.ReadDir("refs/backup/")
if err != nil && !os.IsNotExist(err) {
t.Fatal(err)
}
if len(dentryv) != 0 {
namev := []string{}
for _, fi := range dentryv {
namev = append(namev, fi.Name())
// checks / cleanups after cmd_pull
afterPull := func() {
// verify no garbage is left under refs/backup/
dentryv, err := ioutil.ReadDir("refs/backup/")
if err != nil && !os.IsNotExist(err) {
t.Fatal(err)
}
if len(dentryv) != 0 {
namev := []string{}
for _, fi := range dentryv {
namev = append(namev, fi.Name())
}
t.Fatalf("refs/backup/ not empty after pull: %v", namev)
}
t.Fatalf("refs/backup/ not empty after pull: %v", namev)
}
// prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
xgit("prune")
// prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
xgit("prune")
// verify backup repo is all ok
xgit("fsck")
// verify backup repo is all ok
xgit("fsck")
// verify that just pulled tag objects are now gone after pruning -
// - they become not directly git-present. The only possibility to
// get them back is via recreating from encoded commit objects.
for _, nc := range noncommitv {
if !nc.istag {
continue
// verify that just pulled tag objects are now gone after pruning -
// - they become not directly git-present. The only possibility to
// get them back is via recreating from encoded commit objects.
for _, nc := range noncommitv {
if !nc.istag {
continue
}
gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
if gerr == nil {
t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
}
}
gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
if gerr == nil {
t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
// reopen backup repository - to avoid having stale cache with present
// objects we deleted above with `git prune`
gb, err = git.OpenRepository(".")
if err != nil {
t.Fatal(err)
}
}
// reopen backup repository - to avoid having stale cache with present
// objects we deleted above with `git prune`
gb, err = git.OpenRepository(".")
if err != nil {
t.Fatal(err)
afterPull()
// pull again - it should be noop
h1 := xgitSha1("rev-parse", "HEAD")
cmd_pull(gb, []string{my1+":b1"})
afterPull()
h2 := xgitSha1("rev-parse", "HEAD")
if h1 == h2 {
t.Fatal("pull: second run did not ajusted HEAD")
}
δ12 := xgit("diff", h1, h2)
if δ12 != "" {
t.Fatalf("pull: second run was not noop: δ:\n%s", δ12)
}
// restore backup
work1 := workdir + "/1"
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment