pull: Speedup fetching by prebuilding index of objects we already have at start

Like it was already said in 899103bf (pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends) currently on lab.nexedi.com `git-backup pull` became slow and most of the slowness was tracked down to the fact that `git fetch` for every pulled repository does linear scan of whole backup repository history just to find out there is usually nothing to fetch. Quoting 899103bf: """ `git fetch`, before fetching data from remote repository, first checks whether it already locally has all the objects remote advertises. This boils down to running echo $remote_tips | git rev-list --quiet --objects --stdin --not --all and checking whether it succeeds or not: https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671 https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925 https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8 The "--not --all" in the query means that objects should be not reachable from all locally existing refs and is implemented by linearly scanning from tip of those existing refs and marking objects reachable from there as "do not print". In case of git-backup, where we have mostly master which is super commit merging from whole histories of all projects and from backup history, linearly scanning from such a tip goes through lots of commits. Up to the point where fetching a small, outdated repository, which was already pulled into backup and did not changed since long, takes more than 30 seconds with almost 100% of that time being spent in quickfetch() only. """ The solution is that we can build index of objects we already have ourselves only once at startup, and then in fetch, after checking lsremote output, consult that index, and if we see we already have everything for an advertised reference - just avoid giving it to fetch-pack to process. It turns out for many pulled repositories there is no references changed at all and this way fetch-pack can be skipped completely. This leads to dramatical speedup: before `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes. The index building itself takes ~ 30 seconds - the time which we were previously spending to fetch just from 1 unchanged repository. The index size is small and so it all can be kept in RAM - please see details in the code comments on this. I initially wanted to speedup fetching by teaching `git fetch-objects` to consult backup repo bitmap reachability index (if, for a commit, we can see that there is an entry in this index -> we know we already have all reachable objects for this commit and can skip fetching). This won't however work fully for all our refs - 40% of them are mostly tags, and since in the backup repository we don't keep tag objects - we keep tags/tree/blobs encoded as commits - sha1 of those 40% references to tags won't be in bitmap index. So just do the indexing ourselves.

pull: Speedup fetching by prebuilding index of objects we already have at start
Like it was already said in 899103bf (pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends) currently on lab.nexedi.com `git-backup pull` became slow and most of the slowness was tracked down to the fact that `git fetch` for every pulled repository does linear scan of whole backup repository history just to find out there is usually nothing to fetch. Quoting 899103bf: """ `git fetch`, before fetching data from remote repository, first checks whether it already locally has all the objects remote advertises. This boils down to running echo $remote_tips | git rev-list --quiet --objects --stdin --not --all and checking whether it succeeds or not: https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671 https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925 https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8 The "--not --all" in the query means that objects should be not reachable from all locally existing refs and is implemented by linearly scanning from tip of those existing refs and marking objects reachable from there as "do not print". In case of git-backup, where we have mostly master which is super commit merging from whole histories of all projects and from backup history, linearly scanning from such a tip goes through lots of commits. Up to the point where fetching a small, outdated repository, which was already pulled into backup and did not changed since long, takes more than 30 seconds with almost 100% of that time being spent in quickfetch() only. """ The solution is that we can build index of objects we already have ourselves only once at startup, and then in fetch, after checking lsremote output, consult that index, and if we see we already have everything for an advertised reference - just avoid giving it to fetch-pack to process. It turns out for many pulled repositories there is no references changed at all and this way fetch-pack can be skipped completely. This leads to dramatical speedup: before `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes. The index building itself takes ~ 30 seconds - the time which we were previously spending to fetch just from 1 unchanged repository. The index size is small and so it all can be kept in RAM - please see details in the code comments on this. I initially wanted to speedup fetching by teaching `git fetch-objects` to consult backup repo bitmap reachability index (if, for a commit, we can see that there is an entry in this index -> we know we already have all reachable objects for this commit and can skip fetching). This won't however work fully for all our refs - 40% of them are mostly tags, and since in the backup repository we don't keep tag objects - we keep tags/tree/blobs encoded as commits - sha1 of those 40% references to tags won't be in bitmap index. So just do the indexing ourselves.
3efed898 · Kirill Smelkov · 1be6aaaa · 3efed898 · 3efed898
Commit 3efed898 authored Jun 12, 2018 by Kirill Smelkov
Hide whitespace changes
Inline Side-by-side

Showing with 140 additions and 48 deletions

git-backup.go git-backup.go +92 -19

git-backup_test.go git-backup_test.go +48 -29

No files found.
--- a/git-backup.go
+++ b/git-backup.go
@@ -377,12 +377,67 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
    xgit("update-ref", backup_lock, mktree_empty(), Sha1{})

    // make sure there is root commit
-    gerr, _, _ := ggit("rev-parse", "--verify", "HEAD")
+    var HEAD Sha1
+    var err  error
+    gerr, __, _ := ggit("rev-parse", "--verify", "HEAD")
    if gerr != nil {
        infof("# creating root commit")
        // NOTE `git commit` does not work in bare repo - do commit by hand
-        commit := xcommit_tree(gb, mktree_empty(), []Sha1{}, "Initialize git-backup repository")
-        xgit("update-ref", "-m", "git-backup pull init", "HEAD", commit)
+        HEAD = xcommit_tree(gb, mktree_empty(), []Sha1{}, "Initialize git-backup repository")
+        xgit("update-ref", "-m", "git-backup pull init", "HEAD", HEAD)
+    } else {
+        HEAD, err = Sha1Parse(__)
+        exc.Raiseif(err)
+    }
+
+    // build index of "already-have" objects: all commits + tag/tree/blob that
+    // were at heads of already pulled repositories.
+    //
+    // Build it once and use below to check ourselves whether a head from a pulled
+    // repository needs to be actually fetched. If we don't, `git fetch-pack`
+    // will do similar to "all commits" linear scan for every pulled repository,
+    // which are many out there.
+    alreadyHave := Sha1Set{}
+    infof("# building \"already-have\" index")
+
+    // already have: all commits
+    //
+    // As of lab.nexedi.com/20180612 there are ~ 1.7·10⁷ objects total in backup.
+    // Of those there are ~ 1.9·10⁶ commit objects, i.e. ~10% of total.
+    // Since 1 sha1 is 2·10¹ bytes, the space needed for keeping sha1 of all
+    // commits is ~ 4·10⁷B = ~40MB. It is thus ok to keep this index in RAM for now.
+    for _, __ := range xstrings.SplitLines(xgit("rev-list", HEAD), "\n") {
+        sha1, err := Sha1Parse(__)
+        exc.Raiseif(err)
+        alreadyHave.Add(sha1)
+    }
+
+    // already have: tag/tree/blob that were at heads of already pulled repositories
+    //
+    // As of lab.nexedi.com/20180612 there are ~ 8.4·10⁴ refs in total.
+    // Of those encoded tag/tree/blob are ~ 3.2·10⁴, i.e. ~40% of total.
+    // The number of tag/tree/blob objects in alreadyHave is thus negligible
+    // compared to the number of "all commits".
+    hcommit, err := gb.LookupCommit(HEAD.AsOid())
+    exc.Raiseif(err)
+    htree, err := hcommit.Tree()
+    exc.Raiseif(err)
+    if htree.EntryByName("backup.refs") != nil {
+        repotab, err := loadBackupRefs(fmt.Sprintf("%s:backup.refs", HEAD))
+        exc.Raiseif(err)
+
+        for _, repo := range repotab {
+            for _, xref := range repo.refs {
+                if xref.sha1 != xref.sha1_ && !alreadyHave.Contains(xref.sha1) {
+                    // make sure encoded tag/tree/blob objects represented as
+                    // commits are present. We do so, because we promise to
+                    // fetch that all objects in alreadyHave are present.
+                    obj_recreate_from_commit(gb, xref.sha1_)
+
+                    alreadyHave.Add(xref.sha1)
+                }
+            }
+        }
    }

    // walk over specified dirs, pulling objects from git and blobbing non-git-object files
@@ -435,7 +490,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {

            // git repo - let's pull all refs from it to our backup refs namespace
            infof("# git  %s\t<- %s", prefix, path)
-            refv, err := fetch(path)
+            refv, _, err := fetch(path, alreadyHave)
            exc.Raiseif(err)

            reporefprefix := backup_refs_work +
@@ -531,8 +586,6 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {

    // index is ready - prepare tree and commit
    backup_tree_sha1 := xgitSha1("write-tree")
-
-    HEAD := xgitSha1("rev-parse", "HEAD")
    commit_sha1 := xcommit_tree(gb, backup_tree_sha1, append([]Sha1{HEAD}, backup_refs_parentv...),
            "Git-backup " + backup_time)

@@ -545,7 +598,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
    }

    xgit("update-ref", "--stdin", RunWith{stdin: backup_refs_delete})
-    __ := xgit("for-each-ref", backup_refs_work)
+    __ = xgit("for-each-ref", backup_refs_work)
    if __ != "" {
        exc.Raisef("Backup refs under %s not deleted properly", backup_refs_work)
    }
@@ -566,7 +619,7 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
    //       We can avoid quadratic behaviour via removing refs from just
    //       pulled repo right after the pull.
    gitdir := xgit("rev-parse", "--git-dir")
-    err := os.RemoveAll(gitdir+"/"+backup_refs_work)
+    err = os.RemoveAll(gitdir+"/"+backup_refs_work)
    exc.Raiseif(err) // NOTE err is nil if path does not exist

    // if we have working copy - update it
@@ -595,25 +648,45 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
 // repository in question. The objects considered to fetch are those, that are
 // reachable from all repository references.
 //
-// Returned is list of all references in source repository.
+// AlreadyHave can be given to indicate knowledge on what objects our repository
+// already has. If remote advertises tip with sha1 in alreadyHave, that tip won't be
+// fetched. Notice: alreadyHave is consulted directly - no reachability scan is
+// performed on it.
+//
+// All objects reachable from alreadyHave must be in our repository.
+// AlreadyHave does not need to be complete - if we have something that is not
+// in alreadyHave - it can affect only speed, not correctness.
+//
+// Returned are 2 lists of references from the source repository:
+//
+//  - list of all references, and
+//  - list of references we actually had to fetch.
 //
 // Note: fetch does not create any local references - the references returned
 // only describe state of references in fetched source repository.
-func fetch(repo string) (refv []Ref, err error) {
+func fetch(repo string, alreadyHave Sha1Set) (refv, fetchedv []Ref, err error) {
    defer xerr.Contextf(&err, "fetch %s", repo)

    // first check which references are advertised
    refv, err = lsremote(repo)
    if err != nil {
-        return nil, err
+        return nil, nil, err
+    }
+
+    // check if we already have something
+    var fetchv []Ref // references we need to actually fetch.
+    for _, ref := range refv {
+        if !alreadyHave.Contains(ref.sha1) {
+            fetchv = append(fetchv, ref)
+        }
    }

    // if there is nothing to fetch - we are done
-    if len(refv) == 0 {
-        return refv, nil
+    if len(fetchv) == 0 {
+        return refv, fetchv, nil
    }

-    // fetch all those advertised objects by sha1.
+    // fetch by sha1 what we don't already have from advertised.
    //
    // even if refs would change after ls-remote but before here, we should be
    // getting exactly what was advertised.
@@ -636,14 +709,14 @@ func fetch(repo string) (refv []Ref, err error) {
        " upload-pack",

        repo)
-    for _, ref := range refv {
+    for _, ref := range fetchv {
        arg(ref.sha1)
    }
    arg(RunWith{stderr: gitprogress()})

    gerr, _, _ := ggit(argv...)
    if gerr != nil {
-        return nil, gerr
+        return nil, nil, gerr
    }

    // fetch-pack ran ok - now check that all fetched tips are indeed fully
@@ -660,18 +733,18 @@ func fetch(repo string) (refv []Ref, err error) {
    // https://git.kernel.org/pub/scm/git/git.git/commit/?h=6d4bb3833c
    argv = nil
    arg("rev-list", "--quiet", "--objects", "--not", "--all", "--not")
-    for _, ref := range refv {
+    for _, ref := range fetchv {
        arg(ref.sha1)
    }
    arg(RunWith{stderr: gitprogress()})

    gerr, _, _ = ggit(argv...)
    if gerr != nil {
-        return nil, fmt.Errorf("remote did not send all neccessary objects")
+        return nil, nil, fmt.Errorf("remote did not send all neccessary objects")
    }

    // fetched ok
-    return refv, nil
+    return refv, fetchv, nil
 }

 // lsremote lists all references advertised by repo.

--- a/git-backup_test.go
+++ b/git-backup_test.go
@@ -160,44 +160,63 @@ func TestPullRestore(t *testing.T) {
        }
    }

-    // verify no garbage is left under refs/backup/
-    dentryv, err := ioutil.ReadDir("refs/backup/")
-    if err != nil && !os.IsNotExist(err) {
-        t.Fatal(err)
-    }
-    if len(dentryv) != 0 {
-        namev := []string{}
-        for _, fi := range dentryv {
-            namev = append(namev, fi.Name())
+    // checks / cleanups after cmd_pull
+    afterPull := func() {
+        // verify no garbage is left under refs/backup/
+        dentryv, err := ioutil.ReadDir("refs/backup/")
+        if err != nil && !os.IsNotExist(err) {
+            t.Fatal(err)
+        }
+        if len(dentryv) != 0 {
+            namev := []string{}
+            for _, fi := range dentryv {
+                namev = append(namev, fi.Name())
+            }
+            t.Fatalf("refs/backup/ not empty after pull: %v", namev)
        }
-        t.Fatalf("refs/backup/ not empty after pull: %v", namev)
-    }

-    // prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
-    xgit("prune")
+        // prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
+        xgit("prune")

-    // verify backup repo is all ok
-    xgit("fsck")
+        // verify backup repo is all ok
+        xgit("fsck")

-    // verify that just pulled tag objects are now gone after pruning -
-    // - they become not directly git-present. The only possibility to
-    // get them back is via recreating from encoded commit objects.
-    for _, nc := range noncommitv {
-        if !nc.istag {
-            continue
+        // verify that just pulled tag objects are now gone after pruning -
+        // - they become not directly git-present. The only possibility to
+        // get them back is via recreating from encoded commit objects.
+        for _, nc := range noncommitv {
+            if !nc.istag {
+                continue
+            }
+            gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
+            if gerr == nil {
+                t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
+            }
        }
-        gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
-        if gerr == nil {
-            t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
+
+        // reopen backup repository - to avoid having stale cache with present
+        // objects we deleted above with `git prune`
+        gb, err = git.OpenRepository(".")
+        if err != nil {
+            t.Fatal(err)
        }
    }

-    // reopen backup repository - to avoid having stale cache with present
-    // objects we deleted above with `git prune`
-    gb, err = git.OpenRepository(".")
-    if err != nil {
-        t.Fatal(err)
+    afterPull()
+
+    // pull again - it should be noop
+    h1 := xgitSha1("rev-parse", "HEAD")
+    cmd_pull(gb, []string{my1+":b1"})
+    afterPull()
+    h2 := xgitSha1("rev-parse", "HEAD")
+    if h1 == h2 {
+        t.Fatal("pull: second run did not ajusted HEAD")
    }
+    δ12 := xgit("diff", h1, h2)
+    if δ12 != "" {
+        t.Fatalf("pull: second run was not noop: δ:\n%s", δ12)
+    }
+

    // restore backup
    work1 := workdir + "/1"