Commit 899103bf by Kirill Smelkov

pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends

On lab.nexedi.com `git-backup pull` became slow, and most of the slowness
was tracked down to the following:

`git fetch`, before fetching data from remote repository, first checks
whether it already locally has all the objects remote advertises. This
boils down to running

	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all

and checking whether it succeeds or not:

	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8

The "--not --all" in the query means that objects should be not
reachable from all locally existing refs and is implemented by linearly
scanning from tip of those existing refs and marking objects reachable
from there as "do not print".

In case of git-backup, where we have mostly master which is super commit
merging from whole histories of all projects and from backup history,
linearly scanning from such a tip goes through lots of commits. Up to
the point where fetching a small, outdated repository, which was already
pulled into backup and did not changed since long, takes more than 30
seconds with almost 100% of that time being spent in quickfetch() only.

The solution will be to optimize checking whether we already have all the
remote objects and to not repeat whole backup-repo scanning for every
pulled repository. This will be done via first querying through `git
ls-remote` what tips remote repository has, then checking on
git-backup specific index which tips we already have and then fetching
only the rest. This way we are essentially moving most of quickfetch
phase of git into git-backup.

Since we'll be tailing to git to fetch only some of the remote refs, we
will either have to amend ourselves the refs `git fetch` creates after
fetching, or to not rely on `git fetch` creating any refs at all. Since
we already have a long standing issue that many many refs that are
coming live after `git fetch` slow down further git fetches

https://lab.nexedi.com/kirr/git-backup/blob/0ab7bbb6/git-backup.go#L551

the longer term plan will be not to create unneeded references.
Since 2 forks could have references covering the same commits, we would
either have to compare references created after git-fetch and deduplicate
them or manage references creation ourselves.

It is also generally better to split `git fetch` into steps at plumbing
layer, because after doing so, we can have the chance to optimize or
tweak any of the steps at our side with knowing full git-backup context
and indices.

This commit only switches from using `git fetch` to its plumbing
counterpart `git fetch-pack` + friends + manually creating fetched refs
the way `git fetch` used to do exactly. There should be neither
functionality changed nor any speedup.

Further commits will start to take advantage of the switch and optimize
`git-backup pull`.
1 parent 350a01f9
......@@ -435,15 +435,17 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
// git repo - let's pull all refs from it to our backup refs namespace
infof("# git %s\t<- %s", prefix, path)
// NOTE --no-tags : do not try to autoextend commit -> covering tag
// NOTE fetch.fsckObjects=true : check objects for corruption as they are fetched
xgit("-c", "fetch.fsckObjects=true",
"fetch", "--no-tags", path,
fmt.Sprintf("refs/*:%s%s/*", backup_refs_work,
// NOTE repo name is escaped as it can contain e.g. spaces, and refs must not
path_refescape(reprefix(dir, prefix, path))),
// TODO do not show which ref we pulled - show only pack transfer progress
RunWith{stderr: gitprogress()})
refv, err := fetch(path)
exc.Raiseif(err)
reporefprefix := backup_refs_work +
// NOTE repo name is escaped as it can contain e.g. spaces, and refs must not
path_refescape(reprefix(dir, prefix, path))
for _, ref := range refv {
err = mkref(gb, reporefprefix + "/" + ref.name, ref.sha1)
exc.Raiseif(err)
}
// XXX do we want to do full fsck of source git repo on pull as well ?
......@@ -587,6 +589,132 @@ func cmd_pull_(gb *git.Repository, pullspecv []PullSpec) {
xgit("update-ref", "-d", backup_lock)
}
// fetch makes sure all objects from a repository are present in backup place.
//
// It fetches objects that are potentially missing in backup from the
// repository in question. The objects considered to fetch are those, that are
// reachable from all repository references.
//
// Returned is list of all references in source repository.
//
// Note: fetch does not create any local references - the references returned
// only describe state of references in fetched source repository.
func fetch(repo string) (refv []Ref, err error) {
defer xerr.Contextf(&err, "fetch %s", repo)
// first check which references are advertised
refv, err = lsremote(repo)
if err != nil {
return nil, err
}
// if there is nothing to fetch - we are done
if len(refv) == 0 {
return refv, nil
}
// fetch all those advertised objects by sha1.
//
// even if refs would change after ls-remote but before here, we should be
// getting exactly what was advertised.
//
// related link on the subject:
// https://git.kernel.org/pub/scm/git/git.git/commit/?h=051e4005a3
var argv []interface{}
arg := func(v ...interface{}) { argv = append(argv, v...) }
arg(
// check objects for corruption as they are fetched
"-c", "fetch.fsckObjects=true",
"fetch-pack", "--thin",
// force upload-pack to allow us asking any sha1 we want.
// needed because advertised refs we got at lsremote time could have changed.
"--upload-pack=git -c uploadpack.allowAnySHA1InWant=true" +
// workarounds for git < 2.11.1, which does not have uploadpack.allowAnySHA1InWant:
" -c uploadpack.allowTipSHA1InWant=true -c uploadpack.allowReachableSHA1InWant=true" +
//
" upload-pack",
repo)
for _, ref := range refv {
arg(ref.sha1)
}
arg(RunWith{stderr: gitprogress()})
gerr, _, _ := ggit(argv...)
if gerr != nil {
return nil, gerr
}
// fetch-pack ran ok - now check that all fetched tips are indeed fully
// connected and that we also have all referenced blob/tree objects. The
// reason for this check is that source repository could send us a pack with
// e.g. some objects missing and this way even if fetch-pack would report
// success, chances could be we won't have all the objects we think we
// fetched.
//
// when checking we assume that the roots we already have at all our
// references are ok.
//
// related link on the subject:
// https://git.kernel.org/pub/scm/git/git.git/commit/?h=6d4bb3833c
argv = nil
arg("rev-list", "--quiet", "--objects", "--not", "--all", "--not")
for _, ref := range refv {
arg(ref.sha1)
}
arg(RunWith{stderr: gitprogress()})
gerr, _, _ = ggit(argv...)
if gerr != nil {
return nil, fmt.Errorf("remote did not send all neccessary objects")
}
// fetched ok
return refv, nil
}
// lsremote lists all references advertised by repo.
func lsremote(repo string) (refv []Ref, err error) {
defer xerr.Contextf(&err, "lsremote %s", repo)
// NOTE --refs instructs to omit peeled refs like
//
// c668db59ccc59e97ce81f769d9f4633e27ad3bdb refs/tags/v0.1
// 4b6821f4a4e4c9648941120ccbab03982e33104f refs/tags/v0.1^{} <--
//
// because fetch-pack errors on them:
//
// https://public-inbox.org/git/20180610143231.7131-1-kirr@nexedi.com/
//
// we don't need to pull them anyway.
gerr, stdout, _ := ggit("ls-remote", "--refs", repo)
if gerr != nil {
return nil, gerr
}
// oid refname
// oid refname
// ...
for _, entry := range xstrings.SplitLines(stdout, "\n") {
sha1, ref := Sha1{}, ""
_, err := fmt.Sscanf(entry, "%s %s\n", &sha1, &ref)
if err != nil {
return nil, fmt.Errorf("strange output entry: %q", entry)
}
// Ref says its name goes without "refs/" prefix.
if !strings.HasPrefix(ref, "refs/") {
return nil, fmt.Errorf("non-refs/ reference: %q", ref)
}
ref = strings.TrimPrefix(ref, "refs/")
refv = append(refv, Ref{ref, sha1})
}
return refv, nil
}
// -------- git-backup restore --------
func cmd_restore_usage() {
......
......@@ -272,10 +272,69 @@ func TestPullRestore(t *testing.T) {
func() {
defer exc.Catch(func(e *exc.Error) {
// it ok - pull should raise
// git-backup leaves backup repo locked on error
xgit("update-ref", "-d", "refs/backup.locked")
})
cmd_pull(gb, []string{my2+":b2"})
t.Fatal("fetching from corrupt.git did not complain")
t.Fatal("pull corrupt.git: did not complain")
}()
// now try to pull repo where `git pack-objects` misbehaves
my3 := mydir + "/testdata/3"
checkIncompletePack := func(kind, errExpect string) {
defer exc.Catch(func(e *exc.Error) {
estr := e.Error()
bad := ""
badf := func(format string, argv ...interface{}) {
bad += fmt.Sprintf(format+"\n", argv...)
}
if !strings.Contains(estr, errExpect) {
badf("- no %q", errExpect)
}
if bad != "" {
t.Fatalf("pull incomplete-send-pack.git/%s: complained, but error is wrong:\n%s\nerror: %s", kind, bad, estr)
}
// git-backup leaves backup repo locked on error
xgit("update-ref", "-d", "refs/backup.locked")
})
// for incomplete-send-pack.git to indeed send incomplete pack, its git
// config has to be activated via tweaked $HOME.
home, ok := os.LookupEnv("HOME")
defer func() {
if ok {
err = os.Setenv("HOME", home)
} else {
err = os.Unsetenv("HOME")
}
exc.Raiseif(err)
}()
err = os.Setenv("HOME", my3+"/incomplete-send-pack.git/"+kind)
exc.Raiseif(err)
cmd_pull(gb, []string{my3+":b3"})
t.Fatalf("pull incomplete-send-pack.git/%s: did not complain", kind)
}
// missing blob: should be caught by git itself, because unpack-objects
// performs full reachability checks of fetched tips.
checkIncompletePack("x-missing-blob", "fatal: unpack-objects")
// missing commit: remote sends a pack that is closed under reachability,
// but it has objects starting from only parent of requested tip. This way
// e.g. commit at tip itself is not sent and the fact that it is missing in
// the pack is not caught by fetch-pack. git-backup has to detect the
// problem itself.
checkIncompletePack("x-commit-send-parent", "remote did not send all neccessary objects")
// pulling incomplete-send-pack.git without pack-objects hook must succeed:
// without $HOME tweaks full and complete pack is sent.
cmd_pull(gb, []string{my3+":b3"})
}
func TestRepoRefSplit(t *testing.T) {
......
......@@ -221,6 +221,14 @@ func getDefaultIdent(g *git.Repository) AuthorInfo {
return ident
}
// mkref creates a git reference.
//
// it is an error if the reference already exists.
func mkref(g *git.Repository, name string, sha1 Sha1) error {
_, err := g.References.Create(name, sha1.AsOid(), false, "")
return err
}
// `git commit-tree` -> commit_sha1, raise on error
func xcommit_tree2(g *git.Repository, tree Sha1, parents []Sha1, msg string, author AuthorInfo, committer AuthorInfo) Sha1 {
ident := getDefaultIdent(g)
......
Unnamed repository; edit this file 'description' to name the repository.
This repository contains object with corrupt data.
See objects/corruptit.py for details.
[core]
repositoryformatversion = 0
filemode = true
bare = true
This repository is not corrupt. However it can be configured to force `git
pack-objects` to send valid, but incomplete pack(*). This is needed to test that
a fetcher really verifies whether it got complete pack after fetching from a
repository.
There are 2 scenarios to prepare incomplete pack:
1. x-missing-blob: drop a blob object from the pack,
2. x-commit-send-parent: generate a pack starting from only a parent of requested tip.
To activate a scenario one have to export HOME=<this-repository>/<scenario>
See x-missing-blob/ and x-commit-send-parent/ for details.
See also https://git.kernel.org/pub/scm/git/git.git/commit/?h=6d4bb3833c for related check in `git fetch`.
----
(*) git pack-objects is adjusted at runtime via uploadpack.packObjectsHook:
https://git.kernel.org/pub/scm/git/git.git/commit/?h=20b20a22f8.
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~
46094318d7ea2dab446294556097361409ca1e84
[uploadpack]
packObjectsHook = ./x-commit-send-parent/hook-pack-objects
#!/bin/sh -e
# hook for `git pack-objects` to force it to send refs starting from parent of requested commit.
echo "I: x-commit-send-parent/hook-pack-object is running ..." >&2
# filter to real `git pack-objects` input sha1 stream to be sha1~.
while read oid ; do
case "$oid" in
--*|"")
echo "$oid" # e.g. "--not" or empty line - leave as is
;;
*)
git rev-parse $oid~ # oid -> oid~
;;
esac
done | "$@"
[uploadpack]
packObjectsHook = ./x-missing-blob/hook-pack-objects
#!/bin/sh -e
# hook for `git pack-objects` to force it to omit 1 blob from this repository.
echo "I: x-missing-blob/hook-pack-object is running ..." >&2
# tell real `git pack-objects` to omit blobs larger then 20 bytes.
# this should keep hello.txt (12 bytes), but filter-out will-not-be-sent.txt (25 bytes).
exec "$@" --filter=blob:limit=20
Styling with Markdown is supported
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!