...
 
Commits (9)
  • *: Minimal fixes so that program documentation renders under godoc properly · 7f349cd9
    - remove blank line between main description and package clause, so that
      the main description is understood as such;
    - move notes describing what a file does after package clause, so that
      those notes do not get mixed into program description under godoc.
    Kirill Smelkov committed
  • *: Handle Git object types as git.ObjectType instead of string · cbfa78d2
    Kirill Smelkov committed
  • restore: Use bitmap index from backup repo, if present · 0ab7bbb6
    This way, if backup repository was freshly repacked with bitmap index
    generation turned on, we can get ~ 30% - 50% speedup for a typical
    erp5.git pack extraction.
    
    "--use-bitmap-index" option was added to git in v2.0, but was only
    active for to-stdout packs generation. It was enabled for to-file packs
    generation in git v2.11.
    
    Since git v2.0 was released in 2014 - 4 years ago - I'm not adding
    runtime detection of "--use-bitmap-index" availability.
    
    See https://git.kernel.org/pub/scm/git/git.git/commit/?h=645c432d61 for
    details.
    Kirill Smelkov committed
  • restore: Show details when extracted repo refs were found corrupt · 23e07d70
    Noticed this while changing how pull works and making error there
    incidentally with leaving more "refs/" prefix. With the error before
    this patch tests show:
    
            git-backup_test.go:91: git-backup_test.go:204: lab.nexedi.com/kirr/git-backup.cmd_restore: 2 errors:
    			- E: extracted /tmp/t-git-backup981909377/1/dir 2 + β/repo with+fragile name %αβγ.git refs corrupt:
    			- E: extracted /tmp/t-git-backup981909377/1/dir/hello.git refs corrupt:
    
    with the patch tests report:
    
            git-backup_test.go:91: git-backup_test.go:204: lab.nexedi.com/kirr/git-backup.cmd_restore: 2 errors:
                            - E: extracted /tmp/t-git-backup981909377/1/dir 2 + β/repo with+fragile name %αβγ.git refs corrupt:
    
                    want:
                    cbb6d3f205749888f77fb1a88fbac3b8a0b8000f refs/refs/heads/master
    
                    have:
                    cbb6d3f205749888f77fb1a88fbac3b8a0b8000f refs/heads/master
                            - E: extracted /tmp/t-git-backup981909377/1/dir/hello.git refs corrupt:
    
                    want:
                    647e137fd3b31939b36889eba854a298ef97b6ff refs/refs/heads/branch2
                    feeed96ca75fcf8dcf183008f61dbf72e91ab4de refs/refs/heads/master
                    11e67095628aa17b03436850e690faea3006c25d refs/refs/tags/tag-to-blob
                    f735011c9fcece41219729a33f7876cd8791f659 refs/refs/tags/tag-to-commit
                    7124713e403925bc772cd252b0dec099f3ced9c5 refs/refs/tags/tag-to-tag
                    ba899e5639273a6fa4d50d684af8db1ae070351e refs/refs/tags/tag-to-tree
                    7a3343f584218e973165d943d7c0af47a52ca477 refs/refs/test/ref-to-blob
                    61882eb85774ed4401681d800bb9c638031375e2 refs/refs/test/ref-to-tree
    
                    have:
                    647e137fd3b31939b36889eba854a298ef97b6ff refs/heads/branch2
                    feeed96ca75fcf8dcf183008f61dbf72e91ab4de refs/heads/master
                    11e67095628aa17b03436850e690faea3006c25d refs/tags/tag-to-blob
                    f735011c9fcece41219729a33f7876cd8791f659 refs/tags/tag-to-commit
                    7124713e403925bc772cd252b0dec099f3ced9c5 refs/tags/tag-to-tag
                    ba899e5639273a6fa4d50d684af8db1ae070351e refs/tags/tag-to-tree
                    7a3343f584218e973165d943d7c0af47a52ca477 refs/test/ref-to-blob
                    61882eb85774ed4401681d800bb9c638031375e2 refs/test/ref-to-tree
    
    Should be good to have this details if something really breaks after restore.
    Kirill Smelkov committed
  • Clarify git Ref* types a bit · 350a01f9
    - tell that reference name always goes without "refs/" prefix
    - use .name for reference name, not .ref: this way
    
    	ref.name
    
      is more readable than
    
    	ref.ref
    
      and so there is less need to use for __ in range loops.
    Kirill Smelkov committed
  • pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends · 899103bf
    On lab.nexedi.com `git-backup pull` became slow, and most of the slowness
    was tracked down to the following:
    
    `git fetch`, before fetching data from remote repository, first checks
    whether it already locally has all the objects remote advertises. This
    boils down to running
    
    	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all
    
    and checking whether it succeeds or not:
    
    	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
    	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
    	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8
    
    The "--not --all" in the query means that objects should be not
    reachable from all locally existing refs and is implemented by linearly
    scanning from tip of those existing refs and marking objects reachable
    from there as "do not print".
    
    In case of git-backup, where we have mostly master which is super commit
    merging from whole histories of all projects and from backup history,
    linearly scanning from such a tip goes through lots of commits. Up to
    the point where fetching a small, outdated repository, which was already
    pulled into backup and did not changed since long, takes more than 30
    seconds with almost 100% of that time being spent in quickfetch() only.
    
    The solution will be to optimize checking whether we already have all the
    remote objects and to not repeat whole backup-repo scanning for every
    pulled repository. This will be done via first querying through `git
    ls-remote` what tips remote repository has, then checking on
    git-backup specific index which tips we already have and then fetching
    only the rest. This way we are essentially moving most of quickfetch
    phase of git into git-backup.
    
    Since we'll be tailing to git to fetch only some of the remote refs, we
    will either have to amend ourselves the refs `git fetch` creates after
    fetching, or to not rely on `git fetch` creating any refs at all. Since
    we already have a long standing issue that many many refs that are
    coming live after `git fetch` slow down further git fetches
    
    https://lab.nexedi.com/kirr/git-backup/blob/0ab7bbb6/git-backup.go#L551
    
    the longer term plan will be not to create unneeded references.
    Since 2 forks could have references covering the same commits, we would
    either have to compare references created after git-fetch and deduplicate
    them or manage references creation ourselves.
    
    It is also generally better to split `git fetch` into steps at plumbing
    layer, because after doing so, we can have the chance to optimize or
    tweak any of the steps at our side with knowing full git-backup context
    and indices.
    
    This commit only switches from using `git fetch` to its plumbing
    counterpart `git fetch-pack` + friends + manually creating fetched refs
    the way `git fetch` used to do exactly. There should be neither
    functionality changed nor any speedup.
    
    Further commits will start to take advantage of the switch and optimize
    `git-backup pull`.
    Kirill Smelkov committed
  • Factor out backup.refs loading code from restore · 1be6aaaa
    In the next patch we will need to load backup.refs in the beginning of
    pull too. Factored function changed to return regular error instead of
    raising exception (which will be the general plan from now on).
    Kirill Smelkov committed
  • pull: Speedup fetching by prebuilding index of objects we already have at start · 3efed898
    Like it was already said in 899103bf (pull: Switch from porcelain `git
    fetch` to plumbing `git fetch-pack` + friends) currently on
    lab.nexedi.com `git-backup pull` became slow and most of the slowness
    was tracked down to the fact that `git fetch` for every pulled repository does
    linear scan of whole backup repository history just to find out there is
    usually nothing to fetch. Quoting 899103bf:
    
    """
        `git fetch`, before fetching data from remote repository, first checks
        whether it already locally has all the objects remote advertises. This
        boils down to running
    
    	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all
    
        and checking whether it succeeds or not:
    
    	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
    	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
    	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8
    
        The "--not --all" in the query means that objects should be not
        reachable from all locally existing refs and is implemented by linearly
        scanning from tip of those existing refs and marking objects reachable
        from there as "do not print".
    
        In case of git-backup, where we have mostly master which is super commit
        merging from whole histories of all projects and from backup history,
        linearly scanning from such a tip goes through lots of commits. Up to
        the point where fetching a small, outdated repository, which was already
        pulled into backup and did not changed since long, takes more than 30
        seconds with almost 100% of that time being spent in quickfetch() only.
    """
    
    The solution is that we can build index of objects we already have ourselves
    only once at startup, and then in fetch, after checking lsremote output, consult
    that index, and if we see we already have everything for an advertised
    reference - just avoid giving it to fetch-pack to process. It turns out for
    many pulled repositories there is no references changed at all and this way
    fetch-pack can be skipped completely. This leads to dramatical speedup: before
    `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes.
    
    The index building itself takes ~ 30 seconds - the time which we were
    previously spending to fetch just from 1 unchanged repository. The index size
    is small and so it all can be kept in RAM - please see details in the code
    comments on this.
    
    I initially wanted to speedup fetching by teaching `git fetch-objects` to
    consult backup repo bitmap reachability index (if, for a commit, we can see
    that there is an entry in this index -> we know we already have all reachable
    objects for this commit and can skip fetching). This won't however work
    fully for all our refs - 40% of them are mostly tags, and since in the backup
    repository we don't keep tag objects - we keep tags/tree/blobs encoded as
    commits - sha1 of those 40% references to tags won't be in bitmap index.
    
    So just do the indexing ourselves.
    Kirill Smelkov committed
......@@ -60,6 +60,15 @@ func XSha1(s string) Sha1 {
return sha1
}
func xgittype(s string) git.ObjectType {
type_, ok := gittype(s)
if !ok {
exc.Raisef("unknown git type %q", s)
}
return type_
}
// verify end-to-end pull-restore
func TestPullRestore(t *testing.T) {
// if something raises -> don't let testing panic - report it as proper error with context.
......@@ -145,51 +154,70 @@ func TestPullRestore(t *testing.T) {
// encoding original object should give sha1_
obj_type := xgit("cat-file", "-t", nc.sha1)
sha1_ := obj_represent_as_commit(gb, nc.sha1, obj_type)
sha1_ := obj_represent_as_commit(gb, nc.sha1, xgittype(obj_type))
if sha1_ != nc.sha1_ {
t.Fatalf("encode %s -> %s ; want %s", sha1, sha1_, nc.sha1_)
}
}
// verify no garbage is left under refs/backup/
dentryv, err := ioutil.ReadDir("refs/backup/")
if err != nil && !os.IsNotExist(err) {
t.Fatal(err)
}
if len(dentryv) != 0 {
namev := []string{}
for _, fi := range dentryv {
namev = append(namev, fi.Name())
// checks / cleanups after cmd_pull
afterPull := func() {
// verify no garbage is left under refs/backup/
dentryv, err := ioutil.ReadDir("refs/backup/")
if err != nil && !os.IsNotExist(err) {
t.Fatal(err)
}
if len(dentryv) != 0 {
namev := []string{}
for _, fi := range dentryv {
namev = append(namev, fi.Name())
}
t.Fatalf("refs/backup/ not empty after pull: %v", namev)
}
t.Fatalf("refs/backup/ not empty after pull: %v", namev)
}
// prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
xgit("prune")
// prune all non-reachable objects (e.g. tags just pulled - they were encoded as commits)
xgit("prune")
// verify backup repo is all ok
xgit("fsck")
// verify backup repo is all ok
xgit("fsck")
// verify that just pulled tag objects are now gone after pruning -
// - they become not directly git-present. The only possibility to
// get them back is via recreating from encoded commit objects.
for _, nc := range noncommitv {
if !nc.istag {
continue
// verify that just pulled tag objects are now gone after pruning -
// - they become not directly git-present. The only possibility to
// get them back is via recreating from encoded commit objects.
for _, nc := range noncommitv {
if !nc.istag {
continue
}
gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
if gerr == nil {
t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
}
}
gerr, _, _ := ggit("cat-file", "-p", nc.sha1)
if gerr == nil {
t.Fatalf("tag %s still present in backup.git after git-prune", nc.sha1)
// reopen backup repository - to avoid having stale cache with present
// objects we deleted above with `git prune`
gb, err = git.OpenRepository(".")
if err != nil {
t.Fatal(err)
}
}
// reopen backup repository - to avoid having stale cache with present
// objects we deleted above with `git prune`
gb, err = git.OpenRepository(".")
if err != nil {
t.Fatal(err)
afterPull()
// pull again - it should be noop
h1 := xgitSha1("rev-parse", "HEAD")
cmd_pull(gb, []string{my1+":b1"})
afterPull()
h2 := xgitSha1("rev-parse", "HEAD")
if h1 == h2 {
t.Fatal("pull: second run did not ajusted HEAD")
}
δ12 := xgit("diff", h1, h2)
if δ12 != "" {
t.Fatalf("pull: second run was not noop: δ:\n%s", δ12)
}
// restore backup
work1 := workdir + "/1"
cmd_restore(gb, []string{"HEAD", "b1:"+work1})
......@@ -263,10 +291,69 @@ func TestPullRestore(t *testing.T) {
func() {
defer exc.Catch(func(e *exc.Error) {
// it ok - pull should raise
// git-backup leaves backup repo locked on error
xgit("update-ref", "-d", "refs/backup.locked")
})
cmd_pull(gb, []string{my2+":b2"})
t.Fatal("fetching from corrupt.git did not complain")
t.Fatal("pull corrupt.git: did not complain")
}()
// now try to pull repo where `git pack-objects` misbehaves
my3 := mydir + "/testdata/3"
checkIncompletePack := func(kind, errExpect string) {
defer exc.Catch(func(e *exc.Error) {
estr := e.Error()
bad := ""
badf := func(format string, argv ...interface{}) {
bad += fmt.Sprintf(format+"\n", argv...)
}
if !strings.Contains(estr, errExpect) {
badf("- no %q", errExpect)
}
if bad != "" {
t.Fatalf("pull incomplete-send-pack.git/%s: complained, but error is wrong:\n%s\nerror: %s", kind, bad, estr)
}
// git-backup leaves backup repo locked on error
xgit("update-ref", "-d", "refs/backup.locked")
})
// for incomplete-send-pack.git to indeed send incomplete pack, its git
// config has to be activated via tweaked $HOME.
home, ok := os.LookupEnv("HOME")
defer func() {
if ok {
err = os.Setenv("HOME", home)
} else {
err = os.Unsetenv("HOME")
}
exc.Raiseif(err)
}()
err = os.Setenv("HOME", my3+"/incomplete-send-pack.git/"+kind)
exc.Raiseif(err)
cmd_pull(gb, []string{my3+":b3"})
t.Fatalf("pull incomplete-send-pack.git/%s: did not complain", kind)
}
// missing blob: should be caught by git itself, because unpack-objects
// performs full reachability checks of fetched tips.
checkIncompletePack("x-missing-blob", "fatal: unpack-objects")
// missing commit: remote sends a pack that is closed under reachability,
// but it has objects starting from only parent of requested tip. This way
// e.g. commit at tip itself is not sent and the fact that it is missing in
// the pack is not caught by fetch-pack. git-backup has to detect the
// problem itself.
checkIncompletePack("x-commit-send-parent", "remote did not send all neccessary objects")
// pulling incomplete-send-pack.git without pack-objects hook must succeed:
// without $HOME tweaks full and complete pack is sent.
cmd_pull(gb, []string{my3+":b3"})
}
func TestRepoRefSplit(t *testing.T) {
......
......@@ -17,8 +17,8 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Git-backup | Run git subprocess
package main
// Git-backup | Run git subprocess
import (
"bytes"
......
......@@ -17,8 +17,8 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Git-backup | Git object: Blob Tree Commit Tag
package main
// Git-backup | Git object: Blob Tree Commit Tag
import (
"errors"
......@@ -91,9 +91,8 @@ func (e *UnexpectedObjType) Error() string {
}
type Tag struct {
tagged_type string
tagged_type git.ObjectType
tagged_sha1 Sha1
// TODO msg
}
......@@ -127,10 +126,16 @@ func (e *TagLoadError) Error() string {
func tag_parse(tag_raw string) (*Tag, error) {
t := Tag{}
_, err := fmt.Sscanf(tag_raw, "object %s\ntype %s\n", &t.tagged_sha1, &t.tagged_type)
tagged_type := ""
_, err := fmt.Sscanf(tag_raw, "object %s\ntype %s\n", &t.tagged_sha1, &tagged_type)
if err != nil {
return nil, errors.New("invalid header")
}
var ok bool
t.tagged_type, ok = gittype(tagged_type)
if !ok {
return nil, fmt.Errorf("invalid tagged type %q", tagged_type)
}
return &t, nil
}
......@@ -216,6 +221,14 @@ func getDefaultIdent(g *git.Repository) AuthorInfo {
return ident
}
// mkref creates a git reference.
//
// it is an error if the reference already exists.
func mkref(g *git.Repository, name string, sha1 Sha1) error {
_, err := g.References.Create(name, sha1.AsOid(), false, "")
return err
}
// `git commit-tree` -> commit_sha1, raise on error
func xcommit_tree2(g *git.Repository, tree Sha1, parents []Sha1, msg string, author AuthorInfo, committer AuthorInfo) Sha1 {
ident := getDefaultIdent(g)
......@@ -244,3 +257,37 @@ func xcommit_tree2(g *git.Repository, tree Sha1, parents []Sha1, msg string, aut
func xcommit_tree(g *git.Repository, tree Sha1, parents []Sha1, msg string) Sha1 {
return xcommit_tree2(g, tree, parents, msg, AuthorInfo{}, AuthorInfo{})
}
// gittype converts string to git.ObjectType.
//
// Only valid concrete git types are converted successfully.
func gittype(typ string) (git.ObjectType, bool) {
switch typ {
case "commit": return git.ObjectCommit, true
case "tree": return git.ObjectTree, true
case "blob": return git.ObjectBlob, true
case "tag": return git.ObjectTag, true
}
return git.ObjectBad, false
}
// gittypestr converts git.ObjectType to string.
//
// We depend on this conversion being exact and matching how Git encodes it in
// objects. git.ObjectType.String() is different (e.g. "Blob" instead of
// "blob"), and can potentially change over time.
//
// gittypestr expects the type to be valid and concrete - else it panics.
func gittypestr(typ git.ObjectType) string {
switch typ {
case git.ObjectCommit: return "commit"
case git.ObjectTree: return "tree"
case git.ObjectBlob: return "blob"
case git.ObjectTag: return "tag"
}
panic(fmt.Sprintf("git type %#v invalid", typ))
}
......@@ -17,9 +17,9 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
package main
// Git-backup | Set "template" type
// TODO -> go:generate + template
package main
// Set<Sha1>
type Sha1Set map[Sha1]struct{}
......
......@@ -17,8 +17,8 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Git-backup | Sha1 type to work with SHA1 oids
package main
// Git-backup | Sha1 type to work with SHA1 oids
import (
"bytes"
......
Unnamed repository; edit this file 'description' to name the repository.
This repository contains object with corrupt data.
See objects/corruptit.py for details.
[core]
repositoryformatversion = 0
filemode = true
bare = true
This repository is not corrupt. However it can be configured to force `git
pack-objects` to send valid, but incomplete pack(*). This is needed to test that
a fetcher really verifies whether it got complete pack after fetching from a
repository.
There are 2 scenarios to prepare incomplete pack:
1. x-missing-blob: drop a blob object from the pack,
2. x-commit-send-parent: generate a pack starting from only a parent of requested tip.
To activate a scenario one have to export HOME=<this-repository>/<scenario>
See x-missing-blob/ and x-commit-send-parent/ for details.
See also https://git.kernel.org/pub/scm/git/git.git/commit/?h=6d4bb3833c for related check in `git fetch`.
----
(*) git pack-objects is adjusted at runtime via uploadpack.packObjectsHook:
https://git.kernel.org/pub/scm/git/git.git/commit/?h=20b20a22f8.
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~
46094318d7ea2dab446294556097361409ca1e84
[uploadpack]
packObjectsHook = ./x-commit-send-parent/hook-pack-objects
#!/bin/sh -e
# hook for `git pack-objects` to force it to send refs starting from parent of requested commit.
echo "I: x-commit-send-parent/hook-pack-object is running ..." >&2
# filter to real `git pack-objects` input sha1 stream to be sha1~.
while read oid ; do
case "$oid" in
--*|"")
echo "$oid" # e.g. "--not" or empty line - leave as is
;;
*)
git rev-parse $oid~ # oid -> oid~
;;
esac
done | "$@"
[uploadpack]
packObjectsHook = ./x-missing-blob/hook-pack-objects
#!/bin/sh -e
# hook for `git pack-objects` to force it to omit 1 blob from this repository.
echo "I: x-missing-blob/hook-pack-object is running ..." >&2
# tell real `git pack-objects` to omit blobs larger then 20 bytes.
# this should keep hello.txt (12 bytes), but filter-out will-not-be-sent.txt (25 bytes).
exec "$@" --filter=blob:limit=20
......@@ -17,8 +17,8 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Git-backup | Miscellaneous utilities
package main
// Git-backup | Miscellaneous utilities
import (
"encoding/hex"
......