- 01 Aug, 2016 2 commits
-
-
Kirill Smelkov authored
This way it allows us to leverage multiple CPUs on a system for pack extractions, which are computation-heavy operations. The way to do is more-or-less classical: - main worker prepares requests for pack extraction jobs - there are multiple pack-extraction workers, which read requests from jobs queue and perform them - at the end we wait for everything to stop, collect errors and optionally signalling the whole thing to cancel if we see an error coming. (it is only a signal and we still have to wait for everything to stop) The default number of workers is N(CPU) on the system - because we spawn separate `git pack-objects ...` for every request. We also now explicitly limit N(CPU) each `git pack-objects ...` can use to 1. This way control how many resources to use is in git-backup hand and also git packs better this way (when only using 1 thread) because when deltifying all objects are considered to each other, not only all objects inside 1 thread's object poll, and even when pack.threads is not 1, first "objects counting" phase of pack is serial - wasting all but 1 core. On lab.nexedi.com we already use pack.threads=1 by default in global gitconfig, but the above change is for code to be universal. Time to restore nexedi/ from lab.nexedi.com backup: 2CPU laptop: before (pack.threads=1) 10m11s before (pack.threads=NCPU) 9m13s after -j1 10m11s after 6m17s 8CPU system (with other load present, noisy) : before (pack.threads=1) ~5m after ~1m30s
-
Kirill Smelkov authored
like in 302aaaea (raiseif: Fix it wrt erraddcallingcontext()) now fix raisef, which I originally overlooked.
-
- 31 Jul, 2016 3 commits
-
-
Kirill Smelkov authored
Because spawning separate process per 1 commit is slow. Libgit2 does not allow to create commits only knowing tree & parentv sha1s, but we can create commit objects by hand pretty easily - their format is tree <sha1> parent <parent1-sha1> parent <parent2-sha1> ... author user <email> date +offset committer user <email> date +offset LF message Time for pulling-in kirr/slapos.git before: 2.5s after: 0.9s NOTE AuthorInfo is changed to inherit from git.Signature (same fields and semantic) NOTE Since libgit2 default ident can fail, and does not look beyond user.name and user.email we do backup identity detection (user/hostname) - in similar way Git does - ourselves.
-
Kirill Smelkov authored
We are going to rework this function, but before adding changes let's move it to more appropriate place. Since xcommit_tree() creates commit object from tree and parents and is pretty standard git function - the appropriate place is gitobjects. NOTE we cannot just replace xcommit_tree() with g.CreateCommit() as the latter works with already loaded tree and parent objects, but we want to be able to make commits only knowing tree and parents sha1.
-
Kirill Smelkov authored
In upcoming patch we are going to switch xcommit_tree() to our own implementation, and since this can potentially change how commits are represented, for backward compatibility reason we need to make sure objects encoded as commits stay the same. So for all kind of objects (they are present in testdata/ repositories) add checks that: - encode/decode is idempotent - encoding and decoding produces exactly expected sha1 One nice side effect of this is that we can now remove runtime consistency check from tail of decoding. That check was there from the beginning - from 6f237f22 (git-backup: Initial draft) mainly present because there was no testsuite at that time. That check place is however even not completely right - in case we somehow wrongly pulled an object it has to be detected at pull time, not restore time. So that check was checking only 1/2 of implementation - and not the main one - that decoding does not mess up. Since now we have proper testsuite and add encode/decode tests in this patch, we can remove that partial runtime check. And even if decoding messes something up, despite having it testsuited, it will be 100% caught by restore process, because for an extracted repository, if there is no some object which needs to be present in it, pack generation for that repository will fail. So we can be safe with the removal. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 5.5s after: 3.5s ( so much because there are ~ 500 tags in slapos.git and currently tag encoding is done with spawning separate subprocess per tag )
-
- 30 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
Do not waste resources adding every file converted to blob with spawning `git update-index ...` per file - we can queue the info and add all entries to index in one go. Time to pull files part for lab.nexedi.com before: ~110s after: ~3s
-
- 29 Jul, 2016 6 commits
-
-
Kirill Smelkov authored
Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 7.4s after: 5.6s
-
Kirill Smelkov authored
We can reuse ReadObject() like for blob_to_file(). We cannot drop xload_tag() in favor of Repository.LookupTag() because upon tag loading we need to have not only parsed tag, but also its raw content for encoding in another object. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 8.9s after: 7.4s ( it goes down because on restore restored tags are reencoded again to verify restoration was ok. Pulling time should go down appropriately as well )
-
Kirill Smelkov authored
Substituting `git cat-file` to Odb.Read() and `git hash-object -w` to Odb.Write(). Timing for restoring only files from lab.nexedi.com backup: before: ~95s after: ~8s Timings for making backup in file part should have similar effect.
-
Kirill Smelkov authored
This saves us one `git cat-file` call per recreated tag. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 10.3s after: 8.9s
-
Kirill Smelkov authored
Currently for every file -> blob, and blob -> file we invoke git subprocess (cat-file or hash-object). We also invoke git subprocess for every tag read/write and the same for commits and this 1-subprocess per 1 object has very high overhead. The ways to avoid such overhead could be: 1) for every kind of operation spawn git service process, like e.g. `git cat-file --batch` for reading files, and only do request/reply per object with it. 2) use some go library to work with git repository ourselves. "1" can work but: - at present there is no counterpart of `cat-file --batch` for e.g. `hash-object` - i.e. we cannot write objects without quirks or patching git. - even if we add support for hashing via request/reply, as all requests are processed sequentially on git side by e.g. `git cat-file --batch`, we won't be able to leverage parallelism. - request/reply has also latency attached. For "2" we have roughly the following choices: - use cgo bindings to libgit2 (git2go) - use some pure-go git library Pure-go approach has pros that it by design avoids problems related to tricky CGo pointer C <-> Go passing rules. The fact that this was sorted out by go team itself only during 1.6 cycle https://github.com/golang/go/issues/12416 tells a lot. The net is full of examples where those were hard to get, and git2go in particular has a story of e.g. heap corruption (the bug was on golang itself side and fixed only for 1.5) https://github.com/libgit2/git2go/issues/223 https://groups.google.com/forum/#!topic/golang-nuts/Vi1HD-54BTA/discussion However there is no good (to my knowledge) pure-go git library, and the family of forks around github.com/speedata/gogit either: - works 3x slower compared to git2go ( or the same 3x in serial mode compared to e.g. `git cat-file --batch` as in serial mode git subservice and git2go has roughly similar performance ) - or does not work at all (e.g. barfing out on REF_DELTA pack entries, etc) So because of 3x slowdown, pure-go way is currently a no-runner. Since one person from golang team cared to update git2go to properly follow the CGo rules https://github.com/libgit2/git2go/pull/282 we can be relatively confident about git2go bindings quality and try to use it. This commit only hooks git2go into the build, subcommands and to Sha1 for to/from Oid conversion. We'll be switching places to git2go incrementally in upcoming patches. NOTE for now we need git2go from next branch for https://github.com/libgit2/git2go/commit/cf7553e7 The plan is to eventually switch to gopkg.in/libgit2/git2go.v25 once it is out.
-
Kirill Smelkov authored
We are going to use git2go (see next patch) for which canonical import path is git (import "github.com/libgit2/git2go" results in package name being autotruncated to just "git") so free up the "git" name for that package. Reason is: git() - as function - is used not often, while the package will be used often. Regarding naming: not sure it is good choice but ggit() is something like xgit(), only g is for "GitError".
-
- 27 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
We can do similar to what git does for blobs - searching in a window of repositories sorted by repo basename.
-
- 25 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
In 28986e0e (Rewrite in Go) I've added mypkgname() with comment that go escapes all '.' in function name with %2e. That turned out to be not true: Go escapes only dots in last component after last slash, e.g. lab.nexedi.com/kirr/git-backup/package%2ename.Function lab.nexedi.com/kirr/git-backup/pkg2.qqq/name%2ezzz.Function Correct mypkgname() accordingly. Noted while trying to run git-backup in a GOPATH root, not as standalone.
-
- 07 Jul, 2016 2 commits
-
-
Kirill Smelkov authored
erraddcallingcontext() already tries not to go beyond raise, but since raiseif wes calling raise, it was omitting raiseif but not raise itself. So an error could be like this cmd_restore: raiseif: mkdir ../R/1: file exists while it should be cmd_restore: mkdir ../R/1: file exists Fix it.
-
Kirill Smelkov authored
when/if we ever get to need them.
-
- 06 Jul, 2016 2 commits
-
-
Kirill Smelkov authored
It was a default leftover to autodetect object type if obj_type=None, from the beginning - from bbee44ce (Start of git-backup.git) - because even there obj_represent_as_commit() is always called with obj_type explicitly passed in. So remove the leftover.
-
Kirill Smelkov authored
This is more-or-less 1-to-1 port of git-backup to Go. There are things we handle a bit differently: - there is a separate type for Sha1 - conversion of repo paths to git references is now more robust wrt avoiding not-allowed in git constructs like ".." or ".lock" https://git.kernel.org/cgit/git/git.git/tree/refs.c?h=v2.9.0-37-g6d523a3#n34 The rewrite happened because we need to optimize restore, and for e.g. parallelizing part it should be convenient to use goroutines and channels. I'm not very comfortable with how error handling is done, because contrary to what canonical Go way seems to be, in a lot of places it still looks to me exceptions are better idea compared to just error codes, though in many places just error codes are better and makes more sense. Probably there will be less exceptions over time once the code starts to be collaborating set of goroutines with communications done via channels. Still a lot of python habits on my side. And as a bonus we now have end-to-end pull/restore tests...
-
- 20 Jun, 2016 2 commits
-
-
Kirill Smelkov authored
Bug present since the beginning: 6f237f22 (git-backup: Initial draft).
-
Kirill Smelkov authored
Even though we delete all temporary refs after pull, git leaves empty directories in the place where the refs were - for example if there was a ref dir/ref and we delete ref `ref`, empty dir/ is still leaved there. That increasingly hurts next pull performance a lot - before pulling git wants to scan all local refs, and while doing so it descends into all directories under refs/. As after several pulls we can have many such empty directories under refs/backup/, this scanning can take quite some time: e.g. for lab.nexedi.com normal pull currently takes ~3 minutes, but after doing pull ~60 times, it can become as bad as ~10 minutes for one pull. And all that slowness goes away after cleaning refs/backup/ manually. /cc https://lab.nexedi.com/lab.nexedi.com/lab.nexedi.com/issues/4
-
- 13 Jun, 2016 1 commit
-
-
Kirill Smelkov authored
Same story as in e.g. wendelin.core@b0b2c52e ( in short: GitLab now prepends namespace/repo/blob/ref/ prefix by itself )
-
- 02 May, 2016 1 commit
-
-
Kirill Smelkov authored
No need to compute that twice. My mistake from original 6f237f22 (git-backup: Initial draft).
-
- 13 Apr, 2016 2 commits
-
-
Kirill Smelkov authored
README.txt was renamed -> README.rst in a695bdbe (readme: .txt -> .rst)
-
Kirill Smelkov authored
Add a short introduction which outlines what gitlab-backup program does.
-
- 29 Feb, 2016 1 commit
-
-
Kirill Smelkov authored
Following up on 48062989 (gitlab-backup/restore: Gitlab wants uploads/ to be 0750 and dirs inside uploads/ to be 0755): Starting from 8.5: https://gitlab.com/gitlab-org/gitlab-ce/commit/4f946f03 GitLab wants uploads/ to be 0700 and dirs inside uploads/ to be 0700 too.
-
- 28 Feb, 2016 1 commit
-
-
Kirill Smelkov authored
ln has several syntaxes. man ln 1 ln: SYNOPSIS ln [OPTION]... [-T] TARGET LINK_NAME (1st form) ln [OPTION]... TARGET (2nd form) ln [OPTION]... TARGET... DIRECTORY (3rd form) ln [OPTION]... -t DIRECTORY TARGET... (4th form) so without -T or -t what is target and what is link name is ambiguous and ln tries to guess. Now imagine: ln -sf /path/to/new/hook $H and let us consider that $H is already a symlink, pointing to some place which _exists_, but current user do not have access to. Then ln will complain: ln: accessing `$H': Permission denied and abort. Fix it by specifying ln form we use explicitly with -T.
-
- 10 Feb, 2016 1 commit
-
-
Kirill Smelkov authored
On restore we were initializing refs/ and objects/ for repositories obtained from backuped refs set, but this approach does not cover empty repositories - e.g. repositories without any ref at all. A frequent case for this is *.wiki.git in gitlab, and if we restore only files for such repo, without empty refs/ and objects/ it would look like restored ok, but any git-related operation on such repo will fail. Fix it via making sure to create refs/ and objects/ the first time we see a *.git while restoring files. /cc @kazuhiko
-
- 09 Feb, 2016 9 commits
-
-
Kirill Smelkov authored
Add comments about what each function does, and add appropriate echo which were missing in several pull & restore places.
-
Kirill Smelkov authored
- don't start/stop services - we assume appropriate services start/stop will be done bu invoker, and tell people to do so via dumping proper comments. (Rationale: services are start/stopped differently on different systems, e.g. in omnibus and in slapos) - mv in repositories atomically with just 1 mv + fix case when there was no repositories/ previously at all. - adjust `gitlab-rake gitlab:backup:restore` with force=yes, so it does not interactively ask about whether ok to restore ssh keys - just do it. - add `-go` option to actually run gitlab restoration in addition to preparing backup files. /cc @kazuhiko
-
Kirill Smelkov authored
Currently GitLab backup restoration works on exactly the same GitLab version, as the one with which the backup was made: https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/manager.rb#L132 However in many cases restoring backup on a newer GitLab version is desirable - e.g. when moving GitLab instance to upgraded software. GitLab answer - that we should first prepare exactly the same GitLab version on moved instance, restore backup, then upgrade GitLab itself _inplace_, is not satisfactory in e.g. slapos case - as upgrading can take a long time, and in-place software changes can render GitLab instance non-working. What we better prefer to do is to fully prepare new GitLab software version, and then knowing software is ready, restore backup in a quick manner. The following analysis says we should be 99% ok to do so: 1. git-backup cares backward compatibility for format of repositories backup. 2. db dump is backward compatible, because Rails, when seeing old db schema, will run migrations. 3. the rest is relatively minor - e.g. uploads, which is just files in tar, and format for such things changes seldomly. because of 3, strictly speaking, it is not 100% correct to restore backup from older gitlab version to newer one (since gitlab does not provide a promise of backward compatibility on e.g. uploads/ backup format) , but in practice it is 99% correct and is usually handy. /cc @kazuhiko
-
Kirill Smelkov authored
As with repositories (see patch "gitlab-backup/restore: GitLab wants repositories to be drwxrws---") Gitlab wants proper permissions for uploads/ - else the following check fail Uploads directory setup correctly? ... no Try fixing it: sudo chmod 0750 .../var/gitlab/uploads For more information see: doc/install/installation.md in section "GitLab" Please fix the error above and rerun the checks. Uploads directory setup correctly? ... no Try fixing it: sudo chown -R slapuser14 .../var/gitlab/uploads sudo find .../var/gitlab/uploads -type f -exec chmod 0644 {} \; sudo find .../var/gitlab/uploads -type d -not -path .../var/gitlab/uploads -exec chmod 0755 {} \; and files are not served back from uploads - e.g. there is no uploaded icons shown. /cc @kazuhiko
-
Kirill Smelkov authored
By design Gitlab currently symlinks *.git/hooks to hooks in gitlab-shell working tree. As when restoring backup on different machine gitlab-shell worktree can be located in another place, all hooks needs to be adjusted upon restoration. Btw, Gitlab itself does the same: https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/repository.rb#L103 https://gitlab.com/gitlab-org/gitlab-ce/commit/1d03fa2e /cc @kazuhiko
-
Kirill Smelkov authored
As git-backup does not currently preserve file persmissions fully, we need to adjust them on restore. For repositories after restore the following gitlab check currently fails: Repo base access is drwxrws---? ... no Fix it. /cc @kazuhiko
-
Kirill Smelkov authored
As was outlined 2 patches before (gitlab-backup: Dump DB ourselves), currently DB dump is not git friendly, because for each table dump is just one (potentially large) file and grows over time. In Gitlab there is one big table which dominates ~95% of whole dump size. So to avoid overloading git with large blobs, let's split each table to parts <= 16M in size, so this way we do not store very large blobs in git, with which it is inefficient. The fact that table data is sorted (see previous patch) helps the splitting result to be more-or-less stable - as we split not purely by byte size, but by lines, and max size 16M is only approximate, if a row is changed in a part, it will be splitted the same way on the next backup run. This works not so good, when row entries are large itself (e.g. for big patches which change a lot of files with big diff). For such cases splitting can be improved with splitting by edges found similar to e.g. bup[1] - by finding nodes of a rolling checksum, but for now we are staying with more simple way of doing the split. This reduce load on git packing (for e.g. repack or when doing fetch and push) a lot. [1] https://github.com/bup/bup /cc @kazuhiko
-
Kirill Smelkov authored
As was outlined in previous patch, DB dump is currently not git/rsync friendly because order of rows in PostgreSQL dump constantly changes: pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering - http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590 http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order - in fact it dumps data as stored raw in DB pages, and every record update changes row order. On the other hand, Rails by default adds integer `id` first column to every table as convention - http://edgeguides.rubyonrails.org/active_record_basics.html and GitLab does not override this. So we can sort tables on id and this way make data order stable. And even if there is no id column we can sort - as COPY does not guarantee ordering, we can change the order of rows in _whatever_ way and the dump will still be correct. This change helps git a lot to find good object deltas in less time, and it should also help rsync to find less delta between backup dumps. NOTE no changes are needed on restore side at all - the dump stays valid - sorted or not, and restores to semantically the same DB, even if internal rows ordering is different. /cc @kazuhiko
-
Kirill Smelkov authored
The reason to do this is that we want to have more control over DB dump process. Current problems which lead to this decision are: 1. DB dump is one large file which size grows over time. This is not friendly to git; 2. DB dump is currently not git/rsync friendly - when PostgreSQL does a dump, it just copes internal pages for data to output. And internal ordering changes every time a row is updated. http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590 http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order both 1 and 2 currently put our backup tool to their knees. We'll be handling those issues in the following patches. For now we perform the dump manually and switch from dumping in plain-text SQL to dumping in PostgreSQL native "directory" format, where there is small table of contents with schema (toc.dat) and output of `COPY <table> TO stdout` for each table in separate file. http://www.postgresql.org/docs/9.5/static/app-pgdump.html On restore we restore plain-text SQL with pg_restore and give this plain-text SQL back to gitlab, so it thinks it restores it the usual way. NOTE: backward compatibility is preserved - restore part, if it sees backup made by older version of gitlab-backup, which dumps database.sql in plain text - restores it correctly. NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL). Adding support for other databases is possible, but requires custom handler for every DB (or just a fallback to usual plaintext maybe). NOTE3: even as we split DB into separate tables, this does not currently help problem #1, as in GitLab it is mostly just one table which occupies the whole space. /cc @kazuhiko
-
- 08 Feb, 2016 4 commits
-
-
Kirill Smelkov authored
For now having $tmpd worked ok, but in the next patch, we are going to pass this directory to a command, which, when run, automatically changes its working directory as a first step, so passing $tmpd as relative pathname won't work for it. So switch $tmpd to be an absolute path. /cc @kazuhiko
-
Kirill Smelkov authored
In the following patches we will be adding more and more settings to read from gitlab config, so structure of code which does this is better prepared: - part that emits the settings (in Ruby) is now multiline - we prepare shortcuts c & s which are Gitlab.config and Gitlab.config.gitlab_shell - in the end there is "END" emitted, and the reader checks this to make sure generate and read parts stay in sync. /cc @kazuhiko
-
Kirill Smelkov authored
It works ok without it: ---- 8< ---- z.sh #!/bin/bash -e { read A read B } < <(echo -e 'AAA\nBBB') echo $A echo $B ---- 8< ---- $ ./z.sh AAA BBB $ echo $? 0 /cc @kazuhiko
-
Kirill Smelkov authored
In 495bd2fa (gitlab-backup: Unpack *.tar.gz before storing them in git) we used find(1) to find *.tar.gz and unpack/repack them on backup/restore. However `find -exec ...` does not stop on errors and does not report them. Compare: ---- 8< ---- x.sh #!/bin/bash -e echo AAA find . -exec false ';' echo BBB ---- 8< ---- ---- 8< ---- y.sh #!/bin/bash -e echo XXX find . | \ while read F; do false done echo YYY ---- 8< ---- $ ./x.sh AAA BBB $ echo $? 0 $ ./y.sh XXX $ echo $? 1 So we switch to second style where find passes entries to processing program via channel. This second new style is also more clean, in my view, because listing and processing parts are now more better structured. /cc @kazuhiko
-