Commits · fa5226c9378fe9686f31bb31cb743f1fd7af910e · iv / git-backup

13 Apr, 2016 1 commit
- contrib/gitlab-backup: Intro · fa5226c9
  Kirill Smelkov authored Apr 13, 2016
```
Add a short introduction which outlines what gitlab-backup program does.
```
  fa5226c9
29 Feb, 2016 1 commit

gitlab-backup/restore: Gitlab >= 8.5 now wants uploads to be 0700 · 6a2852cf

Kirill Smelkov authored Feb 29, 2016

Following up on 48062989 (gitlab-backup/restore: Gitlab wants uploads/
to be 0750 and dirs inside uploads/ to be 0755):

Starting from 8.5:

    https://gitlab.com/gitlab-org/gitlab-ce/commit/4f946f03

GitLab wants uploads/ to be 0700 and dirs inside uploads/ to be 0700
too.

6a2852cf

28 Feb, 2016 1 commit

gitlab-backup/restore: Don't allow ln ambiguity (which can lead to failures) · 7279754d

Kirill Smelkov authored Feb 28, 2016

ln has several syntaxes. man ln 1 ln:

   SYNOPSIS
          ln [OPTION]... [-T] TARGET LINK_NAME   (1st form)
          ln [OPTION]... TARGET                  (2nd form)
          ln [OPTION]... TARGET... DIRECTORY     (3rd form)
          ln [OPTION]... -t DIRECTORY TARGET...  (4th form)

so without -T or -t what is target and what is link name is ambiguous and
ln tries to guess. Now imagine:

    ln -sf /path/to/new/hook    $H

and let us consider that $H is already a symlink, pointing to some place
which _exists_, but current user do not have access to. Then ln will
complain:

    ln: accessing `$H': Permission denied

and abort.

Fix it by specifying ln form we use explicitly with -T.

7279754d

10 Feb, 2016 1 commit

Make sure git will recognize *.git as repositories, even empty ones, after restore · b770b689

Kirill Smelkov authored Feb 10, 2016

On restore we were initializing refs/ and objects/ for repositories
obtained from backuped refs set, but this approach does not cover empty
repositories - e.g. repositories without any ref at all.

A frequent case for this is *.wiki.git in gitlab, and if we restore only
files for such repo, without empty refs/ and objects/ it would look like
restored ok, but any git-related operation on such repo will fail.

Fix it via making sure to create refs/ and objects/ the first time we
see a *.git while restoring files.

/cc @kazuhiko

b770b689

09 Feb, 2016 9 commits

gitlab-backup: Cosmetics · 02c80d58

Kirill Smelkov authored Feb 09, 2016

Add comments about what each function does, and add appropriate echo
which were missing in several pull & restore places.

02c80d58

gitlab-backup/restore: Review restoration commands + add way to actually run them on user request · 14ce9ff3

Kirill Smelkov authored Feb 09, 2016

- don't start/stop services - we assume appropriate services start/stop
  will be done bu invoker, and tell people to do so via dumping proper
  comments. (Rationale: services are start/stopped differently on
  different systems, e.g. in omnibus and in slapos)

- mv in repositories atomically with just 1 mv + fix case when there was
  no repositories/ previously at all.

- adjust `gitlab-rake gitlab:backup:restore` with force=yes, so it does
  not interactively ask about whether ok to restore ssh keys - just do it.

- add `-go` option to actually run gitlab restoration in addition to
  preparing backup files.

/cc @kazuhiko

14ce9ff3

gitlab-backup/restore: Allow restoration on higher GitLab version, if user requests so · a8ba07d5

Kirill Smelkov authored Feb 09, 2016

Currently GitLab backup restoration works on exactly the same GitLab
version, as the one with which the backup was made:

    https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/manager.rb#L132

However in many cases restoring backup on a newer GitLab version is
desirable - e.g. when moving GitLab instance to upgraded software.
GitLab answer - that we should first prepare exactly the same GitLab
version on moved instance, restore backup, then upgrade GitLab itself
_inplace_, is not satisfactory in e.g. slapos case - as upgrading can
take a long time, and in-place software changes can render GitLab
instance non-working.

What we better prefer to do is to fully prepare new GitLab software
version, and then knowing software is ready, restore backup in a quick
manner.

The following analysis says we should be 99% ok to do so:

1. git-backup cares backward compatibility for format of repositories backup.
2. db dump is backward compatible, because Rails, when seeing old db
   schema, will run migrations.
3. the rest is relatively minor - e.g. uploads, which is just files in
   tar, and format for such things changes seldomly.

because of 3, strictly speaking, it is not 100% correct to restore
backup from older gitlab version to newer one (since gitlab does not
provide a promise of backward compatibility on e.g. uploads/ backup
format) , but in practice it is 99% correct and is usually handy.

/cc @kazuhiko

a8ba07d5

gitlab-backup/restore: Gitlab wants uploads/ to be 0750 and dirs inside uploads/ to be 0755 · 48062989

Kirill Smelkov authored Feb 09, 2016

As with repositories (see patch "gitlab-backup/restore: GitLab wants
repositories to be drwxrws---") Gitlab wants proper permissions for
uploads/ - else the following check fail

    Uploads directory setup correctly? ... no
      Try fixing it:
      sudo chmod 0750 .../var/gitlab/uploads
      For more information see:
      doc/install/installation.md in section "GitLab"
      Please fix the error above and rerun the checks.

    Uploads directory setup correctly? ... no
      Try fixing it:
      sudo chown -R slapuser14 .../var/gitlab/uploads
      sudo find .../var/gitlab/uploads -type f -exec chmod 0644 {} \;
      sudo find .../var/gitlab/uploads -type d -not -path .../var/gitlab/uploads -exec chmod 0755 {} \;

and files are not served back from uploads - e.g. there is no uploaded icons shown.

/cc @kazuhiko

48062989

gitlab-backup/restore: Adjust hooks links to point to current gitlab-shell location · a3e3e5ad

Kirill Smelkov authored Feb 09, 2016

By design Gitlab currently symlinks *.git/hooks to hooks in gitlab-shell
working tree. As when restoring backup on different machine gitlab-shell
worktree can be located in another place, all hooks needs to be adjusted
upon restoration.

Btw, Gitlab itself does the same:

    https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/repository.rb#L103
    https://gitlab.com/gitlab-org/gitlab-ce/commit/1d03fa2e

/cc @kazuhiko

a3e3e5ad

gitlab-backup/restore: GitLab wants repositories to be drwxrws--- · c8ac2f3a

Kirill Smelkov authored Feb 09, 2016

As git-backup does not currently preserve file persmissions fully, we
need to adjust them on restore. For repositories after restore the
following gitlab check currently fails:

    Repo base access is drwxrws---? ... no

Fix it.

/cc @kazuhiko

c8ac2f3a

gitlab-backup: Split each table to parts <= 16M in size · d31febed

Kirill Smelkov authored Feb 09, 2016

As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
currently DB dump is not git friendly, because for each table dump is
just one (potentially large) file and grows over time. In Gitlab there
is one big table which dominates ~95% of whole dump size.

So to avoid overloading git with large blobs, let's split each table to
parts <= 16M in size, so this way we do not store very large blobs in
git, with which it is inefficient.

The fact that table data is sorted (see previous patch) helps the
splitting result to be more-or-less stable - as we split not purely by
byte size, but by lines, and max size 16M is only approximate, if a row
is changed in a part, it will be splitted the same way on the next
backup run.

This works not so good, when row entries are large itself (e.g. for big
patches which change a lot of files with big diff). For such cases
splitting can be improved with splitting by edges found similar to e.g.
bup[1] - by finding nodes of a rolling checksum, but for now we are
staying with more simple way of doing the split.

This reduce load on git packing (for e.g. repack or when doing fetch and
push) a lot.

[1] https://github.com/bup/bup

/cc @kazuhiko

d31febed

gitlab-backup: Sort each DB table data · 5534e682

Kirill Smelkov authored Feb 09, 2016

As was outlined in previous patch, DB dump is currently not git/rsync
friendly because order of rows in PostgreSQL dump constantly changes:

pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering -
  http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
  http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
- in fact it dumps data as stored raw in DB pages, and every record update changes row order.

On the other hand, Rails by default adds integer `id` first column to
every table as convention -
  http://edgeguides.rubyonrails.org/active_record_basics.html
and GitLab does not override this. So we can sort tables on id and this
way make data order stable.

And even if there is no id column we can sort - as COPY does not
guarantee ordering, we can change the order of rows in _whatever_ way and
the dump will still be correct.

This change helps git a lot to find good object deltas in less time, and
it should also help rsync to find less delta between backup dumps.

NOTE no changes are needed on restore side at all - the dump stays valid
    - sorted or not, and restores to semantically the same DB, even if
    internal rows ordering is different.

/cc @kazuhiko

5534e682

gitlab-backup: Dump DB ourselves · 6fa6df4b

Kirill Smelkov authored Feb 08, 2016

The reason to do this is that we want to have more control over DB dump
process. Current problems which lead to this decision are:

1. DB dump is one large file which size grows over time. This is not
friendly to git;

2. DB dump is currently not git/rsync friendly - when PostgreSQL
does a dump, it just copes internal pages for data to output.
And internal ordering changes every time a row is updated.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order

both 1 and 2 currently put our backup tool to their knees. We'll be
handling those issues in the following patches.

For now we perform the dump manually and switch from dumping in
plain-text SQL to dumping in PostgreSQL native "directory" format, where
there is small table of contents with schema (toc.dat) and output of
`COPY <table> TO stdout` for each table in separate file.

http://www.postgresql.org/docs/9.5/static/app-pgdump.html

On restore we restore plain-text SQL with pg_restore and give this
plain-text SQL back to gitlab, so it thinks it restores it the usual way.

NOTE: backward compatibility is preserved - restore part, if it sees
backup made by older version of gitlab-backup, which dumps
database.sql in plain text - restores it correctly.

NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL).
Adding support for other databases is possible, but requires custom
handler for every DB (or just a fallback to usual plaintext maybe).

NOTE3: even as we split DB into separate tables, this does not currently
help problem #1, as in GitLab it is mostly just one table which
occupies the whole space.

/cc @kazuhiko

6fa6df4b

08 Feb, 2016 5 commits

gitlab-backup: Make $tmpd absolute · 5cdfd51e

Kirill Smelkov authored Feb 08, 2016

For now having $tmpd worked ok, but in the next patch, we are going to
pass this directory to a command, which, when run, automatically changes
its working directory as a first step, so passing $tmpd as relative
pathname won't work for it.

So switch $tmpd to be an absolute path.

/cc @kazuhiko

5cdfd51e

gitlab-backup: Refactor need_gitlab_config() a bit · 8099a8bf

Kirill Smelkov authored Feb 08, 2016

In the following patches we will be adding more and more settings to
read from gitlab config, so structure of code which does this is better
prepared:

    - part that emits the settings (in Ruby) is now multiline
    - we prepare shortcuts c & s which are Gitlab.config and
      Gitlab.config.gitlab_shell
    - in the end there is "END" emitted, and the reader checks this to
      make sure generate and read parts stay in sync.

/cc @kazuhiko

8099a8bf

gitlab-backup: There is no need to use ';' in inner block of need_gitlab_config() · bb03a6d7

Kirill Smelkov authored Feb 08, 2016

It works ok without it:

    ---- 8< ---- z.sh
    #!/bin/bash -e

    {
        read A
        read B
    } < <(echo -e 'AAA\nBBB')

    echo $A
    echo $B
    ---- 8< ----

    $ ./z.sh
    AAA
    BBB
    $ echo $?
    0

/cc @kazuhiko

bb03a6d7

gitlab-backup: Use find in a way, that does not hide errors · 64a16570

Kirill Smelkov authored Feb 08, 2016

In 495bd2fa (gitlab-backup: Unpack *.tar.gz before storing them in git)
we used find(1) to find *.tar.gz and unpack/repack them on
backup/restore. However `find -exec ...` does not stop on errors and
does not report them. Compare:

    ---- 8< ---- x.sh
    #!/bin/bash -e

    echo AAA
    find . -exec false ';'
    echo BBB
    ---- 8< ----

    ---- 8< ---- y.sh
    #!/bin/bash -e

    echo XXX
    find . | \
    while read F; do
        false
    done
    echo YYY
    ---- 8< ----

    $ ./x.sh
    AAA
    BBB
    $ echo $?
    0

    $ ./y.sh
    XXX
    $ echo $?
    1

So we switch to second style where find passes entries to processing
program via channel. This second new style is also more clean, in my view,
because listing and processing parts are now more better structured.

/cc @kazuhiko

64a16570

*: Update copyright years · 70776a8f
Kirill Smelkov authored Feb 08, 2016

70776a8f

30 Dec, 2015 1 commit

gitlab-backup: Unpack *.tar.gz before storing them in git · 495bd2fa

Kirill Smelkov authored Dec 30, 2015

Starting from 8.2 GitLab backups uploads and other directories not just
as set of files, but as one tarball:

    https://gitlab.com/gitlab-org/gitlab-ce/commit/d3734fbd

and this does not play well with git - now objects are stored as a
one big whole, compressed, so git cannot find good deltas.

So to help git properly deltify and find duplicates, let's unpack/repack
the archives, the same way we already do for database.sql.gz

495bd2fa

14 Oct, 2015 1 commit

fsck incoming objects on pull · 7c0e3ff2

Kirill Smelkov authored Oct 14, 2015

Since objects are shared between backed up repositories, it is important
to make sure we do not pull a broken object once, thus programming
future corruption of that object after restore in all repositories which
use it.

Object corruption could happen for two reasons:

    - plain storage corruption, or
    - someone intentionally pushing corrupted object with known sha1 to
      any repository.

Second case is even more dangerous, as it potentially allows attacker to
change data in not-available-to-him repositories.

Now objects are checked on pull, and if corrupt, git-backup complains,
e.g. this way:

    RuntimeError: git -c fetch.fsckObjects=true fetch --no-tags ../D/corrupt.git refs/*:refs/backup/20151014-1914/aaa/corrupt.git/*
    error: inflate: data stream error (incorrect data check)
    fatal: loose object 52baccfe8479b61c2a0d5447bc0a6bf7c6827c60 (stored in ./objects/52/baccfe8479b61c2a0d5447bc0a6bf7c6827c60) is corrupt
    fatal: The remote end hung up unexpectedly

7c0e3ff2

24 Sep, 2015 1 commit
- readme: Turn what-should-be-reference into hyperlinks · 19b35be9
  Kirill Smelkov authored Sep 24, 2015
  
  19b35be9
22 Sep, 2015 1 commit

readme: .txt -> .rst · a695bdbe

Kirill Smelkov authored Sep 22, 2015

Current hostings don't recognize .txt as being reStructuredText, so
let's be explicit, so readme gets automatically rendered.

a695bdbe

08 Sep, 2015 2 commits

Fix typo · 73815b9f
Kirill Smelkov authored Sep 08, 2015

73815b9f

Don't forget to save symlinks pointing to directories · 380b65f1

Kirill Smelkov authored Sep 08, 2015

os.walk() yields symlinks to directories in dirnames and do not follow
them. Our backup cycle expects all files that need to go to blob to be
in filenames and that dirnames are only recursed-into by walk().

Thus, until now, symlink to a directory was simply ignored and not
backup'ed. In particular *.git/hooks are usually symlinks to common
place.

The fix is to adjust our xwalk() to always represent blob-ish things in
filenames, and leave dirnames only for real directories.

/cc @kazuhiko

380b65f1

31 Aug, 2015 3 commits

gitlab-backup: Initial draft · 32e1f7af

Kirill Smelkov authored Aug 31, 2015

This is convenience program to pull/restore backup data for a GitLab
instance into/from git-backup managed repository.

Backup layout is:

    gitlab/misc   - db + uploads + ...
    gitlab/repo   - git repositories

On restoration we extract repositories into
.../git-data/repositories.<timestamp> and db backup into standard gitlab
backup tar and advice user how to proceed with exact finishing commands.

This will hopefully be improved and changed to finish automatically,
after some testing.

32e1f7af

git-backup: Initial draft · 6f237f22

Kirill Smelkov authored Aug 31, 2015

This program backups files and set of bare Git repositories into one Git repository.
Files are copied to blobs and then added to tree under certain place, and for
Git repositories, all reachable objects are pulled in with maintaining index
which remembers reference -> sha1 for every pulled repositories.

After objects from backuped Git repositories are pulled in, we create new
commit which references tree with changed backup index and files, and also has
all head objects from pulled-in repositories in its parents(*). This way backup
has history and all pulled objects become reachable from single head commit in
backup repository. In particular that means that the whole state of backup can
be described with only single sha1, and that backup repository itself could be
synchronized via standard git pull/push, be repacked, etc.

Restoration process is the opposite - from a particular backup state, files are
extracted at a proper place, and for Git repositories a pack with all objects
reachable from that repository heads is prepared and extracted from backup
repository object database.

This approach allows to leverage Git's good ability for object contents
deduplication and packing, especially for cases when there are many hosted
repositories which are forks of each other with relatively minor changes in
between each other and over time, and mostly common base. In author experience
the size of backup is dramatically smaller compared to straightforward "let's
tar it all" approach.

Data for all backuped files and repositories can be accessed if one has access
to backup repository, so either they all should be in the same security domain,
or extra care has to be taken to protect access to backup repository.

File permissions are not managed with strict details due to inherent
nature of Git. This aspect can be improved with e.g. etckeeper-like
(http://etckeeper.branchable.com/) approach if needed.

Please see README.txt with user-level overview on how to use git-backup.

NOTE the idea of pulling all refs together is similar to git-namespaces
     http://git-scm.com/docs/gitnamespaces

(*) Tag objects are handled specially - because in a lot of places Git insists and
    assumes commit parents can only be commit objects. We encode tag objects in
    specially-crafted commit object on pull, and decode back on backup restore.

    We do likewise if a ref points to tree or blob, which is valid in Git.

6f237f22

Start of git-backup.git · bbee44ce

Kirill Smelkov authored Aug 31, 2015

The project to implement backing up repositories on git hosting
efficiently.

bbee44ce