Commits · 5534e6829e4e2a063441ee9b50644247a6a7290c · Jérome Perrin / git-backup

An error occurred fetching the project authors.

09 Feb, 2016 2 commits

gitlab-backup: Sort each DB table data · 5534e682

Kirill Smelkov authored 9 years ago

As was outlined in previous patch, DB dump is currently not git/rsync
friendly because order of rows in PostgreSQL dump constantly changes:

pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering -
  http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
  http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
- in fact it dumps data as stored raw in DB pages, and every record update changes row order.

On the other hand, Rails by default adds integer `id` first column to
every table as convention -
  http://edgeguides.rubyonrails.org/active_record_basics.html
and GitLab does not override this. So we can sort tables on id and this
way make data order stable.

And even if there is no id column we can sort - as COPY does not
guarantee ordering, we can change the order of rows in _whatever_ way and
the dump will still be correct.

This change helps git a lot to find good object deltas in less time, and
it should also help rsync to find less delta between backup dumps.

NOTE no changes are needed on restore side at all - the dump stays valid
    - sorted or not, and restores to semantically the same DB, even if
    internal rows ordering is different.

/cc @kazuhiko

5534e682

gitlab-backup: Dump DB ourselves · 6fa6df4b

Kirill Smelkov authored 9 years ago

The reason to do this is that we want to have more control over DB dump
process. Current problems which lead to this decision are:

1. DB dump is one large file which size grows over time. This is not
friendly to git;

2. DB dump is currently not git/rsync friendly - when PostgreSQL
does a dump, it just copes internal pages for data to output.
And internal ordering changes every time a row is updated.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order

both 1 and 2 currently put our backup tool to their knees. We'll be
handling those issues in the following patches.

For now we perform the dump manually and switch from dumping in
plain-text SQL to dumping in PostgreSQL native "directory" format, where
there is small table of contents with schema (toc.dat) and output of
`COPY <table> TO stdout` for each table in separate file.

http://www.postgresql.org/docs/9.5/static/app-pgdump.html

On restore we restore plain-text SQL with pg_restore and give this
plain-text SQL back to gitlab, so it thinks it restores it the usual way.

NOTE: backward compatibility is preserved - restore part, if it sees
backup made by older version of gitlab-backup, which dumps
database.sql in plain text - restores it correctly.

NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL).
Adding support for other databases is possible, but requires custom
handler for every DB (or just a fallback to usual plaintext maybe).

NOTE3: even as we split DB into separate tables, this does not currently
help problem #1, as in GitLab it is mostly just one table which
occupies the whole space.

/cc @kazuhiko

6fa6df4b

08 Feb, 2016 5 commits

gitlab-backup: Make $tmpd absolute · 5cdfd51e

Kirill Smelkov authored 9 years ago

For now having $tmpd worked ok, but in the next patch, we are going to
pass this directory to a command, which, when run, automatically changes
its working directory as a first step, so passing $tmpd as relative
pathname won't work for it.

So switch $tmpd to be an absolute path.

/cc @kazuhiko

5cdfd51e

gitlab-backup: Refactor need_gitlab_config() a bit · 8099a8bf

Kirill Smelkov authored 9 years ago

In the following patches we will be adding more and more settings to
read from gitlab config, so structure of code which does this is better
prepared:

    - part that emits the settings (in Ruby) is now multiline
    - we prepare shortcuts c & s which are Gitlab.config and
      Gitlab.config.gitlab_shell
    - in the end there is "END" emitted, and the reader checks this to
      make sure generate and read parts stay in sync.

/cc @kazuhiko

8099a8bf

gitlab-backup: There is no need to use ';' in inner block of need_gitlab_config() · bb03a6d7

Kirill Smelkov authored 9 years ago

It works ok without it:

    ---- 8< ---- z.sh
    #!/bin/bash -e

    {
        read A
        read B
    } < <(echo -e 'AAA\nBBB')

    echo $A
    echo $B
    ---- 8< ----

    $ ./z.sh
    AAA
    BBB
    $ echo $?
    0

/cc @kazuhiko

bb03a6d7

gitlab-backup: Use find in a way, that does not hide errors · 64a16570

Kirill Smelkov authored 9 years ago

In 495bd2fa (gitlab-backup: Unpack *.tar.gz before storing them in git)
we used find(1) to find *.tar.gz and unpack/repack them on
backup/restore. However `find -exec ...` does not stop on errors and
does not report them. Compare:

    ---- 8< ---- x.sh
    #!/bin/bash -e

    echo AAA
    find . -exec false ';'
    echo BBB
    ---- 8< ----

    ---- 8< ---- y.sh
    #!/bin/bash -e

    echo XXX
    find . | \
    while read F; do
        false
    done
    echo YYY
    ---- 8< ----

    $ ./x.sh
    AAA
    BBB
    $ echo $?
    0

    $ ./y.sh
    XXX
    $ echo $?
    1

So we switch to second style where find passes entries to processing
program via channel. This second new style is also more clean, in my view,
because listing and processing parts are now more better structured.

/cc @kazuhiko

64a16570

*: Update copyright years · 70776a8f
Kirill Smelkov authored 9 years ago

70776a8f

30 Dec, 2015 1 commit

gitlab-backup: Unpack *.tar.gz before storing them in git · 495bd2fa

Kirill Smelkov authored 9 years ago

Starting from 8.2 GitLab backups uploads and other directories not just
as set of files, but as one tarball:

    https://gitlab.com/gitlab-org/gitlab-ce/commit/d3734fbd

and this does not play well with git - now objects are stored as a
one big whole, compressed, so git cannot find good deltas.

So to help git properly deltify and find duplicates, let's unpack/repack
the archives, the same way we already do for database.sql.gz

495bd2fa

31 Aug, 2015 1 commit

gitlab-backup: Initial draft · 32e1f7af

Kirill Smelkov authored 9 years ago

This is convenience program to pull/restore backup data for a GitLab
instance into/from git-backup managed repository.

Backup layout is:

    gitlab/misc   - db + uploads + ...
    gitlab/repo   - git repositories

On restoration we extract repositories into
.../git-data/repositories.<timestamp> and db backup into standard gitlab
backup tar and advice user how to proceed with exact finishing commands.

This will hopefully be improved and changed to finish automatically,
after some testing.

32e1f7af