1. 09 Feb, 2016 4 commits
    • Kirill Smelkov's avatar
      gitlab-backup/restore: GitLab wants repositories to be drwxrws--- · c8ac2f3a
      Kirill Smelkov authored
      As git-backup does not currently preserve file persmissions fully, we
      need to adjust them on restore. For repositories after restore the
      following gitlab check currently fails:
      
          Repo base access is drwxrws---? ... no
      
      Fix it.
      
      /cc @kazuhiko
      c8ac2f3a
    • Kirill Smelkov's avatar
      gitlab-backup: Split each table to parts <= 16M in size · d31febed
      Kirill Smelkov authored
      As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
      currently DB dump is not git friendly, because for each table dump is
      just one (potentially large) file and grows over time. In Gitlab there
      is one big table which dominates ~95% of whole dump size.
      
      So to avoid overloading git with large blobs, let's split each table to
      parts <= 16M in size, so this way we do not store very large blobs in
      git, with which it is inefficient.
      
      The fact that table data is sorted (see previous patch) helps the
      splitting result to be more-or-less stable - as we split not purely by
      byte size, but by lines, and max size 16M is only approximate, if a row
      is changed in a part, it will be splitted the same way on the next
      backup run.
      
      This works not so good, when row entries are large itself (e.g. for big
      patches which change a lot of files with big diff). For such cases
      splitting can be improved with splitting by edges found similar to e.g.
      bup[1] - by finding nodes of a rolling checksum, but for now we are
      staying with more simple way of doing the split.
      
      This reduce load on git packing (for e.g. repack or when doing fetch and
      push) a lot.
      
      [1] https://github.com/bup/bup
      
      /cc @kazuhiko
      d31febed
    • Kirill Smelkov's avatar
      gitlab-backup: Sort each DB table data · 5534e682
      Kirill Smelkov authored
      As was outlined in previous patch, DB dump is currently not git/rsync
      friendly because order of rows in PostgreSQL dump constantly changes:
      
      pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering -
        http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
        http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
      - in fact it dumps data as stored raw in DB pages, and every record update changes row order.
      
      On the other hand, Rails by default adds integer `id` first column to
      every table as convention -
        http://edgeguides.rubyonrails.org/active_record_basics.html
      and GitLab does not override this. So we can sort tables on id and this
      way make data order stable.
      
      And even if there is no id column we can sort - as COPY does not
      guarantee ordering, we can change the order of rows in _whatever_ way and
      the dump will still be correct.
      
      This change helps git a lot to find good object deltas in less time, and
      it should also help rsync to find less delta between backup dumps.
      
      NOTE no changes are needed on restore side at all - the dump stays valid
          - sorted or not, and restores to semantically the same DB, even if
          internal rows ordering is different.
      
      /cc @kazuhiko
      5534e682
    • Kirill Smelkov's avatar
      gitlab-backup: Dump DB ourselves · 6fa6df4b
      Kirill Smelkov authored
      The reason to do this is that we want to have more control over DB dump
      process. Current problems which lead to this decision are:
      
          1. DB dump is one large file which size grows over time. This is not
             friendly to git;
      
          2. DB dump is currently not git/rsync friendly - when PostgreSQL
             does a dump, it just copes internal pages for data to output.
             And internal ordering changes every time a row is updated.
      
              http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
              http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
      
      both 1 and 2 currently put our backup tool to their knees. We'll be
      handling those issues in the following patches.
      
      For now we perform the dump manually and switch from dumping in
      plain-text SQL to dumping in PostgreSQL native "directory" format, where
      there is small table of contents with schema (toc.dat) and output of
      `COPY <table> TO stdout` for each table in separate file.
      
          http://www.postgresql.org/docs/9.5/static/app-pgdump.html
      
      On restore we restore plain-text SQL with pg_restore and give this
      plain-text SQL back to gitlab, so it thinks it restores it the usual way.
      
      NOTE: backward compatibility is preserved - restore part, if it sees
          backup made by older version of gitlab-backup, which dumps
          database.sql in plain text - restores it correctly.
      
      NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL).
          Adding support for other databases is possible, but requires custom
          handler for every DB (or just a fallback to usual plaintext maybe).
      
      NOTE3: even as we split DB into separate tables, this does not currently
          help problem #1, as in GitLab it is mostly just one table which
          occupies the whole space.
      
      /cc @kazuhiko
      6fa6df4b
  2. 08 Feb, 2016 5 commits
    • Kirill Smelkov's avatar
      gitlab-backup: Make $tmpd absolute · 5cdfd51e
      Kirill Smelkov authored
      For now having $tmpd worked ok, but in the next patch, we are going to
      pass this directory to a command, which, when run, automatically changes
      its working directory as a first step, so passing $tmpd as relative
      pathname won't work for it.
      
      So switch $tmpd to be an absolute path.
      
      /cc @kazuhiko
      5cdfd51e
    • Kirill Smelkov's avatar
      gitlab-backup: Refactor need_gitlab_config() a bit · 8099a8bf
      Kirill Smelkov authored
      In the following patches we will be adding more and more settings to
      read from gitlab config, so structure of code which does this is better
      prepared:
      
          - part that emits the settings (in Ruby) is now multiline
          - we prepare shortcuts c & s which are Gitlab.config and
            Gitlab.config.gitlab_shell
          - in the end there is "END" emitted, and the reader checks this to
            make sure generate and read parts stay in sync.
      
      /cc @kazuhiko
      8099a8bf
    • Kirill Smelkov's avatar
      gitlab-backup: There is no need to use ';' in inner block of need_gitlab_config() · bb03a6d7
      Kirill Smelkov authored
      It works ok without it:
      
          ---- 8< ---- z.sh
          #!/bin/bash -e
      
          {
              read A
              read B
          } < <(echo -e 'AAA\nBBB')
      
          echo $A
          echo $B
          ---- 8< ----
      
          $ ./z.sh
          AAA
          BBB
          $ echo $?
          0
      
      /cc @kazuhiko
      bb03a6d7
    • Kirill Smelkov's avatar
      gitlab-backup: Use find in a way, that does not hide errors · 64a16570
      Kirill Smelkov authored
      In 495bd2fa (gitlab-backup: Unpack *.tar.gz before storing them in git)
      we used find(1) to find *.tar.gz and unpack/repack them on
      backup/restore. However `find -exec ...` does not stop on errors and
      does not report them. Compare:
      
          ---- 8< ---- x.sh
          #!/bin/bash -e
      
          echo AAA
          find . -exec false ';'
          echo BBB
          ---- 8< ----
      
          ---- 8< ---- y.sh
          #!/bin/bash -e
      
          echo XXX
          find . | \
          while read F; do
              false
          done
          echo YYY
          ---- 8< ----
      
          $ ./x.sh
          AAA
          BBB
          $ echo $?
          0
      
          $ ./y.sh
          XXX
          $ echo $?
          1
      
      So we switch to second style where find passes entries to processing
      program via channel. This second new style is also more clean, in my view,
      because listing and processing parts are now more better structured.
      
      /cc @kazuhiko
      64a16570
    • Kirill Smelkov's avatar
      *: Update copyright years · 70776a8f
      Kirill Smelkov authored
      70776a8f
  3. 30 Dec, 2015 1 commit
  4. 14 Oct, 2015 1 commit
    • Kirill Smelkov's avatar
      fsck incoming objects on pull · 7c0e3ff2
      Kirill Smelkov authored
      Since objects are shared between backed up repositories, it is important
      to make sure we do not pull a broken object once, thus programming
      future corruption of that object after restore in all repositories which
      use it.
      
      Object corruption could happen for two reasons:
      
          - plain storage corruption, or
          - someone intentionally pushing corrupted object with known sha1 to
            any repository.
      
      Second case is even more dangerous, as it potentially allows attacker to
      change data in not-available-to-him repositories.
      
      Now objects are checked on pull, and if corrupt, git-backup complains,
      e.g. this way:
      
          RuntimeError: git -c fetch.fsckObjects=true fetch --no-tags ../D/corrupt.git refs/*:refs/backup/20151014-1914/aaa/corrupt.git/*
          error: inflate: data stream error (incorrect data check)
          fatal: loose object 52baccfe8479b61c2a0d5447bc0a6bf7c6827c60 (stored in ./objects/52/baccfe8479b61c2a0d5447bc0a6bf7c6827c60) is corrupt
          fatal: The remote end hung up unexpectedly
      7c0e3ff2
  5. 24 Sep, 2015 1 commit
  6. 22 Sep, 2015 1 commit
    • Kirill Smelkov's avatar
      readme: .txt -> .rst · a695bdbe
      Kirill Smelkov authored
      Current hostings don't recognize .txt as being reStructuredText, so
      let's be explicit, so readme gets automatically rendered.
      a695bdbe
  7. 08 Sep, 2015 2 commits
    • Kirill Smelkov's avatar
      Fix typo · 73815b9f
      Kirill Smelkov authored
      73815b9f
    • Kirill Smelkov's avatar
      Don't forget to save symlinks pointing to directories · 380b65f1
      Kirill Smelkov authored
      os.walk() yields symlinks to directories in dirnames and do not follow
      them. Our backup cycle expects all files that need to go to blob to be
      in filenames and that dirnames are only recursed-into by walk().
      
      Thus, until now, symlink to a directory was simply ignored and not
      backup'ed. In particular *.git/hooks are usually symlinks to common
      place.
      
      The fix is to adjust our xwalk() to always represent blob-ish things in
      filenames, and leave dirnames only for real directories.
      
      /cc @kazuhiko
      380b65f1
  8. 31 Aug, 2015 3 commits
    • Kirill Smelkov's avatar
      gitlab-backup: Initial draft · 32e1f7af
      Kirill Smelkov authored
      This is convenience program to pull/restore backup data for a GitLab
      instance into/from git-backup managed repository.
      
      Backup layout is:
      
          gitlab/misc   - db + uploads + ...
          gitlab/repo   - git repositories
      
      On restoration we extract repositories into
      .../git-data/repositories.<timestamp> and db backup into standard gitlab
      backup tar and advice user how to proceed with exact finishing commands.
      
      This will hopefully be improved and changed to finish automatically,
      after some testing.
      32e1f7af
    • Kirill Smelkov's avatar
      git-backup: Initial draft · 6f237f22
      Kirill Smelkov authored
      This program backups files and set of bare Git repositories into one Git repository.
      Files are copied to blobs and then added to tree under certain place, and for
      Git repositories, all reachable objects are pulled in with maintaining index
      which remembers reference -> sha1 for every pulled repositories.
      
      After objects from backuped Git repositories are pulled in, we create new
      commit which references tree with changed backup index and files, and also has
      all head objects from pulled-in repositories in its parents(*). This way backup
      has history and all pulled objects become reachable from single head commit in
      backup repository. In particular that means that the whole state of backup can
      be described with only single sha1, and that backup repository itself could be
      synchronized via standard git pull/push, be repacked, etc.
      
      Restoration process is the opposite - from a particular backup state, files are
      extracted at a proper place, and for Git repositories a pack with all objects
      reachable from that repository heads is prepared and extracted from backup
      repository object database.
      
      This approach allows to leverage Git's good ability for object contents
      deduplication and packing, especially for cases when there are many hosted
      repositories which are forks of each other with relatively minor changes in
      between each other and over time, and mostly common base. In author experience
      the size of backup is dramatically smaller compared to straightforward "let's
      tar it all" approach.
      
      Data for all backuped files and repositories can be accessed if one has access
      to backup repository, so either they all should be in the same security domain,
      or extra care has to be taken to protect access to backup repository.
      
      File permissions are not managed with strict details due to inherent
      nature of Git. This aspect can be improved with e.g. etckeeper-like
      (http://etckeeper.branchable.com/) approach if needed.
      
      Please see README.txt with user-level overview on how to use git-backup.
      
      NOTE the idea of pulling all refs together is similar to git-namespaces
           http://git-scm.com/docs/gitnamespaces
      
      (*) Tag objects are handled specially - because in a lot of places Git insists and
          assumes commit parents can only be commit objects. We encode tag objects in
          specially-crafted commit object on pull, and decode back on backup restore.
      
          We do likewise if a ref points to tree or blob, which is valid in Git.
      6f237f22
    • Kirill Smelkov's avatar
      Start of git-backup.git · bbee44ce
      Kirill Smelkov authored
      The project to implement backing up repositories on git hosting
      efficiently.
      bbee44ce