gitlab-backup: Split each table to parts <= 16M in size

As was outlined 2 patches before (gitlab-backup: Dump DB ourselves), currently DB dump is not git friendly, because for each table dump is just one (potentially large) file and grows over time. In Gitlab there is one big table which dominates ~95% of whole dump size. So to avoid overloading git with large blobs, let's split each table to parts <= 16M in size, so this way we do not store very large blobs in git, with which it is inefficient. The fact that table data is sorted (see previous patch) helps the splitting result to be more-or-less stable - as we split not purely by byte size, but by lines, and max size 16M is only approximate, if a row is changed in a part, it will be splitted the same way on the next backup run. This works not so good, when row entries are large itself (e.g. for big patches which change a lot of files with big diff). For such cases splitting can be improved with splitting by edges found similar to e.g. bup[1] - by finding nodes of a rolling checksum, but for now we are staying with more simple way of doing the split. This reduce load on git packing (for e.g. repack or when doing fetch and push) a lot. [1] https://github.com/bup/bup /cc @kazuhiko

gitlab-backup: Split each table to parts <= 16M in size
As was outlined 2 patches before (gitlab-backup: Dump DB ourselves), currently DB dump is not git friendly, because for each table dump is just one (potentially large) file and grows over time. In Gitlab there is one big table which dominates ~95% of whole dump size. So to avoid overloading git with large blobs, let's split each table to parts <= 16M in size, so this way we do not store very large blobs in git, with which it is inefficient. The fact that table data is sorted (see previous patch) helps the splitting result to be more-or-less stable - as we split not purely by byte size, but by lines, and max size 16M is only approximate, if a row is changed in a part, it will be splitted the same way on the next backup run. This works not so good, when row entries are large itself (e.g. for big patches which change a lot of files with big diff). For such cases splitting can be improved with splitting by edges found similar to e.g. bup[1] - by finding nodes of a rolling checksum, but for now we are staying with more simple way of doing the split. This reduce load on git packing (for e.g. repack or when doing fetch and push) a lot. [1] https://github.com/bup/bup /cc @kazuhiko
d31febed · Kirill Smelkov · 5534e682 · d31febed
Commit d31febed authored Feb 09, 2016 by Kirill Smelkov
Hide whitespace changes
Inline Side-by-side

Showing with 21 additions and 0 deletions

contrib/gitlab-backup contrib/gitlab-backup +21 -0

No files found.
--- a/contrib/gitlab-backup
+++ b/contrib/gitlab-backup
@@ -152,6 +152,19 @@ backup_pull() {
        rm -f $F.data{,.x} $F.tail{,.x}
    done

+    # ... split each table to parts <= 16M in size
+    # so we do not store very large blobs in git (with which it is inefficient)
+    find "$db_pgdump" -maxdepth 1 -type f -name "*.dat" -a \! -name toc.dat | \
+    while read F; do
+        mv $F $F.x
+        mkdir $F
+        split -C 16M $F.x $F/`basename $F`.
+        md5=`md5sum <$F.x`
+        md5_=`cat $F/* | md5sum`
+        test "$md5" = "$md5_" || die "E: md5 mismatch after $F split"
+        rm $F.x
+    done
+

    # 4. pull gitlab data into git-backup
    # gitlab/misc   - db + uploads + ...
@@ -192,6 +205,14 @@ backup_restore() {
    # if backup is in pgdump (not sql) format - decode it
    db_pgdump="$tmpd/gitlab_backup/db/database.pgdump"
    if [ -d "$db_pgdump" ]; then
+        # merge splitted database dump files
+        find "$db_pgdump" -maxdepth 1 -type d -name "*.dat" | \
+        while read F; do
+            mv $F $F.x
+            cat $F.x/* >$F
+            rm -rf "$F.x"
+        done
+
        # convert database dump to plain-text sql (as gitlab restore expects)
        gitlab-rake -e "exec \"pg_restore --clean \\"$db_pgdump\\" >$tmpd/gitlab_backup/db/database.sql \""
        rm -rf "$db_pgdump"