• Kirill Smelkov's avatar
    gitlab-backup: Split each table to parts <= 16M in size · d31febed
    Kirill Smelkov authored
    As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
    currently DB dump is not git friendly, because for each table dump is
    just one (potentially large) file and grows over time. In Gitlab there
    is one big table which dominates ~95% of whole dump size.
    
    So to avoid overloading git with large blobs, let's split each table to
    parts <= 16M in size, so this way we do not store very large blobs in
    git, with which it is inefficient.
    
    The fact that table data is sorted (see previous patch) helps the
    splitting result to be more-or-less stable - as we split not purely by
    byte size, but by lines, and max size 16M is only approximate, if a row
    is changed in a part, it will be splitted the same way on the next
    backup run.
    
    This works not so good, when row entries are large itself (e.g. for big
    patches which change a lot of files with big diff). For such cases
    splitting can be improved with splitting by edges found similar to e.g.
    bup[1] - by finding nodes of a rolling checksum, but for now we are
    staying with more simple way of doing the split.
    
    This reduce load on git packing (for e.g. repack or when doing fetch and
    push) a lot.
    
    [1] https://github.com/bup/bup
    
    /cc @kazuhiko
    d31febed
gitlab-backup 10.2 KB