gitlab-backup: Split each table to parts <= 16M in size
As was outlined 2 patches before (gitlab-backup: Dump DB ourselves), currently DB dump is not git friendly, because for each table dump is just one (potentially large) file and grows over time. In Gitlab there is one big table which dominates ~95% of whole dump size. So to avoid overloading git with large blobs, let's split each table to parts <= 16M in size, so this way we do not store very large blobs in git, with which it is inefficient. The fact that table data is sorted (see previous patch) helps the splitting result to be more-or-less stable - as we split not purely by byte size, but by lines, and max size 16M is only approximate, if a row is changed in a part, it will be splitted the same way on the next backup run. This works not so good, when row entries are large itself (e.g. for big patches which change a lot of files with big diff). For such cases splitting can be improved with splitting by edges found similar to e.g. bup[1] - by finding nodes of a rolling checksum, but for now we are staying with more simple way of doing the split. This reduce load on git packing (for e.g. repack or when doing fetch and push) a lot. [1] https://github.com/bup/bup /cc @kazuhiko
Showing