*: Fix memory corruptions caused by improper git2go usage
Alain reports that lab.nexedi.com backup restoration sometimes fails with error like ... # file gitlab/misc -> .../srv/backup/backup-gitlab.git/gitlab-backup.Pj0fpp/gitlab_backup/db/database.pgdump/7630.dat/7630.dat.ry main.cmd_restore: main.blob_to_file: write .../srv/backup/backup-gitlab.git/gitlab-backup.Pj0fpp/gitlab_backup/db/database.pgdump/7630.dat/7630.dat.ry: bad address which means that write system call invoked by writefile at tail of blob_to_file returned EFAULT. The blob_to_file function is organized approximately as this: blob_to_file(blob_sha1, path) { blob = ReadObject(blob_sha1, git.ObjectBlob) blob_content = blob.Data() writefile(path, blob_content) } and getting EFAULT inside writefile means that blob_content points to some unmapped memory. How that could be? The answer is that blob.Data(), as implemented by git2go, returns []byte that points to Cgo memory owned by blob object, and the blob object has finalizer that frees that memory, which sometimes leads to libc allocator to also return freed region completely back to OS by doing munmap: https://github.com/libgit2/git2go/blob/v31.7.9-0-gcbca5b8/odb.go#L345-L359 https://github.com/libgit2/git2go/blob/v31.7.9-0-gcbca5b8/odb.go#L162-L177 https://github.com/libgit2/git2go/blob/v31.7.9-0-gcbca5b8/odb.go#L322-L325 and if that happens we see the EFAULT, but if no munmap happens we can be saving corrupt data to restored file. The OdbObject.Data even has comment about that - that one needs to keep the object alive until retrieved data is used: // Data returns a slice pointing to the unmanaged object memory. You must make // sure the object is referenced for at least as long as the slice is used. func (object *OdbObject) Data() (data []byte) { but this comment was added in 2017 in https://github.com/libgit2/git2go/commit/55a109614151 as part of https://github.com/libgit2/git2go/pull/393 while doing "KeepAlive all the things" to fix segmentation faults and other misbehaviours. I missed all that because we switched blob_to_file from `git cat-file` to git2go in 2016 in fbd72c02 (Switch file_to_blob() and blob_to_file() to work without spawning Git subprocesses) and we never actively worked on that part of code anymore. For the reference the git2go introduction to git-backup happened on that same day in 2016 in 624393db (Hook in git2go (cgo bindings to libgit2)). The problem of memory corruption inside blob_to_file can be reliably reproduced via injecting the following patch blob_to_file(blob_sha1, path) { blob = ReadObject(blob_sha1, git.ObjectBlob) blob_content = blob.Data() + runtime.GC() writefile(path, blob_content) } which leads to e.g. the following test failure at every test run: === RUN TestPullRestore ... # file b1 -> /tmp/t-git-backup2575257088/1/symlink.file git-backup_test.go:109: git-backup_test.go:297: lab.nexedi.com/kirr/git-backup.cmd_restore: lab.nexedi.com/kirr/git-backup.blob_to_file: symlink ^D<80><8c>þ^@^@^2h space + α /tmp/t-git-backup2575257088/1/symlink.file: invalid argument and the memory corruption can be fixed reliably by adding proper runtime.KeepAlive so that the blob object assuredly stays alive during writefile call: blob_to_file(blob_sha1, path) { blob = ReadObject(blob_sha1, git.ObjectBlob) blob_content = blob.Data() writefile(path, blob_content) + runtime.KeepAlive(blob) } However going through git2go code it could be seen that it is full of Go <-> C interactions and given that there is a track records of catching many crashes due to not getting lifetime management right (see e.g. https://github.com/libgit2/git2go/issues/352, https://github.com/libgit2/git2go/issues/334, https://github.com/libgit2/git2go/issues/553, https://github.com/libgit2/git2go/issues/513, https://github.com/libgit2/git2go/issues/373, https://github.com/libgit2/git2go/pull/387 and once again https://github.com/libgit2/git2go/pull/393) there is no guarantee that no any other similar issue is there anywhere else besides OdbObject.Data(). With that we either need to put a lot of runtime.KeepAlive after every interaction with git2go, and put it properly, switch back to `git cat-file` and similar things reverting fbd72c02 and friends, or do something else. As fbd72c02 explains switching back to `git cat-file` will slowdown files restoration by an order of magnitude. Putting runtime.KeepAlive is also not practical because it is hard to see all the places where we interact with git2go, even indirectly, and so it is easy to make mistakes. -> Thus let's keep the code that interacts with git2go well localized (done by previous patch), and let's make a copy over every string or []byte object we receive from git2go with adding careful runtime.KeepAlive post after that. This fixes the problem of blob_to_file data corruption and it should fix all other potential memory corruption problems we might ever have with git2go due to potentially improper usage on git-backup side. The copy cost is smaller compared to the cost of either spawning e.g. `git cat-file` for every object, or interacting with `git cat-file --batch` server spawned once, but still spending context switches on every request and still making the copy on socket or pipe transfer. But most of all the copy cost is negligible to the cost of catching hard to reproduce crashes or data corruptions in the production environment. For the reference the time it takes to restore "files" part of lab.nexedi.com backup was ~ 1m51s before this patch, and became ~ 1m55s after this patch indicating ~ 3.5% slowdown for that part. Which could be said as noticeable but not big, and since most of the time is spent during git pack restoration, taking much more time than files, those several seconds of slowdown become completely negligible. /reported-by @alain.takoudjou, @tomo /reported-at https://www.erp5.com/group_section/forum/Gitlab-backup-zDVMZqaMAK/view?list_start=15&reset=1#2074747282 /cc @jerome, @rafael
Showing
internal/git/bytes_go1.19.go
0 → 100644
internal/git/bytes_go1.20.go
0 → 100644
Please register or sign in to comment