Atomically select replicas that meet LSN requirement
During a merge, we attempt to find a matching merge request with a SHA using a replica that should be up-to-date with the primary for a given PostgreSQL log sequence number (LSN). However, there is a race condition that can happen if service discovery alters the host list after this check has taken place. This most likely happens when a Web worker starts: 1. When Rails starts up for the first time, there is a 1-minute or 2-minute delay before service discovery finds replicas (see https://gitlab.com/gitlab-org/gitlab/-/issues/271575). 2. During this time `LoadBalancer#all_caught_up?` will return `true`. This will indicate to the Web worker that it can use replicas and does not have to use the primary. 3. During a request, service discovery may load all the replicas and change the host list. As a result, the next read may be directed to a lagging replica. However, this may cause a merge to fail if it cannot find a match. When a user merges a merge request, Sidekiq logs the minimum LSN needed to match a merge request for the API. If we have this LSN, we now: 1. Select from the available list of replicas that meet this LSN requirement. 2. Store this subset for the given request. 3. Round-robin reads with this subset of replicas. Relates to https://gitlab.com/gitlab-org/gitlab/-/issues/247857
Showing
Please register or sign in to comment