Commit a2867b1b authored by Yorick Peterse's avatar Yorick Peterse

Handle outdated replicas in the DB load balancer

Instead of checking if a replica is online by just running "SELECT 1" we
instead check the replication status. If a replica is lagging behind too
much time and data wise we'll stop using it until it is back in sync
with the primary. We also only check for the status roughly every 30
seconds. This reduces the overhead of the status checks, at the cost of
the status potentially lagging behind the real world for 30 seconds or
so.

Checking the replicas happens in a request, without any central
coordination mechanism. This keeps the implementation simple and the
processes independent of _another_ central service. To prevent all
processes from doing the same work at the same time we randomize the
checking intervals on a per process basis. This won't prevent 2
processes from checking at the same time, but it does reduce the
likelihood of _all_ of check checking at the same time.

Fixes https://gitlab.com/gitlab-org/gitlab-ee/issues/2197, closes
https://gitlab.com/gitlab-org/gitlab-ee/issues/2866
parent 891acf9d
---
title: Handle outdated replicas in the DB load balancer
merge_request:
author:
type: added
......@@ -156,6 +156,40 @@ log entries easier. For example:
[DB-LB] Host 10.123.2.7 came back online
```
## Handling Stale Reads
> [Introduced][ee-3526] in [GitLab Enterprise Edition Premium][eep] 10.3.
To prevent reading from an outdated secondary the load balancer will check if it
is in sync with the primary. If the data is determined to be recent enough the
secondary can be used, otherwise it will be ignored. To reduce the overhead of
these checks we only perform these checks at certain intervals.
There are three configuration options that influence this behaviour:
| Option | Description | Default |
|------------------------------|----------------------------------------------------------------------------------------------------------------|------------|
| `max_replication_difference` | The amount of data (in bytes) a secondary is allowed to lag behind when it hasn't replicated data for a while. | 8 MB |
| `max_replication_lag_time` | The maximum number of seconds a secondary is allowed to lag behind before we stop using it. | 60 seconds |
| `replica_check_interval` | The minimum number of seconds we have to wait before checking the status of a secondary. | 60 seconds |
The defaults should be sufficient for most users. Should you want to change them
you can specify them in `config/database.yml` like so:
```yaml
production:
username: gitlab
database: gitlab
encoding: unicode
load_balancing:
hosts:
- host1.example.com
- host2.example.com
max_replication_difference: 16777216 # 16 MB
max_replication_lag_time: 30
replica_check_interval: 30
```
[hot-standby]: https://www.postgresql.org/docs/9.6/static/hot-standby.html
[ee-1283]: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1283
[eep]: https://about.gitlab.com/gitlab-ee/
......@@ -163,3 +197,4 @@ log entries easier. For example:
[restart gitlab]: restart_gitlab.md#installations-from-source "How to restart GitLab"
[wikipedia]: https://en.wikipedia.org/wiki/Load_balancing_(computing)
[db-req]: ../install/requirements.md#database
[ee-3526]: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/3526
......@@ -24,15 +24,30 @@ module Gitlab
[].freeze
end
# Returns a Hash containing the load balancing configuration.
def self.configuration
ActiveRecord::Base.configurations[Rails.env]['load_balancing'] || {}
end
# Returns the maximum replica lag size in bytes.
def self.max_replication_difference
(configuration['max_replication_difference'] || 8.megabytes).to_i
end
# Returns the maximum lag time for a replica.
def self.max_replication_lag_time
(configuration['max_replication_lag_time'] || 60.0).to_f
end
# Returns the interval (in seconds) to use for checking the status of a
# replica.
def self.replica_check_interval
(configuration['replica_check_interval'] || 60).to_f
end
# Returns the additional hosts to use for load balancing.
def self.hosts
hash = ActiveRecord::Base.configurations[Rails.env]['load_balancing']
if hash
hash['hosts'] || []
else
[]
end
configuration['hosts'] || []
end
def self.log(level, message)
......
......@@ -3,15 +3,21 @@ module Gitlab
module LoadBalancing
# A single database host used for load balancing.
class Host
attr_reader :pool
attr_reader :pool, :last_checked_at, :intervals, :load_balancer
delegate :connection, :release_connection, to: :pool
# host - The address of the database.
def initialize(host)
# load_balancer - The LoadBalancer that manages this Host.
def initialize(host, load_balancer)
@host = host
@load_balancer = load_balancer
@pool = Database.create_connection_pool(LoadBalancing.pool_size, host)
@online = true
@last_checked_at = Time.zone.now
interval = LoadBalancing.replica_check_interval
@intervals = (interval..(interval * 2)).step(0.5).to_a
end
def offline!
......@@ -23,30 +29,80 @@ module Gitlab
# Returns true if the host is online.
def online?
return true if @online
begin
retried = 0
@online = begin
connection.active?
rescue
if retried < 3
release_connection
retried += 1
retry
else
false
end
end
LoadBalancing.log(:info, "Host #{@host} came back online") if @online
@online
ensure
release_connection
return @online unless check_replica_status?
refresh_status
LoadBalancing.log(:info, "Host #{@host} came back online") if @online
@online
end
def refresh_status
@online = replica_is_up_to_date?
@last_checked_at = Time.zone.now
end
def check_replica_status?
(Time.zone.now - last_checked_at) >= intervals.sample
end
def replica_is_up_to_date?
replication_lag_below_threshold? || data_is_recent_enough?
end
def replication_lag_below_threshold?
if (lag_time = replication_lag_time)
lag_time <= LoadBalancing.max_replication_lag_time
else
false
end
end
# Returns true if the replica has replicated enough data to be useful.
def data_is_recent_enough?
# It's possible for a replica to not replay WAL data for a while,
# despite being up to date. This can happen when a primary does not
# receive any writes a for a while.
#
# To prevent this from happening we check if the lag size (in bytes)
# of the replica is small enough for the replica to be useful. We
# only do this if we haven't replicated in a while so we only need
# to connect to the primary when truly necessary.
if (lag_size = replication_lag_size)
lag_size <= LoadBalancing.max_replication_difference
else
false
end
end
# Returns the replication lag time of this secondary in seconds as a
# float.
#
# This method will return nil if no lag time could be calculated.
def replication_lag_time
row = query_and_release('SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float as lag')
row['lag'].to_f if row.any?
end
# Returns the number of bytes this secondary is lagging behind the
# primary.
#
# This method will return nil if no lag size could be calculated.
def replication_lag_size
location = connection.quote(primary_write_location)
row = query_and_release("SELECT pg_xlog_location_diff(#{location}, pg_last_xlog_replay_location())::float AS diff")
row['diff'].to_i if row.any?
end
def primary_write_location
load_balancer.primary_write_location
ensure
load_balancer.release_primary_connection
end
# Returns true if this host has caught up to the given transaction
# write location.
#
......@@ -60,9 +116,15 @@ module Gitlab
query = "SELECT NOT pg_is_in_recovery() OR " \
"pg_xlog_location_diff(pg_last_xlog_replay_location(), #{string}) >= 0 AS result"
row = connection.select_all(query).first
row = query_and_release(query)
row['result'] == 't'
end
row && row['result'] == 't'
def query_and_release(sql)
connection.select_all(sql).first || {}
rescue
{}
ensure
release_connection
end
......
......@@ -15,7 +15,7 @@ module Gitlab
# hosts - The hostnames/addresses of the additional databases.
def initialize(hosts = [])
@host_list = HostList.new(hosts.map { |addr| Host.new(addr) })
@host_list = HostList.new(hosts.map { |addr| Host.new(addr, self) })
end
# Yields a connection that can be used for reads.
......
......@@ -2,13 +2,16 @@ require 'spec_helper'
describe Gitlab::Database::LoadBalancing::HostList do
before do
allow(Gitlab::Database).to receive(:create_connection_pool)
allow(Gitlab::Database)
.to receive(:create_connection_pool)
.and_return(ActiveRecord::Base.connection_pool)
end
let(:load_balancer) { double(:load_balancer) }
let(:host_list) do
hosts = Array.new(2) do
Gitlab::Database::LoadBalancing::Host.new('localhost')
Gitlab::Database::LoadBalancing::Host.new('localhost', load_balancer)
end
described_class.new(hosts)
......
require 'spec_helper'
describe Gitlab::Database::LoadBalancing::Host do
let(:host) { described_class.new('localhost') }
describe Gitlab::Database::LoadBalancing::Host, :postgresql do
let(:load_balancer) do
Gitlab::Database::LoadBalancing::LoadBalancer.new(%w[localhost])
end
let(:host) { load_balancer.host_list.hosts.first }
before do
allow(Gitlab::Database).to receive(:create_connection_pool)
......@@ -33,68 +37,186 @@ describe Gitlab::Database::LoadBalancing::Host do
end
describe '#online?' do
let(:error) { Class.new(RuntimeError) }
context 'when the replica status is recent enough' do
it 'returns the latest status' do
Timecop.freeze do
host = described_class.new('localhost', load_balancer)
before do
allow(host.pool).to receive(:disconnect!)
expect(host).not_to receive(:refresh_status)
expect(host).to be_online
end
end
end
it 'returns true when the host is online' do
expect(host).not_to receive(:connection)
expect(host).not_to receive(:release_connection)
context 'when the replica status is outdated' do
it 'refreshes the status' do
host.offline!
expect(host.online?).to eq(true)
expect(host)
.to receive(:check_replica_status?)
.and_return(true)
expect(host).to be_online
end
end
end
it 'returns true when the host was marked as offline but is online again' do
connection = double(:connection, active?: true)
describe '#refresh_status' do
it 'refreshes the status' do
host.offline!
allow(host).to receive(:connection).and_return(connection)
expect(host)
.to receive(:replica_is_up_to_date?)
.and_call_original
host.offline!
host.refresh_status
expect(host).to receive(:release_connection)
expect(host.online?).to eq(true)
expect(host).to be_online
end
end
it 'returns false when the host is offline' do
connection = double(:connection, active?: false)
describe '#check_replica_status?' do
it 'returns true when we need to check the replica status' do
allow(host)
.to receive(:last_checked_at)
.and_return(1.year.ago)
allow(host).to receive(:connection).and_return(connection)
expect(host).to receive(:release_connection)
expect(host.check_replica_status?).to eq(true)
end
host.offline!
it 'returns false when we do not need to check the replica status' do
Timecop.freeze do
allow(host)
.to receive(:last_checked_at)
.and_return(Time.zone.now)
expect(host.online?).to eq(false)
expect(host.check_replica_status?).to eq(false)
end
end
end
it 'returns false when a connection could not be established' do
expect(host).to receive(:connection).exactly(4).times.and_raise(error)
expect(host).to receive(:release_connection).exactly(4).times
describe '#replica_is_up_to_date?' do
context 'when the lag time is below the threshold' do
it 'returns true' do
expect(host)
.to receive(:replication_lag_below_threshold?)
.and_return(true)
host.offline!
expect(host.online?).to eq(false)
expect(host.replica_is_up_to_date?).to eq(true)
end
end
it 'retries when a connection error is thrown' do
connection = double(:connection, active?: true)
raised = false
context 'when the lag time exceeds the threshold' do
before do
allow(host)
.to receive(:replication_lag_below_threshold?)
.and_return(false)
end
allow(host).to receive(:connection) do
unless raised
raised = true
raise error.new
end
it 'returns true if the data is recent enough' do
expect(host)
.to receive(:data_is_recent_enough?)
.and_return(true)
connection
expect(host.replica_is_up_to_date?).to eq(true)
end
expect(host).to receive(:release_connection).twice
it 'returns false when the data is not recent enough' do
expect(host)
.to receive(:data_is_recent_enough?)
.and_return(false)
host.offline!
expect(host.replica_is_up_to_date?).to eq(false)
end
end
end
describe '#replication_lag_below_threshold' do
it 'returns true when the lag time is below the threshold' do
expect(host)
.to receive(:replication_lag_time)
.and_return(1)
expect(host.online?).to eq(true)
expect(host.replication_lag_below_threshold?).to eq(true)
end
it 'returns false when the lag time exceeds the threshold' do
expect(host)
.to receive(:replication_lag_time)
.and_return(9000)
expect(host.replication_lag_below_threshold?).to eq(false)
end
it 'returns false when no lag time could be calculated' do
expect(host)
.to receive(:replication_lag_time)
.and_return(nil)
expect(host.replication_lag_below_threshold?).to eq(false)
end
end
describe '#data_is_recent_enough?' do
it 'returns true when the data is recent enough' do
expect(host.data_is_recent_enough?).to eq(true)
end
it 'returns false when the data is not recent enough' do
diff = Gitlab::Database::LoadBalancing.max_replication_difference * 2
expect(host)
.to receive(:query_and_release)
.and_return({ 'diff' => diff })
expect(host.data_is_recent_enough?).to eq(false)
end
it 'returns false when no lag size could be calculated' do
expect(host)
.to receive(:replication_lag_size)
.and_return(nil)
expect(host.data_is_recent_enough?).to eq(false)
end
end
describe '#replication_lag_time' do
it 'returns the lag time as a Float' do
expect(host.replication_lag_time).to be_an_instance_of(Float)
end
it 'returns nil when the database query returned no rows' do
expect(host)
.to receive(:query_and_release)
.and_return({})
expect(host.replication_lag_time).to be_nil
end
end
describe '#replication_lag_size' do
it 'returns the lag size as an Integer' do
# On newer versions of Ruby the class is Integer, but on CI we run a
# version that still uses Fixnum.
classes = [Fixnum, Integer] # rubocop: disable Lint/UnifiedInteger
expect(classes).to include(host.replication_lag_size.class)
end
it 'returns nil when the database query returned no rows' do
expect(host)
.to receive(:query_and_release)
.and_return({})
expect(host.replication_lag_size).to be_nil
end
end
describe '#primary_write_location' do
it 'returns the write location of the primary' do
expect(host.primary_write_location).to be_an_instance_of(String)
expect(host.primary_write_location).not_to be_empty
end
end
......@@ -119,4 +241,29 @@ describe Gitlab::Database::LoadBalancing::Host do
expect(host.caught_up?('foo')).to eq(false)
end
end
describe '#query_and_release' do
it 'executes a SQL query' do
results = host.query_and_release('SELECT 10 AS number')
expect(results).to be_an_instance_of(Hash)
expect(results['number'].to_i).to eq(10)
end
it 'releases the connection after running the query' do
expect(host)
.to receive(:release_connection)
.once
host.query_and_release('SELECT 10 AS number')
end
it 'returns an empty Hash in the event of an error' do
expect(host.connection)
.to receive(:select_all)
.and_raise(RuntimeError, 'kittens')
expect(host.query_and_release('SELECT 10 AS number')).to eq({})
end
end
end
......@@ -9,10 +9,89 @@ describe Gitlab::Database::LoadBalancing do
end
end
describe '.configuration' do
it 'returns a Hash' do
config = { 'hosts' => %w(foo) }
allow(ActiveRecord::Base.configurations[Rails.env])
.to receive(:[])
.with('load_balancing')
.and_return(config)
expect(described_class.configuration).to eq(config)
end
end
describe '.max_replication_difference' do
context 'without an explicitly configured value' do
it 'returns the default value' do
allow(described_class)
.to receive(:configuration)
.and_return({})
expect(described_class.max_replication_difference).to eq(8.megabytes)
end
end
context 'with an explicitly configured value' do
it 'returns the configured value' do
allow(described_class)
.to receive(:configuration)
.and_return({ 'max_replication_difference' => 4 })
expect(described_class.max_replication_difference).to eq(4)
end
end
end
describe '.max_replication_lag_time' do
context 'without an explicitly configured value' do
it 'returns the default value' do
allow(described_class)
.to receive(:configuration)
.and_return({})
expect(described_class.max_replication_lag_time).to eq(60)
end
end
context 'with an explicitly configured value' do
it 'returns the configured value' do
allow(described_class)
.to receive(:configuration)
.and_return({ 'max_replication_lag_time' => 4 })
expect(described_class.max_replication_lag_time).to eq(4)
end
end
end
describe '.replica_check_interval' do
context 'without an explicitly configured value' do
it 'returns the default value' do
allow(described_class)
.to receive(:configuration)
.and_return({})
expect(described_class.replica_check_interval).to eq(60)
end
end
context 'with an explicitly configured value' do
it 'returns the configured value' do
allow(described_class)
.to receive(:configuration)
.and_return({ 'replica_check_interval' => 4 })
expect(described_class.replica_check_interval).to eq(4)
end
end
end
describe '.hosts' do
it 'returns a list of hosts' do
allow(ActiveRecord::Base.configurations[Rails.env]).to receive(:[])
.with('load_balancing')
allow(described_class)
.to receive(:configuration)
.and_return({ 'hosts' => %w(foo bar baz) })
expect(described_class.hosts).to eq(%w(foo bar baz))
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment