Commit e3fd22d4 authored by Diana Logan's avatar Diana Logan

Merge branch 'rework-sidekiq-docs' into 'master'

Split Sidekiq development docs into separate files

See merge request gitlab-org/gitlab!78186
parents 364155fc bb308906
......@@ -296,7 +296,7 @@ Instead of a queue, a queue namespace can also be provided, to have the process
automatically listen on all queues in that namespace without needing to
explicitly list all the queue names. For more information about queue namespaces,
see the relevant section in the
[Sidekiq style guide](../../development/sidekiq_style_guide.md#queue-namespaces).
[Sidekiq development documentation](../../development/sidekiq/index.md#queue-namespaces).
For example, say you want to start 2 extra processes: one to process the
`process_commit` queue, and one to process the `post_receive` queue. This can be
......
......@@ -599,7 +599,7 @@ Enterprise Edition instance. This has some implications:
- [Background migrations](background_migrations.md) run in Sidekiq, and
should only be done for migrations that would take an extreme amount of
time at GitLab.com scale.
1. **Sidekiq workers** [cannot change in a backwards-incompatible way](sidekiq_style_guide.md#sidekiq-compatibility-across-updates):
1. **Sidekiq workers** [cannot change in a backwards-incompatible way](sidekiq/compatibility_across_updates.md):
1. Sidekiq queues are not drained before a deploy happens, so there are
workers in the queue from the previous version of GitLab.
1. If you need to change a method signature, try to do so across two releases,
......
......@@ -10,8 +10,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w
A Sidekiq job is enqueued whenever `deliver_later` is called on an `ActionMailer`.
If a mailer argument needs to be added or removed, it is important to ensure
both backward and forward compatibility. Adhere to the Sidekiq Style Guide steps for
[changing the arguments for a worker](sidekiq_style_guide.md#changing-the-arguments-for-a-worker).
both backward and forward compatibility. Adhere to the Sidekiq steps for
[changing the arguments for a worker](sidekiq/compatibility_across_updates.md#changing-the-arguments-for-a-worker).
In the following example from [`NotificationService`](https://gitlab.com/gitlab-org/gitlab/-/blob/33ccb22e4fc271dbaac94b003a7a1a2915a13441/app/services/notification_service.rb#L74)
adding or removing an argument in this mailer's definition may cause problems
......
......@@ -185,7 +185,7 @@ Changes to the schema require multiple rollouts. While the new version is being
- Events get persisted in the Sidekiq queue as job arguments, so we could have 2 versions of the schema during deployments.
As changing the schema ultimately impacts the Sidekiq arguments, please refer to our
[Sidekiq style guide](sidekiq_style_guide.md#changing-the-arguments-for-a-worker) with regards to multiple rollouts.
[Sidekiq style guide](sidekiq/compatibility_across_updates.md#changing-the-arguments-for-a-worker) with regards to multiple rollouts.
#### Add properties
......
......@@ -221,7 +221,7 @@ the [reviewer values](https://about.gitlab.com/handbook/engineering/workflow/rev
- [Geo development](geo.md)
- [Redis guidelines](redis.md)
- [Adding a new Redis instance](redis/new_redis_instance.md)
- [Sidekiq guidelines](sidekiq_style_guide.md) for working with Sidekiq workers
- [Sidekiq guidelines](sidekiq/index.md) for working with Sidekiq workers
- [Working with Gitaly](gitaly.md)
- [Elasticsearch integration docs](elasticsearch.md)
- [Working with Merge Request diffs](diffs.md)
......
......@@ -14,14 +14,14 @@ In a sense, these scenarios are all transient states. But they can often persist
### When modifying a Sidekiq worker
For example when [changing arguments](sidekiq_style_guide.md#changing-the-arguments-for-a-worker):
For example when [changing arguments](sidekiq/compatibility_across_updates.md#changing-the-arguments-for-a-worker):
- Is it ok if jobs are being enqueued with the old signature but executed by the new monthly release?
- Is it ok if jobs are being enqueued with the new signature but executed by the previous monthly release?
### When adding a new Sidekiq worker
Is it ok if these jobs don't get executed for several hours because [Sidekiq nodes are not yet updated](sidekiq_style_guide.md#adding-new-workers)?
Is it ok if these jobs don't get executed for several hours because [Sidekiq nodes are not yet updated](sidekiq/compatibility_across_updates.md#adding-new-workers)?
### When modifying JavaScript
......@@ -89,7 +89,7 @@ Many users [skip some monthly releases](../update/index.md#upgrading-to-a-new-ma
- 13.0 => 13.12
These users accept some downtime during the update. Unfortunately we can't ignore this case completely. For example, 13.12 may execute Sidekiq jobs from 13.0, which illustrates why [we avoid removing arguments from jobs until a major release](sidekiq_style_guide.md#deprecate-and-remove-an-argument). The main question is: Will the deployment get to a good state after the update is complete?
These users accept some downtime during the update. Unfortunately we can't ignore this case completely. For example, 13.12 may execute Sidekiq jobs from 13.0, which illustrates why [we avoid removing arguments from jobs until a major release](sidekiq/compatibility_across_updates.md#deprecate-and-remove-an-argument). The main question is: Will the deployment get to a good state after the update is complete?
## What kind of components can GitLab be broken down into?
......@@ -180,7 +180,7 @@ coexists in production.
### Changing Sidekiq worker's parameters
This topic is explained in detail in [Sidekiq Compatibility across Updates](sidekiq_style_guide.md#sidekiq-compatibility-across-updates).
This topic is explained in detail in [Sidekiq Compatibility across Updates](sidekiq/compatibility_across_updates.md).
When we need to add a new parameter to a Sidekiq worker class, we can split this into the following steps:
......
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Sidekiq Compatibility across Updates
The arguments for a Sidekiq job are stored in a queue while it is
scheduled for execution. During a online update, this could lead to
several possible situations:
1. An older version of the application publishes a job, which is executed by an
upgraded Sidekiq node.
1. A job is queued before an upgrade, but executed after an upgrade.
1. A job is queued by a node running the newer version of the application, but
executed on a node running an older version of the application.
## Adding new workers
On GitLab.com, we [do not currently have a Sidekiq deployment in the
canary stage](https://gitlab.com/gitlab-org/gitlab/-/issues/19239). This
means that a new worker than can be scheduled from an HTTP endpoint may
be scheduled from canary but not run on Sidekiq until the full
production deployment is complete. This can be several hours later than
scheduling the job. For some workers, this will not be a problem. For
others - particularly [latency-sensitive
jobs](worker_attributes.md#latency-sensitive-jobs) - this will result in a poor user
experience.
This only applies to new worker classes when they are first introduced.
As we recommend [using feature flags](../feature_flags/) as a general
development process, it's best to control the entire change (including
scheduling of the new Sidekiq worker) with a feature flag.
## Changing the arguments for a worker
Jobs need to be backward and forward compatible between consecutive versions
of the application. Adding or removing an argument may cause problems
during deployment before all Rails and Sidekiq nodes have the updated code.
### Deprecate and remove an argument
**Before you remove arguments from the `perform_async` and `perform` methods.**, deprecate them. The
following example deprecates and then removes `arg2` from the `perform_async` method:
1. Provide a default value (usually `nil`) and use a comment to mark the
argument as deprecated in the coming minor release. (Release M)
```ruby
class ExampleWorker
# Keep arg2 parameter for backwards compatibility.
def perform(object_id, arg1, arg2 = nil)
# ...
end
end
```
1. One minor release later, stop using the argument in `perform_async`. (Release M+1)
```ruby
ExampleWorker.perform_async(object_id, arg1)
```
1. At the next major release, remove the value from the worker class. (Next major release)
```ruby
class ExampleWorker
def perform(object_id, arg1)
# ...
end
end
```
### Add an argument
There are two options for safely adding new arguments to Sidekiq workers:
1. Set up a [multi-step deployment](#multi-step-deployment) in which the new argument is first added to the worker.
1. Use a [parameter hash](#parameter-hash) for additional arguments. This is perhaps the most flexible option.
#### Multi-step deployment
This approach requires multiple releases.
1. Add the argument to the worker with a default value (Release M).
```ruby
class ExampleWorker
def perform(object_id, new_arg = nil)
# ...
end
end
```
1. Add the new argument to all the invocations of the worker (Release M+1).
```ruby
ExampleWorker.perform_async(object_id, new_arg)
```
1. Remove the default value (Release M+2).
```ruby
class ExampleWorker
def perform(object_id, new_arg)
# ...
end
end
```
#### Parameter hash
This approach doesn't require multiple releases if an existing worker already
uses a parameter hash.
1. Use a parameter hash in the worker to allow future flexibility.
```ruby
class ExampleWorker
def perform(object_id, params = {})
# ...
end
end
```
## Removing workers
Try to avoid removing workers and their queues in minor and patch
releases.
During online update instance can have pending jobs and removing the queue can
lead to those jobs being stuck forever. If you can't write migration for those
Sidekiq jobs, please consider removing the worker in a major release only.
## Renaming queues
For the same reasons that removing workers is dangerous, care should be taken
when renaming queues.
When renaming queues, use the `sidekiq_queue_migrate` helper migration method
in a **post-deployment migration**:
```ruby
class MigrateTheRenamedSidekiqQueue < Gitlab::Database::Migration[1.0]
def up
sidekiq_queue_migrate 'old_queue_name', to: 'new_queue_name'
end
def down
sidekiq_queue_migrate 'new_queue_name', to: 'old_queue_name'
end
end
```
You must rename the queue in a post-deployment migration not in a normal
migration. Otherwise, it runs too early, before all the workers that
schedule these jobs have stopped running. See also [other examples](../post_deployment_migrations.md#use-cases).
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Sidekiq idempotent jobs
It's known that a job can fail for multiple reasons. For example, network outages or bugs.
In order to address this, Sidekiq has a built-in retry mechanism that is
used by default by most workers within GitLab.
It's expected that a job can run again after a failure without major side-effects for the
application or users, which is why Sidekiq encourages
jobs to be [idempotent and transactional](https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-job-idempotent-and-transactional).
As a general rule, a worker can be considered idempotent if:
- It can safely run multiple times with the same arguments.
- Application side-effects are expected to happen only once
(or side-effects of a second run do not have an effect).
A good example of that would be a cache expiration worker.
A job scheduled for an idempotent worker is [deduplicated](#deduplication) when
an unstarted job with the same arguments is already in the queue.
## Ensuring a worker is idempotent
Make sure the worker tests pass using the following shared example:
```ruby
include_examples 'an idempotent worker' do
it 'marks the MR as merged' do
# Using subject inside this block will process the job multiple times
subject
expect(merge_request.state).to eq('merged')
end
end
```
Use the `perform_multiple` method directly instead of `job.perform` (this
helper method is automatically included for workers).
## Declaring a worker as idempotent
```ruby
class IdempotentWorker
include ApplicationWorker
# Declares a worker is idempotent and can
# safely run multiple times.
idempotent!
# ...
end
```
It's encouraged to only have the `idempotent!` call in the top-most worker class, even if
the `perform` method is defined in another class or module.
If the worker class isn't marked as idempotent, a cop fails. Consider skipping
the cop if you're not confident your job can safely run multiple times.
## Deduplication
When a job for an idempotent worker is enqueued while another
unstarted job is already in the queue, GitLab drops the second
job. The work is skipped because the same work would be
done by the job that was scheduled first; by the time the second
job executed, the first job would do nothing.
### Strategies
GitLab supports two deduplication strategies:
- `until_executing`
- `until_executed`
More [deduplication strategies have been
suggested](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/195). If
you are implementing a worker that could benefit from a different
strategy, please comment in the issue.
#### Until Executing
This strategy takes a lock when a job is added to the queue, and removes that lock before the job starts.
For example, `AuthorizedProjectsWorker` takes a user ID. When the
worker runs, it recalculates a user's authorizations. GitLab schedules
this job each time an action potentially changes a user's
authorizations. If the same user is added to two projects at the
same time, the second job can be skipped if the first job hasn't
begun, because when the first job runs, it creates the
authorizations for both projects.
```ruby
module AuthorizedProjectUpdate
class UserRefreshOverUserRangeWorker
include ApplicationWorker
deduplicate :until_executing
idempotent!
# ...
end
end
```
#### Until Executed
This strategy takes a lock when a job is added to the queue, and removes that lock after the job finishes.
It can be used to prevent jobs from running simultaneously multiple times.
```ruby
module Ci
class BuildTraceChunkFlushWorker
include ApplicationWorker
deduplicate :until_executed
idempotent!
# ...
end
end
```
Also, you can pass `if_deduplicated: :reschedule_once` option to re-run a job once after
the currently running job finished and deduplication happened at least once.
This ensures that the latest result is always produced even if a race condition
happened. See [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/342123) for more information.
### Scheduling jobs in the future
GitLab doesn't skip jobs scheduled in the future, as we assume that
the state has changed by the time the job is scheduled to
execute. Deduplication of jobs scheduled in the feature is possible
for both `until_executed` and `until_executing` strategies.
If you do want to deduplicate jobs scheduled in the future,
this can be specified on the worker by passing `including_scheduled: true` argument
when defining deduplication strategy:
```ruby
module AuthorizedProjectUpdate
class UserRefreshOverUserRangeWorker
include ApplicationWorker
deduplicate :until_executing, including_scheduled: true
idempotent!
# ...
end
end
```
## Setting the deduplication time-to-live (TTL)
Deduplication depends on an idempotency key that is stored in Redis. This is normally
cleared by the configured deduplication strategy.
However, the key can remain until its TTL in certain cases like:
1. `until_executing` is used but the job was never enqueued or executed after the Sidekiq
client middleware was run.
1. `until_executed` is used but the job fails to finish due to retry exhaustion, gets
interrupted the maximum number of times, or gets lost.
The default value is 6 hours. During this time, jobs won't be enqueued even if the first
job never executed or finished.
The TTL can be configured with:
```ruby
class ProjectImportScheduleWorker
include ApplicationWorker
idempotent!
deduplicate :until_executing, ttl: 5.minutes
end
```
Duplicate jobs can happen when the TTL is reached, so make sure you lower this only for jobs
that can tolerate some duplication.
## Deduplication with load balancing
> [Introduced](https://gitlab.com/groups/gitlab-org/-/epics/6763) in GitLab 14.4.
Jobs that declare either `:sticky` or `:delayed` data consistency
are eligible for database load-balancing.
In both cases, jobs are [scheduled in the future](#scheduling-jobs-in-the-future) with a short delay (1 second).
This minimizes the chance of replication lag after a write.
If you really want to deduplicate jobs eligible for load balancing,
specify `including_scheduled: true` argument when defining deduplication strategy:
```ruby
class DelayedIdempotentWorker
include ApplicationWorker
data_consistency :delayed
deduplicate :until_executing, including_scheduled: true
idempotent!
# ...
end
```
### Preserve the latest WAL location for idempotent jobs
> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69372) in GitLab 14.3.
> - [Enabled on GitLab.com](https://gitlab.com/gitlab-org/gitlab/-/issues/338350) in GitLab 14.4.
> - [Enabled on self-managed](https://gitlab.com/gitlab-org/gitlab/-/issues/338350) in GitLab 14.6.
The deduplication always take into account the latest binary replication pointer, not the first one.
This happens because we drop the same job scheduled for the second time and the Write-Ahead Log (WAL) is lost.
This could lead to comparing the old WAL location and reading from a stale replica.
To support both deduplication and maintaining data consistency with load balancing,
we are preserving the latest WAL location for idempotent jobs in Redis.
This way we are always comparing the latest binary replication pointer,
making sure that we read from the replica that is fully caught up.
FLAG:
On self-managed GitLab, by default this feature is available. To hide the feature, ask an administrator to
[disable the feature flag](../../administration/feature_flags.md) named `preserve_latest_wal_locations_for_idempotent_jobs`.
This feature flag is related to GitLab development and is not intended to be used by GitLab administrators, though.
On GitLab.com, this feature is available.
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Sidekiq guides
We use [Sidekiq](https://github.com/mperham/sidekiq) as our background
job processor. These guides are for writing jobs that will work well on
GitLab.com and be consistent with our existing worker classes. For
information on administering GitLab, see [configuring Sidekiq](../../administration/sidekiq.md).
There are pages with additional detail on the following topics:
1. [Compatibility across updates](compatibility_across_updates.md)
1. [Job idempotency and job deduplication](idempotent_jobs.md)
1. [Limited capacity worker: continuously performing work with a specified concurrency](limited_capacity_worker.md)
1. [Logging](logging.md)
1. [Worker attributes](worker_attributes.md)
1. **Job urgency** specifies queuing and execution SLOs
1. **Resource boundaries** and **external dependencies** for describing the workload
1. **Feature categorization**
1. **Database load balancing**
## ApplicationWorker
All workers should include `ApplicationWorker` instead of `Sidekiq::Worker`,
which adds some convenience methods and automatically sets the queue based on
the [routing rules](../../administration/operations/extra_sidekiq_routing.md#queue-routing-rules).
## Retries
Sidekiq defaults to using [25
retries](https://github.com/mperham/sidekiq/wiki/Error-Handling#automatic-job-retry),
with back-off between each retry. 25 retries means that the last retry
would happen around three weeks after the first attempt (assuming all 24
prior retries failed).
For most workers - especially [idempotent workers](idempotent_jobs.md) -
the default of 25 retries is more than sufficient. Many of our older
workers declare 3 retries, which used to be the default within the
GitLab application. 3 retries happen over the course of a couple of
minutes, so the jobs are prone to failing completely.
A lower retry count may be applicable if any of the below apply:
1. The worker contacts an external service and we do not provide
guarantees on delivery. For example, webhooks.
1. The worker is not idempotent and running it multiple times could
leave the system in an inconsistent state. For example, a worker that
posts a system note and then performs an action: if the second step
fails and the worker retries, the system note will be posted again.
1. The worker is a cronjob that runs frequently. For example, if a cron
job runs every hour, then we don't need to retry beyond an hour
because we don't need two of the same job running at once.
Each retry for a worker is counted as a failure in our metrics. A worker
which always fails 9 times and succeeds on the 10th would have a 90%
error rate.
## Sidekiq Queues
Previously, each worker had its own queue, which was automatically set based on the
worker class name. For a worker named `ProcessSomethingWorker`, the queue name
would be `process_something`. You can now route workers to a specific queue using
[queue routing rules](../../administration/operations/extra_sidekiq_routing.md#queue-routing-rules).
In GDK, new workers are routed to a queue named `default`.
If you're not sure what queue a worker uses,
you can find it using `SomeWorker.queue`. There is almost never a reason to
manually override the queue name using `sidekiq_options queue: :some_queue`.
After adding a new worker, run `bin/rake
gitlab:sidekiq:all_queues_yml:generate` to regenerate
`app/workers/all_queues.yml` or `ee/app/workers/all_queues.yml` so that
it can be picked up by
[`sidekiq-cluster`](../../administration/operations/extra_sidekiq_processes.md)
in installations that don't use routing rules. To learn more about potential changes,
read [Use routing rules by default and deprecate queue selectors for self-managed](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/596).
Additionally, run
`bin/rake gitlab:sidekiq:sidekiq_queues_yml:generate` to regenerate
`config/sidekiq_queues.yml`.
## Queue Namespaces
While different workers cannot share a queue, they can share a queue namespace.
Defining a queue namespace for a worker makes it possible to start a Sidekiq
process that automatically handles jobs for all workers in that namespace,
without needing to explicitly list all their queue names. If, for example, all
workers that are managed by `sidekiq-cron` use the `cronjob` queue namespace, we
can spin up a Sidekiq process specifically for these kinds of scheduled jobs.
If a new worker using the `cronjob` namespace is added later on, the Sidekiq
process also picks up jobs for that worker (after having been restarted),
without the need to change any configuration.
A queue namespace can be set using the `queue_namespace` DSL class method:
```ruby
class SomeScheduledTaskWorker
include ApplicationWorker
queue_namespace :cronjob
# ...
end
```
Behind the scenes, this sets `SomeScheduledTaskWorker.queue` to
`cronjob:some_scheduled_task`. Commonly used namespaces have their own
concern module that can easily be included into the worker class, and that may
set other Sidekiq options besides the queue namespace. `CronjobQueue`, for
example, sets the namespace, but also disables retries.
`bundle exec sidekiq` is namespace-aware, and listens on all
queues in a namespace (technically: all queues prefixed with the namespace name)
when a namespace is provided instead of a simple queue name in the `--queue`
(`-q`) option, or in the `:queues:` section in `config/sidekiq_queues.yml`.
Note that adding a worker to an existing namespace should be done with care, as
the extra jobs take resources away from jobs from workers that were already
there, if the resources available to the Sidekiq process handling the namespace
are not adjusted appropriately.
## Versioning
Version can be specified on each Sidekiq worker class.
This is then sent along when the job is created.
```ruby
class FooWorker
include ApplicationWorker
version 2
def perform(*args)
if job_version == 2
foo = args.first['foo']
else
foo = args.first
end
end
end
```
Under this schema, any worker is expected to be able to handle any job that was
enqueued by an older version of that worker. This means that when changing the
arguments a worker takes, you must increment the `version` (or set `version 1`
if this is the first time a worker's arguments are changing), but also make sure
that the worker is still able to handle jobs that were queued with any earlier
version of the arguments. From the worker's `perform` method, you can read
`self.job_version` if you want to specifically branch on job version, or you
can read the number or type of provided arguments.
## Job size
GitLab stores Sidekiq jobs and their arguments in Redis. To avoid
excessive memory usage, we compress the arguments of Sidekiq jobs
if their original size is bigger than 100KB.
After compression, if their size still exceeds 5MB, it raises an
[`ExceedLimitError`](https://gitlab.com/gitlab-org/gitlab/-/blob/f3dd89e5e510ea04b43ffdcb58587d8f78a8d77c/lib/gitlab/sidekiq_middleware/size_limiter/exceed_limit_error.rb#L8)
error when scheduling the job.
If this happens, rely on other means of making the data
available in Sidekiq. There are possible workarounds such as:
- Rebuild the data in Sidekiq with data loaded from the database or
elsewhere.
- Store the data in [object storage](../file_storage.md#object-storage)
before scheduling the job, and retrieve it inside the job.
## Job weights
Some jobs have a weight declared. This is only used when running Sidekiq
in the default execution mode - using
[`sidekiq-cluster`](../../administration/operations/extra_sidekiq_processes.md)
does not account for weights.
As we are [moving towards using `sidekiq-cluster` in
Free](https://gitlab.com/gitlab-org/gitlab/-/issues/34396), newly-added
workers do not need to have weights specified. They can use the
default weight, which is 1.
## Tests
Each Sidekiq worker must be tested using RSpec, just like any other class. These
tests should be placed in `spec/workers`.
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Sidekiq limited capacity worker
It is possible to limit the number of concurrent running jobs for a worker class
by using the `LimitedCapacity::Worker` concern.
The worker must implement three methods:
- `perform_work`: The concern implements the usual `perform` method and calls
`perform_work` if there's any available capacity.
- `remaining_work_count`: Number of jobs that have work to perform.
- `max_running_jobs`: Maximum number of jobs allowed to run concurrently.
```ruby
class MyDummyWorker
include ApplicationWorker
include LimitedCapacity::Worker
def perform_work(*args)
end
def remaining_work_count(*args)
5
end
def max_running_jobs
25
end
end
```
Additional to the regular worker, a cron worker must be defined as well to
backfill the queue with jobs. the arguments passed to `perform_with_capacity`
are passed to the `perform_work` method.
```ruby
class ScheduleMyDummyCronWorker
include ApplicationWorker
include CronjobQueue
def perform(*args)
MyDummyWorker.perform_with_capacity(*args)
end
end
```
## How many jobs are running?
It runs `max_running_jobs` at almost all times.
The cron worker checks the remaining capacity on each execution and it
schedules at most `max_running_jobs` jobs. Those jobs on completion
re-enqueue themselves immediately, but not on failure. The cron worker is in
charge of replacing those failed jobs.
## Handling errors and idempotence
This concern disables Sidekiq retries, logs the errors, and sends the job to the
dead queue. This is done to have only one source that produces jobs and because
the retry would occupy a slot with a job to perform in the distant future.
We let the cron worker enqueue new jobs, this could be seen as our retry and
back off mechanism because the job might fail again if executed immediately.
This means that for every failed job, we run at a lower capacity
until the cron worker fills the capacity again. If it is important for the
worker not to get a backlog, exceptions must be handled in `#perform_work` and
the job should not raise.
The jobs are deduplicated using the `:none` strategy, but the worker is not
marked as `idempotent!`.
## Metrics
This concern exposes three Prometheus metrics of gauge type with the worker class
name as label:
- `limited_capacity_worker_running_jobs`
- `limited_capacity_worker_max_running_jobs`
- `limited_capacity_worker_remaining_work_count`
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Sidekiq logging
## Worker context
> [Introduced](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/9) in GitLab 12.8.
To have some more information about workers in the logs, we add
[metadata to the jobs in the form of an
`ApplicationContext`](../logging.md#logging-context-metadata-through-rails-or-grape-requests).
In most cases, when scheduling a job from a request, this context is already
deducted from the request and added to the scheduled job.
When a job runs, the context that was active when it was scheduled
is restored. This causes the context to be propagated to any job
scheduled from within the running job.
All this means that in most cases, to add context to jobs, we don't
need to do anything.
There are however some instances when there would be no context
present when the job is scheduled, or the context that is present is
likely to be incorrect. For these instances, we've added Rubocop rules
to draw attention and avoid incorrect metadata in our logs.
As with most our cops, there are perfectly valid reasons for disabling
them. In this case it could be that the context from the request is
correct. Or maybe you've specified a context already in a way that
isn't picked up by the cops. In any case, leave a code comment
pointing to which context to use when disabling the cops.
When you do provide objects to the context, make sure that the
route for namespaces and projects is pre-loaded. This can be done by using
the `.with_route` scope defined on all `Routable`s.
### Cron workers
The context is automatically cleared for workers in the cronjob queue
(`include CronjobQueue`), even when scheduling them from
requests. We do this to avoid incorrect metadata when other jobs are
scheduled from the cron worker.
Cron workers themselves run instance wide, so they aren't scoped to
users, namespaces, projects, or other resources that should be added to
the context.
However, they often schedule other jobs that _do_ require context.
That is why there needs to be an indication of context somewhere in
the worker. This can be done by using one of the following methods
somewhere within the worker:
1. Wrap the code that schedules jobs in the `with_context` helper:
```ruby
def perform
deletion_cutoff = Gitlab::CurrentSettings
.deletion_adjourned_period.days.ago.to_date
projects = Project.with_route.with_namespace
.aimed_for_deletion(deletion_cutoff)
projects.find_each(batch_size: 100).with_index do |project, index|
delay = index * INTERVAL
with_context(project: project) do
AdjournedProjectDeletionWorker.perform_in(delay, project.id)
end
end
end
```
1. Use the a batch scheduling method that provides context:
```ruby
def schedule_projects_in_batch(projects)
ProjectImportScheduleWorker.bulk_perform_async_with_contexts(
projects,
arguments_proc: -> (project) { project.id },
context_proc: -> (project) { { project: project } }
)
end
```
Or, when scheduling with delays:
```ruby
diffs.each_batch(of: BATCH_SIZE) do |diffs, index|
DeleteDiffFilesWorker
.bulk_perform_in_with_contexts(index * 5.minutes,
diffs,
arguments_proc: -> (diff) { diff.id },
context_proc: -> (diff) { { project: diff.merge_request.target_project } })
end
```
### Jobs scheduled in bulk
Often, when scheduling jobs in bulk, these jobs should have a separate
context rather than the overarching context.
If that is the case, `bulk_perform_async` can be replaced by the
`bulk_perform_async_with_context` helper, and instead of
`bulk_perform_in` use `bulk_perform_in_with_context`.
For example:
```ruby
ProjectImportScheduleWorker.bulk_perform_async_with_contexts(
projects,
arguments_proc: -> (project) { project.id },
context_proc: -> (project) { { project: project } }
)
```
Each object from the enumerable in the first argument is yielded into 2
blocks:
- The `arguments_proc` which needs to return the list of arguments the
job needs to be scheduled with.
- The `context_proc` which needs to return a hash with the context
information for the job.
## Arguments logging
As of GitLab 13.6, Sidekiq job arguments are logged by default, unless [`SIDEKIQ_LOG_ARGUMENTS`](../../administration/troubleshooting/sidekiq.md#log-arguments-to-sidekiq-jobs)
is disabled.
By default, the only arguments logged are numeric arguments, because
arguments of other types could contain sensitive information. To
override this, use `loggable_arguments` inside a worker with the indexes
of the arguments to be logged. (Numeric arguments do not need to be
specified here.)
For example:
```ruby
class MyWorker
include ApplicationWorker
loggable_arguments 1, 3
# object_id will be logged as it's numeric
# string_a will be logged due to the loggable_arguments call
# string_b will be filtered from logs
# string_c will be logged due to the loggable_arguments call
def perform(object_id, string_a, string_b, string_c)
end
end
```
This diff is collapsed.
This diff is collapsed.
......@@ -64,8 +64,8 @@ component can have 2 indicators:
[customize the request Apdex](application_slis/rails_request_apdex.md), this new Apdex
measurement is not yet part of the error budget.
For Sidekiq job execution, the threshold depends on the [job
urgency](sidekiq_style_guide.md#job-urgency). It is
For Sidekiq job execution, the threshold depends on the
[job urgency](sidekiq/worker_attributes.md#job-urgency). It is
[currently](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/lib/sidekiq-helpers.libsonnet#L25-38)
**10 seconds** for high-urgency jobs and **5 minutes** for other
jobs.
......
......@@ -120,7 +120,7 @@ When there are 2 jobs being worked on at the same time, it is possible that the
In this example, `Worker B` is meant to set the updated status. But `Worker A` calls `#update_state` a little too late.
This can be avoided by utilizing either database locks or `Gitlab::ExclusiveLease`. This way, jobs will be
worked on one at a time. This also allows them to be marked as [idempotent](../sidekiq_style_guide.md#idempotent-jobs).
worked on one at a time. This also allows them to be marked as [idempotent](../sidekiq/idempotent_jobs.md).
### Retry mechanism handling
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment