Added documentation for GitLab Metrics

790c6868 · Yorick Peterse · f603f3b3 · 790c6868 · 790c6868 · 790c6868
Commit 790c6868 authored Jan 18, 2016 by Yorick Peterse
5 changed files
--- a/doc/README.md
+++ b/doc/README.md
@@ -49,6 +49,13 @@
 - [Test Clojure applications](ci/examples/test-clojure-application.md)
 - Help your favorite programming language and GitLab by sending a merge request with a guide for that language.

+## GitLab Metrics
+
+- [Introduction](metrics/introduction.md)
+- [GitLab Configuration](metrics/gitlab_configuration.md)
+- [InfluxDB Configuration](metrics/influxdb_configuration.md)
+- [InfluxDB Schema](metrics/influxdb_schema.md)
+
 ## Administrator documentation

 - [Custom git hooks](hooks/custom_hooks.md) Custom git hooks (on the filesystem) for when web hooks aren't enough.

--- a/doc/metrics/gitlab_configuration.md
+++ b/doc/metrics/gitlab_configuration.md
+# GitLab Configuration
+
+By default GitLab Metrics is disabled. To enable GitLab Metrics and change any
+of its settings open a web browser and navigate to
+`http://YOUR_GITLAB_HOST/admin/application_settings`, the settings can be found
+in the "Metrics" section. A restart of all GitLab processes is required for any
+changes to take effect.
+
+## Pending Migrations
+
+When any migrations are pending the metrics are disabled until the migrations
+have been performed.
--- a/doc/metrics/influxdb_configuration.md
+++ b/doc/metrics/influxdb_configuration.md
+# InfluxDB Configuration
+
+The default settings provided by InfluxDB are not sufficient for a high traffic
+GitLab environment. The settings discussed in this document are based on the
+settings GitLab uses for GitLab.com, depending on your own needs you may need to
+further adjust them.
+
+## Requirements
+
+* InfluxDB 0.9 or newer
+* A fairly modern version of Linux
+* At least 4GB of RAM
+* At least 10GB of storage for InfluxDB data
+
+Note that the RAM and storage requirements can differ greatly depending on the
+amount of data received/stored. To limit the amount of stored data users can
+look into [InfluxDB Retention Policies][influxdb-retention].
+
+## InfluxDB Server Settings
+
+Since InfluxDB has many settings that users may wish to customize themselves
+(e.g. what port to run InfluxDB on) we'll only cover the essentials.
+
+### Storage Engine
+
+InfluxDB comes with different storage engines and as of InfluxDB 0.9 a new
+storage engine is available called "tsm1". All users _must_ use the new tsm1
+storage engine (this will be the default engine in upcoming InfluxDB engines).
+
+### Admin Panel
+
+Production environments should have the InfluxDB admin panel _disabled_. This
+feature can be disabled by adding the following to your InfluxDB configuration
+file:
+
+    [admin]
+      enabled = false
+
+### HTTP
+
+HTTP is required when using the InfluxDB CLI or other tools such as Grafana,
+thus it should be enabled. When enabling make sure to _also_ enable
+authentication:
+
+    [http]
+      enabled = true
+      auth-enabled = true
+
+### UDP
+
+GitLab writes data to InfluxDB via UDP and thus this must be enabled. Enabling
+UDP can be done using the following settings:
+
+    [udp]
+      enabled = true
+      bind-address = ":8089"
+      database = "gitlab"
+      batch-size = 1000
+      batch-pending = 5
+      batch-timeout = 1s
+      read-buffer = 209715200
+
+This does the following:
+
+1. Enable UDP and bind it to port 8089 for all addresses.
+2. Store any data received in the "gitlab" database.
+3. Define a batch of points to be 1000 points in size and allow a maximum of
+   5 batches _or_ flush them automatically after 1 second.
+4. Define a UDP read buffer size of 200 MB.
+
+One of the most important settings here is the UDP read buffer size as if this
+value is set too low packets will be dropped. You must also make sure the OS
+buffer size is set to the same value, the default value is almost never enough.
+
+To set the OS buffer size to 200 MB on Linux you can run the following command:
+
+    sysctl -w net.core.rmem_max=209715200
+
+To make this permanent, add the following to `/etc/sysctl.conf` and restart the
+server:
+
+    net.core.rmem_max=209715200
+
+It is **very important** to make sure the buffer sizes are large enough to
+handle all data sent to InfluxDB as otherwise you _will_ lose data. The above
+buffer sizes are based on the traffic for GitLab.com. Depending on the amount of
+traffic users may be able to use a smaller buffer size, but we highly recommend
+using _at least_ 100 MB.
+
+When enabling UDP users should take care to not expose the port to the public as
+doing so will allow anybody to write data into your InfluxDB database (as
+InfluxDB's UDP protocol doesn't support authentication). We recommend either
+whitelisting the allowed IP addresses/ranges, or setting up a VLAN and only
+allowing traffic from members of said VLAN.
+
+[influxdb-retention]: https://docs.influxdata.com/influxdb/v0.9/query_language/database_management/#retention-policy-management
--- a/doc/metrics/influxdb_schema.md
+++ b/doc/metrics/influxdb_schema.md
+# InfluxDB Schema
+
+The following measurements are currently stored in InfluxDB:
+
+* `PROCESS_file_descriptors`
+* `PROCESS_gc_statistics`
+* `PROCESS_memory_usage`
+* `PROCESS_method_calls`
+* `PROCESS_object_counts`
+* `PROCESS_transactions`
+* `PROCESS_views`
+
+Here `PROCESS` is replaced with either "rails" or "sidekiq" depending on the
+process type. In all series any form of duration is stored in milliseconds.
+
+## PROCESS_file_descriptors
+
+This measurement contains the number of open file descriptors over time. The
+value field `value` contains the number of descriptors.
+
+## PROCESS_gc_statistics
+
+This measurement contains Ruby garbage collection statistics such as the amount
+of minor/major GC runs (relative to the last sampling interval), the time spent
+in garbage collection cycles, and all fields/values returned by `GC.stat`.
+
+## PROCESS_memory_usage
+
+This measurement contains the process' memory usage (in bytes) over time. The
+value field `value` contains the number of bytes.
+
+## PROCESS_method_calls
+
+This measurement contains the methods called during a transaction along with
+their durations and a name of the transaction action that invoked the method (if
+available). The method call duration is stored in the value field `duration`
+while the method name is stored in the tag `method`. The tag `action` contains
+the full name of the transaction action. Both the `method` and `action` fields
+are in the following format:
+
+    ClassName#method_name
+
+For example, a method called by the `show` method in the `UsersController` class
+would have `action` set to `UsersController#show`.
+
+## PROCESS_object_counts
+
+This measurement is used to store retained Ruby objects (per class) and the
+amount of retained objects. The number of objects is stored in the `count` value
+field while the class name is stored in the `type` tag.
+
+## PROCESS_transactions
+
+This measurement is used to store basic transaction details such as the time it
+took to complete a transaction, how much time was spent in SQL queries, etc. The
+following value fields are available:
+
+* `duration`: the total duration of the transaction.
+* `allocated_memory`: the amount of bytes allocated while the transaction was
+  running. This value is only reliable when using single-threaded application
+  servers.
+* `method_duration`: the total time spent in method calls.
+* `sql_duration`: the total time spent in SQL queries.
+* `view_duration`: the total time spent in views.
+
+## PROCESS_views
+
+This measurement is used to store view rendering timings for a transaction. The
+following value fields are available:
+
+* `duration`: the rendering time of the view.
+* `view`: the path of the view, relative to the application's root directory.
+
+The `action` tag contains the action name of the transaction that rendered the
+view.
--- a/doc/metrics/introduction.md
+++ b/doc/metrics/introduction.md
+# Introduction to GitLab Metrics
+
+GitLab comes with its own application performance measuring system as of GitLab
+8.4, simply called "GitLab Metrics". GitLab Metrics is available in both the
+Community and Enterprise editions.
+
+GitLab Metrics makes it possible to measure a wide variety of statistics
+including (but not limited to):
+
+* The time it took to complete a transaction (a web request or Sidekiq job).
+* The time spent in running SQL queries and rendering HAML views.
+* The time spent executing (instrumented) Ruby methods.
+* Ruby object allocations, and retained objects in particular.
+* System statistics such as the process' memory usage and open file descriptors.
+* Ruby garbage collection statistics.
+
+Metrics data is written to [InfluxDB][influxdb] over [UDP](influxdb-udp). Stored
+data can be visualized using [Grafana][grafana] or any other application that
+supports reading data from InfluxDB. Alternatively data can be queried using the
+InfluxDB CLI.
+
+## Metric Types
+
+Two types of metrics are collected:
+
+1. Transaction specific metrics.
+2. Sampled metrics, collected at a certain interval in a separate thread.
+
+### Transaction Metrics
+
+Transaction metrics are metrics that can be associated with a single
+transaction. This includes statistics such as the transaction duration, timings
+of any executed SQL queries, time spent rendering HAML views, etc. These metrics
+are collected for every Rack request and Sidekiq job processed.
+
+### Sampled Metrics
+
+Sampled metrics are metrics that can't be associated with a single transaction.
+Examples include garbage collection statistics and retained Ruby objects. These
+metrics are collected at a regular interval. This interval is made up out of two
+parts:
+
+1. A user defined interval.
+2. A randomly generated offset added on top of the interval, the same offset
+   can't be used twice in a row.
+
+The actual interval can be anywhere between a half of the defined interval and a
+half above the interval. For example, for a user defined interval of 15 seconds
+the actual interval can be anywhere between 7.5 and 22.5. The interval is
+re-generated for every sampling run instead of being generated once and re-used
+for the duration of the process' lifetime.
+
+[influxdb]: https://influxdata.com/time-series-platform/influxdb/
+[influxdb-udp]: https://docs.influxdata.com/influxdb/v0.9/write_protocols/udp/
+[grafana]: http://grafana.org/