Skip to main content

metrics

Metrics and Monitoring in GitLab

Overview

GitLab provides comprehensive metrics collection and monitoring through Prometheus integration, enabling teams to track system health, application performance, and business metrics in production environments.

What are Metrics?

Metrics are time-series data points that measure:

  • System health: CPU, memory, disk usage
  • Application performance: Response times, throughput, error rates
  • Business metrics: User registrations, transactions, revenue
  • Custom measurements: Any quantifiable aspect of your system

Prometheus Integration

GitLab natively integrates with Prometheus, the de facto standard for metrics in cloud-native environments.

Why Prometheus?

  • Native Kubernetes support: Built for containerized environments
  • Powerful query language: PromQL for flexible analysis
  • Multi-dimensional data: Labels for fine-grained filtering
  • Pull-based model: Services expose metrics, Prometheus scrapes
  • Active ecosystem: Wide tool and integration support

GitLab's Built-in Prometheus

GitLab bundles Prometheus in its Linux packages:

  • Prometheus services are on by default
  • Many GitLab dependencies are pre-configured to export metrics
  • Metrics available at /-/metrics endpoint

Accessing GitLab Metrics

Metrics Endpoint

GitLab exposes metrics at: https://your-gitlab-instance.com/-/metrics

Security Requirements:

  • Client IP must be explicitly allowed
  • Endpoint requires authentication
  • Configure in: Admin Area Settings Metrics and profiling

Example Metrics Response

# HELP gitlab_cache_misses_total Cache read miss # TYPE gitlab_cache_misses_total counter gitlab_cache_misses_total{controller="Projects::MergeRequestsController",action="show"} 12345 # HELP gitlab_transaction_duration_seconds Transaction duration # TYPE gitlab_transaction_duration_seconds histogram gitlab_transaction_duration_seconds_bucket{controller="Projects::IssuesController",action="index",le="0.1"} 1000 gitlab_transaction_duration_seconds_bucket{controller="Projects::IssuesController",action="index",le="0.5"} 5000 gitlab_transaction_duration_seconds_sum{controller="Projects::IssuesController",action="index"} 250.5 gitlab_transaction_duration_seconds_count{controller="Projects::IssuesController",action="index"} 10000

Metric Types

1. Counter

Cumulative value that only increases:

# Total number of requests http_requests_total{method="GET",endpoint="/api/users"} 12345

Use cases:

  • Request counts
  • Error counts
  • Task completions
  • Events processed

PromQL examples:

# Rate of requests per second rate(http_requests_total[5m]) # Total requests in last hour increase(http_requests_total[1h])

2. Gauge

Value that can go up or down:

# Current memory usage in bytes memory_usage_bytes{service="api"} 1073741824

Use cases:

  • Current memory/CPU usage
  • Queue depth
  • Active connections
  • Temperature readings

PromQL examples:

# Current memory usage memory_usage_bytes # Average over 5 minutes avg_over_time(memory_usage_bytes[5m])

3. Histogram

Distribution of values in buckets:

# Request duration histogram http_request_duration_seconds_bucket{le="0.1"} 1000 http_request_duration_seconds_bucket{le="0.5"} 5000 http_request_duration_seconds_bucket{le="1.0"} 8000 http_request_duration_seconds_bucket{le="+Inf"} 10000 http_request_duration_seconds_sum 2500 http_request_duration_seconds_count 10000

Use cases:

  • Response time distributions
  • Request size distributions
  • Query durations

PromQL examples:

# 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Average request duration rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

4. Summary

Pre-calculated quantiles:

# Request duration summary http_request_duration_seconds{quantile="0.5"} 0.15 http_request_duration_seconds{quantile="0.9"} 0.45 http_request_duration_seconds{quantile="0.99"} 1.2 http_request_duration_seconds_sum 2500 http_request_duration_seconds_count 10000

Use cases:

  • When exact quantiles are required
  • Client-side quantile calculation
  • Lower cardinality than histograms

Custom Metrics

Node.js (prom-client)

const client = require('prom-client'); // Create a Registry const register = new client.Registry(); // Add default metrics (CPU, memory, etc.) client.collectDefaultMetrics({ register }); // Counter: Track total requests const httpRequestsTotal = new client.Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'endpoint', 'status'], registers: [register], }); // Gauge: Track active connections const activeConnections = new client.Gauge({ name: 'active_connections', help: 'Number of active connections', registers: [register], }); // Histogram: Track request duration const httpRequestDuration = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'endpoint'], buckets: [0.1, 0.5, 1, 2, 5], registers: [register], }); // Instrument your application app.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; httpRequestsTotal.inc({ method: req.method, endpoint: req.route?.path || req.path, status: res.statusCode, }); httpRequestDuration.observe( { method: req.method, endpoint: req.route?.path || req.path }, duration ); }); next(); }); // Expose metrics endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

Python (prometheus_client)

from prometheus_client import Counter, Gauge, Histogram, make_wsgi_app from werkzeug.middleware.dispatcher import DispatcherMiddleware from flask import Flask app = Flask(__name__) # Counter: Track total requests http_requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Gauge: Track active connections active_connections = Gauge( 'active_connections', 'Number of active connections' ) # Histogram: Track request duration http_request_duration = Histogram( 'http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'], buckets=[0.1, 0.5, 1, 2, 5] ) # Instrument application @app.before_request def before_request(): active_connections.inc() request.start_time = time.time() @app.after_request def after_request(response): active_connections.dec() duration = time.time() - request.start_time http_requests_total.labels( method=request.method, endpoint=request.endpoint, status=response.status_code ).inc() http_request_duration.labels( method=request.method, endpoint=request.endpoint ).observe(duration) return response # Add metrics endpoint app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })

Go (prometheus/client_golang)

package main import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "net/http" "time" ) var ( // Counter: Track total requests httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, ) // Gauge: Track active connections activeConnections = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "active_connections", Help: "Number of active connections", }, ) // Histogram: Track request duration httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: []float64{0.1, 0.5, 1, 2, 5}, }, []string{"method", "endpoint"}, ) ) func init() { // Register metrics prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(activeConnections) prometheus.MustRegister(httpRequestDuration) } // Middleware to track metrics func metricsMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() activeConnections.Inc() defer activeConnections.Dec() // Wrap response writer to capture status code rw := &responseWriter{ResponseWriter: w, statusCode: 200} next.ServeHTTP(rw, r) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues( r.Method, r.URL.Path, fmt.Sprintf("%d", rw.statusCode), ).Inc() httpRequestDuration.WithLabelValues( r.Method, r.URL.Path, ).Observe(duration) }) } func main() { mux := http.NewServeMux() mux.HandleFunc("/api/users", handleUsers) // Expose metrics endpoint mux.Handle("/metrics", promhttp.Handler()) // Apply metrics middleware http.ListenAndServe(":8080", metricsMiddleware(mux)) }

Prometheus Configuration

Scrape Configuration

Configure Prometheus to scrape your services:

# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: # GitLab metrics - job_name: 'gitlab' static_configs: - targets: ['localhost:9090'] metrics_path: '/-/metrics' # Custom application metrics - job_name: 'user-api' static_configs: - targets: ['user-api:8080'] metrics_path: '/metrics' scrape_interval: 10s # Kubernetes service discovery - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__

Kubernetes Annotations

Enable automatic discovery:

apiVersion: v1 kind: Pod metadata: name: user-api annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: api image: user-api:latest ports: - containerPort: 8080

PromQL Queries

Basic Queries

# Current value http_requests_total # Filter by labels http_requests_total{method="GET"} # Multiple label filters http_requests_total{method="GET",status="200"} # Label matching operators http_requests_total{status=~"5.."} # Regex match http_requests_total{status!="200"} # Not equal

Rate Calculations

# Requests per second (average over 5 minutes) rate(http_requests_total[5m]) # Increase over 1 hour increase(http_requests_total[1h]) # Instantaneous rate (per-second) irate(http_requests_total[5m])

Aggregations

# Sum across all instances sum(http_requests_total) # Sum by label sum by (method) (http_requests_total) # Average avg(memory_usage_bytes) # Maximum max(response_time_seconds) # Count count(up == 1) # Count healthy instances

Advanced Queries

# Error rate (4xx and 5xx responses) sum(rate(http_requests_total{status=~"4..|5.."}[5m])) / sum(rate(http_requests_total[5m])) # 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) # Request rate by endpoint topk(10, sum by (endpoint) (rate(http_requests_total[5m]))) # Apdex score (Application Performance Index) ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2 ) / sum(rate(http_request_duration_seconds_count[5m]))

Metric Cardinality Management

The Cardinality Problem

High cardinality = many unique label combinations:

# Bad: User ID as label (millions of unique values) http_requests_total{user_id="12345"} 1 # Good: Aggregate user metrics separately user_requests_total 1000000

Best Practices

  1. Limit label values: Keep cardinality < 10,000
  2. Avoid high-cardinality labels: User IDs, timestamps, UUIDs
  3. Use aggregation: Pre-aggregate high-cardinality data
  4. Drop unnecessary labels: Only essential dimensions

Example: Managing Cardinality

// Bad: High cardinality httpRequestsTotal.inc({ method: req.method, endpoint: req.url, // Unique per request! userId: req.user.id, // Millions of users! timestamp: Date.now(), // Always unique! }); // Good: Bounded cardinality httpRequestsTotal.inc({ method: req.method, endpoint: req.route?.path || 'unknown', // Template, not actual URL status: res.statusCode, // Limited values }); // Track user metrics separately userRequestsTotal.inc();

Monitoring Best Practices

1. RED Method (Requests, Errors, Duration)

Essential metrics for every service:

# Rate: Requests per second sum(rate(http_requests_total[5m])) # Errors: Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Duration: 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

2. USE Method (Utilization, Saturation, Errors)

For resource monitoring:

# Utilization: % of resource used node_cpu_seconds_total{mode="user"} / node_cpu_seconds_total{mode="idle"} # Saturation: Queue depth queue_depth # Errors: Error count sum(rate(errors_total[5m]))

3. Golden Signals (Google SRE)

Four key metrics:

# Latency histogram_quantile(0.95, rate(request_duration_bucket[5m])) # Traffic sum(rate(requests_total[5m])) # Errors sum(rate(requests_total{status=~"5.."}[5m])) # Saturation node_memory_usage_bytes / node_memory_total_bytes

Alerting on Metrics

Alert Rules

Define alerts in Prometheus:

# prometheus_rules.yml groups: - name: api_alerts interval: 30s rules: # High error rate - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" # High latency - alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 2 for: 10m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s" # Service down - alert: ServiceDown expr: up{job="user-api"} == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.instance }} is unreachable" # High memory usage - alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage is {{ $value | humanizePercentage }}"

Integration with GitLab

Configure alert notifications in GitLab:

  1. Settings Monitor Alerts
  2. Add Prometheus endpoint
  3. Configure notification channels (Slack, PagerDuty, email)

Performance Optimization

1. Recording Rules

Pre-compute expensive queries:

# prometheus_rules.yml groups: - name: recording_rules interval: 30s rules: # Pre-compute request rate - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Pre-compute error rate - record: job:http_errors:rate5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # Pre-compute p95 latency - record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )

2. Metric Relabeling

Reduce storage overhead:

# prometheus.yml scrape_configs: - job_name: 'user-api' static_configs: - targets: ['user-api:8080'] metric_relabel_configs: # Drop high-cardinality metrics - source_labels: [__name__] regex: 'debug_.*' action: drop # Rename labels - source_labels: [old_label] target_label: new_label # Drop specific labels - regex: 'unnecessary_label' action: labeldrop

3. Retention Policies

Configure data retention:

# prometheus.yml global: # Keep data for 15 days storage.tsdb.retention.time: 15d # Or limit by size storage.tsdb.retention.size: 50GB

Visualization with Grafana

Connecting GitLab to Grafana

  1. Add Prometheus data source

    • Navigate to Grafana Configuration Data Sources
    • Add Prometheus
    • URL: http://prometheus:9090
  2. Create dashboards

    • Import GitLab dashboard templates
    • Build custom dashboards

Example Dashboard Panels

{ "title": "Request Rate", "targets": [ { "expr": "sum(rate(http_requests_total[5m]))", "legendFormat": "Requests/sec" } ], "type": "graph" }

Cost Optimization

Strategies

  1. Sampling: Reduce metric frequency
  2. Aggregation: Use recording rules
  3. Retention: Keep recent data only
  4. Cardinality control: Limit label values
  5. Drop unnecessary metrics: Focus on essentials

Example: Cost-Effective Configuration

# prometheus.yml global: # Longer scrape interval for non-critical services scrape_interval: 60s scrape_configs: # Critical services: frequent scraping - job_name: 'production-api' scrape_interval: 15s static_configs: - targets: ['api:8080'] # Non-critical: infrequent scraping - job_name: 'batch-jobs' scrape_interval: 5m static_configs: - targets: ['batch:8080']

References