metrics
Metrics and Monitoring in GitLab
Overview
GitLab provides comprehensive metrics collection and monitoring through Prometheus integration, enabling teams to track system health, application performance, and business metrics in production environments.
What are Metrics?
Metrics are time-series data points that measure:
- System health: CPU, memory, disk usage
- Application performance: Response times, throughput, error rates
- Business metrics: User registrations, transactions, revenue
- Custom measurements: Any quantifiable aspect of your system
Prometheus Integration
GitLab natively integrates with Prometheus, the de facto standard for metrics in cloud-native environments.
Why Prometheus?
- Native Kubernetes support: Built for containerized environments
- Powerful query language: PromQL for flexible analysis
- Multi-dimensional data: Labels for fine-grained filtering
- Pull-based model: Services expose metrics, Prometheus scrapes
- Active ecosystem: Wide tool and integration support
GitLab's Built-in Prometheus
GitLab bundles Prometheus in its Linux packages:
- Prometheus services are on by default
- Many GitLab dependencies are pre-configured to export metrics
- Metrics available at
/-/metricsendpoint
Accessing GitLab Metrics
Metrics Endpoint
GitLab exposes metrics at: https://your-gitlab-instance.com/-/metrics
Security Requirements:
- Client IP must be explicitly allowed
- Endpoint requires authentication
- Configure in: Admin Area Settings Metrics and profiling
Example Metrics Response
# HELP gitlab_cache_misses_total Cache read miss # TYPE gitlab_cache_misses_total counter gitlab_cache_misses_total{controller="Projects::MergeRequestsController",action="show"} 12345 # HELP gitlab_transaction_duration_seconds Transaction duration # TYPE gitlab_transaction_duration_seconds histogram gitlab_transaction_duration_seconds_bucket{controller="Projects::IssuesController",action="index",le="0.1"} 1000 gitlab_transaction_duration_seconds_bucket{controller="Projects::IssuesController",action="index",le="0.5"} 5000 gitlab_transaction_duration_seconds_sum{controller="Projects::IssuesController",action="index"} 250.5 gitlab_transaction_duration_seconds_count{controller="Projects::IssuesController",action="index"} 10000
Metric Types
1. Counter
Cumulative value that only increases:
# Total number of requests http_requests_total{method="GET",endpoint="/api/users"} 12345
Use cases:
- Request counts
- Error counts
- Task completions
- Events processed
PromQL examples:
# Rate of requests per second rate(http_requests_total[5m]) # Total requests in last hour increase(http_requests_total[1h])
2. Gauge
Value that can go up or down:
# Current memory usage in bytes memory_usage_bytes{service="api"} 1073741824
Use cases:
- Current memory/CPU usage
- Queue depth
- Active connections
- Temperature readings
PromQL examples:
# Current memory usage memory_usage_bytes # Average over 5 minutes avg_over_time(memory_usage_bytes[5m])
3. Histogram
Distribution of values in buckets:
# Request duration histogram http_request_duration_seconds_bucket{le="0.1"} 1000 http_request_duration_seconds_bucket{le="0.5"} 5000 http_request_duration_seconds_bucket{le="1.0"} 8000 http_request_duration_seconds_bucket{le="+Inf"} 10000 http_request_duration_seconds_sum 2500 http_request_duration_seconds_count 10000
Use cases:
- Response time distributions
- Request size distributions
- Query durations
PromQL examples:
# 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Average request duration rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
4. Summary
Pre-calculated quantiles:
# Request duration summary http_request_duration_seconds{quantile="0.5"} 0.15 http_request_duration_seconds{quantile="0.9"} 0.45 http_request_duration_seconds{quantile="0.99"} 1.2 http_request_duration_seconds_sum 2500 http_request_duration_seconds_count 10000
Use cases:
- When exact quantiles are required
- Client-side quantile calculation
- Lower cardinality than histograms
Custom Metrics
Node.js (prom-client)
const client = require('prom-client'); // Create a Registry const register = new client.Registry(); // Add default metrics (CPU, memory, etc.) client.collectDefaultMetrics({ register }); // Counter: Track total requests const httpRequestsTotal = new client.Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'endpoint', 'status'], registers: [register], }); // Gauge: Track active connections const activeConnections = new client.Gauge({ name: 'active_connections', help: 'Number of active connections', registers: [register], }); // Histogram: Track request duration const httpRequestDuration = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'endpoint'], buckets: [0.1, 0.5, 1, 2, 5], registers: [register], }); // Instrument your application app.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; httpRequestsTotal.inc({ method: req.method, endpoint: req.route?.path || req.path, status: res.statusCode, }); httpRequestDuration.observe( { method: req.method, endpoint: req.route?.path || req.path }, duration ); }); next(); }); // Expose metrics endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });
Python (prometheus_client)
from prometheus_client import Counter, Gauge, Histogram, make_wsgi_app from werkzeug.middleware.dispatcher import DispatcherMiddleware from flask import Flask app = Flask(__name__) # Counter: Track total requests http_requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Gauge: Track active connections active_connections = Gauge( 'active_connections', 'Number of active connections' ) # Histogram: Track request duration http_request_duration = Histogram( 'http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'], buckets=[0.1, 0.5, 1, 2, 5] ) # Instrument application @app.before_request def before_request(): active_connections.inc() request.start_time = time.time() @app.after_request def after_request(response): active_connections.dec() duration = time.time() - request.start_time http_requests_total.labels( method=request.method, endpoint=request.endpoint, status=response.status_code ).inc() http_request_duration.labels( method=request.method, endpoint=request.endpoint ).observe(duration) return response # Add metrics endpoint app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })
Go (prometheus/client_golang)
package main import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "net/http" "time" ) var ( // Counter: Track total requests httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, ) // Gauge: Track active connections activeConnections = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "active_connections", Help: "Number of active connections", }, ) // Histogram: Track request duration httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: []float64{0.1, 0.5, 1, 2, 5}, }, []string{"method", "endpoint"}, ) ) func init() { // Register metrics prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(activeConnections) prometheus.MustRegister(httpRequestDuration) } // Middleware to track metrics func metricsMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() activeConnections.Inc() defer activeConnections.Dec() // Wrap response writer to capture status code rw := &responseWriter{ResponseWriter: w, statusCode: 200} next.ServeHTTP(rw, r) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues( r.Method, r.URL.Path, fmt.Sprintf("%d", rw.statusCode), ).Inc() httpRequestDuration.WithLabelValues( r.Method, r.URL.Path, ).Observe(duration) }) } func main() { mux := http.NewServeMux() mux.HandleFunc("/api/users", handleUsers) // Expose metrics endpoint mux.Handle("/metrics", promhttp.Handler()) // Apply metrics middleware http.ListenAndServe(":8080", metricsMiddleware(mux)) }
Prometheus Configuration
Scrape Configuration
Configure Prometheus to scrape your services:
# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: # GitLab metrics - job_name: 'gitlab' static_configs: - targets: ['localhost:9090'] metrics_path: '/-/metrics' # Custom application metrics - job_name: 'user-api' static_configs: - targets: ['user-api:8080'] metrics_path: '/metrics' scrape_interval: 10s # Kubernetes service discovery - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__
Kubernetes Annotations
Enable automatic discovery:
apiVersion: v1 kind: Pod metadata: name: user-api annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: api image: user-api:latest ports: - containerPort: 8080
PromQL Queries
Basic Queries
# Current value http_requests_total # Filter by labels http_requests_total{method="GET"} # Multiple label filters http_requests_total{method="GET",status="200"} # Label matching operators http_requests_total{status=~"5.."} # Regex match http_requests_total{status!="200"} # Not equal
Rate Calculations
# Requests per second (average over 5 minutes) rate(http_requests_total[5m]) # Increase over 1 hour increase(http_requests_total[1h]) # Instantaneous rate (per-second) irate(http_requests_total[5m])
Aggregations
# Sum across all instances sum(http_requests_total) # Sum by label sum by (method) (http_requests_total) # Average avg(memory_usage_bytes) # Maximum max(response_time_seconds) # Count count(up == 1) # Count healthy instances
Advanced Queries
# Error rate (4xx and 5xx responses) sum(rate(http_requests_total{status=~"4..|5.."}[5m])) / sum(rate(http_requests_total[5m])) # 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) # Request rate by endpoint topk(10, sum by (endpoint) (rate(http_requests_total[5m]))) # Apdex score (Application Performance Index) ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2 ) / sum(rate(http_request_duration_seconds_count[5m]))
Metric Cardinality Management
The Cardinality Problem
High cardinality = many unique label combinations:
# Bad: User ID as label (millions of unique values) http_requests_total{user_id="12345"} 1 # Good: Aggregate user metrics separately user_requests_total 1000000
Best Practices
- Limit label values: Keep cardinality < 10,000
- Avoid high-cardinality labels: User IDs, timestamps, UUIDs
- Use aggregation: Pre-aggregate high-cardinality data
- Drop unnecessary labels: Only essential dimensions
Example: Managing Cardinality
// Bad: High cardinality httpRequestsTotal.inc({ method: req.method, endpoint: req.url, // Unique per request! userId: req.user.id, // Millions of users! timestamp: Date.now(), // Always unique! }); // Good: Bounded cardinality httpRequestsTotal.inc({ method: req.method, endpoint: req.route?.path || 'unknown', // Template, not actual URL status: res.statusCode, // Limited values }); // Track user metrics separately userRequestsTotal.inc();
Monitoring Best Practices
1. RED Method (Requests, Errors, Duration)
Essential metrics for every service:
# Rate: Requests per second sum(rate(http_requests_total[5m])) # Errors: Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Duration: 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
2. USE Method (Utilization, Saturation, Errors)
For resource monitoring:
# Utilization: % of resource used node_cpu_seconds_total{mode="user"} / node_cpu_seconds_total{mode="idle"} # Saturation: Queue depth queue_depth # Errors: Error count sum(rate(errors_total[5m]))
3. Golden Signals (Google SRE)
Four key metrics:
# Latency histogram_quantile(0.95, rate(request_duration_bucket[5m])) # Traffic sum(rate(requests_total[5m])) # Errors sum(rate(requests_total{status=~"5.."}[5m])) # Saturation node_memory_usage_bytes / node_memory_total_bytes
Alerting on Metrics
Alert Rules
Define alerts in Prometheus:
# prometheus_rules.yml groups: - name: api_alerts interval: 30s rules: # High error rate - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" # High latency - alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 2 for: 10m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s" # Service down - alert: ServiceDown expr: up{job="user-api"} == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.instance }} is unreachable" # High memory usage - alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage is {{ $value | humanizePercentage }}"
Integration with GitLab
Configure alert notifications in GitLab:
- Settings Monitor Alerts
- Add Prometheus endpoint
- Configure notification channels (Slack, PagerDuty, email)
Performance Optimization
1. Recording Rules
Pre-compute expensive queries:
# prometheus_rules.yml groups: - name: recording_rules interval: 30s rules: # Pre-compute request rate - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Pre-compute error rate - record: job:http_errors:rate5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # Pre-compute p95 latency - record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
2. Metric Relabeling
Reduce storage overhead:
# prometheus.yml scrape_configs: - job_name: 'user-api' static_configs: - targets: ['user-api:8080'] metric_relabel_configs: # Drop high-cardinality metrics - source_labels: [__name__] regex: 'debug_.*' action: drop # Rename labels - source_labels: [old_label] target_label: new_label # Drop specific labels - regex: 'unnecessary_label' action: labeldrop
3. Retention Policies
Configure data retention:
# prometheus.yml global: # Keep data for 15 days storage.tsdb.retention.time: 15d # Or limit by size storage.tsdb.retention.size: 50GB
Visualization with Grafana
Connecting GitLab to Grafana
-
Add Prometheus data source
- Navigate to Grafana Configuration Data Sources
- Add Prometheus
- URL:
http://prometheus:9090
-
Create dashboards
- Import GitLab dashboard templates
- Build custom dashboards
Example Dashboard Panels
{ "title": "Request Rate", "targets": [ { "expr": "sum(rate(http_requests_total[5m]))", "legendFormat": "Requests/sec" } ], "type": "graph" }
Cost Optimization
Strategies
- Sampling: Reduce metric frequency
- Aggregation: Use recording rules
- Retention: Keep recent data only
- Cardinality control: Limit label values
- Drop unnecessary metrics: Focus on essentials
Example: Cost-Effective Configuration
# prometheus.yml global: # Longer scrape interval for non-critical services scrape_interval: 60s scrape_configs: # Critical services: frequent scraping - job_name: 'production-api' scrape_interval: 15s static_configs: - targets: ['api:8080'] # Non-critical: infrequent scraping - job_name: 'batch-jobs' scrape_interval: 5m static_configs: - targets: ['batch:8080']
References
- GitLab Prometheus Metrics Documentation
- Monitoring GitLab with Prometheus
- Prometheus Best Practices
- PromQL Documentation
- GitLab CI Pipelines Exporter
Related Documentation
- Tracing - Distributed tracing with OpenTelemetry
- APM - Application performance monitoring
- Dashboards - Metric visualization
- Alerting - Alert configuration
- CI/CD Analytics - Pipeline metrics