apm
Application Performance Monitoring (APM) in GitLab
Overview
GitLab's Application Performance Monitoring (APM) provides integrated observability within the DevOps Platform, enabling teams to monitor application state, understand how changes impact performance, and resolve issues efficiently.
What is APM?
Application Performance Monitoring enables you to:
- Monitor application state: Real-time visibility into system health
- Track performance impact: Understand how changes affect performance
- Identify bottlenecks: Find slow queries, API calls, and operations
- Correlate deployments: Link performance changes to releases
- Optimize resource usage: Track and reduce costs
GitLab APM Architecture
Three-Tier Architecture
Visualization & Analytics
(Charts, tables, drill-down)
Storage & Querying
(ClickHouse, Prometheus)
Collection
(Logs, traces from production)
Integration Points
GitLab APM integrates with:
- GitLab CI/CD: Automatic correlation with deployments
- GitLab Issues: Create issues from performance problems
- GitLab Metrics: Prometheus-based monitoring
- OpenTelemetry: Standard telemetry collection
- Kubernetes: Native container monitoring
Core APM Features
1. Performance Metrics
Track key performance indicators:
- Response time: P50, P95, P99 latencies
- Throughput: Requests per second
- Error rate: Failed request percentage
- Apdex score: User satisfaction metric
- Resource utilization: CPU, memory, disk
2. Transaction Tracking
Monitor individual transactions:
Transaction: POST /api/orders
Duration: 450ms
Database Queries: 5 (total 120ms)
SELECT * FROM users: 30ms
SELECT * FROM products: 40ms
INSERT INTO orders: 25ms
INSERT INTO order_items: 15ms
UPDATE inventory: 10ms
External APIs: 2 (total 200ms)
Payment Gateway: 150ms
Email Service: 50ms
Application Logic: 130ms
3. Slow Query Detection
Identify database performance issues:
-- Slow query detected: 2.5s SELECT o.*, u.*, p.* FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id WHERE o.created_at > NOW() - INTERVAL 30 DAY -- Missing index on orders.created_at
4. Deployment Correlation
Link performance changes to deployments:
Performance Impact: Deploy v2.3.0
Before Deploy (v2.2.9)
P95 Latency: 250ms
Error Rate: 0.2%
Throughput: 1000 req/s
After Deploy (v2.3.0)
P95 Latency: 450ms (+80%)
Error Rate: 1.5% (+650%)
Throughput: 850 req/s (-15%)
Action: Rollback recommended
Setting Up APM
1. Enable GitLab Observability
Configure in project settings:
- Navigate to Settings Monitor Observability
- Enable Application Performance Monitoring
- Note the Observability Endpoint URL
2. Instrument Your Application
Node.js with OpenTelemetry
// instrument.js - Load BEFORE application code const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http'); const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics'); const { Resource } = require('@opentelemetry/resources'); const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions'); const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'user-api', [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION, [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces', }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics', }), exportIntervalMillis: 60000, // Export every 60s }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, '@opentelemetry/instrumentation-http': { enabled: true, requestHook: (span, request) => { span.setAttribute('http.user_agent', request.headers['user-agent']); }, }, '@opentelemetry/instrumentation-express': { enabled: true }, '@opentelemetry/instrumentation-pg': { enabled: true }, '@opentelemetry/instrumentation-redis': { enabled: true }, }), ], }); sdk.start(); // Graceful shutdown process.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); }); // app.js - Your application code require('./instrument'); // Must be first! const express = require('express'); const app = express(); app.get('/api/users', async (req, res) => { const users = await db.query('SELECT * FROM users'); res.json(users); }); app.listen(3000);
Python with OpenTelemetry
# app.py from opentelemetry import trace, metrics from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter from opentelemetry.sdk.resources import Resource from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor from flask import Flask import os # Configure resource resource = Resource.create({ "service.name": "user-api", "service.version": os.getenv("APP_VERSION"), "deployment.environment": os.getenv("ENVIRONMENT"), }) # Setup tracing trace_provider = TracerProvider(resource=resource) trace_provider.add_span_processor( BatchSpanProcessor( OTLPSpanExporter( endpoint=f"{os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT')}/v1/traces" ) ) ) trace.set_tracer_provider(trace_provider) # Setup metrics metric_reader = PeriodicExportingMetricReader( OTLPMetricExporter( endpoint=f"{os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT')}/v1/metrics" ), export_interval_millis=60000, ) meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader]) metrics.set_meter_provider(meter_provider) # Create Flask app app = Flask(__name__) # Auto-instrument FlaskInstrumentor().instrument_app(app) RequestsInstrumentor().instrument() Psycopg2Instrumentor().instrument() @app.route('/api/users') def get_users(): users = db.query('SELECT * FROM users') return jsonify(users) if __name__ == '__main__': app.run()
3. Configure Environment Variables
# .env OTEL_EXPORTER_OTLP_ENDPOINT=https://your-gitlab-instance.com/observability OTEL_SERVICE_NAME=user-api APP_VERSION=1.2.3 ENVIRONMENT=production
4. Deploy with Configuration
# docker-compose.yml services: api: image: user-api:latest environment: - OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_ENDPOINT} - APP_VERSION=${CI_COMMIT_SHA} - ENVIRONMENT=production
Viewing APM Data
Accessing APM Dashboard
- Navigate to Monitor Application Performance
- Select time range and service
- View performance metrics and traces
Key Metrics Dashboard
Application Performance Dashboard
Throughput: 1,234 req/s
P95 Latency: 245ms
Error Rate: 0.5%
Apdex Score: 0.92
Response Time Over Time
300ms
250ms
200ms
150ms
12:00 13:00 14:00 15:00
Slowest Endpoints
/api/reports/generate 1.2s
/api/analytics/dashboard 850ms
/api/exports/csv 720ms
/api/search/advanced 450ms
Performance Analysis
1. Transaction Analysis
Drill into slow transactions:
Transaction Details: POST /api/orders/123
Duration: 1,250ms (P99)
Timeline:
0ms HTTP Request Received
10ms Authentication (JWT verify): 10ms
20ms Authorization check: 10ms
30ms Database: Fetch user: 50ms
80ms Database: Fetch cart items: 80ms (N+1 query!)
160ms Database: Check inventory (x5): 100ms
260ms Payment API: Charge card: 600ms
860ms Database: Create order: 150ms
1010ms Database: Update inventory: 100ms
1110ms Email API: Send confirmation: 120ms
1230ms Cache: Invalidate cart: 10ms
1240ms HTTP Response Sent: 10ms
Issues Identified:
N+1 Query: Cart items fetched individually
Slow External API: Payment gateway timeout
Optimization: Batch fetch + add payment retry
2. Database Performance
Identify slow queries:
-- Top 5 Slowest Queries (Last 24h) -- 1. Analytics Report Generation: Avg 2.3s, 450 calls SELECT DATE(created_at) as date, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE created_at > NOW() - INTERVAL 90 DAY GROUP BY DATE(created_at); -- Issue: Missing index on created_at -- Fix: CREATE INDEX idx_orders_created_at ON orders(created_at); -- 2. User Dashboard: Avg 1.1s, 12,000 calls SELECT u.*, o.*, p.* FROM users u LEFT JOIN orders o ON u.id = o.user_id LEFT JOIN payments p ON o.id = p.order_id WHERE u.id = ?; -- Issue: N+1 query, no limit -- Fix: Add pagination, use separate queries -- 3. Product Search: Avg 850ms, 8,500 calls SELECT * FROM products WHERE LOWER(name) LIKE LOWER(?); -- Issue: Full table scan -- Fix: Use full-text search or Elasticsearch
3. External API Analysis
Track third-party service performance:
External API Performance (Last 24h)
Payment Gateway (Stripe)
Avg Latency: 450ms
P95 Latency: 850ms
P99 Latency: 1.2s
Success Rate: 99.2%
Timeout Rate: 0.5%
Cost: $1,245 (12,450 calls @ $0.10)
Email Service (SendGrid)
Avg Latency: 120ms
P95 Latency: 250ms
P99 Latency: 500ms
Success Rate: 99.8%
Throttle Rate: 0.1%
Cost: $89 (89,000 emails @ $0.001)
Recommendations:
Add circuit breaker for payment gateway
Implement async email sending
Cache payment method validation
Performance Optimization
1. Database Optimization
Identify and fix slow queries:
// Before: N+1 Query Problem async function getOrdersWithItems(userId) { const orders = await db.query( 'SELECT * FROM orders WHERE user_id = ?', [userId] ); for (const order of orders) { order.items = await db.query( 'SELECT * FROM order_items WHERE order_id = ?', [order.id] ); // N queries! } return orders; } // After: Single Query with JOIN async function getOrdersWithItems(userId) { const result = await db.query(` SELECT o.*, json_agg(oi.*) as items FROM orders o LEFT JOIN order_items oi ON o.id = oi.order_id WHERE o.user_id = ? GROUP BY o.id `, [userId]); return result; }
2. Caching Strategy
Implement intelligent caching:
const redis = require('redis'); const client = redis.createClient(); async function getUserProfile(userId) { // Check cache first const cached = await client.get(`user:${userId}`); if (cached) { return JSON.parse(cached); } // Fetch from database const user = await db.query( 'SELECT * FROM users WHERE id = ?', [userId] ); // Cache for 1 hour await client.setex( `user:${userId}`, 3600, JSON.stringify(user) ); return user; }
3. Async Processing
Offload long-running tasks:
// Before: Synchronous email sending (blocks response) app.post('/api/orders', async (req, res) => { const order = await createOrder(req.body); await sendOrderConfirmation(order); // 500ms delay! res.json(order); }); // After: Async with job queue const Queue = require('bull'); const emailQueue = new Queue('email'); app.post('/api/orders', async (req, res) => { const order = await createOrder(req.body); await emailQueue.add({ orderId: order.id }); // Instant! res.json(order); }); // Background worker processes emails emailQueue.process(async (job) => { const order = await getOrder(job.data.orderId); await sendOrderConfirmation(order); });
4. Connection Pooling
Optimize database connections:
// Configure connection pool const { Pool } = require('pg'); const pool = new Pool({ host: 'localhost', database: 'mydb', max: 20, // Max connections idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000, }); // Reuse connections efficiently async function queryDatabase(sql, params) { const client = await pool.connect(); try { return await client.query(sql, params); } finally { client.release(); // Return to pool } }
Cost Attribution
Tracking Costs
Monitor resource usage and costs:
// Custom metrics for cost tracking const { metrics } = require('@opentelemetry/api'); const meter = metrics.getMeter('cost-attribution'); // Track LLM token usage const tokenCounter = meter.createCounter('llm_tokens_used', { description: 'Total LLM tokens consumed', unit: 'tokens', }); async function callLLM(prompt) { const response = await openai.complete({ prompt }); tokenCounter.add(response.usage.total_tokens, { model: response.model, user: getCurrentUser().id, feature: 'chat', }); return response; } // Query cost metrics // SELECT // user, // feature, // SUM(tokens) * 0.002 / 1000 as cost_usd // FROM llm_tokens_used // WHERE timestamp > NOW() - INTERVAL 1 DAY // GROUP BY user, feature
Cost Optimization
Reduce operational costs:
- Efficient queries: Reduce database load
- Caching: Minimize API calls
- Batching: Group operations
- Sampling: Monitor subset of traffic
- Resource limits: Prevent runaway costs
// Example: Rate limiting + caching const rateLimit = require('express-rate-limit'); const limiter = rateLimit({ windowMs: 15 * 60 * 1000, // 15 minutes max: 100, // Limit each IP to 100 requests per window message: 'Too many requests, please try again later', }); app.use('/api/', limiter);
Integration with GitLab CI/CD
Automatic Performance Testing
# .gitlab-ci.yml stages: - test - deploy - monitor # Run performance tests performance_test: stage: test script: - npm run test:performance - node scripts/analyze-performance.js artifacts: reports: performance: performance-report.json # Deploy with monitoring deploy_production: stage: deploy script: - kubectl apply -f k8s/ environment: name: production action: start # Post-deployment monitoring monitor_deployment: stage: monitor script: - node scripts/check-apm-metrics.js when: always needs: - deploy_production
Performance Budgets
Fail CI if performance degrades:
// scripts/check-apm-metrics.js const axios = require('axios'); const PERFORMANCE_BUDGETS = { p95_latency_ms: 500, error_rate_pct: 1.0, throughput_rps: 100, }; async function checkPerformance() { const metrics = await fetchAPMMetrics(); const failures = []; if (metrics.p95_latency > PERFORMANCE_BUDGETS.p95_latency_ms) { failures.push(`P95 latency ${metrics.p95_latency}ms exceeds budget ${PERFORMANCE_BUDGETS.p95_latency_ms}ms`); } if (metrics.error_rate > PERFORMANCE_BUDGETS.error_rate_pct) { failures.push(`Error rate ${metrics.error_rate}% exceeds budget ${PERFORMANCE_BUDGETS.error_rate_pct}%`); } if (failures.length > 0) { console.error('Performance budget exceeded:'); failures.forEach(f => console.error(` - ${f}`)); process.exit(1); } console.log(' All performance budgets met'); } checkPerformance();
Best Practices
1. Instrument Critical Paths
Focus on high-value operations:
- User-facing endpoints
- Payment processing
- Data exports
- Search functionality
2. Set Performance Baselines
Establish acceptable performance:
performance_baselines: api_endpoints: p50_latency: 100ms p95_latency: 500ms p99_latency: 1000ms database_queries: p95_latency: 100ms p99_latency: 500ms external_apis: timeout: 5000ms
3. Monitor User Experience
Track real user metrics:
- Page load time
- Time to interactive
- First contentful paint
- Cumulative layout shift
4. Correlate with Business Metrics
Link performance to business outcomes:
- Conversion rates vs. page load time
- Cart abandonment vs. checkout latency
- User engagement vs. API response time
References
- GitLab APM Single-Engineer Group
- GitLab Performance Monitoring
- OpenTelemetry Documentation
- Application Performance Monitoring Best Practices
Related Documentation
- Tracing - Distributed tracing implementation
- Metrics - Prometheus metrics collection
- Logs - Log aggregation and analysis
- Dashboards - Performance visualization
- CI/CD Analytics - Pipeline performance