Skip to main content

apm

Application Performance Monitoring (APM) in GitLab

Overview

GitLab's Application Performance Monitoring (APM) provides integrated observability within the DevOps Platform, enabling teams to monitor application state, understand how changes impact performance, and resolve issues efficiently.

What is APM?

Application Performance Monitoring enables you to:

  • Monitor application state: Real-time visibility into system health
  • Track performance impact: Understand how changes affect performance
  • Identify bottlenecks: Find slow queries, API calls, and operations
  • Correlate deployments: Link performance changes to releases
  • Optimize resource usage: Track and reduce costs

GitLab APM Architecture

Three-Tier Architecture


  Visualization & Analytics              
  (Charts, tables, drill-down)           

                  

  Storage & Querying                     
  (ClickHouse, Prometheus)               

                  

  Collection                             
  (Logs, traces from production)         

Integration Points

GitLab APM integrates with:

  • GitLab CI/CD: Automatic correlation with deployments
  • GitLab Issues: Create issues from performance problems
  • GitLab Metrics: Prometheus-based monitoring
  • OpenTelemetry: Standard telemetry collection
  • Kubernetes: Native container monitoring

Core APM Features

1. Performance Metrics

Track key performance indicators:

  • Response time: P50, P95, P99 latencies
  • Throughput: Requests per second
  • Error rate: Failed request percentage
  • Apdex score: User satisfaction metric
  • Resource utilization: CPU, memory, disk

2. Transaction Tracking

Monitor individual transactions:

Transaction: POST /api/orders
 Duration: 450ms
 Database Queries: 5 (total 120ms)
    SELECT * FROM users: 30ms
    SELECT * FROM products: 40ms
    INSERT INTO orders: 25ms
    INSERT INTO order_items: 15ms
    UPDATE inventory: 10ms
 External APIs: 2 (total 200ms)
    Payment Gateway: 150ms
    Email Service: 50ms
 Application Logic: 130ms

3. Slow Query Detection

Identify database performance issues:

-- Slow query detected: 2.5s SELECT o.*, u.*, p.* FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id WHERE o.created_at > NOW() - INTERVAL 30 DAY -- Missing index on orders.created_at

4. Deployment Correlation

Link performance changes to deployments:

Performance Impact: Deploy v2.3.0
 Before Deploy (v2.2.9)
    P95 Latency: 250ms
    Error Rate: 0.2%
    Throughput: 1000 req/s
 After Deploy (v2.3.0)
    P95 Latency: 450ms (+80%) 
    Error Rate: 1.5% (+650%) 
    Throughput: 850 req/s (-15%) 
 Action: Rollback recommended

Setting Up APM

1. Enable GitLab Observability

Configure in project settings:

  1. Navigate to Settings Monitor Observability
  2. Enable Application Performance Monitoring
  3. Note the Observability Endpoint URL

2. Instrument Your Application

Node.js with OpenTelemetry

// instrument.js - Load BEFORE application code const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http'); const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics'); const { Resource } = require('@opentelemetry/resources'); const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions'); const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'user-api', [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION, [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces', }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics', }), exportIntervalMillis: 60000, // Export every 60s }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, '@opentelemetry/instrumentation-http': { enabled: true, requestHook: (span, request) => { span.setAttribute('http.user_agent', request.headers['user-agent']); }, }, '@opentelemetry/instrumentation-express': { enabled: true }, '@opentelemetry/instrumentation-pg': { enabled: true }, '@opentelemetry/instrumentation-redis': { enabled: true }, }), ], }); sdk.start(); // Graceful shutdown process.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); }); // app.js - Your application code require('./instrument'); // Must be first! const express = require('express'); const app = express(); app.get('/api/users', async (req, res) => { const users = await db.query('SELECT * FROM users'); res.json(users); }); app.listen(3000);

Python with OpenTelemetry

# app.py from opentelemetry import trace, metrics from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter from opentelemetry.sdk.resources import Resource from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor from flask import Flask import os # Configure resource resource = Resource.create({ "service.name": "user-api", "service.version": os.getenv("APP_VERSION"), "deployment.environment": os.getenv("ENVIRONMENT"), }) # Setup tracing trace_provider = TracerProvider(resource=resource) trace_provider.add_span_processor( BatchSpanProcessor( OTLPSpanExporter( endpoint=f"{os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT')}/v1/traces" ) ) ) trace.set_tracer_provider(trace_provider) # Setup metrics metric_reader = PeriodicExportingMetricReader( OTLPMetricExporter( endpoint=f"{os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT')}/v1/metrics" ), export_interval_millis=60000, ) meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader]) metrics.set_meter_provider(meter_provider) # Create Flask app app = Flask(__name__) # Auto-instrument FlaskInstrumentor().instrument_app(app) RequestsInstrumentor().instrument() Psycopg2Instrumentor().instrument() @app.route('/api/users') def get_users(): users = db.query('SELECT * FROM users') return jsonify(users) if __name__ == '__main__': app.run()

3. Configure Environment Variables

# .env OTEL_EXPORTER_OTLP_ENDPOINT=https://your-gitlab-instance.com/observability OTEL_SERVICE_NAME=user-api APP_VERSION=1.2.3 ENVIRONMENT=production

4. Deploy with Configuration

# docker-compose.yml services: api: image: user-api:latest environment: - OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_ENDPOINT} - APP_VERSION=${CI_COMMIT_SHA} - ENVIRONMENT=production

Viewing APM Data

Accessing APM Dashboard

  1. Navigate to Monitor Application Performance
  2. Select time range and service
  3. View performance metrics and traces

Key Metrics Dashboard

Application Performance Dashboard

 Throughput: 1,234 req/s                 
 P95 Latency: 245ms                      
 Error Rate: 0.5%                        
 Apdex Score: 0.92                       



        Response Time Over Time          
   300ms                            
   250ms                        
   200ms                        
   150ms                          
           
         12:00  13:00  14:00  15:00      



         Slowest Endpoints               
 /api/reports/generate      1.2s         
 /api/analytics/dashboard   850ms        
 /api/exports/csv           720ms        
 /api/search/advanced       450ms        

Performance Analysis

1. Transaction Analysis

Drill into slow transactions:

Transaction Details: POST /api/orders/123
Duration: 1,250ms (P99)

Timeline:
0ms      HTTP Request Received
10ms     Authentication (JWT verify): 10ms
20ms     Authorization check: 10ms
30ms     Database: Fetch user: 50ms
80ms     Database: Fetch cart items: 80ms (N+1 query!) 
160ms    Database: Check inventory (x5): 100ms
260ms    Payment API: Charge card: 600ms 
860ms    Database: Create order: 150ms
1010ms   Database: Update inventory: 100ms
1110ms   Email API: Send confirmation: 120ms
1230ms   Cache: Invalidate cart: 10ms
1240ms   HTTP Response Sent: 10ms

Issues Identified:
 N+1 Query: Cart items fetched individually
 Slow External API: Payment gateway timeout
 Optimization: Batch fetch + add payment retry

2. Database Performance

Identify slow queries:

-- Top 5 Slowest Queries (Last 24h) -- 1. Analytics Report Generation: Avg 2.3s, 450 calls SELECT DATE(created_at) as date, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE created_at > NOW() - INTERVAL 90 DAY GROUP BY DATE(created_at); -- Issue: Missing index on created_at -- Fix: CREATE INDEX idx_orders_created_at ON orders(created_at); -- 2. User Dashboard: Avg 1.1s, 12,000 calls SELECT u.*, o.*, p.* FROM users u LEFT JOIN orders o ON u.id = o.user_id LEFT JOIN payments p ON o.id = p.order_id WHERE u.id = ?; -- Issue: N+1 query, no limit -- Fix: Add pagination, use separate queries -- 3. Product Search: Avg 850ms, 8,500 calls SELECT * FROM products WHERE LOWER(name) LIKE LOWER(?); -- Issue: Full table scan -- Fix: Use full-text search or Elasticsearch

3. External API Analysis

Track third-party service performance:

External API Performance (Last 24h)

Payment Gateway (Stripe)
 Avg Latency: 450ms
 P95 Latency: 850ms
 P99 Latency: 1.2s 
 Success Rate: 99.2%
 Timeout Rate: 0.5%
 Cost: $1,245 (12,450 calls @ $0.10)

Email Service (SendGrid)
 Avg Latency: 120ms
 P95 Latency: 250ms
 P99 Latency: 500ms
 Success Rate: 99.8%
 Throttle Rate: 0.1%
 Cost: $89 (89,000 emails @ $0.001)

Recommendations:
 Add circuit breaker for payment gateway
 Implement async email sending
 Cache payment method validation

Performance Optimization

1. Database Optimization

Identify and fix slow queries:

// Before: N+1 Query Problem async function getOrdersWithItems(userId) { const orders = await db.query( 'SELECT * FROM orders WHERE user_id = ?', [userId] ); for (const order of orders) { order.items = await db.query( 'SELECT * FROM order_items WHERE order_id = ?', [order.id] ); // N queries! } return orders; } // After: Single Query with JOIN async function getOrdersWithItems(userId) { const result = await db.query(` SELECT o.*, json_agg(oi.*) as items FROM orders o LEFT JOIN order_items oi ON o.id = oi.order_id WHERE o.user_id = ? GROUP BY o.id `, [userId]); return result; }

2. Caching Strategy

Implement intelligent caching:

const redis = require('redis'); const client = redis.createClient(); async function getUserProfile(userId) { // Check cache first const cached = await client.get(`user:${userId}`); if (cached) { return JSON.parse(cached); } // Fetch from database const user = await db.query( 'SELECT * FROM users WHERE id = ?', [userId] ); // Cache for 1 hour await client.setex( `user:${userId}`, 3600, JSON.stringify(user) ); return user; }

3. Async Processing

Offload long-running tasks:

// Before: Synchronous email sending (blocks response) app.post('/api/orders', async (req, res) => { const order = await createOrder(req.body); await sendOrderConfirmation(order); // 500ms delay! res.json(order); }); // After: Async with job queue const Queue = require('bull'); const emailQueue = new Queue('email'); app.post('/api/orders', async (req, res) => { const order = await createOrder(req.body); await emailQueue.add({ orderId: order.id }); // Instant! res.json(order); }); // Background worker processes emails emailQueue.process(async (job) => { const order = await getOrder(job.data.orderId); await sendOrderConfirmation(order); });

4. Connection Pooling

Optimize database connections:

// Configure connection pool const { Pool } = require('pg'); const pool = new Pool({ host: 'localhost', database: 'mydb', max: 20, // Max connections idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000, }); // Reuse connections efficiently async function queryDatabase(sql, params) { const client = await pool.connect(); try { return await client.query(sql, params); } finally { client.release(); // Return to pool } }

Cost Attribution

Tracking Costs

Monitor resource usage and costs:

// Custom metrics for cost tracking const { metrics } = require('@opentelemetry/api'); const meter = metrics.getMeter('cost-attribution'); // Track LLM token usage const tokenCounter = meter.createCounter('llm_tokens_used', { description: 'Total LLM tokens consumed', unit: 'tokens', }); async function callLLM(prompt) { const response = await openai.complete({ prompt }); tokenCounter.add(response.usage.total_tokens, { model: response.model, user: getCurrentUser().id, feature: 'chat', }); return response; } // Query cost metrics // SELECT // user, // feature, // SUM(tokens) * 0.002 / 1000 as cost_usd // FROM llm_tokens_used // WHERE timestamp > NOW() - INTERVAL 1 DAY // GROUP BY user, feature

Cost Optimization

Reduce operational costs:

  1. Efficient queries: Reduce database load
  2. Caching: Minimize API calls
  3. Batching: Group operations
  4. Sampling: Monitor subset of traffic
  5. Resource limits: Prevent runaway costs
// Example: Rate limiting + caching const rateLimit = require('express-rate-limit'); const limiter = rateLimit({ windowMs: 15 * 60 * 1000, // 15 minutes max: 100, // Limit each IP to 100 requests per window message: 'Too many requests, please try again later', }); app.use('/api/', limiter);

Integration with GitLab CI/CD

Automatic Performance Testing

# .gitlab-ci.yml stages: - test - deploy - monitor # Run performance tests performance_test: stage: test script: - npm run test:performance - node scripts/analyze-performance.js artifacts: reports: performance: performance-report.json # Deploy with monitoring deploy_production: stage: deploy script: - kubectl apply -f k8s/ environment: name: production action: start # Post-deployment monitoring monitor_deployment: stage: monitor script: - node scripts/check-apm-metrics.js when: always needs: - deploy_production

Performance Budgets

Fail CI if performance degrades:

// scripts/check-apm-metrics.js const axios = require('axios'); const PERFORMANCE_BUDGETS = { p95_latency_ms: 500, error_rate_pct: 1.0, throughput_rps: 100, }; async function checkPerformance() { const metrics = await fetchAPMMetrics(); const failures = []; if (metrics.p95_latency > PERFORMANCE_BUDGETS.p95_latency_ms) { failures.push(`P95 latency ${metrics.p95_latency}ms exceeds budget ${PERFORMANCE_BUDGETS.p95_latency_ms}ms`); } if (metrics.error_rate > PERFORMANCE_BUDGETS.error_rate_pct) { failures.push(`Error rate ${metrics.error_rate}% exceeds budget ${PERFORMANCE_BUDGETS.error_rate_pct}%`); } if (failures.length > 0) { console.error('Performance budget exceeded:'); failures.forEach(f => console.error(` - ${f}`)); process.exit(1); } console.log(' All performance budgets met'); } checkPerformance();

Best Practices

1. Instrument Critical Paths

Focus on high-value operations:

  • User-facing endpoints
  • Payment processing
  • Data exports
  • Search functionality

2. Set Performance Baselines

Establish acceptable performance:

performance_baselines: api_endpoints: p50_latency: 100ms p95_latency: 500ms p99_latency: 1000ms database_queries: p95_latency: 100ms p99_latency: 500ms external_apis: timeout: 5000ms

3. Monitor User Experience

Track real user metrics:

  • Page load time
  • Time to interactive
  • First contentful paint
  • Cumulative layout shift

4. Correlate with Business Metrics

Link performance to business outcomes:

  • Conversion rates vs. page load time
  • Cart abandonment vs. checkout latency
  • User engagement vs. API response time

References