Skip to main content

tracing

Distributed Tracing in GitLab

Overview

GitLab provides distributed tracing capabilities through OpenTelemetry integration, enabling teams to monitor application performance, troubleshoot issues, and understand how requests flow through different services and systems.

What is Distributed Tracing?

Distributed tracing allows you to:

  • Track request flows across multiple services and systems
  • Measure timing of each operation in the request path
  • Identify bottlenecks and performance issues
  • Correlate errors with specific operations
  • Visualize service dependencies and relationships

OpenTelemetry Integration

GitLab uses OpenTelemetry, an open-source observability framework that supports a wide array of SDKs and libraries across major programming languages and frameworks.

Key Benefits

  1. Vendor-neutral: Standard format for telemetry data
  2. Language support: SDKs for all major programming languages
  3. Framework support: Automatic instrumentation for popular frameworks
  4. Flexibility: Export to any OpenTelemetry-compatible backend

Current Status (2026)

As of GitLab 17.7, distributed tracing is available as an internal beta feature:

  • Available for testing
  • Not ready for production use
  • Actively being developed

CI/CD Pipeline Tracing

Automatic Instrumentation

GitLab Observability automatically instruments your CI/CD pipelines when enabled, providing visibility into:

  • Pipeline execution flow
  • Job dependencies and timing
  • Stage durations
  • Resource usage

Enabling Pipeline Tracing

Set the GITLAB_OBSERVABILITY_EXPORT variable in your CI/CD configuration:

variables: GITLAB_OBSERVABILITY_EXPORT: traces build: stage: build script: - npm install - npm run build test: stage: test script: - npm test

This exports distributed traces showing:

  • Pipeline execution flow
  • Job dependencies
  • Timing for each stage and job
  • Parent-child relationships between jobs

Benefits of Pipeline Tracing

  1. Performance optimization: Identify slow jobs and stages
  2. Dependency analysis: Understand job relationships
  3. Failure investigation: Trace errors back to specific jobs
  4. Resource planning: Understand resource usage patterns

Application Tracing

Instrumenting Your Application

Configure your application to send telemetry data using standard OpenTelemetry libraries.

Example: Node.js Application

// Install dependencies // npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

Example: Python Application

# Install dependencies # pip install opentelemetry-distro opentelemetry-exporter-otlp from opentelemetry import trace from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor # Configure the tracer provider trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) # Configure the OTLP exporter otlp_exporter = OTLPSpanExporter( endpoint="http://localhost:4318/v1/traces" ) # Add the span processor trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(otlp_exporter) ) # Create spans with tracer.start_as_current_span("operation_name"): # Your code here pass

Example: Go Application

// Install dependencies // go get go.opentelemetry.io/otel // go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp package main import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/trace" ) func initTracer() (*trace.TracerProvider, error) { exporter, err := otlptracehttp.New(context.Background(), otlptracehttp.WithEndpoint("localhost:4318"), otlptracehttp.WithInsecure(), ) if err != nil { return nil, err } tp := trace.NewTracerProvider( trace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) return tp, nil }

Viewing Traces in GitLab

Accessing the Traces Interface

  1. Navigate to your project in GitLab
  2. Go to Monitor Traces
  3. View successfully exported traces

Trace Visualization

The GitLab traces interface provides:

  • Timeline view: See how operations relate temporally
  • Span details: Inspect individual operations
  • Service map: Visualize service dependencies
  • Performance metrics: Latency, duration, throughput

Trace Data Structure

Core Concepts

Trace: A complete end-to-end request journey

Trace ID: abc123
 Span: HTTP GET /api/users (200ms)
     Span: Database Query (50ms)
     Span: Cache Lookup (10ms)
     Span: External API Call (100ms)

Span: A single operation within a trace

  • Name: Operation identifier (e.g., "HTTP GET /api/users")
  • Duration: Time taken for the operation
  • Attributes: Key-value metadata
  • Events: Point-in-time markers
  • Status: Success, error, or unknown

Span Attributes

Standard attributes to include:

{ "http.method": "GET", "http.url": "/api/users", "http.status_code": 200, "db.system": "postgresql", "db.statement": "SELECT * FROM users", "user.id": "12345", "environment": "production" }

Trace Sampling Strategies

Why Sampling?

Sampling reduces storage costs and performance overhead while maintaining visibility:

  • Cost efficiency: Store only relevant traces
  • Performance: Minimize instrumentation overhead
  • Insight: Retain sufficient data for analysis

Sampling Strategies

1. Head-based Sampling

Decision made at trace start:

const sampler = new TraceIdRatioBasedSampler(0.1); // Sample 10%

2. Tail-based Sampling

Decision made after trace completion:

# Sample all errors and 1% of successful traces processors: tail_sampling: policies: - name: error-traces type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces type: latency latency: {threshold_ms: 1000} - name: random-sample type: probabilistic probabilistic: {sampling_percentage: 1}

3. Adaptive Sampling

Dynamically adjust sampling based on traffic:

// Sample 100% during low traffic // Sample 10% during high traffic const adaptiveSampler = new AdaptiveSampler({ minRate: 0.1, maxRate: 1.0, targetTPS: 1000 });

Best Practices

  1. Always sample errors: Never drop error traces
  2. Sample slow requests: Capture performance issues
  3. Use stratified sampling: Different rates for different services
  4. Monitor sampling rates: Ensure sufficient coverage

Service Dependency Mapping

Automatic Discovery

OpenTelemetry automatically discovers service dependencies through trace propagation:

User Request  API Gateway  Auth Service  Database
                          
                      User Service  Cache

Benefits

  1. Architecture visualization: See how services interact
  2. Dependency analysis: Identify critical dependencies
  3. Performance troubleshooting: Find bottleneck services
  4. Impact assessment: Understand service failure impact

Performance Profiling

Using Traces for Profiling

Identify performance bottlenecks:

// Instrument expensive operations const tracer = trace.getTracer('my-service'); async function processOrder(orderId) { return tracer.startActiveSpan('process-order', async (span) => { span.setAttribute('order.id', orderId); // Database query await tracer.startActiveSpan('db-query', async (dbSpan) => { const order = await db.query('SELECT * FROM orders WHERE id = ?', orderId); dbSpan.end(); return order; }); // External API call await tracer.startActiveSpan('payment-api', async (apiSpan) => { await paymentService.charge(order); apiSpan.end(); }); span.end(); }); }

Analyzing Performance

Use traces to identify:

  • Slow database queries: Look for high-duration DB spans
  • N+1 queries: Multiple sequential DB calls
  • External API latency: Third-party service delays
  • Inefficient algorithms: Unexpectedly long processing times

Production Best Practices

1. Structured Instrumentation

Follow consistent patterns:

// Good: Structured, searchable attributes span.setAttribute('user.id', userId); span.setAttribute('order.total', orderTotal); span.setAttribute('payment.method', 'credit_card'); // Bad: Unstructured string concatenation span.setAttribute('message', `User ${userId} paid ${orderTotal} with credit card`);

2. Error Handling

Always record errors:

try { await riskyOperation(); } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); throw error; } finally { span.end(); }

3. Context Propagation

Ensure trace context flows through your system:

// Extract context from incoming request const context = propagation.extract(context.active(), req.headers); // Use context for outgoing request await context.with(context, async () => { await fetch('http://api.example.com', { headers: { ...propagation.inject(context.active(), {}) } }); });

4. Resource Attributes

Include service-level metadata:

const resource = Resource.default().merge( new Resource({ 'service.name': 'user-api', 'service.version': '1.2.3', 'deployment.environment': 'production', 'host.name': process.env.HOSTNAME }) );

5. Span Lifecycle Management

Properly manage span lifecycles:

// Always end spans, even on error const span = tracer.startSpan('operation'); try { await doWork(); } finally { span.end(); // Guaranteed to execute }

Troubleshooting with Traces

Common Scenarios

1. Slow Request Investigation

Query: duration > 1000ms
Filter: http.route = "/api/orders"
Result: Identify which span is causing slowness

2. Error Root Cause Analysis

Query: status.code = ERROR
Filter: service.name = "payment-service"
Result: Find common attributes across failing traces

3. Dependency Failure Impact

Query: service.name = "database"
Filter: status.code = ERROR
Result: See which upstream services are affected

Integration with Other Observability Tools

Correlation with Logs

Link traces to logs using trace IDs:

logger.info('Processing order', { traceId: span.spanContext().traceId, spanId: span.spanContext().spanId, orderId: orderId });

Correlation with Metrics

Add trace exemplars to metrics:

counter.add(1, { 'http.method': 'GET', 'http.route': '/api/users' }, { traceId: span.spanContext().traceId, spanId: span.spanContext().spanId });

Cost Optimization

Strategies to Reduce Costs

  1. Intelligent sampling: Sample based on value, not randomly
  2. Attribute filtering: Remove high-cardinality attributes
  3. Retention policies: Keep recent traces longer, archive old traces
  4. Aggregation: Pre-aggregate common queries

Example: Cost-Effective Sampling

# Sample all errors, 10% of slow requests, 1% of normal requests processors: tail_sampling: policies: # Always sample errors - name: errors type: status_code status_code: {status_codes: [ERROR]} # Sample slow requests (>1s) - name: slow-requests type: latency latency: {threshold_ms: 1000} probabilistic: {sampling_percentage: 100} # Sample 10% of medium requests (>500ms) - name: medium-requests type: latency latency: {threshold_ms: 500} probabilistic: {sampling_percentage: 10} # Sample 1% of fast requests - name: fast-requests type: probabilistic probabilistic: {sampling_percentage: 1}

Future Roadmap

GitLab's tracing capabilities are evolving:

  • OpenTelemetry integration (beta)
  • Production-ready tracing (planned)
  • Advanced trace analytics (planned)
  • Trace-based alerting (planned)
  • Service level objectives (SLOs) based on traces (planned)

References

  • Error Tracking - Aggregate and track application errors
  • Logs - Log aggregation and correlation
  • Metrics - Time-series metrics collection
  • APM - Application performance monitoring
  • Dashboards - Creating visualization dashboards