tracing

Distributed Tracing in GitLab

Overview

GitLab provides distributed tracing capabilities through OpenTelemetry integration, enabling teams to monitor application performance, troubleshoot issues, and understand how requests flow through different services and systems.

What is Distributed Tracing?

Distributed tracing allows you to:

Track request flows across multiple services and systems
Measure timing of each operation in the request path
Identify bottlenecks and performance issues
Correlate errors with specific operations
Visualize service dependencies and relationships

OpenTelemetry Integration

GitLab uses OpenTelemetry, an open-source observability framework that supports a wide array of SDKs and libraries across major programming languages and frameworks.

Key Benefits

Vendor-neutral: Standard format for telemetry data
Language support: SDKs for all major programming languages
Framework support: Automatic instrumentation for popular frameworks
Flexibility: Export to any OpenTelemetry-compatible backend

Current Status (2026)

As of GitLab 17.7, distributed tracing is available as an internal beta feature:

Available for testing
Not ready for production use
Actively being developed

CI/CD Pipeline Tracing

Automatic Instrumentation

GitLab Observability automatically instruments your CI/CD pipelines when enabled, providing visibility into:

Pipeline execution flow
Job dependencies and timing
Stage durations
Resource usage

Enabling Pipeline Tracing

Set the GITLAB_OBSERVABILITY_EXPORT variable in your CI/CD configuration:

variables:
  GITLAB_OBSERVABILITY_EXPORT: traces

build:
  stage: build
  script:
    - npm install
    - npm run build

test:
  stage: test
  script:
    - npm test

This exports distributed traces showing:

Pipeline execution flow
Job dependencies
Timing for each stage and job
Parent-child relationships between jobs

Benefits of Pipeline Tracing

Performance optimization: Identify slow jobs and stages
Dependency analysis: Understand job relationships
Failure investigation: Trace errors back to specific jobs
Resource planning: Understand resource usage patterns

Application Tracing

Instrumenting Your Application

Configure your application to send telemetry data using standard OpenTelemetry libraries.

Example: Node.js Application

// Install dependencies
// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Example: Python Application

# Install dependencies
# pip install opentelemetry-distro opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure the tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure the OTLP exporter
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4318/v1/traces"
)

# Add the span processor
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

# Create spans
with tracer.start_as_current_span("operation_name"):
    # Your code here
    pass

Example: Go Application

// Install dependencies
// go get go.opentelemetry.io/otel
// go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(context.Background(),
        otlptracehttp.WithEndpoint("localhost:4318"),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )
    otel.SetTracerProvider(tp)

    return tp, nil
}

Viewing Traces in GitLab

Accessing the Traces Interface

Navigate to your project in GitLab
Go to Monitor Traces
View successfully exported traces

Trace Visualization

The GitLab traces interface provides:

Timeline view: See how operations relate temporally
Span details: Inspect individual operations
Service map: Visualize service dependencies
Performance metrics: Latency, duration, throughput

Trace Data Structure

Core Concepts

Trace: A complete end-to-end request journey

Trace ID: abc123
 Span: HTTP GET /api/users (200ms)
     Span: Database Query (50ms)
     Span: Cache Lookup (10ms)
     Span: External API Call (100ms)

Span: A single operation within a trace

Name: Operation identifier (e.g., "HTTP GET /api/users")
Duration: Time taken for the operation
Attributes: Key-value metadata
Events: Point-in-time markers
Status: Success, error, or unknown

Span Attributes

Standard attributes to include:

{
  "http.method": "GET",
  "http.url": "/api/users",
  "http.status_code": 200,
  "db.system": "postgresql",
  "db.statement": "SELECT * FROM users",
  "user.id": "12345",
  "environment": "production"
}

Trace Sampling Strategies

Why Sampling?

Sampling reduces storage costs and performance overhead while maintaining visibility:

Cost efficiency: Store only relevant traces
Performance: Minimize instrumentation overhead
Insight: Retain sufficient data for analysis

Sampling Strategies

1. Head-based Sampling

Decision made at trace start:

const sampler = new TraceIdRatioBasedSampler(0.1); // Sample 10%

2. Tail-based Sampling

Decision made after trace completion:

# Sample all errors and 1% of successful traces
processors:
  tail_sampling:
    policies:
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}
      - name: random-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

3. Adaptive Sampling

Dynamically adjust sampling based on traffic:

// Sample 100% during low traffic
// Sample 10% during high traffic
const adaptiveSampler = new AdaptiveSampler({
  minRate: 0.1,
  maxRate: 1.0,
  targetTPS: 1000
});

Best Practices

Always sample errors: Never drop error traces
Sample slow requests: Capture performance issues
Use stratified sampling: Different rates for different services
Monitor sampling rates: Ensure sufficient coverage

Service Dependency Mapping

Automatic Discovery

OpenTelemetry automatically discovers service dependencies through trace propagation:

User Request  API Gateway  Auth Service  Database
                          
                      User Service  Cache

Benefits

Architecture visualization: See how services interact
Dependency analysis: Identify critical dependencies
Performance troubleshooting: Find bottleneck services
Impact assessment: Understand service failure impact

Performance Profiling

Using Traces for Profiling

Identify performance bottlenecks:

// Instrument expensive operations
const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('process-order', async (span) => {
    span.setAttribute('order.id', orderId);

    // Database query
    await tracer.startActiveSpan('db-query', async (dbSpan) => {
      const order = await db.query('SELECT * FROM orders WHERE id = ?', orderId);
      dbSpan.end();
      return order;
    });

    // External API call
    await tracer.startActiveSpan('payment-api', async (apiSpan) => {
      await paymentService.charge(order);
      apiSpan.end();
    });

    span.end();
  });
}

Analyzing Performance

Use traces to identify:

Slow database queries: Look for high-duration DB spans
N+1 queries: Multiple sequential DB calls
External API latency: Third-party service delays
Inefficient algorithms: Unexpectedly long processing times

Production Best Practices

1. Structured Instrumentation

Follow consistent patterns:

// Good: Structured, searchable attributes
span.setAttribute('user.id', userId);
span.setAttribute('order.total', orderTotal);
span.setAttribute('payment.method', 'credit_card');

// Bad: Unstructured string concatenation
span.setAttribute('message', `User ${userId} paid ${orderTotal} with credit card`);

2. Error Handling

Always record errors:

try {
  await riskyOperation();
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
  throw error;
} finally {
  span.end();
}

3. Context Propagation

Ensure trace context flows through your system:

// Extract context from incoming request
const context = propagation.extract(context.active(), req.headers);

// Use context for outgoing request
await context.with(context, async () => {
  await fetch('http://api.example.com', {
    headers: {
      ...propagation.inject(context.active(), {})
    }
  });
});

4. Resource Attributes

Include service-level metadata:

const resource = Resource.default().merge(
  new Resource({
    'service.name': 'user-api',
    'service.version': '1.2.3',
    'deployment.environment': 'production',
    'host.name': process.env.HOSTNAME
  })
);

5. Span Lifecycle Management

Properly manage span lifecycles:

// Always end spans, even on error
const span = tracer.startSpan('operation');
try {
  await doWork();
} finally {
  span.end(); // Guaranteed to execute
}

Troubleshooting with Traces

Common Scenarios

1. Slow Request Investigation

Query: duration > 1000ms
Filter: http.route = "/api/orders"
Result: Identify which span is causing slowness

2. Error Root Cause Analysis

Query: status.code = ERROR
Filter: service.name = "payment-service"
Result: Find common attributes across failing traces

3. Dependency Failure Impact

Query: service.name = "database"
Filter: status.code = ERROR
Result: See which upstream services are affected

Integration with Other Observability Tools

Correlation with Logs

Link traces to logs using trace IDs:

logger.info('Processing order', {
  traceId: span.spanContext().traceId,
  spanId: span.spanContext().spanId,
  orderId: orderId
});

Correlation with Metrics

Add trace exemplars to metrics:

counter.add(1, {
  'http.method': 'GET',
  'http.route': '/api/users'
}, {
  traceId: span.spanContext().traceId,
  spanId: span.spanContext().spanId
});

Cost Optimization

Strategies to Reduce Costs

Intelligent sampling: Sample based on value, not randomly
Attribute filtering: Remove high-cardinality attributes
Retention policies: Keep recent traces longer, archive old traces
Aggregation: Pre-aggregate common queries

Example: Cost-Effective Sampling

# Sample all errors, 10% of slow requests, 1% of normal requests
processors:
  tail_sampling:
    policies:
      # Always sample errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Sample slow requests (>1s)
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 1000}
        probabilistic: {sampling_percentage: 100}

      # Sample 10% of medium requests (>500ms)
      - name: medium-requests
        type: latency
        latency: {threshold_ms: 500}
        probabilistic: {sampling_percentage: 10}

      # Sample 1% of fast requests
      - name: fast-requests
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

Future Roadmap

GitLab's tracing capabilities are evolving:

OpenTelemetry integration (beta)
Production-ready tracing (planned)
Advanced trace analytics (planned)
Trace-based alerting (planned)
Service level objectives (SLOs) based on traces (planned)

References

Error Tracking - Aggregate and track application errors
Logs - Log aggregation and correlation
Metrics - Time-series metrics collection
APM - Application performance monitoring
Dashboards - Creating visualization dashboards