tracing
Distributed Tracing in GitLab
Overview
GitLab provides distributed tracing capabilities through OpenTelemetry integration, enabling teams to monitor application performance, troubleshoot issues, and understand how requests flow through different services and systems.
What is Distributed Tracing?
Distributed tracing allows you to:
- Track request flows across multiple services and systems
- Measure timing of each operation in the request path
- Identify bottlenecks and performance issues
- Correlate errors with specific operations
- Visualize service dependencies and relationships
OpenTelemetry Integration
GitLab uses OpenTelemetry, an open-source observability framework that supports a wide array of SDKs and libraries across major programming languages and frameworks.
Key Benefits
- Vendor-neutral: Standard format for telemetry data
- Language support: SDKs for all major programming languages
- Framework support: Automatic instrumentation for popular frameworks
- Flexibility: Export to any OpenTelemetry-compatible backend
Current Status (2026)
As of GitLab 17.7, distributed tracing is available as an internal beta feature:
- Available for testing
- Not ready for production use
- Actively being developed
CI/CD Pipeline Tracing
Automatic Instrumentation
GitLab Observability automatically instruments your CI/CD pipelines when enabled, providing visibility into:
- Pipeline execution flow
- Job dependencies and timing
- Stage durations
- Resource usage
Enabling Pipeline Tracing
Set the GITLAB_OBSERVABILITY_EXPORT variable in your CI/CD configuration:
variables: GITLAB_OBSERVABILITY_EXPORT: traces build: stage: build script: - npm install - npm run build test: stage: test script: - npm test
This exports distributed traces showing:
- Pipeline execution flow
- Job dependencies
- Timing for each stage and job
- Parent-child relationships between jobs
Benefits of Pipeline Tracing
- Performance optimization: Identify slow jobs and stages
- Dependency analysis: Understand job relationships
- Failure investigation: Trace errors back to specific jobs
- Resource planning: Understand resource usage patterns
Application Tracing
Instrumenting Your Application
Configure your application to send telemetry data using standard OpenTelemetry libraries.
Example: Node.js Application
// Install dependencies // npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();
Example: Python Application
# Install dependencies # pip install opentelemetry-distro opentelemetry-exporter-otlp from opentelemetry import trace from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor # Configure the tracer provider trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) # Configure the OTLP exporter otlp_exporter = OTLPSpanExporter( endpoint="http://localhost:4318/v1/traces" ) # Add the span processor trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(otlp_exporter) ) # Create spans with tracer.start_as_current_span("operation_name"): # Your code here pass
Example: Go Application
// Install dependencies // go get go.opentelemetry.io/otel // go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp package main import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/trace" ) func initTracer() (*trace.TracerProvider, error) { exporter, err := otlptracehttp.New(context.Background(), otlptracehttp.WithEndpoint("localhost:4318"), otlptracehttp.WithInsecure(), ) if err != nil { return nil, err } tp := trace.NewTracerProvider( trace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) return tp, nil }
Viewing Traces in GitLab
Accessing the Traces Interface
- Navigate to your project in GitLab
- Go to Monitor Traces
- View successfully exported traces
Trace Visualization
The GitLab traces interface provides:
- Timeline view: See how operations relate temporally
- Span details: Inspect individual operations
- Service map: Visualize service dependencies
- Performance metrics: Latency, duration, throughput
Trace Data Structure
Core Concepts
Trace: A complete end-to-end request journey
Trace ID: abc123
Span: HTTP GET /api/users (200ms)
Span: Database Query (50ms)
Span: Cache Lookup (10ms)
Span: External API Call (100ms)
Span: A single operation within a trace
- Name: Operation identifier (e.g., "HTTP GET /api/users")
- Duration: Time taken for the operation
- Attributes: Key-value metadata
- Events: Point-in-time markers
- Status: Success, error, or unknown
Span Attributes
Standard attributes to include:
{ "http.method": "GET", "http.url": "/api/users", "http.status_code": 200, "db.system": "postgresql", "db.statement": "SELECT * FROM users", "user.id": "12345", "environment": "production" }
Trace Sampling Strategies
Why Sampling?
Sampling reduces storage costs and performance overhead while maintaining visibility:
- Cost efficiency: Store only relevant traces
- Performance: Minimize instrumentation overhead
- Insight: Retain sufficient data for analysis
Sampling Strategies
1. Head-based Sampling
Decision made at trace start:
const sampler = new TraceIdRatioBasedSampler(0.1); // Sample 10%
2. Tail-based Sampling
Decision made after trace completion:
# Sample all errors and 1% of successful traces processors: tail_sampling: policies: - name: error-traces type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces type: latency latency: {threshold_ms: 1000} - name: random-sample type: probabilistic probabilistic: {sampling_percentage: 1}
3. Adaptive Sampling
Dynamically adjust sampling based on traffic:
// Sample 100% during low traffic // Sample 10% during high traffic const adaptiveSampler = new AdaptiveSampler({ minRate: 0.1, maxRate: 1.0, targetTPS: 1000 });
Best Practices
- Always sample errors: Never drop error traces
- Sample slow requests: Capture performance issues
- Use stratified sampling: Different rates for different services
- Monitor sampling rates: Ensure sufficient coverage
Service Dependency Mapping
Automatic Discovery
OpenTelemetry automatically discovers service dependencies through trace propagation:
User Request API Gateway Auth Service Database
User Service Cache
Benefits
- Architecture visualization: See how services interact
- Dependency analysis: Identify critical dependencies
- Performance troubleshooting: Find bottleneck services
- Impact assessment: Understand service failure impact
Performance Profiling
Using Traces for Profiling
Identify performance bottlenecks:
// Instrument expensive operations const tracer = trace.getTracer('my-service'); async function processOrder(orderId) { return tracer.startActiveSpan('process-order', async (span) => { span.setAttribute('order.id', orderId); // Database query await tracer.startActiveSpan('db-query', async (dbSpan) => { const order = await db.query('SELECT * FROM orders WHERE id = ?', orderId); dbSpan.end(); return order; }); // External API call await tracer.startActiveSpan('payment-api', async (apiSpan) => { await paymentService.charge(order); apiSpan.end(); }); span.end(); }); }
Analyzing Performance
Use traces to identify:
- Slow database queries: Look for high-duration DB spans
- N+1 queries: Multiple sequential DB calls
- External API latency: Third-party service delays
- Inefficient algorithms: Unexpectedly long processing times
Production Best Practices
1. Structured Instrumentation
Follow consistent patterns:
// Good: Structured, searchable attributes span.setAttribute('user.id', userId); span.setAttribute('order.total', orderTotal); span.setAttribute('payment.method', 'credit_card'); // Bad: Unstructured string concatenation span.setAttribute('message', `User ${userId} paid ${orderTotal} with credit card`);
2. Error Handling
Always record errors:
try { await riskyOperation(); } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); throw error; } finally { span.end(); }
3. Context Propagation
Ensure trace context flows through your system:
// Extract context from incoming request const context = propagation.extract(context.active(), req.headers); // Use context for outgoing request await context.with(context, async () => { await fetch('http://api.example.com', { headers: { ...propagation.inject(context.active(), {}) } }); });
4. Resource Attributes
Include service-level metadata:
const resource = Resource.default().merge( new Resource({ 'service.name': 'user-api', 'service.version': '1.2.3', 'deployment.environment': 'production', 'host.name': process.env.HOSTNAME }) );
5. Span Lifecycle Management
Properly manage span lifecycles:
// Always end spans, even on error const span = tracer.startSpan('operation'); try { await doWork(); } finally { span.end(); // Guaranteed to execute }
Troubleshooting with Traces
Common Scenarios
1. Slow Request Investigation
Query: duration > 1000ms
Filter: http.route = "/api/orders"
Result: Identify which span is causing slowness
2. Error Root Cause Analysis
Query: status.code = ERROR
Filter: service.name = "payment-service"
Result: Find common attributes across failing traces
3. Dependency Failure Impact
Query: service.name = "database"
Filter: status.code = ERROR
Result: See which upstream services are affected
Integration with Other Observability Tools
Correlation with Logs
Link traces to logs using trace IDs:
logger.info('Processing order', { traceId: span.spanContext().traceId, spanId: span.spanContext().spanId, orderId: orderId });
Correlation with Metrics
Add trace exemplars to metrics:
counter.add(1, { 'http.method': 'GET', 'http.route': '/api/users' }, { traceId: span.spanContext().traceId, spanId: span.spanContext().spanId });
Cost Optimization
Strategies to Reduce Costs
- Intelligent sampling: Sample based on value, not randomly
- Attribute filtering: Remove high-cardinality attributes
- Retention policies: Keep recent traces longer, archive old traces
- Aggregation: Pre-aggregate common queries
Example: Cost-Effective Sampling
# Sample all errors, 10% of slow requests, 1% of normal requests processors: tail_sampling: policies: # Always sample errors - name: errors type: status_code status_code: {status_codes: [ERROR]} # Sample slow requests (>1s) - name: slow-requests type: latency latency: {threshold_ms: 1000} probabilistic: {sampling_percentage: 100} # Sample 10% of medium requests (>500ms) - name: medium-requests type: latency latency: {threshold_ms: 500} probabilistic: {sampling_percentage: 10} # Sample 1% of fast requests - name: fast-requests type: probabilistic probabilistic: {sampling_percentage: 1}
Future Roadmap
GitLab's tracing capabilities are evolving:
- OpenTelemetry integration (beta)
- Production-ready tracing (planned)
- Advanced trace analytics (planned)
- Trace-based alerting (planned)
- Service level objectives (SLOs) based on traces (planned)
References
- GitLab Observability Documentation
- GitLab Distributed Tracing
- OpenTelemetry Documentation
- Monitor Application Performance with Distributed Tracing
- GitLab Observability Development Guidelines
Related Documentation
- Error Tracking - Aggregate and track application errors
- Logs - Log aggregation and correlation
- Metrics - Time-series metrics collection
- APM - Application performance monitoring
- Dashboards - Creating visualization dashboards