Skip to main content

logs

Log Management in GitLab

Overview

GitLab provides comprehensive log management capabilities, including log aggregation, search, filtering, and correlation with traces and metrics. Effective log management is critical for troubleshooting, monitoring, and understanding application behavior in production.

What is Log Management?

Log management enables you to:

  • Aggregate logs: Collect logs from multiple sources
  • Search and filter: Find relevant log entries quickly
  • Correlate: Link logs with traces and metrics
  • Analyze trends: Identify patterns and anomalies
  • Troubleshoot: Debug production issues efficiently

GitLab Log Architecture

Components

Application Logs  Log Collector  Storage  Query Engine  UI
                  (Vector/Fluentd)  (ClickHouse)  (SQL)    (GitLab)

ClickHouse Integration

GitLab leverages ClickHouse for log storage and analytics:

  • Column-oriented: Optimized for analytical queries
  • High compression: Efficient storage of large log volumes
  • Fast queries: Sub-second query performance
  • JSON support: Native handling of structured logs
  • Time-series optimized: Perfect for log data

Structured Logging

Why Structured Logs?

Structured logs are machine-readable and queryable:

// Bad: Unstructured string console.log(`User ${userId} logged in from ${ipAddress}`); // Good: Structured JSON logger.info('user_login', { userId: userId, ipAddress: ipAddress, timestamp: new Date().toISOString(), userAgent: req.headers['user-agent'] });

Standard Log Format

Use JSON format with consistent fields:

{ "timestamp": "2026-01-08T12:34:56.789Z", "level": "info", "service": "user-api", "message": "User logged in", "userId": "12345", "ipAddress": "192.168.1.1", "requestId": "abc-123-def-456", "traceId": "xyz-789-uvw-012", "environment": "production", "version": "0.4.9" }

Log Levels

Use appropriate log levels:

LevelPurposeExample
TRACEVery detailed debuggingFunction entry/exit
DEBUGDebugging informationVariable values, flow control
INFONormal operationsUser actions, system events
WARNPotential issuesDeprecated API usage, slow queries
ERRORErrors requiring attentionExceptions, failed operations
FATALCritical system failuresDatabase unavailable, OOM

Implementing Structured Logging

Node.js (Winston)

const winston = require('winston'); const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: 'user-api', environment: process.env.NODE_ENV, version: process.env.APP_VERSION }, transports: [ new winston.transports.Console(), new winston.transports.File({ filename: 'logs/error.log', level: 'error' }), new winston.transports.File({ filename: 'logs/combined.log' }) ] }); // Usage logger.info('User logged in', { userId: user.id, email: user.email, loginMethod: 'oauth' }); logger.error('Database query failed', { query: 'SELECT * FROM users', error: error.message, stack: error.stack });

Python (structlog)

import structlog # Configure structlog structlog.configure( processors=[ structlog.processors.TimeStamper(fmt="iso"), structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer() ], context_class=dict, logger_factory=structlog.PrintLoggerFactory(), ) logger = structlog.get_logger() # Usage logger.info( "user_logged_in", user_id=user.id, email=user.email, login_method="oauth" ) logger.error( "database_query_failed", query="SELECT * FROM users", error=str(error) )

Go (zap)

import ( "go.uber.org/zap" "go.uber.org/zap/zapcore" ) // Create logger config := zap.NewProductionConfig() config.EncoderConfig.TimeKey = "timestamp" config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder logger, _ := config.Build() defer logger.Sync() // Usage logger.Info("user_logged_in", zap.String("user_id", user.ID), zap.String("email", user.Email), zap.String("login_method", "oauth"), ) logger.Error("database_query_failed", zap.String("query", "SELECT * FROM users"), zap.Error(err), )

Java (Logback + SLF4J)

import org.slf4j.Logger; import org.slf4j.LoggerFactory; import net.logstash.logback.marker.Markers; Logger logger = LoggerFactory.getLogger(UserService.class); // Usage logger.info(Markers.append("user_id", user.getId()) .and(Markers.append("email", user.getEmail())) .and(Markers.append("login_method", "oauth")), "User logged in"); logger.error(Markers.append("query", "SELECT * FROM users") .and(Markers.append("error", e.getMessage())), "Database query failed", e);

Log Collection

Vector Configuration

Vector is a high-performance log collector:

# vector.toml [sources.application_logs] type = "file" include = ["/var/log/app/*.log"] read_from = "beginning" [transforms.parse_json] type = "remap" inputs = ["application_logs"] source = ''' . = parse_json!(.message) .timestamp = parse_timestamp!(.timestamp, "%+") ''' [transforms.enrich] type = "remap" inputs = ["parse_json"] source = ''' .environment = get_env_var!("ENVIRONMENT") .host = get_hostname!() .pod_name = get_env_var!("POD_NAME") ''' [sinks.clickhouse] type = "clickhouse" inputs = ["enrich"] endpoint = "http://clickhouse:8123" table = "logs" database = "observability" compression = "gzip"

Fluentd Configuration

Alternative log collector:

# fluent.conf <source> @type tail path /var/log/app/*.log pos_file /var/log/fluentd/app.pos tag app.logs <parse> @type json time_key timestamp time_format %Y-%m-%dT%H:%M:%S.%L%z </parse> </source> <filter app.logs> @type record_transformer <record> environment "#{ENV['ENVIRONMENT']}" host "#{Socket.gethostname}" pod_name "#{ENV['POD_NAME']}" </record> </filter> <match app.logs> @type clickhouse host clickhouse port 8123 database observability table logs <buffer> @type memory flush_interval 5s </buffer> </match>

Log Storage with ClickHouse

Table Schema

Optimized schema for log storage:

CREATE TABLE logs ( timestamp DateTime64(3) CODEC(DoubleDelta, LZ4), level LowCardinality(String), service LowCardinality(String), message String, trace_id String, span_id String, user_id String, request_id String, environment LowCardinality(String), host LowCardinality(String), attributes Map(String, String), INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 1, INDEX idx_user_id user_id TYPE bloom_filter GRANULARITY 1, INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (service, level, timestamp) TTL timestamp + INTERVAL 30 DAY;

Design Decisions:

  • Compression codecs: Reduce storage by 80-90%
  • LowCardinality: Optimize repeated values (level, service)
  • Bloom filters: Fast lookups by trace_id, user_id
  • Token index: Full-text search on message
  • Partitioning: By date for efficient pruning
  • TTL: Automatic cleanup of old logs

Materialized Views

Pre-aggregate common queries:

-- Error count by service CREATE MATERIALIZED VIEW error_count_by_service ENGINE = SummingMergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (service, timestamp) AS SELECT service, toStartOfHour(timestamp) as timestamp, count() as error_count FROM logs WHERE level = 'ERROR' GROUP BY service, timestamp; -- Log volume by level CREATE MATERIALIZED VIEW log_volume_by_level ENGINE = SummingMergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (level, timestamp) AS SELECT level, toStartOfMinute(timestamp) as timestamp, count() as log_count FROM logs GROUP BY level, timestamp;

Log Search and Filtering

Accessing Logs in GitLab

Navigate to: Monitor Logs

Query Examples

1. Search by Service

SELECT * FROM logs WHERE service = 'user-api' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC LIMIT 100;

2. Filter by Log Level

SELECT * FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp DESC;

3. Search Message Content

SELECT * FROM logs WHERE message LIKE '%database connection%' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC;

4. Find Logs by User

SELECT * FROM logs WHERE user_id = '12345' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp DESC;

5. Trace ID Correlation

SELECT * FROM logs WHERE trace_id = 'abc-123-def-456' ORDER BY timestamp ASC;

Advanced Filtering

SELECT * FROM logs WHERE service = 'payment-api' AND level IN ('ERROR', 'WARN') AND message LIKE '%timeout%' AND timestamp >= now() - INTERVAL 6 HOUR ORDER BY timestamp DESC;

Aggregation Queries

-- Error rate by service SELECT service, countIf(level = 'ERROR') as error_count, count() as total_count, round(error_count / total_count * 100, 2) as error_rate_pct FROM logs WHERE timestamp >= now() - INTERVAL 1 HOUR GROUP BY service ORDER BY error_rate_pct DESC;

Time-based Analysis

-- Log volume by hour SELECT toStartOfHour(timestamp) as hour, count() as log_count FROM logs WHERE timestamp >= now() - INTERVAL 1 DAY GROUP BY hour ORDER BY hour;

Log Correlation

Linking Logs with Traces

Include trace IDs in logs:

const { trace } = require('@opentelemetry/api'); function logWithTrace(message, data) { const span = trace.getActiveSpan(); const spanContext = span?.spanContext(); logger.info(message, { ...data, traceId: spanContext?.traceId, spanId: spanContext?.spanId, }); } // Usage logWithTrace('Processing payment', { orderId: order.id, amount: order.total, });

Querying Correlated Data

-- Get all logs for a trace SELECT timestamp, level, service, message FROM logs WHERE trace_id = 'abc-123-def-456' ORDER BY timestamp ASC;

Benefits of Correlation

  1. End-to-end visibility: See complete request flow
  2. Root cause analysis: Identify where errors originated
  3. Performance debugging: Find slow operations
  4. Context preservation: Maintain request context across services

Log Retention Policies

Storage Tiers

Implement multi-tier storage:

-- Hot tier: Recent logs (7 days) CREATE TABLE logs_hot AS logs ENGINE = MergeTree() TTL timestamp + INTERVAL 7 DAY TO DISK 'warm'; -- Warm tier: Historical logs (30 days) -- Stored on slower, cheaper storage -- Automatically moved by TTL -- Cold tier: Archive (1 year) CREATE TABLE logs_archive AS logs ENGINE = MergeTree() TTL timestamp + INTERVAL 1 YEAR DELETE;

Retention Best Practices

Log TypeRetention PeriodReasoning
Debug logs1-3 daysHigh volume, low value
Info logs7-14 daysNormal operations
Warn logs30 daysPotential issues
Error logs90 daysInvestigation needed
Security logs1 yearCompliance requirements
Audit logs7 yearsLegal requirements

Cost Optimization

-- Sample debug logs (keep 10%) CREATE TABLE logs_sampled AS logs ENGINE = MergeTree() AS SELECT * FROM logs WHERE level = 'DEBUG' AND rand() % 10 = 0; -- 10% sampling -- Keep all error/warn logs CREATE TABLE logs_important AS logs ENGINE = MergeTree() AS SELECT * FROM logs WHERE level IN ('ERROR', 'WARN', 'FATAL');

Production Best Practices

1. Include Request Context

Add context to every log:

// Middleware to add request context app.use((req, res, next) => { req.requestId = uuid.v4(); req.logger = logger.child({ requestId: req.requestId, method: req.method, path: req.path, userAgent: req.headers['user-agent'], clientIp: req.ip, }); next(); }); // Use request-scoped logger app.get('/api/users', (req, res) => { req.logger.info('Fetching users'); // ... handle request });

2. Avoid Logging Sensitive Data

Scrub sensitive information:

const SENSITIVE_FIELDS = ['password', 'token', 'apiKey', 'ssn', 'creditCard']; function sanitizeLog(data) { const sanitized = { ...data }; SENSITIVE_FIELDS.forEach(field => { if (sanitized[field]) { sanitized[field] = '***REDACTED***'; } }); return sanitized; } // Usage logger.info('User created', sanitizeLog({ userId: user.id, email: user.email, password: user.password, // Will be redacted }));

3. Use Sampling for High-Volume Logs

Sample verbose logs:

function shouldLog(level, samplingRate = 1.0) { if (level === 'ERROR' || level === 'WARN') { return true; // Always log errors and warnings } return Math.random() < samplingRate; } // Sample 10% of debug logs if (shouldLog('DEBUG', 0.1)) { logger.debug('Detailed debug info', { ... }); }

4. Implement Log Rotation

Prevent disk space exhaustion:

// Winston with rotation const winston = require('winston'); require('winston-daily-rotate-file'); const transport = new winston.transports.DailyRotateFile({ filename: 'logs/app-%DATE%.log', datePattern: 'YYYY-MM-DD', maxSize: '20m', maxFiles: '14d', // Keep 14 days compress: true, }); const logger = winston.createLogger({ transports: [transport] });

5. Structured Error Logging

Include full error context:

try { await riskyOperation(); } catch (error) { logger.error('Operation failed', { error: { message: error.message, stack: error.stack, code: error.code, }, context: { userId: user.id, operation: 'riskyOperation', input: sanitizeLog(input), }, }); throw error; }

Alerting Based on Logs

Log-based Alerts

Create alerts from log patterns:

# Alert on error rate alert: name: high_error_rate query: | SELECT countIf(level = 'ERROR') / count() as error_rate FROM logs WHERE timestamp >= now() - INTERVAL 5 MINUTE condition: error_rate > 0.05 notification: - slack: #alerts - pagerduty: oncall # Alert on specific error messages alert: name: database_connection_errors query: | SELECT count() as error_count FROM logs WHERE level = 'ERROR' AND message LIKE '%database connection%' AND timestamp >= now() - INTERVAL 5 MINUTE condition: error_count > 10 notification: - slack: #database-team

Performance Considerations

1. Batch Log Writes

Reduce I/O overhead:

const logBuffer = []; const BATCH_SIZE = 100; const FLUSH_INTERVAL = 5000; // 5 seconds function bufferedLog(level, message, data) { logBuffer.push({ level, message, data, timestamp: Date.now() }); if (logBuffer.length >= BATCH_SIZE) { flushLogs(); } } function flushLogs() { if (logBuffer.length === 0) return; const batch = logBuffer.splice(0, logBuffer.length); logger.info('Batch logs', { logs: batch }); } // Flush periodically setInterval(flushLogs, FLUSH_INTERVAL);

2. Async Logging

Prevent blocking application:

const winston = require('winston'); const { createLogger, format, transports } = winston; const logger = createLogger({ transports: [ new transports.Stream({ stream: process.stdout, // Non-blocking writes handleExceptions: false, handleRejections: false, }) ] }); // Errors don't block application logger.on('error', (error) => { console.error('Logging error:', error); });

3. Minimize Log Volume

Log only what's needed:

// Development: Verbose logging if (process.env.NODE_ENV === 'development') { logger.level = 'debug'; } // Production: Essential logs only if (process.env.NODE_ENV === 'production') { logger.level = 'info'; } // Conditional logging if (logger.isLevelEnabled('debug')) { logger.debug('Complex computation', expensiveToSerialize()); }

Troubleshooting with Logs

Common Scenarios

1. Find Recent Errors

SELECT * FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC LIMIT 50;

2. Track User Journey

SELECT timestamp, service, message FROM logs WHERE user_id = '12345' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp ASC;

3. Identify Slow Operations

SELECT * FROM logs WHERE message LIKE '%took%ms%' AND extractAll(message, 'took (\\d+)ms')[1]::Int > 1000 ORDER BY timestamp DESC;

4. Detect Error Patterns

SELECT message, count() as occurrence_count FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 DAY GROUP BY message ORDER BY occurrence_count DESC LIMIT 10;

References

  • Tracing - Distributed tracing and correlation
  • Error Tracking - Error management and aggregation
  • Metrics - Time-series metrics collection
  • APM - Application performance monitoring
  • Dashboards - Log visualization and reporting