logs
Log Management in GitLab
Overview
GitLab provides comprehensive log management capabilities, including log aggregation, search, filtering, and correlation with traces and metrics. Effective log management is critical for troubleshooting, monitoring, and understanding application behavior in production.
What is Log Management?
Log management enables you to:
- Aggregate logs: Collect logs from multiple sources
- Search and filter: Find relevant log entries quickly
- Correlate: Link logs with traces and metrics
- Analyze trends: Identify patterns and anomalies
- Troubleshoot: Debug production issues efficiently
GitLab Log Architecture
Components
Application Logs Log Collector Storage Query Engine UI
(Vector/Fluentd) (ClickHouse) (SQL) (GitLab)
ClickHouse Integration
GitLab leverages ClickHouse for log storage and analytics:
- Column-oriented: Optimized for analytical queries
- High compression: Efficient storage of large log volumes
- Fast queries: Sub-second query performance
- JSON support: Native handling of structured logs
- Time-series optimized: Perfect for log data
Structured Logging
Why Structured Logs?
Structured logs are machine-readable and queryable:
// Bad: Unstructured string console.log(`User ${userId} logged in from ${ipAddress}`); // Good: Structured JSON logger.info('user_login', { userId: userId, ipAddress: ipAddress, timestamp: new Date().toISOString(), userAgent: req.headers['user-agent'] });
Standard Log Format
Use JSON format with consistent fields:
{ "timestamp": "2026-01-08T12:34:56.789Z", "level": "info", "service": "user-api", "message": "User logged in", "userId": "12345", "ipAddress": "192.168.1.1", "requestId": "abc-123-def-456", "traceId": "xyz-789-uvw-012", "environment": "production", "version": "0.4.9" }
Log Levels
Use appropriate log levels:
| Level | Purpose | Example |
|---|---|---|
TRACE | Very detailed debugging | Function entry/exit |
DEBUG | Debugging information | Variable values, flow control |
INFO | Normal operations | User actions, system events |
WARN | Potential issues | Deprecated API usage, slow queries |
ERROR | Errors requiring attention | Exceptions, failed operations |
FATAL | Critical system failures | Database unavailable, OOM |
Implementing Structured Logging
Node.js (Winston)
const winston = require('winston'); const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: 'user-api', environment: process.env.NODE_ENV, version: process.env.APP_VERSION }, transports: [ new winston.transports.Console(), new winston.transports.File({ filename: 'logs/error.log', level: 'error' }), new winston.transports.File({ filename: 'logs/combined.log' }) ] }); // Usage logger.info('User logged in', { userId: user.id, email: user.email, loginMethod: 'oauth' }); logger.error('Database query failed', { query: 'SELECT * FROM users', error: error.message, stack: error.stack });
Python (structlog)
import structlog # Configure structlog structlog.configure( processors=[ structlog.processors.TimeStamper(fmt="iso"), structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer() ], context_class=dict, logger_factory=structlog.PrintLoggerFactory(), ) logger = structlog.get_logger() # Usage logger.info( "user_logged_in", user_id=user.id, email=user.email, login_method="oauth" ) logger.error( "database_query_failed", query="SELECT * FROM users", error=str(error) )
Go (zap)
import ( "go.uber.org/zap" "go.uber.org/zap/zapcore" ) // Create logger config := zap.NewProductionConfig() config.EncoderConfig.TimeKey = "timestamp" config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder logger, _ := config.Build() defer logger.Sync() // Usage logger.Info("user_logged_in", zap.String("user_id", user.ID), zap.String("email", user.Email), zap.String("login_method", "oauth"), ) logger.Error("database_query_failed", zap.String("query", "SELECT * FROM users"), zap.Error(err), )
Java (Logback + SLF4J)
import org.slf4j.Logger; import org.slf4j.LoggerFactory; import net.logstash.logback.marker.Markers; Logger logger = LoggerFactory.getLogger(UserService.class); // Usage logger.info(Markers.append("user_id", user.getId()) .and(Markers.append("email", user.getEmail())) .and(Markers.append("login_method", "oauth")), "User logged in"); logger.error(Markers.append("query", "SELECT * FROM users") .and(Markers.append("error", e.getMessage())), "Database query failed", e);
Log Collection
Vector Configuration
Vector is a high-performance log collector:
# vector.toml [sources.application_logs] type = "file" include = ["/var/log/app/*.log"] read_from = "beginning" [transforms.parse_json] type = "remap" inputs = ["application_logs"] source = ''' . = parse_json!(.message) .timestamp = parse_timestamp!(.timestamp, "%+") ''' [transforms.enrich] type = "remap" inputs = ["parse_json"] source = ''' .environment = get_env_var!("ENVIRONMENT") .host = get_hostname!() .pod_name = get_env_var!("POD_NAME") ''' [sinks.clickhouse] type = "clickhouse" inputs = ["enrich"] endpoint = "http://clickhouse:8123" table = "logs" database = "observability" compression = "gzip"
Fluentd Configuration
Alternative log collector:
# fluent.conf <source> @type tail path /var/log/app/*.log pos_file /var/log/fluentd/app.pos tag app.logs <parse> @type json time_key timestamp time_format %Y-%m-%dT%H:%M:%S.%L%z </parse> </source> <filter app.logs> @type record_transformer <record> environment "#{ENV['ENVIRONMENT']}" host "#{Socket.gethostname}" pod_name "#{ENV['POD_NAME']}" </record> </filter> <match app.logs> @type clickhouse host clickhouse port 8123 database observability table logs <buffer> @type memory flush_interval 5s </buffer> </match>
Log Storage with ClickHouse
Table Schema
Optimized schema for log storage:
CREATE TABLE logs ( timestamp DateTime64(3) CODEC(DoubleDelta, LZ4), level LowCardinality(String), service LowCardinality(String), message String, trace_id String, span_id String, user_id String, request_id String, environment LowCardinality(String), host LowCardinality(String), attributes Map(String, String), INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 1, INDEX idx_user_id user_id TYPE bloom_filter GRANULARITY 1, INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (service, level, timestamp) TTL timestamp + INTERVAL 30 DAY;
Design Decisions:
- Compression codecs: Reduce storage by 80-90%
- LowCardinality: Optimize repeated values (level, service)
- Bloom filters: Fast lookups by trace_id, user_id
- Token index: Full-text search on message
- Partitioning: By date for efficient pruning
- TTL: Automatic cleanup of old logs
Materialized Views
Pre-aggregate common queries:
-- Error count by service CREATE MATERIALIZED VIEW error_count_by_service ENGINE = SummingMergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (service, timestamp) AS SELECT service, toStartOfHour(timestamp) as timestamp, count() as error_count FROM logs WHERE level = 'ERROR' GROUP BY service, timestamp; -- Log volume by level CREATE MATERIALIZED VIEW log_volume_by_level ENGINE = SummingMergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (level, timestamp) AS SELECT level, toStartOfMinute(timestamp) as timestamp, count() as log_count FROM logs GROUP BY level, timestamp;
Log Search and Filtering
Accessing Logs in GitLab
Navigate to: Monitor Logs
Query Examples
1. Search by Service
SELECT * FROM logs WHERE service = 'user-api' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC LIMIT 100;
2. Filter by Log Level
SELECT * FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp DESC;
3. Search Message Content
SELECT * FROM logs WHERE message LIKE '%database connection%' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC;
4. Find Logs by User
SELECT * FROM logs WHERE user_id = '12345' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp DESC;
5. Trace ID Correlation
SELECT * FROM logs WHERE trace_id = 'abc-123-def-456' ORDER BY timestamp ASC;
Advanced Filtering
Multi-field Search
SELECT * FROM logs WHERE service = 'payment-api' AND level IN ('ERROR', 'WARN') AND message LIKE '%timeout%' AND timestamp >= now() - INTERVAL 6 HOUR ORDER BY timestamp DESC;
Aggregation Queries
-- Error rate by service SELECT service, countIf(level = 'ERROR') as error_count, count() as total_count, round(error_count / total_count * 100, 2) as error_rate_pct FROM logs WHERE timestamp >= now() - INTERVAL 1 HOUR GROUP BY service ORDER BY error_rate_pct DESC;
Time-based Analysis
-- Log volume by hour SELECT toStartOfHour(timestamp) as hour, count() as log_count FROM logs WHERE timestamp >= now() - INTERVAL 1 DAY GROUP BY hour ORDER BY hour;
Log Correlation
Linking Logs with Traces
Include trace IDs in logs:
const { trace } = require('@opentelemetry/api'); function logWithTrace(message, data) { const span = trace.getActiveSpan(); const spanContext = span?.spanContext(); logger.info(message, { ...data, traceId: spanContext?.traceId, spanId: spanContext?.spanId, }); } // Usage logWithTrace('Processing payment', { orderId: order.id, amount: order.total, });
Querying Correlated Data
-- Get all logs for a trace SELECT timestamp, level, service, message FROM logs WHERE trace_id = 'abc-123-def-456' ORDER BY timestamp ASC;
Benefits of Correlation
- End-to-end visibility: See complete request flow
- Root cause analysis: Identify where errors originated
- Performance debugging: Find slow operations
- Context preservation: Maintain request context across services
Log Retention Policies
Storage Tiers
Implement multi-tier storage:
-- Hot tier: Recent logs (7 days) CREATE TABLE logs_hot AS logs ENGINE = MergeTree() TTL timestamp + INTERVAL 7 DAY TO DISK 'warm'; -- Warm tier: Historical logs (30 days) -- Stored on slower, cheaper storage -- Automatically moved by TTL -- Cold tier: Archive (1 year) CREATE TABLE logs_archive AS logs ENGINE = MergeTree() TTL timestamp + INTERVAL 1 YEAR DELETE;
Retention Best Practices
| Log Type | Retention Period | Reasoning |
|---|---|---|
| Debug logs | 1-3 days | High volume, low value |
| Info logs | 7-14 days | Normal operations |
| Warn logs | 30 days | Potential issues |
| Error logs | 90 days | Investigation needed |
| Security logs | 1 year | Compliance requirements |
| Audit logs | 7 years | Legal requirements |
Cost Optimization
-- Sample debug logs (keep 10%) CREATE TABLE logs_sampled AS logs ENGINE = MergeTree() AS SELECT * FROM logs WHERE level = 'DEBUG' AND rand() % 10 = 0; -- 10% sampling -- Keep all error/warn logs CREATE TABLE logs_important AS logs ENGINE = MergeTree() AS SELECT * FROM logs WHERE level IN ('ERROR', 'WARN', 'FATAL');
Production Best Practices
1. Include Request Context
Add context to every log:
// Middleware to add request context app.use((req, res, next) => { req.requestId = uuid.v4(); req.logger = logger.child({ requestId: req.requestId, method: req.method, path: req.path, userAgent: req.headers['user-agent'], clientIp: req.ip, }); next(); }); // Use request-scoped logger app.get('/api/users', (req, res) => { req.logger.info('Fetching users'); // ... handle request });
2. Avoid Logging Sensitive Data
Scrub sensitive information:
const SENSITIVE_FIELDS = ['password', 'token', 'apiKey', 'ssn', 'creditCard']; function sanitizeLog(data) { const sanitized = { ...data }; SENSITIVE_FIELDS.forEach(field => { if (sanitized[field]) { sanitized[field] = '***REDACTED***'; } }); return sanitized; } // Usage logger.info('User created', sanitizeLog({ userId: user.id, email: user.email, password: user.password, // Will be redacted }));
3. Use Sampling for High-Volume Logs
Sample verbose logs:
function shouldLog(level, samplingRate = 1.0) { if (level === 'ERROR' || level === 'WARN') { return true; // Always log errors and warnings } return Math.random() < samplingRate; } // Sample 10% of debug logs if (shouldLog('DEBUG', 0.1)) { logger.debug('Detailed debug info', { ... }); }
4. Implement Log Rotation
Prevent disk space exhaustion:
// Winston with rotation const winston = require('winston'); require('winston-daily-rotate-file'); const transport = new winston.transports.DailyRotateFile({ filename: 'logs/app-%DATE%.log', datePattern: 'YYYY-MM-DD', maxSize: '20m', maxFiles: '14d', // Keep 14 days compress: true, }); const logger = winston.createLogger({ transports: [transport] });
5. Structured Error Logging
Include full error context:
try { await riskyOperation(); } catch (error) { logger.error('Operation failed', { error: { message: error.message, stack: error.stack, code: error.code, }, context: { userId: user.id, operation: 'riskyOperation', input: sanitizeLog(input), }, }); throw error; }
Alerting Based on Logs
Log-based Alerts
Create alerts from log patterns:
# Alert on error rate alert: name: high_error_rate query: | SELECT countIf(level = 'ERROR') / count() as error_rate FROM logs WHERE timestamp >= now() - INTERVAL 5 MINUTE condition: error_rate > 0.05 notification: - slack: #alerts - pagerduty: oncall # Alert on specific error messages alert: name: database_connection_errors query: | SELECT count() as error_count FROM logs WHERE level = 'ERROR' AND message LIKE '%database connection%' AND timestamp >= now() - INTERVAL 5 MINUTE condition: error_count > 10 notification: - slack: #database-team
Performance Considerations
1. Batch Log Writes
Reduce I/O overhead:
const logBuffer = []; const BATCH_SIZE = 100; const FLUSH_INTERVAL = 5000; // 5 seconds function bufferedLog(level, message, data) { logBuffer.push({ level, message, data, timestamp: Date.now() }); if (logBuffer.length >= BATCH_SIZE) { flushLogs(); } } function flushLogs() { if (logBuffer.length === 0) return; const batch = logBuffer.splice(0, logBuffer.length); logger.info('Batch logs', { logs: batch }); } // Flush periodically setInterval(flushLogs, FLUSH_INTERVAL);
2. Async Logging
Prevent blocking application:
const winston = require('winston'); const { createLogger, format, transports } = winston; const logger = createLogger({ transports: [ new transports.Stream({ stream: process.stdout, // Non-blocking writes handleExceptions: false, handleRejections: false, }) ] }); // Errors don't block application logger.on('error', (error) => { console.error('Logging error:', error); });
3. Minimize Log Volume
Log only what's needed:
// Development: Verbose logging if (process.env.NODE_ENV === 'development') { logger.level = 'debug'; } // Production: Essential logs only if (process.env.NODE_ENV === 'production') { logger.level = 'info'; } // Conditional logging if (logger.isLevelEnabled('debug')) { logger.debug('Complex computation', expensiveToSerialize()); }
Troubleshooting with Logs
Common Scenarios
1. Find Recent Errors
SELECT * FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 HOUR ORDER BY timestamp DESC LIMIT 50;
2. Track User Journey
SELECT timestamp, service, message FROM logs WHERE user_id = '12345' AND timestamp >= now() - INTERVAL 1 DAY ORDER BY timestamp ASC;
3. Identify Slow Operations
SELECT * FROM logs WHERE message LIKE '%took%ms%' AND extractAll(message, 'took (\\d+)ms')[1]::Int > 1000 ORDER BY timestamp DESC;
4. Detect Error Patterns
SELECT message, count() as occurrence_count FROM logs WHERE level = 'ERROR' AND timestamp >= now() - INTERVAL 1 DAY GROUP BY message ORDER BY occurrence_count DESC LIMIT 10;
References
- GitLab Log System Documentation
- ClickHouse Documentation
- Structured Logging Best Practices
- Vector Documentation
- Fluentd Documentation
Related Documentation
- Tracing - Distributed tracing and correlation
- Error Tracking - Error management and aggregation
- Metrics - Time-series metrics collection
- APM - Application performance monitoring
- Dashboards - Log visualization and reporting