logs

Log Management in GitLab

Overview

GitLab provides comprehensive log management capabilities, including log aggregation, search, filtering, and correlation with traces and metrics. Effective log management is critical for troubleshooting, monitoring, and understanding application behavior in production.

What is Log Management?

Log management enables you to:

Aggregate logs: Collect logs from multiple sources
Search and filter: Find relevant log entries quickly
Correlate: Link logs with traces and metrics
Analyze trends: Identify patterns and anomalies
Troubleshoot: Debug production issues efficiently

GitLab Log Architecture

Components

Application Logs  Log Collector  Storage  Query Engine  UI
                  (Vector/Fluentd)  (ClickHouse)  (SQL)    (GitLab)

ClickHouse Integration

GitLab leverages ClickHouse for log storage and analytics:

Column-oriented: Optimized for analytical queries
High compression: Efficient storage of large log volumes
Fast queries: Sub-second query performance
JSON support: Native handling of structured logs
Time-series optimized: Perfect for log data

Structured Logging

Why Structured Logs?

Structured logs are machine-readable and queryable:

// Bad: Unstructured string
console.log(`User ${userId} logged in from ${ipAddress}`);

// Good: Structured JSON
logger.info('user_login', {
  userId: userId,
  ipAddress: ipAddress,
  timestamp: new Date().toISOString(),
  userAgent: req.headers['user-agent']
});

Standard Log Format

Use JSON format with consistent fields:

{
  "timestamp": "2026-01-08T12:34:56.789Z",
  "level": "info",
  "service": "user-api",
  "message": "User logged in",
  "userId": "12345",
  "ipAddress": "192.168.1.1",
  "requestId": "abc-123-def-456",
  "traceId": "xyz-789-uvw-012",
  "environment": "production",
  "version": "0.4.9"
}

Log Levels

Use appropriate log levels:

Level	Purpose	Example
`TRACE`	Very detailed debugging	Function entry/exit
`DEBUG`	Debugging information	Variable values, flow control
`INFO`	Normal operations	User actions, system events
`WARN`	Potential issues	Deprecated API usage, slow queries
`ERROR`	Errors requiring attention	Exceptions, failed operations
`FATAL`	Critical system failures	Database unavailable, OOM

Implementing Structured Logging

Node.js (Winston)

const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'user-api',
    environment: process.env.NODE_ENV,
    version: process.env.APP_VERSION
  },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'logs/error.log', level: 'error' }),
    new winston.transports.File({ filename: 'logs/combined.log' })
  ]
});

// Usage
logger.info('User logged in', {
  userId: user.id,
  email: user.email,
  loginMethod: 'oauth'
});

logger.error('Database query failed', {
  query: 'SELECT * FROM users',
  error: error.message,
  stack: error.stack
});

Python (structlog)

import structlog

# Configure structlog
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

logger = structlog.get_logger()

# Usage
logger.info(
    "user_logged_in",
    user_id=user.id,
    email=user.email,
    login_method="oauth"
)

logger.error(
    "database_query_failed",
    query="SELECT * FROM users",
    error=str(error)
)

Go (zap)

import (
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

// Create logger
config := zap.NewProductionConfig()
config.EncoderConfig.TimeKey = "timestamp"
config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
logger, _ := config.Build()
defer logger.Sync()

// Usage
logger.Info("user_logged_in",
    zap.String("user_id", user.ID),
    zap.String("email", user.Email),
    zap.String("login_method", "oauth"),
)

logger.Error("database_query_failed",
    zap.String("query", "SELECT * FROM users"),
    zap.Error(err),
)

Java (Logback + SLF4J)

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.marker.Markers;

Logger logger = LoggerFactory.getLogger(UserService.class);

// Usage
logger.info(Markers.append("user_id", user.getId())
    .and(Markers.append("email", user.getEmail()))
    .and(Markers.append("login_method", "oauth")),
    "User logged in");

logger.error(Markers.append("query", "SELECT * FROM users")
    .and(Markers.append("error", e.getMessage())),
    "Database query failed", e);

Log Collection

Vector Configuration

Vector is a high-performance log collector:

# vector.toml
[sources.application_logs]
type = "file"
include = ["/var/log/app/*.log"]
read_from = "beginning"

[transforms.parse_json]
type = "remap"
inputs = ["application_logs"]
source = '''
  . = parse_json!(.message)
  .timestamp = parse_timestamp!(.timestamp, "%+")
'''

[transforms.enrich]
type = "remap"
inputs = ["parse_json"]
source = '''
  .environment = get_env_var!("ENVIRONMENT")
  .host = get_hostname!()
  .pod_name = get_env_var!("POD_NAME")
'''

[sinks.clickhouse]
type = "clickhouse"
inputs = ["enrich"]
endpoint = "http://clickhouse:8123"
table = "logs"
database = "observability"
compression = "gzip"

Fluentd Configuration

Alternative log collector:

# fluent.conf
<source>
  @type tail
  path /var/log/app/*.log
  pos_file /var/log/fluentd/app.pos
  tag app.logs
  <parse>
    @type json
    time_key timestamp
    time_format %Y-%m-%dT%H:%M:%S.%L%z
  </parse>
</source>

<filter app.logs>
  @type record_transformer
  <record>
    environment "#{ENV['ENVIRONMENT']}"
    host "#{Socket.gethostname}"
    pod_name "#{ENV['POD_NAME']}"
  </record>
</filter>

<match app.logs>
  @type clickhouse
  host clickhouse
  port 8123
  database observability
  table logs
  <buffer>
    @type memory
    flush_interval 5s
  </buffer>
</match>

Log Storage with ClickHouse

Table Schema

Optimized schema for log storage:

CREATE TABLE logs (
    timestamp DateTime64(3) CODEC(DoubleDelta, LZ4),
    level LowCardinality(String),
    service LowCardinality(String),
    message String,
    trace_id String,
    span_id String,
    user_id String,
    request_id String,
    environment LowCardinality(String),
    host LowCardinality(String),
    attributes Map(String, String),
    INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_user_id user_id TYPE bloom_filter GRANULARITY 1,
    INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (service, level, timestamp)
TTL timestamp + INTERVAL 30 DAY;

Design Decisions:

Compression codecs: Reduce storage by 80-90%
LowCardinality: Optimize repeated values (level, service)
Bloom filters: Fast lookups by trace_id, user_id
Token index: Full-text search on message
Partitioning: By date for efficient pruning
TTL: Automatic cleanup of old logs

Materialized Views

Pre-aggregate common queries:

-- Error count by service
CREATE MATERIALIZED VIEW error_count_by_service
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (service, timestamp)
AS SELECT
    service,
    toStartOfHour(timestamp) as timestamp,
    count() as error_count
FROM logs
WHERE level = 'ERROR'
GROUP BY service, timestamp;

-- Log volume by level
CREATE MATERIALIZED VIEW log_volume_by_level
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (level, timestamp)
AS SELECT
    level,
    toStartOfMinute(timestamp) as timestamp,
    count() as log_count
FROM logs
GROUP BY level, timestamp;

Log Search and Filtering

Accessing Logs in GitLab

Navigate to: Monitor Logs

Query Examples

1. Search by Service

SELECT *
FROM logs
WHERE service = 'user-api'
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;

2. Filter by Log Level

SELECT *
FROM logs
WHERE level = 'ERROR'
  AND timestamp >= now() - INTERVAL 1 DAY
ORDER BY timestamp DESC;

3. Search Message Content

SELECT *
FROM logs
WHERE message LIKE '%database connection%'
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC;

4. Find Logs by User

SELECT *
FROM logs
WHERE user_id = '12345'
  AND timestamp >= now() - INTERVAL 1 DAY
ORDER BY timestamp DESC;

5. Trace ID Correlation

SELECT *
FROM logs
WHERE trace_id = 'abc-123-def-456'
ORDER BY timestamp ASC;

Advanced Filtering

Multi-field Search

SELECT *
FROM logs
WHERE service = 'payment-api'
  AND level IN ('ERROR', 'WARN')
  AND message LIKE '%timeout%'
  AND timestamp >= now() - INTERVAL 6 HOUR
ORDER BY timestamp DESC;

Aggregation Queries

-- Error rate by service
SELECT
    service,
    countIf(level = 'ERROR') as error_count,
    count() as total_count,
    round(error_count / total_count * 100, 2) as error_rate_pct
FROM logs
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service
ORDER BY error_rate_pct DESC;

Time-based Analysis

-- Log volume by hour
SELECT
    toStartOfHour(timestamp) as hour,
    count() as log_count
FROM logs
WHERE timestamp >= now() - INTERVAL 1 DAY
GROUP BY hour
ORDER BY hour;

Log Correlation

Linking Logs with Traces

Include trace IDs in logs:

const { trace } = require('@opentelemetry/api');

function logWithTrace(message, data) {
  const span = trace.getActiveSpan();
  const spanContext = span?.spanContext();

  logger.info(message, {
    ...data,
    traceId: spanContext?.traceId,
    spanId: spanContext?.spanId,
  });
}

// Usage
logWithTrace('Processing payment', {
  orderId: order.id,
  amount: order.total,
});

Querying Correlated Data

-- Get all logs for a trace
SELECT
    timestamp,
    level,
    service,
    message
FROM logs
WHERE trace_id = 'abc-123-def-456'
ORDER BY timestamp ASC;

Benefits of Correlation

End-to-end visibility: See complete request flow
Root cause analysis: Identify where errors originated
Performance debugging: Find slow operations
Context preservation: Maintain request context across services

Log Retention Policies

Storage Tiers

Implement multi-tier storage:

-- Hot tier: Recent logs (7 days)
CREATE TABLE logs_hot AS logs
ENGINE = MergeTree()
TTL timestamp + INTERVAL 7 DAY TO DISK 'warm';

-- Warm tier: Historical logs (30 days)
-- Stored on slower, cheaper storage
-- Automatically moved by TTL

-- Cold tier: Archive (1 year)
CREATE TABLE logs_archive AS logs
ENGINE = MergeTree()
TTL timestamp + INTERVAL 1 YEAR DELETE;

Retention Best Practices

Log Type	Retention Period	Reasoning
Debug logs	1-3 days	High volume, low value
Info logs	7-14 days	Normal operations
Warn logs	30 days	Potential issues
Error logs	90 days	Investigation needed
Security logs	1 year	Compliance requirements
Audit logs	7 years	Legal requirements

Cost Optimization

-- Sample debug logs (keep 10%)
CREATE TABLE logs_sampled AS logs
ENGINE = MergeTree()
AS SELECT *
FROM logs
WHERE level = 'DEBUG'
  AND rand() % 10 = 0; -- 10% sampling

-- Keep all error/warn logs
CREATE TABLE logs_important AS logs
ENGINE = MergeTree()
AS SELECT *
FROM logs
WHERE level IN ('ERROR', 'WARN', 'FATAL');

Production Best Practices

1. Include Request Context

Add context to every log:

// Middleware to add request context
app.use((req, res, next) => {
  req.requestId = uuid.v4();
  req.logger = logger.child({
    requestId: req.requestId,
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
    clientIp: req.ip,
  });
  next();
});

// Use request-scoped logger
app.get('/api/users', (req, res) => {
  req.logger.info('Fetching users');
  // ... handle request
});

2. Avoid Logging Sensitive Data

Scrub sensitive information:

const SENSITIVE_FIELDS = ['password', 'token', 'apiKey', 'ssn', 'creditCard'];

function sanitizeLog(data) {
  const sanitized = { ...data };
  SENSITIVE_FIELDS.forEach(field => {
    if (sanitized[field]) {
      sanitized[field] = '***REDACTED***';
    }
  });
  return sanitized;
}

// Usage
logger.info('User created', sanitizeLog({
  userId: user.id,
  email: user.email,
  password: user.password, // Will be redacted
}));

3. Use Sampling for High-Volume Logs

Sample verbose logs:

function shouldLog(level, samplingRate = 1.0) {
  if (level === 'ERROR' || level === 'WARN') {
    return true; // Always log errors and warnings
  }
  return Math.random() < samplingRate;
}

// Sample 10% of debug logs
if (shouldLog('DEBUG', 0.1)) {
  logger.debug('Detailed debug info', { ... });
}

4. Implement Log Rotation

Prevent disk space exhaustion:

// Winston with rotation
const winston = require('winston');
require('winston-daily-rotate-file');

const transport = new winston.transports.DailyRotateFile({
  filename: 'logs/app-%DATE%.log',
  datePattern: 'YYYY-MM-DD',
  maxSize: '20m',
  maxFiles: '14d', // Keep 14 days
  compress: true,
});

const logger = winston.createLogger({
  transports: [transport]
});

5. Structured Error Logging

Include full error context:

try {
  await riskyOperation();
} catch (error) {
  logger.error('Operation failed', {
    error: {
      message: error.message,
      stack: error.stack,
      code: error.code,
    },
    context: {
      userId: user.id,
      operation: 'riskyOperation',
      input: sanitizeLog(input),
    },
  });
  throw error;
}

Alerting Based on Logs

Log-based Alerts

Create alerts from log patterns:

# Alert on error rate
alert:
  name: high_error_rate
  query: |
    SELECT
      countIf(level = 'ERROR') / count() as error_rate
    FROM logs
    WHERE timestamp >= now() - INTERVAL 5 MINUTE
  condition: error_rate > 0.05
  notification:
    - slack: #alerts
    - pagerduty: oncall

# Alert on specific error messages
alert:
  name: database_connection_errors
  query: |
    SELECT count() as error_count
    FROM logs
    WHERE level = 'ERROR'
      AND message LIKE '%database connection%'
      AND timestamp >= now() - INTERVAL 5 MINUTE
  condition: error_count > 10
  notification:
    - slack: #database-team

Performance Considerations

1. Batch Log Writes

Reduce I/O overhead:

const logBuffer = [];
const BATCH_SIZE = 100;
const FLUSH_INTERVAL = 5000; // 5 seconds

function bufferedLog(level, message, data) {
  logBuffer.push({ level, message, data, timestamp: Date.now() });

  if (logBuffer.length >= BATCH_SIZE) {
    flushLogs();
  }
}

function flushLogs() {
  if (logBuffer.length === 0) return;

  const batch = logBuffer.splice(0, logBuffer.length);
  logger.info('Batch logs', { logs: batch });
}

// Flush periodically
setInterval(flushLogs, FLUSH_INTERVAL);

2. Async Logging

Prevent blocking application:

const winston = require('winston');
const { createLogger, format, transports } = winston;

const logger = createLogger({
  transports: [
    new transports.Stream({
      stream: process.stdout,
      // Non-blocking writes
      handleExceptions: false,
      handleRejections: false,
    })
  ]
});

// Errors don't block application
logger.on('error', (error) => {
  console.error('Logging error:', error);
});

3. Minimize Log Volume

Log only what's needed:

// Development: Verbose logging
if (process.env.NODE_ENV === 'development') {
  logger.level = 'debug';
}

// Production: Essential logs only
if (process.env.NODE_ENV === 'production') {
  logger.level = 'info';
}

// Conditional logging
if (logger.isLevelEnabled('debug')) {
  logger.debug('Complex computation', expensiveToSerialize());
}

Troubleshooting with Logs

Common Scenarios

1. Find Recent Errors

SELECT *
FROM logs
WHERE level = 'ERROR'
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 50;

2. Track User Journey

SELECT timestamp, service, message
FROM logs
WHERE user_id = '12345'
  AND timestamp >= now() - INTERVAL 1 DAY
ORDER BY timestamp ASC;

3. Identify Slow Operations

SELECT *
FROM logs
WHERE message LIKE '%took%ms%'
  AND extractAll(message, 'took (\\d+)ms')[1]::Int > 1000
ORDER BY timestamp DESC;

4. Detect Error Patterns

SELECT
    message,
    count() as occurrence_count
FROM logs
WHERE level = 'ERROR'
  AND timestamp >= now() - INTERVAL 1 DAY
GROUP BY message
ORDER BY occurrence_count DESC
LIMIT 10;

References

Tracing - Distributed tracing and correlation
Error Tracking - Error management and aggregation
Metrics - Time-series metrics collection
APM - Application performance monitoring
Dashboards - Log visualization and reporting