Skip to main content

alerting

Alerting and Notifications in GitLab

Overview

GitLab provides comprehensive alerting capabilities to notify teams about system issues, performance degradations, and security threats. Effective alerting helps teams respond quickly to incidents and maintain system reliability.

What is Alerting?

Alerting enables you to:

  • Detect issues proactively: Be notified before users are affected
  • Respond quickly: Reduce mean time to resolution (MTTR)
  • Prevent alert fatigue: Smart routing and deduplication
  • Track incident response: Integrate with incident management
  • Maintain SLOs: Monitor service level objectives

Alert Sources

1. Prometheus Alerts

Metrics-based alerting through Prometheus:

  • System metrics (CPU, memory, disk)
  • Application metrics (latency, errors, throughput)
  • Custom business metrics

2. Error Tracking Alerts

Application error notifications:

  • New error types
  • Error rate spikes
  • Critical errors

3. Security Alerts

Security scanning notifications:

  • Vulnerability discoveries
  • Secret detection
  • License compliance issues

4. Pipeline Alerts

CI/CD notifications:

  • Pipeline failures
  • Deployment issues
  • Performance regressions

Setting Up Alerts in GitLab

Accessing Alert Settings

Navigate to: Settings Monitor Alerts

Enable Alerting

  1. Enable alert management
  2. Configure notification channels:
    • Email
    • Slack
    • PagerDuty
    • Webhooks
  3. Set up alert rules

Prometheus Alert Rules

Alert Rule Structure

# prometheus_rules.yml groups: - name: application_alerts interval: 30s rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook: "https://gitlab.com/runbooks/high-error-rate" dashboard: "https://gitlab.com/dashboards/errors"

Alert Rule Components

Expression (expr):

  • PromQL query defining the alert condition
  • Must return a boolean (true = alert fires)

Duration (for):

  • Time the condition must be true before alerting
  • Prevents false positives from transient spikes

Labels:

  • Metadata for routing and grouping alerts
  • Common labels: severity, team, service

Annotations:

  • Human-readable alert information
  • Can include dynamic values from metrics

Common Alert Patterns

1. High Error Rate

- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate exceeds 5%" description: "Current error rate: {{ $value | humanizePercentage }}"

2. High Latency

- alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 2 for: 10m labels: severity: warning annotations: summary: "P95 latency exceeds 2 seconds" description: "Current P95 latency: {{ $value }}s"

3. Service Down

- alert: ServiceDown expr: up{job="user-api"} == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"

4. High Memory Usage

- alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Memory usage exceeds 90%" description: "Current usage: {{ $value | humanizePercentage }}"

5. Disk Space Low

- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: critical annotations: summary: "Disk space below 10%" description: "Available space: {{ $value | humanizePercentage }}"

6. Certificate Expiring

- alert: CertificateExpiring expr: | (ssl_certificate_expiry_seconds - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "SSL certificate expires in {{ $value }} days" description: "Certificate for {{ $labels.domain }} expires soon"

7. Pod Restart Loop

- alert: PodRestartLoop expr: | rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 15m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restart looping" description: "Pod has restarted {{ $value }} times in 15 minutes"

Alert Severity Levels

Severity Classification

# Critical: Immediate action required severity: critical # Examples: # - Service completely down # - Data loss occurring # - Security breach # Response: Page on-call engineer immediately # Warning: Action required soon severity: warning # Examples: # - Performance degraded # - Resource usage high # - Non-critical service down # Response: Create ticket, notify team # Info: Awareness only severity: info # Examples: # - Deployment completed # - Scaling event occurred # - Maintenance window starting # Response: Log for reference

Routing by Severity

# alertmanager.yml route: receiver: default group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts PagerDuty - match: severity: critical receiver: pagerduty group_wait: 10s repeat_interval: 1h # Warning alerts Slack - match: severity: warning receiver: slack group_wait: 5m repeat_interval: 12h # Info alerts Email (daily digest) - match: severity: info receiver: email group_wait: 24h repeat_interval: 24h

Notification Channels

Slack Integration

Setup Slack Integration

  1. Create Slack App:

  2. Configure in GitLab:

    # Settings Integrations Slack webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL username: GitLab Alerts channel: #alerts notify_only_broken_pipelines: false

Slack Notification Format

# alertmanager.yml receivers: - name: slack slack_configs: - api_url: ${SLACK_WEBHOOK_URL} channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: | *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} *Severity:* {{ .CommonLabels.severity }} *Runbook:* {{ .CommonAnnotations.runbook }} color: '{{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}good{{ end }}'

PagerDuty Integration

Setup PagerDuty

  1. Create Integration:

    • Navigate to PagerDuty Services
    • Add integration: Prometheus
    • Copy integration key
  2. Configure in GitLab:

    # Settings Monitor Alerts PagerDuty integration_key: ${PAGERDUTY_INTEGRATION_KEY}

PagerDuty Configuration

# alertmanager.yml receivers: - name: pagerduty pagerduty_configs: - service_key: ${PAGERDUTY_INTEGRATION_KEY} description: '{{ .CommonAnnotations.summary }}' details: severity: '{{ .CommonLabels.severity }}' service: '{{ .CommonLabels.service }}' instance: '{{ .CommonLabels.instance }}' runbook: '{{ .CommonAnnotations.runbook }}' dashboard: '{{ .CommonAnnotations.dashboard }}'

Email Notifications

Configure Email Alerts

# alertmanager.yml receivers: - name: email email_configs: - to: 'team@example.com' from: 'alerts@example.com' smarthost: smtp.gmail.com:587 auth_username: alerts@example.com auth_password: ${SMTP_PASSWORD} headers: Subject: '[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}' html: | <h2>{{ .CommonAnnotations.summary }}</h2> <p><strong>Severity:</strong> {{ .CommonLabels.severity }}</p> <p><strong>Description:</strong> {{ .CommonAnnotations.description }}</p> <p><strong>Started:</strong> {{ .StartsAt }}</p> {{ if .CommonAnnotations.runbook }} <p><a href="{{ .CommonAnnotations.runbook }}">View Runbook</a></p> {{ end }}

Webhook Integration

Custom Webhook

# alertmanager.yml receivers: - name: webhook webhook_configs: - url: 'https://your-service.com/webhook' send_resolved: true http_config: bearer_token: ${WEBHOOK_TOKEN}

Webhook Payload

{ "status": "firing", "labels": { "alertname": "HighErrorRate", "severity": "critical", "service": "user-api" }, "annotations": { "summary": "High error rate detected", "description": "Error rate is 7.5%", "runbook": "https://gitlab.com/runbooks/high-error-rate" }, "startsAt": "2026-01-08T12:34:56Z", "endsAt": "0001-01-01T00:00:00Z", "generatorURL": "https://prometheus/graph?g0.expr=..." }

Alert Management

Viewing Alerts

Navigate to: Monitor Alerts

Alert List View

Active Alerts

  HighErrorRate                         Critical  5m ago 
    Error rate exceeds 5% (current: 7.5%)                 
    Service: user-api | Instance: api-01                  
    [Runbook] [Dashboard] [Silence] [Create Issue]       

   HighLatency                           Warning  15m ago
    P95 latency exceeds 2 seconds (current: 2.8s)         
    Service: payment-api | Instance: api-03               
    [Runbook] [Dashboard] [Silence] [Create Issue]       

Alert Details

Click on an alert to view:

  • Full description
  • Metric graphs
  • Related alerts
  • Alert history
  • Actions (silence, create issue, assign)

Creating Issues from Alerts

Automatically create GitLab issues:

# Enable auto-issue creation alert_management: auto_create_issues: true issue_template: | ## Alert: {{ .CommonAnnotations.summary }} **Severity:** {{ .CommonLabels.severity }} **Service:** {{ .CommonLabels.service }} **Started:** {{ .StartsAt }} ### Description {{ .CommonAnnotations.description }} ### Investigation - [ ] Check recent deployments - [ ] Review error logs - [ ] Check resource utilization - [ ] Verify external dependencies ### Runbook {{ .CommonAnnotations.runbook }} ### Dashboard {{ .CommonAnnotations.dashboard }} /label ~incident ~{{ .CommonLabels.severity }} /assign @oncall

Alert Grouping and Deduplication

Grouping Alerts

Group related alerts to reduce noise:

# alertmanager.yml route: group_by: ['alertname', 'service', 'environment'] group_wait: 30s # Wait 30s for more alerts before sending group_interval: 5m # Wait 5m before sending new group repeat_interval: 4h # Resend after 4h if still firing

Example Grouping

Instead of 10 separate alerts:
 HighLatency on api-01
 HighLatency on api-02
 HighLatency on api-03
... (7 more)

Send 1 grouped alert:
 HighLatency affecting 10 instances:
   api-01, api-02, api-03, api-04, api-05,
   api-06, api-07, api-08, api-09, api-10

Deduplication

Prevent duplicate alerts:

# Alerts with same fingerprint are deduplicated # Fingerprint = hash(alertname + labels) # These are considered duplicates: HighLatency{service="api", instance="api-01"} HighLatency{service="api", instance="api-01"} # These are different: HighLatency{service="api", instance="api-01"} HighLatency{service="api", instance="api-02"}

Silencing Alerts

Temporary Silence

Silence alerts during maintenance:

# Via GitLab UI: Monitor Alerts Silence alertname: HighLatency service: user-api duration: 2h comment: "Database maintenance window"

Via API

# Create silence curl -X POST https://alertmanager/api/v2/silences \ -H "Content-Type: application/json" \ -d '{ "matchers": [ {"name": "alertname", "value": "HighLatency"}, {"name": "service", "value": "user-api"} ], "startsAt": "2026-01-08T14:00:00Z", "endsAt": "2026-01-08T16:00:00Z", "createdBy": "admin", "comment": "Database maintenance window" }'

Silence Patterns

# Silence specific alert matchers: - alertname: HighLatency # Silence all alerts for a service matchers: - service: user-api # Silence by severity matchers: - severity: warning # Multiple conditions (AND) matchers: - service: user-api - severity: warning

Escalation Policies

On-Call Schedule

Configure escalation:

# escalation_policy.yml teams: - name: backend schedule: timezone: America/New_York rotations: - name: Primary On-Call participants: - alice@example.com - bob@example.com rotation_length: 1 week escalation_rules: - delay: 0m notify: primary_oncall - delay: 15m notify: primary_oncall action: escalate - delay: 30m notify: team_lead - delay: 60m notify: engineering_manager

Escalation Flow

Alert Fires
 0m: Notify primary on-call (PagerDuty)
 15m: No ack? Page again + escalate
 30m: Still no ack? Notify team lead
 60m: Escalate to engineering manager

Alert Fatigue Prevention

1. Adjust Alert Thresholds

Tune alerts to reduce false positives:

# Bad: Too sensitive expr: rate(errors[1m]) > 0 # Better: Allow some errors expr: rate(errors[5m]) > 10 # Best: Contextual threshold expr: | rate(errors[5m]) / rate(requests[5m]) > 0.05

2. Use [object Object] Duration

Require sustained condition:

# Fires immediately on spike - alert: HighCPU expr: cpu_usage > 80 # Fires only if sustained for 10 minutes - alert: HighCPU expr: cpu_usage > 80 for: 10m

3. Alert During Business Hours

Adjust severity by time:

- alert: HighLatency expr: p95_latency > 1 for: 5m labels: severity: | {{ if match "^(Mon|Tue|Wed|Thu|Fri) (09|10|11|12|13|14|15|16|17):" .ActiveAt }} critical {{ else }} warning {{ end }}

4. Maintenance Windows

Automatically silence during deployments:

# .gitlab-ci.yml deploy:production: script: - ./scripts/create-silence.sh "Deployment in progress" 30m - kubectl apply -f k8s/ - ./scripts/health-check.sh after_script: - ./scripts/delete-silence.sh

Incident Response Integration

Incident Lifecycle

Alert Fires
 1. Create Incident Issue
    Auto-populated with alert details
    Assigned to on-call engineer
    Labels: ~incident, ~severity::critical
 2. Notify Team
    PagerDuty page
    Slack notification
    Email to team
 3. Investigation
    Follow runbook
    Check dashboards
    Review logs/traces
 4. Resolution
    Deploy fix or rollback
    Verify metrics
    Mark incident resolved
 5. Post-Mortem
     Document root cause
     Identify action items
     Update runbooks

Incident Template

## Incident: {{ alert.summary }} **Status:** ACTIVE **Severity:** {{ alert.severity }} **Started:** {{ alert.startsAt }} **Service:** {{ alert.service }} ### Impact - Users affected: [To be determined] - Services impacted: {{ alert.service }} - Data loss: [Yes/No] ### Timeline - {{ alert.startsAt }}: Alert fired - {{ now }}: Incident created ### Investigation - [ ] Check recent deployments - [ ] Review error logs: {{ alert.logLink }} - [ ] Check metrics dashboard: {{ alert.dashboardLink }} - [ ] Verify external dependencies - [ ] Run diagnostic commands ### Resolution Steps - [ ] Identify root cause - [ ] Implement fix or rollback - [ ] Verify recovery - [ ] Monitor for recurrence ### Runbook {{ alert.runbook }} ### Communication - [ ] Notify stakeholders - [ ] Update status page - [ ] Post updates in Slack /label ~incident ~{{ alert.severity }} ~{{ alert.service }} /assign @oncall

Alert Testing

Test Alert Rules

# Test Prometheus rule syntax promtool check rules prometheus_rules.yml # Test alert expression promtool query instant http://prometheus:9090 \ 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05'

Simulate Alerts

# Send test alert to Alertmanager curl -X POST http://alertmanager:9093/api/v1/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning" }, "annotations": { "summary": "This is a test alert" } }]'

Alert Validation Checklist

  • Alert fires when condition is met
  • Alert resolves when condition clears
  • Notifications reach correct channels
  • Runbook link is valid
  • Dashboard link is valid
  • Alert description is clear
  • Severity is appropriate
  • Threshold is tuned (no false positives)

Best Practices

1. Actionable Alerts

Every alert should:

  • Be actionable (what should I do?)
  • Include runbook link
  • Include dashboard link
  • Have clear severity
  • Contain context (affected service, instance)

2. Alert Hierarchy

Critical  Page on-call immediately
 Service completely down
 Data loss occurring
 Security breach

Warning  Create ticket, notify team
 Performance degraded
 Resource usage high
 Non-critical service down

Info  Log for reference
 Deployment completed
 Scaling event occurred
 Maintenance window

3. Mean Time to Acknowledge (MTTA)

Track how quickly alerts are acknowledged:

SELECT AVG(TIMESTAMPDIFF(SECOND, alert_fired_at, acknowledged_at)) / 60 as mtta_minutes FROM alerts WHERE acknowledged_at IS NOT NULL AND fired_at >= NOW() - INTERVAL 30 DAY;

4. Alert Quality Metrics

Monitor alert effectiveness:

  • True positive rate: Alerts requiring action
  • False positive rate: Alerts that auto-resolve
  • Time to acknowledge: How fast team responds
  • Time to resolve: How fast issues are fixed

References