alerting

Alerting and Notifications in GitLab

Overview

GitLab provides comprehensive alerting capabilities to notify teams about system issues, performance degradations, and security threats. Effective alerting helps teams respond quickly to incidents and maintain system reliability.

What is Alerting?

Alerting enables you to:

Detect issues proactively: Be notified before users are affected
Respond quickly: Reduce mean time to resolution (MTTR)
Prevent alert fatigue: Smart routing and deduplication
Track incident response: Integrate with incident management
Maintain SLOs: Monitor service level objectives

Alert Sources

1. Prometheus Alerts

Metrics-based alerting through Prometheus:

System metrics (CPU, memory, disk)
Application metrics (latency, errors, throughput)
Custom business metrics

2. Error Tracking Alerts

Application error notifications:

New error types
Error rate spikes
Critical errors

3. Security Alerts

Security scanning notifications:

Vulnerability discoveries
Secret detection
License compliance issues

4. Pipeline Alerts

CI/CD notifications:

Pipeline failures
Deployment issues
Performance regressions

Setting Up Alerts in GitLab

Accessing Alert Settings

Navigate to: Settings Monitor Alerts

Enable Alerting

Enable alert management
Configure notification channels:
- Email
- Slack
- PagerDuty
- Webhooks
Set up alert rules

Prometheus Alert Rules

Alert Rule Structure

# prometheus_rules.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook: "https://gitlab.com/runbooks/high-error-rate"
          dashboard: "https://gitlab.com/dashboards/errors"

Alert Rule Components

Expression (expr):

PromQL query defining the alert condition
Must return a boolean (true = alert fires)

Duration (for):

Time the condition must be true before alerting
Prevents false positives from transient spikes

Labels:

Metadata for routing and grouping alerts
Common labels: severity, team, service

Annotations:

Human-readable alert information
Can include dynamic values from metrics

Common Alert Patterns

1. High Error Rate

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate exceeds 5%"
    description: "Current error rate: {{ $value | humanizePercentage }}"

2. High Latency

- alert: HighLatency
  expr: |
    histogram_quantile(0.95,
      rate(http_request_duration_seconds_bucket[5m])
    ) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 latency exceeds 2 seconds"
    description: "Current P95 latency: {{ $value }}s"

3. Service Down

- alert: ServiceDown
  expr: up{job="user-api"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.instance }} is down"
    description: "Service has been unreachable for 1 minute"

4. High Memory Usage

- alert: HighMemoryUsage
  expr: |
    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
    / node_memory_MemTotal_bytes > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Memory usage exceeds 90%"
    description: "Current usage: {{ $value | humanizePercentage }}"

5. Disk Space Low

- alert: DiskSpaceLow
  expr: |
    (node_filesystem_avail_bytes{mountpoint="/"} /
     node_filesystem_size_bytes{mountpoint="/"}) < 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Disk space below 10%"
    description: "Available space: {{ $value | humanizePercentage }}"

6. Certificate Expiring

- alert: CertificateExpiring
  expr: |
    (ssl_certificate_expiry_seconds - time()) / 86400 < 30
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "SSL certificate expires in {{ $value }} days"
    description: "Certificate for {{ $labels.domain }} expires soon"

7. Pod Restart Loop

- alert: PodRestartLoop
  expr: |
    rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Pod {{ $labels.pod }} is restart looping"
    description: "Pod has restarted {{ $value }} times in 15 minutes"

Alert Severity Levels

Severity Classification

# Critical: Immediate action required
severity: critical
# Examples:
# - Service completely down
# - Data loss occurring
# - Security breach
# Response: Page on-call engineer immediately

# Warning: Action required soon
severity: warning
# Examples:
# - Performance degraded
# - Resource usage high
# - Non-critical service down
# Response: Create ticket, notify team

# Info: Awareness only
severity: info
# Examples:
# - Deployment completed
# - Scaling event occurred
# - Maintenance window starting
# Response: Log for reference

Routing by Severity

# alertmanager.yml
route:
  receiver: default
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts  PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts  Slack
    - match:
        severity: warning
      receiver: slack
      group_wait: 5m
      repeat_interval: 12h

    # Info alerts  Email (daily digest)
    - match:
        severity: info
      receiver: email
      group_wait: 24h
      repeat_interval: 24h

Notification Channels

Slack Integration

Setup Slack Integration

Create Slack App:
- Visit https://api.slack.com/apps
- Create new app
- Add Incoming Webhook
- Install to workspace

Configure in GitLab:

# Settings  Integrations  Slack
webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
username: GitLab Alerts
channel: #alerts
notify_only_broken_pipelines: false

Slack Notification Format

# alertmanager.yml
receivers:
  - name: slack
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Severity:* {{ .CommonLabels.severity }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        color: '{{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}good{{ end }}'

PagerDuty Integration

Setup PagerDuty

Create Integration:
- Navigate to PagerDuty Services
- Add integration: Prometheus
- Copy integration key

Configure in GitLab:

# Settings  Monitor  Alerts  PagerDuty
integration_key: ${PAGERDUTY_INTEGRATION_KEY}

PagerDuty Configuration

# alertmanager.yml
receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PAGERDUTY_INTEGRATION_KEY}
        description: '{{ .CommonAnnotations.summary }}'
        details:
          severity: '{{ .CommonLabels.severity }}'
          service: '{{ .CommonLabels.service }}'
          instance: '{{ .CommonLabels.instance }}'
          runbook: '{{ .CommonAnnotations.runbook }}'
          dashboard: '{{ .CommonAnnotations.dashboard }}'

Email Notifications

Configure Email Alerts

# alertmanager.yml
receivers:
  - name: email
    email_configs:
      - to: 'team@example.com'
        from: 'alerts@example.com'
        smarthost: smtp.gmail.com:587
        auth_username: alerts@example.com
        auth_password: ${SMTP_PASSWORD}
        headers:
          Subject: '[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}'
        html: |
          <h2>{{ .CommonAnnotations.summary }}</h2>
          <p><strong>Severity:</strong> {{ .CommonLabels.severity }}</p>
          <p><strong>Description:</strong> {{ .CommonAnnotations.description }}</p>
          <p><strong>Started:</strong> {{ .StartsAt }}</p>
          {{ if .CommonAnnotations.runbook }}
          <p><a href="{{ .CommonAnnotations.runbook }}">View Runbook</a></p>
          {{ end }}

Webhook Integration

Custom Webhook

# alertmanager.yml
receivers:
  - name: webhook
    webhook_configs:
      - url: 'https://your-service.com/webhook'
        send_resolved: true
        http_config:
          bearer_token: ${WEBHOOK_TOKEN}

Webhook Payload

{
  "status": "firing",
  "labels": {
    "alertname": "HighErrorRate",
    "severity": "critical",
    "service": "user-api"
  },
  "annotations": {
    "summary": "High error rate detected",
    "description": "Error rate is 7.5%",
    "runbook": "https://gitlab.com/runbooks/high-error-rate"
  },
  "startsAt": "2026-01-08T12:34:56Z",
  "endsAt": "0001-01-01T00:00:00Z",
  "generatorURL": "https://prometheus/graph?g0.expr=..."
}

Alert Management

Viewing Alerts

Navigate to: Monitor Alerts

Alert List View

Active Alerts

  HighErrorRate                         Critical  5m ago 
    Error rate exceeds 5% (current: 7.5%)                 
    Service: user-api | Instance: api-01                  
    [Runbook] [Dashboard] [Silence] [Create Issue]       

   HighLatency                           Warning  15m ago
    P95 latency exceeds 2 seconds (current: 2.8s)         
    Service: payment-api | Instance: api-03               
    [Runbook] [Dashboard] [Silence] [Create Issue]

Alert Details

Click on an alert to view:

Full description
Metric graphs
Related alerts
Alert history
Actions (silence, create issue, assign)

Creating Issues from Alerts

Automatically create GitLab issues:

# Enable auto-issue creation
alert_management:
  auto_create_issues: true
  issue_template: |
    ## Alert: {{ .CommonAnnotations.summary }}

    **Severity:** {{ .CommonLabels.severity }}
    **Service:** {{ .CommonLabels.service }}
    **Started:** {{ .StartsAt }}

    ### Description
    {{ .CommonAnnotations.description }}

    ### Investigation
    - [ ] Check recent deployments
    - [ ] Review error logs
    - [ ] Check resource utilization
    - [ ] Verify external dependencies

    ### Runbook
    {{ .CommonAnnotations.runbook }}

    ### Dashboard
    {{ .CommonAnnotations.dashboard }}

    /label ~incident ~{{ .CommonLabels.severity }}
    /assign @oncall

Alert Grouping and Deduplication

Grouping Alerts

Group related alerts to reduce noise:

# alertmanager.yml
route:
  group_by: ['alertname', 'service', 'environment']
  group_wait: 30s       # Wait 30s for more alerts before sending
  group_interval: 5m    # Wait 5m before sending new group
  repeat_interval: 4h   # Resend after 4h if still firing

Example Grouping

Instead of 10 separate alerts:
 HighLatency on api-01
 HighLatency on api-02
 HighLatency on api-03
... (7 more)

Send 1 grouped alert:
 HighLatency affecting 10 instances:
   api-01, api-02, api-03, api-04, api-05,
   api-06, api-07, api-08, api-09, api-10

Deduplication

Prevent duplicate alerts:

# Alerts with same fingerprint are deduplicated
# Fingerprint = hash(alertname + labels)

# These are considered duplicates:
HighLatency{service="api", instance="api-01"}
HighLatency{service="api", instance="api-01"}

# These are different:
HighLatency{service="api", instance="api-01"}
HighLatency{service="api", instance="api-02"}

Silencing Alerts

Temporary Silence

Silence alerts during maintenance:

# Via GitLab UI: Monitor  Alerts  Silence
alertname: HighLatency
service: user-api
duration: 2h
comment: "Database maintenance window"

Via API

# Create silence
curl -X POST https://alertmanager/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "HighLatency"},
      {"name": "service", "value": "user-api"}
    ],
    "startsAt": "2026-01-08T14:00:00Z",
    "endsAt": "2026-01-08T16:00:00Z",
    "createdBy": "admin",
    "comment": "Database maintenance window"
  }'

Silence Patterns

# Silence specific alert
matchers:
  - alertname: HighLatency

# Silence all alerts for a service
matchers:
  - service: user-api

# Silence by severity
matchers:
  - severity: warning

# Multiple conditions (AND)
matchers:
  - service: user-api
  - severity: warning

Escalation Policies

On-Call Schedule

Configure escalation:

# escalation_policy.yml
teams:
  - name: backend
    schedule:
      timezone: America/New_York
      rotations:
        - name: Primary On-Call
          participants:
            - alice@example.com
            - bob@example.com
          rotation_length: 1 week

escalation_rules:
  - delay: 0m
    notify: primary_oncall

  - delay: 15m
    notify: primary_oncall
    action: escalate

  - delay: 30m
    notify: team_lead

  - delay: 60m
    notify: engineering_manager

Escalation Flow

Alert Fires
 0m: Notify primary on-call (PagerDuty)
 15m: No ack? Page again + escalate
 30m: Still no ack? Notify team lead
 60m: Escalate to engineering manager

Alert Fatigue Prevention

1. Adjust Alert Thresholds

Tune alerts to reduce false positives:

# Bad: Too sensitive
expr: rate(errors[1m]) > 0

# Better: Allow some errors
expr: rate(errors[5m]) > 10

# Best: Contextual threshold
expr: |
  rate(errors[5m]) / rate(requests[5m]) > 0.05

2. Use [object Object] Duration

Require sustained condition:

# Fires immediately on spike
- alert: HighCPU
  expr: cpu_usage > 80

# Fires only if sustained for 10 minutes
- alert: HighCPU
  expr: cpu_usage > 80
  for: 10m

3. Alert During Business Hours

Adjust severity by time:

- alert: HighLatency
  expr: p95_latency > 1
  for: 5m
  labels:
    severity: |
      {{ if match "^(Mon|Tue|Wed|Thu|Fri) (09|10|11|12|13|14|15|16|17):" .ActiveAt }}
        critical
      {{ else }}
        warning
      {{ end }}

4. Maintenance Windows

Automatically silence during deployments:

# .gitlab-ci.yml
deploy:production:
  script:
    - ./scripts/create-silence.sh "Deployment in progress" 30m
    - kubectl apply -f k8s/
    - ./scripts/health-check.sh
  after_script:
    - ./scripts/delete-silence.sh

Incident Response Integration

Incident Lifecycle

Alert Fires
 1. Create Incident Issue
    Auto-populated with alert details
    Assigned to on-call engineer
    Labels: ~incident, ~severity::critical
 2. Notify Team
    PagerDuty page
    Slack notification
    Email to team
 3. Investigation
    Follow runbook
    Check dashboards
    Review logs/traces
 4. Resolution
    Deploy fix or rollback
    Verify metrics
    Mark incident resolved
 5. Post-Mortem
     Document root cause
     Identify action items
     Update runbooks

Incident Template

## Incident: {{ alert.summary }}

**Status:**  ACTIVE
**Severity:** {{ alert.severity }}
**Started:** {{ alert.startsAt }}
**Service:** {{ alert.service }}

### Impact
- Users affected: [To be determined]
- Services impacted: {{ alert.service }}
- Data loss: [Yes/No]

### Timeline
- {{ alert.startsAt }}: Alert fired
- {{ now }}: Incident created

### Investigation
- [ ] Check recent deployments
- [ ] Review error logs: {{ alert.logLink }}
- [ ] Check metrics dashboard: {{ alert.dashboardLink }}
- [ ] Verify external dependencies
- [ ] Run diagnostic commands

### Resolution Steps
- [ ] Identify root cause
- [ ] Implement fix or rollback
- [ ] Verify recovery
- [ ] Monitor for recurrence

### Runbook
{{ alert.runbook }}

### Communication
- [ ] Notify stakeholders
- [ ] Update status page
- [ ] Post updates in Slack

/label ~incident ~{{ alert.severity }} ~{{ alert.service }}
/assign @oncall

Alert Testing

Test Alert Rules

# Test Prometheus rule syntax
promtool check rules prometheus_rules.yml

# Test alert expression
promtool query instant http://prometheus:9090 \
  'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05'

Simulate Alerts

# Send test alert to Alertmanager
curl -X POST http://alertmanager:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert"
    }
  }]'

Alert Validation Checklist

Alert fires when condition is met
Alert resolves when condition clears
Notifications reach correct channels
Runbook link is valid
Dashboard link is valid
Alert description is clear
Severity is appropriate
Threshold is tuned (no false positives)

Best Practices

1. Actionable Alerts

Every alert should:

Be actionable (what should I do?)
Include runbook link
Include dashboard link
Have clear severity
Contain context (affected service, instance)

2. Alert Hierarchy

Critical  Page on-call immediately
 Service completely down
 Data loss occurring
 Security breach

Warning  Create ticket, notify team
 Performance degraded
 Resource usage high
 Non-critical service down

Info  Log for reference
 Deployment completed
 Scaling event occurred
 Maintenance window

3. Mean Time to Acknowledge (MTTA)

Track how quickly alerts are acknowledged:

SELECT
  AVG(TIMESTAMPDIFF(SECOND, alert_fired_at, acknowledged_at)) / 60 as mtta_minutes
FROM alerts
WHERE acknowledged_at IS NOT NULL
  AND fired_at >= NOW() - INTERVAL 30 DAY;

4. Alert Quality Metrics

Monitor alert effectiveness:

True positive rate: Alerts requiring action
False positive rate: Alerts that auto-resolve
Time to acknowledge: How fast team responds
Time to resolve: How fast issues are fixed

References

Metrics - Prometheus metrics for alerting
Tracing - Distributed tracing for debugging
Logs - Log analysis for incidents
Dashboards - Visualization and monitoring
DORA Metrics - MTTR tracking