alerting
Alerting and Notifications in GitLab
Overview
GitLab provides comprehensive alerting capabilities to notify teams about system issues, performance degradations, and security threats. Effective alerting helps teams respond quickly to incidents and maintain system reliability.
What is Alerting?
Alerting enables you to:
- Detect issues proactively: Be notified before users are affected
- Respond quickly: Reduce mean time to resolution (MTTR)
- Prevent alert fatigue: Smart routing and deduplication
- Track incident response: Integrate with incident management
- Maintain SLOs: Monitor service level objectives
Alert Sources
1. Prometheus Alerts
Metrics-based alerting through Prometheus:
- System metrics (CPU, memory, disk)
- Application metrics (latency, errors, throughput)
- Custom business metrics
2. Error Tracking Alerts
Application error notifications:
- New error types
- Error rate spikes
- Critical errors
3. Security Alerts
Security scanning notifications:
- Vulnerability discoveries
- Secret detection
- License compliance issues
4. Pipeline Alerts
CI/CD notifications:
- Pipeline failures
- Deployment issues
- Performance regressions
Setting Up Alerts in GitLab
Accessing Alert Settings
Navigate to: Settings Monitor Alerts
Enable Alerting
- Enable alert management
- Configure notification channels:
- Slack
- PagerDuty
- Webhooks
- Set up alert rules
Prometheus Alert Rules
Alert Rule Structure
# prometheus_rules.yml groups: - name: application_alerts interval: 30s rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook: "https://gitlab.com/runbooks/high-error-rate" dashboard: "https://gitlab.com/dashboards/errors"
Alert Rule Components
Expression (expr):
- PromQL query defining the alert condition
- Must return a boolean (true = alert fires)
Duration (for):
- Time the condition must be true before alerting
- Prevents false positives from transient spikes
Labels:
- Metadata for routing and grouping alerts
- Common labels:
severity,team,service
Annotations:
- Human-readable alert information
- Can include dynamic values from metrics
Common Alert Patterns
1. High Error Rate
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate exceeds 5%" description: "Current error rate: {{ $value | humanizePercentage }}"
2. High Latency
- alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 2 for: 10m labels: severity: warning annotations: summary: "P95 latency exceeds 2 seconds" description: "Current P95 latency: {{ $value }}s"
3. Service Down
- alert: ServiceDown expr: up{job="user-api"} == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"
4. High Memory Usage
- alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Memory usage exceeds 90%" description: "Current usage: {{ $value | humanizePercentage }}"
5. Disk Space Low
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: critical annotations: summary: "Disk space below 10%" description: "Available space: {{ $value | humanizePercentage }}"
6. Certificate Expiring
- alert: CertificateExpiring expr: | (ssl_certificate_expiry_seconds - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "SSL certificate expires in {{ $value }} days" description: "Certificate for {{ $labels.domain }} expires soon"
7. Pod Restart Loop
- alert: PodRestartLoop expr: | rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 15m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restart looping" description: "Pod has restarted {{ $value }} times in 15 minutes"
Alert Severity Levels
Severity Classification
# Critical: Immediate action required severity: critical # Examples: # - Service completely down # - Data loss occurring # - Security breach # Response: Page on-call engineer immediately # Warning: Action required soon severity: warning # Examples: # - Performance degraded # - Resource usage high # - Non-critical service down # Response: Create ticket, notify team # Info: Awareness only severity: info # Examples: # - Deployment completed # - Scaling event occurred # - Maintenance window starting # Response: Log for reference
Routing by Severity
# alertmanager.yml route: receiver: default group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts PagerDuty - match: severity: critical receiver: pagerduty group_wait: 10s repeat_interval: 1h # Warning alerts Slack - match: severity: warning receiver: slack group_wait: 5m repeat_interval: 12h # Info alerts Email (daily digest) - match: severity: info receiver: email group_wait: 24h repeat_interval: 24h
Notification Channels
Slack Integration
Setup Slack Integration
-
Create Slack App:
- Visit https://api.slack.com/apps
- Create new app
- Add Incoming Webhook
- Install to workspace
-
Configure in GitLab:
# Settings Integrations Slack webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL username: GitLab Alerts channel: #alerts notify_only_broken_pipelines: false
Slack Notification Format
# alertmanager.yml receivers: - name: slack slack_configs: - api_url: ${SLACK_WEBHOOK_URL} channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: | *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} *Severity:* {{ .CommonLabels.severity }} *Runbook:* {{ .CommonAnnotations.runbook }} color: '{{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}good{{ end }}'
PagerDuty Integration
Setup PagerDuty
-
Create Integration:
- Navigate to PagerDuty Services
- Add integration: Prometheus
- Copy integration key
-
Configure in GitLab:
# Settings Monitor Alerts PagerDuty integration_key: ${PAGERDUTY_INTEGRATION_KEY}
PagerDuty Configuration
# alertmanager.yml receivers: - name: pagerduty pagerduty_configs: - service_key: ${PAGERDUTY_INTEGRATION_KEY} description: '{{ .CommonAnnotations.summary }}' details: severity: '{{ .CommonLabels.severity }}' service: '{{ .CommonLabels.service }}' instance: '{{ .CommonLabels.instance }}' runbook: '{{ .CommonAnnotations.runbook }}' dashboard: '{{ .CommonAnnotations.dashboard }}'
Email Notifications
Configure Email Alerts
# alertmanager.yml receivers: - name: email email_configs: - to: 'team@example.com' from: 'alerts@example.com' smarthost: smtp.gmail.com:587 auth_username: alerts@example.com auth_password: ${SMTP_PASSWORD} headers: Subject: '[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}' html: | <h2>{{ .CommonAnnotations.summary }}</h2> <p><strong>Severity:</strong> {{ .CommonLabels.severity }}</p> <p><strong>Description:</strong> {{ .CommonAnnotations.description }}</p> <p><strong>Started:</strong> {{ .StartsAt }}</p> {{ if .CommonAnnotations.runbook }} <p><a href="{{ .CommonAnnotations.runbook }}">View Runbook</a></p> {{ end }}
Webhook Integration
Custom Webhook
# alertmanager.yml receivers: - name: webhook webhook_configs: - url: 'https://your-service.com/webhook' send_resolved: true http_config: bearer_token: ${WEBHOOK_TOKEN}
Webhook Payload
{ "status": "firing", "labels": { "alertname": "HighErrorRate", "severity": "critical", "service": "user-api" }, "annotations": { "summary": "High error rate detected", "description": "Error rate is 7.5%", "runbook": "https://gitlab.com/runbooks/high-error-rate" }, "startsAt": "2026-01-08T12:34:56Z", "endsAt": "0001-01-01T00:00:00Z", "generatorURL": "https://prometheus/graph?g0.expr=..." }
Alert Management
Viewing Alerts
Navigate to: Monitor Alerts
Alert List View
Active Alerts
HighErrorRate Critical 5m ago
Error rate exceeds 5% (current: 7.5%)
Service: user-api | Instance: api-01
[Runbook] [Dashboard] [Silence] [Create Issue]
HighLatency Warning 15m ago
P95 latency exceeds 2 seconds (current: 2.8s)
Service: payment-api | Instance: api-03
[Runbook] [Dashboard] [Silence] [Create Issue]
Alert Details
Click on an alert to view:
- Full description
- Metric graphs
- Related alerts
- Alert history
- Actions (silence, create issue, assign)
Creating Issues from Alerts
Automatically create GitLab issues:
# Enable auto-issue creation alert_management: auto_create_issues: true issue_template: | ## Alert: {{ .CommonAnnotations.summary }} **Severity:** {{ .CommonLabels.severity }} **Service:** {{ .CommonLabels.service }} **Started:** {{ .StartsAt }} ### Description {{ .CommonAnnotations.description }} ### Investigation - [ ] Check recent deployments - [ ] Review error logs - [ ] Check resource utilization - [ ] Verify external dependencies ### Runbook {{ .CommonAnnotations.runbook }} ### Dashboard {{ .CommonAnnotations.dashboard }} /label ~incident ~{{ .CommonLabels.severity }} /assign @oncall
Alert Grouping and Deduplication
Grouping Alerts
Group related alerts to reduce noise:
# alertmanager.yml route: group_by: ['alertname', 'service', 'environment'] group_wait: 30s # Wait 30s for more alerts before sending group_interval: 5m # Wait 5m before sending new group repeat_interval: 4h # Resend after 4h if still firing
Example Grouping
Instead of 10 separate alerts:
HighLatency on api-01
HighLatency on api-02
HighLatency on api-03
... (7 more)
Send 1 grouped alert:
HighLatency affecting 10 instances:
api-01, api-02, api-03, api-04, api-05,
api-06, api-07, api-08, api-09, api-10
Deduplication
Prevent duplicate alerts:
# Alerts with same fingerprint are deduplicated # Fingerprint = hash(alertname + labels) # These are considered duplicates: HighLatency{service="api", instance="api-01"} HighLatency{service="api", instance="api-01"} # These are different: HighLatency{service="api", instance="api-01"} HighLatency{service="api", instance="api-02"}
Silencing Alerts
Temporary Silence
Silence alerts during maintenance:
# Via GitLab UI: Monitor Alerts Silence alertname: HighLatency service: user-api duration: 2h comment: "Database maintenance window"
Via API
# Create silence curl -X POST https://alertmanager/api/v2/silences \ -H "Content-Type: application/json" \ -d '{ "matchers": [ {"name": "alertname", "value": "HighLatency"}, {"name": "service", "value": "user-api"} ], "startsAt": "2026-01-08T14:00:00Z", "endsAt": "2026-01-08T16:00:00Z", "createdBy": "admin", "comment": "Database maintenance window" }'
Silence Patterns
# Silence specific alert matchers: - alertname: HighLatency # Silence all alerts for a service matchers: - service: user-api # Silence by severity matchers: - severity: warning # Multiple conditions (AND) matchers: - service: user-api - severity: warning
Escalation Policies
On-Call Schedule
Configure escalation:
# escalation_policy.yml teams: - name: backend schedule: timezone: America/New_York rotations: - name: Primary On-Call participants: - alice@example.com - bob@example.com rotation_length: 1 week escalation_rules: - delay: 0m notify: primary_oncall - delay: 15m notify: primary_oncall action: escalate - delay: 30m notify: team_lead - delay: 60m notify: engineering_manager
Escalation Flow
Alert Fires
0m: Notify primary on-call (PagerDuty)
15m: No ack? Page again + escalate
30m: Still no ack? Notify team lead
60m: Escalate to engineering manager
Alert Fatigue Prevention
1. Adjust Alert Thresholds
Tune alerts to reduce false positives:
# Bad: Too sensitive expr: rate(errors[1m]) > 0 # Better: Allow some errors expr: rate(errors[5m]) > 10 # Best: Contextual threshold expr: | rate(errors[5m]) / rate(requests[5m]) > 0.05
2. Use [object Object] Duration
Require sustained condition:
# Fires immediately on spike - alert: HighCPU expr: cpu_usage > 80 # Fires only if sustained for 10 minutes - alert: HighCPU expr: cpu_usage > 80 for: 10m
3. Alert During Business Hours
Adjust severity by time:
- alert: HighLatency expr: p95_latency > 1 for: 5m labels: severity: | {{ if match "^(Mon|Tue|Wed|Thu|Fri) (09|10|11|12|13|14|15|16|17):" .ActiveAt }} critical {{ else }} warning {{ end }}
4. Maintenance Windows
Automatically silence during deployments:
# .gitlab-ci.yml deploy:production: script: - ./scripts/create-silence.sh "Deployment in progress" 30m - kubectl apply -f k8s/ - ./scripts/health-check.sh after_script: - ./scripts/delete-silence.sh
Incident Response Integration
Incident Lifecycle
Alert Fires
1. Create Incident Issue
Auto-populated with alert details
Assigned to on-call engineer
Labels: ~incident, ~severity::critical
2. Notify Team
PagerDuty page
Slack notification
Email to team
3. Investigation
Follow runbook
Check dashboards
Review logs/traces
4. Resolution
Deploy fix or rollback
Verify metrics
Mark incident resolved
5. Post-Mortem
Document root cause
Identify action items
Update runbooks
Incident Template
## Incident: {{ alert.summary }} **Status:** ACTIVE **Severity:** {{ alert.severity }} **Started:** {{ alert.startsAt }} **Service:** {{ alert.service }} ### Impact - Users affected: [To be determined] - Services impacted: {{ alert.service }} - Data loss: [Yes/No] ### Timeline - {{ alert.startsAt }}: Alert fired - {{ now }}: Incident created ### Investigation - [ ] Check recent deployments - [ ] Review error logs: {{ alert.logLink }} - [ ] Check metrics dashboard: {{ alert.dashboardLink }} - [ ] Verify external dependencies - [ ] Run diagnostic commands ### Resolution Steps - [ ] Identify root cause - [ ] Implement fix or rollback - [ ] Verify recovery - [ ] Monitor for recurrence ### Runbook {{ alert.runbook }} ### Communication - [ ] Notify stakeholders - [ ] Update status page - [ ] Post updates in Slack /label ~incident ~{{ alert.severity }} ~{{ alert.service }} /assign @oncall
Alert Testing
Test Alert Rules
# Test Prometheus rule syntax promtool check rules prometheus_rules.yml # Test alert expression promtool query instant http://prometheus:9090 \ 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05'
Simulate Alerts
# Send test alert to Alertmanager curl -X POST http://alertmanager:9093/api/v1/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning" }, "annotations": { "summary": "This is a test alert" } }]'
Alert Validation Checklist
- Alert fires when condition is met
- Alert resolves when condition clears
- Notifications reach correct channels
- Runbook link is valid
- Dashboard link is valid
- Alert description is clear
- Severity is appropriate
- Threshold is tuned (no false positives)
Best Practices
1. Actionable Alerts
Every alert should:
- Be actionable (what should I do?)
- Include runbook link
- Include dashboard link
- Have clear severity
- Contain context (affected service, instance)
2. Alert Hierarchy
Critical Page on-call immediately
Service completely down
Data loss occurring
Security breach
Warning Create ticket, notify team
Performance degraded
Resource usage high
Non-critical service down
Info Log for reference
Deployment completed
Scaling event occurred
Maintenance window
3. Mean Time to Acknowledge (MTTA)
Track how quickly alerts are acknowledged:
SELECT AVG(TIMESTAMPDIFF(SECOND, alert_fired_at, acknowledged_at)) / 60 as mtta_minutes FROM alerts WHERE acknowledged_at IS NOT NULL AND fired_at >= NOW() - INTERVAL 30 DAY;
4. Alert Quality Metrics
Monitor alert effectiveness:
- True positive rate: Alerts requiring action
- False positive rate: Alerts that auto-resolve
- Time to acknowledge: How fast team responds
- Time to resolve: How fast issues are fixed
References
- GitLab Alert Management Documentation
- Prometheus Alerting Documentation
- PagerDuty Integration Guide
- GitLab Slack Integration
Related Documentation
- Metrics - Prometheus metrics for alerting
- Tracing - Distributed tracing for debugging
- Logs - Log analysis for incidents
- Dashboards - Visualization and monitoring
- DORA Metrics - MTTR tracking