monitoring
GitLab CI/CD Monitoring and Analytics
Track pipeline performance, costs, and health across all projects.
Table of Contents
- Built-in Analytics
- Cost Monitoring
- Performance Metrics
- DORA Metrics
- Custom Dashboards
- Alerting
- Audit and Compliance
Built-in Analytics
CI/CD Analytics Overview
Location: Project Analytics CI/CD Analytics Group-level: Group Analytics CI/CD Analytics
Available metrics:
- Pipeline success rate
- Pipeline duration trends
- Job duration breakdown
- Failure patterns
- Coverage trends
Source: GitLab CI/CD Analytics
Pipeline Charts
Visualizations:
- Pipeline status: Success/failed/canceled over time
- Pipeline duration: Average duration by branch
- Job breakdown: Time spent per job
- Stage duration: Bottleneck identification
How to use:
1. Navigate to Analytics CI/CD Analytics
2. Select time range (7 days, 30 days, 90 days)
3. Filter by branch (main, development, all)
4. Identify trends and anomalies
Example insights:
- "Test stage taking 70% of pipeline time parallelize tests"
- "Success rate dropped from 95% to 70% last week investigate failures"
- "Pipeline duration doubled after dependency update check caching"
Value Stream Analytics
Location: Group Analytics Value Stream
Tracks:
- Issue Code Test Production (full cycle)
- Time in each stage
- Bottlenecks in delivery flow
- Lead time for changes
Use case: Understand end-to-end delivery speed
Repository Analytics
Location: Project Analytics Repository
Metrics:
- Commits per day/week
- Contributors
- Programming languages
- Code coverage trends
Correlation: High commit frequency + low pipeline success = investigate quality
Cost Monitoring
Compute Minutes Tracking
Location: Group Settings Usage Quotas Pipelines
View:
- Current month usage
- Usage by project
- Minute multipliers (runner sizes)
- Quota limits
- Historical trends
Example:
Total usage: 38,542 / 50,000 minutes (77%)
Top consumers:
- project-a: 12,450 minutes (32%)
- project-b: 8,920 minutes (23%)
- project-c: 5,670 minutes (15%)
Actions:
- Projects using >20% of quota investigate
- Usage trending above quota optimize or purchase more minutes
- Sudden spikes check for pipeline loops or inefficiencies
Source: GitLab Compute Minutes
Cost Per Project
API Query (get usage per project):
#!/bin/bash GROUP_ID="12345" TOKEN="your-gitlab-token" # Get all projects in group PROJECTS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \ "https://gitlab.com/api/v4/groups/$GROUP_ID/projects?per_page=100" | \ jq -r '.[] | "\(.id):\(.path_with_namespace)"') echo "Project,CI Minutes" # Get CI minute usage per project for project in $PROJECTS; do PROJECT_ID=$(echo $project | cut -d: -f1) PROJECT_NAME=$(echo $project | cut -d: -f2) STATS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \ "https://gitlab.com/api/v4/projects/$PROJECT_ID/statistics") CI_SECONDS=$(echo $STATS | jq '.statistics.ci_runner_seconds // 0') CI_MINUTES=$(echo "scale=2; $CI_SECONDS / 60" | bc) echo "$PROJECT_NAME,$CI_MINUTES" done | sort -t, -k2 -n -r
Output:
Project,CI Minutes
my-group/project-a,12450.25
my-group/project-b,8920.50
my-group/project-c,5670.75
...
Schedule: Run weekly, track trends
Cost Attribution
Add labels to track costs:
.cost-tracking: variables: COST_CENTER: "team-alpha" PROJECT_CODE: "product-x" ENVIRONMENT: "production" build: extends: .cost-tracking script: npm run build
Extract from logs:
# Parse pipeline logs for cost attribution glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.variables.COST_CENTER)"'
Budget Alerts
Scheduled pipeline to check budget:
# .gitlab-ci.yml in monitoring project stages: - monitor check-budget: stage: monitor only: - schedules script: - | # Get current usage USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | \ jq '.ci_minutes.used') BUDGET=50000 THRESHOLD=0.8 # 80% if [ "$USAGE" -gt "$((BUDGET * THRESHOLD))" ]; then echo " CI minute budget at $((100 * USAGE / BUDGET))%" # Send alert (Slack, email, etc.) curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \ -d "{\"text\":\"CI budget alert: $USAGE / $BUDGET minutes used\"}" fi
Schedule: Daily at 9 AM
Minute Multipliers
Track runner sizes to optimize costs:
| Runner Size | vCPU | RAM | Minute Multiplier |
|---|---|---|---|
| Small | 1 | 2GB | 1x |
| Medium | 2 | 4GB | 2x |
| Large | 4 | 8GB | 4x |
| X-Large | 8 | 16GB | 8x |
Cost analysis:
# Get jobs by runner size glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.runner.size)"' | \ awk -F, '{ minutes = $2 / 60 if ($3 == "small") mult = 1 else if ($3 == "medium") mult = 2 else if ($3 == "large") mult = 4 else mult = 8 cost = minutes * mult print $1 ": " cost " CI minutes" }'
Output:
build: 50 CI minutes (10 min 5x large runner)
test: 20 CI minutes (10 min 2x medium runner)
deploy: 5 CI minutes (5 min 1x small runner)
Total: 75 CI minutes
Optimization: Use smallest runner that meets performance needs
Performance Metrics
Pipeline Duration Tracking
API Query:
# Get last 100 pipelines with durations glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \ jq -r '.[] | "\(.created_at),\(.duration),\(.status)"' | \ awk -F, '{ if ($3 == "success") { print $1 "," $2/60 " minutes" } }'
Track over time:
2026-01-01,12.5 minutes
2026-01-02,11.8 minutes
2026-01-03,15.2 minutes Spike - investigate
2026-01-04,12.1 minutes
Job-Level Performance
Identify slowest jobs:
# Get all jobs from recent pipelines glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.status)"' | \ sort -t, -k2 -n -r | \ head -10
Output (top 10 slowest jobs):
e2e-tests,1820,success
integration-tests,1205,success
build-docker,890,success
security-scan,650,success
unit-tests,320,success
...
Actions:
e2e-tests(30 min): Parallelize withparallel: 10build-docker(15 min): Enable Docker layer cachingsecurity-scan(11 min): Run only on MRs, not every commit
Success Rate Monitoring
Track pipeline success rate:
# Last 100 pipelines glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \ jq '[.[] | .status] | group_by(.) | map({status: .[0], count: length})' | \ jq -r '.[] | "\(.status): \(.count)"'
Output:
success: 78
failed: 15
canceled: 7
Success rate: 78%
Target: >90% success rate
If below target:
- Review failure logs
- Check for flaky tests
- Improve validation (see validation.md)
- Add retry logic for transient failures
Job Failure Patterns
Identify most common failures:
# Get failed jobs glab api "/projects/$PROJECT_ID/pipelines?status=failed&per_page=50" | \ jq -r '.[] | .id' | \ while read pipeline_id; do glab api "/projects/$PROJECT_ID/pipelines/$pipeline_id/jobs" | \ jq -r '.[] | select(.status == "failed") | .name' done | \ sort | uniq -c | sort -n -r
Output:
45 test
12 build-docker
8 deploy-staging
3 lint
Action: test failing 45 times investigate test flakiness
Cache Hit Rate
Track cache effectiveness:
test: cache: key: npm-cache paths: - node_modules/ script: - CACHE_START=$(date +%s) - npm ci - CACHE_END=$(date +%s) - echo "Cache restore took $((CACHE_END - CACHE_START)) seconds"
Expected:
- Cold cache: 120 seconds
- Warm cache: 5 seconds
- Cache hit rate: >80%
If low hit rate:
- Check cache key strategy
- Verify runner tags are consistent
- Review cache size limits
DORA Metrics
What are DORA Metrics?
Four key metrics for DevOps performance:
- Deployment Frequency: How often you deploy
- Lead Time for Changes: Time from commit to production
- Mean Time to Recovery (MTTR): Time to recover from failure
- Change Failure Rate: % of deployments causing failures
Source: DORA DevOps Research
Deployment Frequency
Track deployments to production:
# Count deployments per day glab api "/projects/$PROJECT_ID/deployments?environment=production" | \ jq -r '.[] | .created_at' | \ cut -d T -f1 | \ sort | uniq -c
Output:
3 2026-01-05
5 2026-01-06
2 2026-01-07
4 2026-01-08
Benchmarks:
- Elite: Multiple per day
- High: Once per day to once per week
- Medium: Once per week to once per month
- Low: Less than once per month
Lead Time for Changes
Measure commit production time:
# Tag commit time commit-timestamp: stage: .pre script: - echo "COMMIT_TIME=$(date +%s)" >> metrics.env artifacts: reports: dotenv: metrics.env # Tag deploy time deploy: stage: deploy script: - DEPLOY_TIME=$(date +%s) - LEAD_TIME=$((DEPLOY_TIME - COMMIT_TIME)) - echo "Lead time: $((LEAD_TIME / 3600)) hours"
Benchmarks:
- Elite: < 1 hour
- High: 1 day to 1 week
- Medium: 1 week to 1 month
- Low: > 1 month
Mean Time to Recovery
Track incident resolution time:
# Find failed deployments FAILED=$(glab api "/projects/$PROJECT_ID/deployments?status=failed&environment=production") # Find subsequent successful deployment for deploy in $(echo $FAILED | jq -r '.[] | .id'); do FAILED_TIME=$(echo $FAILED | jq -r ".[] | select(.id == $deploy) | .created_at") NEXT_SUCCESS=$(glab api "/projects/$PROJECT_ID/deployments?status=success&environment=production" | \ jq -r ".[] | select(.created_at > \"$FAILED_TIME\") | .created_at" | head -1) RECOVERY_TIME=$(( $(date -d "$NEXT_SUCCESS" +%s) - $(date -d "$FAILED_TIME" +%s) )) echo "Recovery time: $((RECOVERY_TIME / 3600)) hours" done
Benchmarks:
- Elite: < 1 hour
- High: < 1 day
- Medium: 1 day to 1 week
- Low: > 1 week
Change Failure Rate
Calculate % of deployments that fail:
TOTAL=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&per_page=100" | jq 'length') FAILED=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&status=failed&per_page=100" | jq 'length') FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc) echo "Change failure rate: $FAILURE_RATE%"
Benchmarks:
- Elite: 0-15%
- High: 16-30%
- Medium: 31-45%
- Low: > 45%
DORA Dashboard
Aggregate all metrics:
dora-metrics: stage: report only: - schedules script: - | cat > dora-report.json <<EOF { "deployment_frequency": "$(./calculate-deployment-frequency.sh)", "lead_time_hours": "$(./calculate-lead-time.sh)", "mttr_hours": "$(./calculate-mttr.sh)", "change_failure_rate": "$(./calculate-failure-rate.sh)" } EOF - cat dora-report.json - ./send-to-dashboard.sh dora-report.json artifacts: reports: metrics: dora-report.json
Schedule: Daily
Custom Dashboards
Prometheus + Grafana Integration
Export CI metrics to Prometheus:
export-metrics: stage: .post only: - schedules script: - | # Export pipeline duration echo "gitlab_pipeline_duration_seconds{project=\"$CI_PROJECT_NAME\"} $CI_PIPELINE_DURATION" | \ curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci # Export success rate echo "gitlab_pipeline_success_total{project=\"$CI_PROJECT_NAME\"} 1" | \ curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci
Grafana Dashboard queries:
# Average pipeline duration avg(gitlab_pipeline_duration_seconds) by (project) # Success rate (last 24h) rate(gitlab_pipeline_success_total[24h]) / rate(gitlab_pipeline_total[24h]) # CI minute consumption (estimated) sum(gitlab_job_duration_seconds * gitlab_job_runner_multiplier) by (project) / 60
Custom Analytics Script
Weekly report generation:
#!/bin/bash GROUP_ID="12345" START_DATE=$(date -d '7 days ago' +%Y-%m-%d) END_DATE=$(date +%Y-%m-%d) echo "CI/CD Weekly Report: $START_DATE to $END_DATE" echo "================================================" # Get all projects PROJECTS=$(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id') for project in $PROJECTS; do PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace') # Get pipelines in date range PIPELINES=$(glab api "/projects/$project/pipelines?updated_after=$START_DATE&updated_before=$END_DATE") TOTAL=$(echo $PIPELINES | jq 'length') SUCCESS=$(echo $PIPELINES | jq '[.[] | select(.status == "success")] | length') FAILED=$(echo $PIPELINES | jq '[.[] | select(.status == "failed")] | length') AVG_DURATION=$(echo $PIPELINES | jq '[.[] | .duration] | add / length / 60') if [ "$TOTAL" -gt 0 ]; then SUCCESS_RATE=$(echo "scale=2; 100 * $SUCCESS / $TOTAL" | bc) echo "" echo "Project: $PROJECT_NAME" echo " Pipelines: $TOTAL" echo " Success rate: $SUCCESS_RATE%" echo " Average duration: $AVG_DURATION minutes" if [ $(echo "$SUCCESS_RATE < 90" | bc) -eq 1 ]; then echo " Success rate below target (90%)" fi if [ $(echo "$AVG_DURATION > 15" | bc) -eq 1 ]; then echo " Average duration above target (15 min)" fi fi done
Schedule: Run every Monday, send to team Slack
Alerting
Pipeline Failure Alerts
Slack notification on failure:
notify-failure: stage: .post when: on_failure script: - | curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \ -H 'Content-Type: application/json' \ -d "{ \"text\": \" Pipeline failed: $CI_PROJECT_NAME\", \"attachments\": [{ \"color\": \"danger\", \"fields\": [ {\"title\": \"Branch\", \"value\": \"$CI_COMMIT_REF_NAME\", \"short\": true}, {\"title\": \"Commit\", \"value\": \"$CI_COMMIT_SHORT_SHA\", \"short\": true}, {\"title\": \"Author\", \"value\": \"$GITLAB_USER_NAME\", \"short\": true}, {\"title\": \"Pipeline\", \"value\": \"$CI_PIPELINE_URL\", \"short\": false} ] }] }"
Budget Threshold Alerts
Alert when approaching quota:
check-budget-threshold: stage: monitor only: - schedules script: - | USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.used') QUOTA=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.limit') PERCENT=$(echo "scale=2; 100 * $USAGE / $QUOTA" | bc) if [ $(echo "$PERCENT > 80" | bc) -eq 1 ]; then echo " CI minute usage at $PERCENT%" # Send alert fi
Performance Degradation Alerts
Alert on pipeline slowdown:
check-performance: stage: monitor only: - schedules script: - | # Get last 10 pipeline durations RECENT=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=10" | \ jq '[.[] | .duration] | add / length') # Get previous 10 pipeline durations PREVIOUS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=20" | \ jq '[.[-10:] | .[] | .duration] | add / length') INCREASE=$(echo "scale=2; 100 * ($RECENT - $PREVIOUS) / $PREVIOUS" | bc) if [ $(echo "$INCREASE > 20" | bc) -eq 1 ]; then echo " Pipeline duration increased by $INCREASE%" # Send alert fi
Flaky Test Detection
Alert on tests that fail intermittently:
detect-flaky-tests: stage: monitor only: - schedules script: - | # Get last 50 test jobs TESTS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=50" | \ jq -r '.[] | .id' | \ while read pipeline; do glab api "/projects/$CI_PROJECT_ID/pipelines/$pipeline/jobs" | \ jq -r '.[] | select(.name == "test") | .status' done) SUCCESS=$(echo "$TESTS" | grep -c success) FAILED=$(echo "$TESTS" | grep -c failed) TOTAL=$((SUCCESS + FAILED)) if [ "$FAILED" -gt 0 ] && [ "$SUCCESS" -gt 0 ]; then FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc) if [ $(echo "$FAILURE_RATE > 10 && $FAILURE_RATE < 90" | bc) -eq 1 ]; then echo " Flaky test detected: $FAILURE_RATE% failure rate" # Send alert fi fi
Audit and Compliance
Pipeline Configuration Audit
Check for required settings:
#!/bin/bash # Audit all projects for compliance for project in $(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id'); do CI_CONFIG=$(glab api "/projects/$project/repository/files/.gitlab-ci.yml/raw?ref=main" 2>/dev/null) PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace') echo "Auditing: $PROJECT_NAME" # Check for security scans if ! echo "$CI_CONFIG" | grep -q "security-scan\|SAST"; then echo " Missing security scans" fi # Check for test coverage if ! echo "$CI_CONFIG" | grep -q "coverage"; then echo " No coverage reporting" fi # Check for caching if ! echo "$CI_CONFIG" | grep -q "cache:"; then echo " No caching configured" fi # Check for interruptible jobs if ! echo "$CI_CONFIG" | grep -q "interruptible"; then echo " No interruptible jobs (cost optimization)" fi done
Schedule: Weekly compliance check
Change Log Tracking
Track pipeline config changes:
# Get commits that modified .gitlab-ci.yml glab api "/projects/$PROJECT_ID/repository/commits?path=.gitlab-ci.yml&per_page=50" | \ jq -r '.[] | "\(.created_at) | \(.author_name) | \(.title)"'
Output:
2026-01-08 | John Doe | ci: add Docker caching
2026-01-05 | Jane Smith | ci: parallelize tests
2026-01-02 | John Doe | ci: update security scans
Use case: Correlate pipeline changes with performance/cost trends
Access Audit
Track who can modify pipelines:
# Get project members with Maintainer/Owner access glab api "/projects/$PROJECT_ID/members" | \ jq -r '.[] | select(.access_level >= 40) | "\(.name) - \(.access_level)"'
Access levels:
- 50: Owner
- 40: Maintainer (can edit .gitlab-ci.yml)
- 30: Developer (can trigger pipelines)
- 20: Reporter (view only)
Summary Checklist
Essential Monitoring
- Track CI minute usage weekly
- Monitor pipeline success rate (target: >90%)
- Identify slowest jobs and optimize
- Set up budget alerts (80% threshold)
- Review top CI minute consumers monthly
Performance Tracking
- Monitor average pipeline duration
- Track cache hit rates
- Identify and fix flaky tests
- Measure critical path of pipelines
- Set performance SLOs
DORA Metrics
- Track deployment frequency
- Measure lead time for changes
- Calculate MTTR
- Monitor change failure rate
- Set targets based on benchmarks
Compliance and Governance
- Audit pipeline configs weekly
- Track security scan coverage
- Review access controls
- Document pipeline changes
- Enforce required scans via policy
Additional Resources
- Cost Optimization - Reduce CI minute usage
- Pipeline Efficiency - Performance optimization
- Multi-Project Management - Monitoring at scale
- GitLab CI/CD Analytics Docs
- DORA Metrics
Last Updated: 2026-01-08 Priority: HIGH - Essential for cost control and performance optimization