Skip to main content

monitoring

GitLab CI/CD Monitoring and Analytics

Track pipeline performance, costs, and health across all projects.

Table of Contents

Built-in Analytics

CI/CD Analytics Overview

Location: Project Analytics CI/CD Analytics Group-level: Group Analytics CI/CD Analytics

Available metrics:

  • Pipeline success rate
  • Pipeline duration trends
  • Job duration breakdown
  • Failure patterns
  • Coverage trends

Source: GitLab CI/CD Analytics

Pipeline Charts

Visualizations:

  1. Pipeline status: Success/failed/canceled over time
  2. Pipeline duration: Average duration by branch
  3. Job breakdown: Time spent per job
  4. Stage duration: Bottleneck identification

How to use:

1. Navigate to Analytics  CI/CD Analytics
2. Select time range (7 days, 30 days, 90 days)
3. Filter by branch (main, development, all)
4. Identify trends and anomalies

Example insights:

  • "Test stage taking 70% of pipeline time parallelize tests"
  • "Success rate dropped from 95% to 70% last week investigate failures"
  • "Pipeline duration doubled after dependency update check caching"

Value Stream Analytics

Location: Group Analytics Value Stream

Tracks:

  • Issue Code Test Production (full cycle)
  • Time in each stage
  • Bottlenecks in delivery flow
  • Lead time for changes

Use case: Understand end-to-end delivery speed

Repository Analytics

Location: Project Analytics Repository

Metrics:

  • Commits per day/week
  • Contributors
  • Programming languages
  • Code coverage trends

Correlation: High commit frequency + low pipeline success = investigate quality

Cost Monitoring

Compute Minutes Tracking

Location: Group Settings Usage Quotas Pipelines

View:

  • Current month usage
  • Usage by project
  • Minute multipliers (runner sizes)
  • Quota limits
  • Historical trends

Example:

Total usage: 38,542 / 50,000 minutes (77%)
Top consumers:
  - project-a: 12,450 minutes (32%)
  - project-b: 8,920 minutes (23%)
  - project-c: 5,670 minutes (15%)

Actions:

  • Projects using >20% of quota investigate
  • Usage trending above quota optimize or purchase more minutes
  • Sudden spikes check for pipeline loops or inefficiencies

Source: GitLab Compute Minutes

Cost Per Project

API Query (get usage per project):

#!/bin/bash GROUP_ID="12345" TOKEN="your-gitlab-token" # Get all projects in group PROJECTS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \ "https://gitlab.com/api/v4/groups/$GROUP_ID/projects?per_page=100" | \ jq -r '.[] | "\(.id):\(.path_with_namespace)"') echo "Project,CI Minutes" # Get CI minute usage per project for project in $PROJECTS; do PROJECT_ID=$(echo $project | cut -d: -f1) PROJECT_NAME=$(echo $project | cut -d: -f2) STATS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \ "https://gitlab.com/api/v4/projects/$PROJECT_ID/statistics") CI_SECONDS=$(echo $STATS | jq '.statistics.ci_runner_seconds // 0') CI_MINUTES=$(echo "scale=2; $CI_SECONDS / 60" | bc) echo "$PROJECT_NAME,$CI_MINUTES" done | sort -t, -k2 -n -r

Output:

Project,CI Minutes
my-group/project-a,12450.25
my-group/project-b,8920.50
my-group/project-c,5670.75
...

Schedule: Run weekly, track trends

Cost Attribution

Add labels to track costs:

.cost-tracking: variables: COST_CENTER: "team-alpha" PROJECT_CODE: "product-x" ENVIRONMENT: "production" build: extends: .cost-tracking script: npm run build

Extract from logs:

# Parse pipeline logs for cost attribution glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.variables.COST_CENTER)"'

Budget Alerts

Scheduled pipeline to check budget:

# .gitlab-ci.yml in monitoring project stages: - monitor check-budget: stage: monitor only: - schedules script: - | # Get current usage USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | \ jq '.ci_minutes.used') BUDGET=50000 THRESHOLD=0.8 # 80% if [ "$USAGE" -gt "$((BUDGET * THRESHOLD))" ]; then echo " CI minute budget at $((100 * USAGE / BUDGET))%" # Send alert (Slack, email, etc.) curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \ -d "{\"text\":\"CI budget alert: $USAGE / $BUDGET minutes used\"}" fi

Schedule: Daily at 9 AM

Minute Multipliers

Track runner sizes to optimize costs:

Runner SizevCPURAMMinute Multiplier
Small12GB1x
Medium24GB2x
Large48GB4x
X-Large816GB8x

Cost analysis:

# Get jobs by runner size glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.runner.size)"' | \ awk -F, '{ minutes = $2 / 60 if ($3 == "small") mult = 1 else if ($3 == "medium") mult = 2 else if ($3 == "large") mult = 4 else mult = 8 cost = minutes * mult print $1 ": " cost " CI minutes" }'

Output:

build: 50 CI minutes (10 min  5x large runner)
test: 20 CI minutes (10 min  2x medium runner)
deploy: 5 CI minutes (5 min  1x small runner)
Total: 75 CI minutes

Optimization: Use smallest runner that meets performance needs

Performance Metrics

Pipeline Duration Tracking

API Query:

# Get last 100 pipelines with durations glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \ jq -r '.[] | "\(.created_at),\(.duration),\(.status)"' | \ awk -F, '{ if ($3 == "success") { print $1 "," $2/60 " minutes" } }'

Track over time:

2026-01-01,12.5 minutes
2026-01-02,11.8 minutes
2026-01-03,15.2 minutes   Spike - investigate
2026-01-04,12.1 minutes

Job-Level Performance

Identify slowest jobs:

# Get all jobs from recent pipelines glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \ jq -r '.[] | "\(.name),\(.duration),\(.status)"' | \ sort -t, -k2 -n -r | \ head -10

Output (top 10 slowest jobs):

e2e-tests,1820,success
integration-tests,1205,success
build-docker,890,success
security-scan,650,success
unit-tests,320,success
...

Actions:

  • e2e-tests (30 min): Parallelize with parallel: 10
  • build-docker (15 min): Enable Docker layer caching
  • security-scan (11 min): Run only on MRs, not every commit

Success Rate Monitoring

Track pipeline success rate:

# Last 100 pipelines glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \ jq '[.[] | .status] | group_by(.) | map({status: .[0], count: length})' | \ jq -r '.[] | "\(.status): \(.count)"'

Output:

success: 78
failed: 15
canceled: 7

Success rate: 78%

Target: >90% success rate

If below target:

  1. Review failure logs
  2. Check for flaky tests
  3. Improve validation (see validation.md)
  4. Add retry logic for transient failures

Job Failure Patterns

Identify most common failures:

# Get failed jobs glab api "/projects/$PROJECT_ID/pipelines?status=failed&per_page=50" | \ jq -r '.[] | .id' | \ while read pipeline_id; do glab api "/projects/$PROJECT_ID/pipelines/$pipeline_id/jobs" | \ jq -r '.[] | select(.status == "failed") | .name' done | \ sort | uniq -c | sort -n -r

Output:

45 test
12 build-docker
8 deploy-staging
3 lint

Action: test failing 45 times investigate test flakiness

Cache Hit Rate

Track cache effectiveness:

test: cache: key: npm-cache paths: - node_modules/ script: - CACHE_START=$(date +%s) - npm ci - CACHE_END=$(date +%s) - echo "Cache restore took $((CACHE_END - CACHE_START)) seconds"

Expected:

  • Cold cache: 120 seconds
  • Warm cache: 5 seconds
  • Cache hit rate: >80%

If low hit rate:

  • Check cache key strategy
  • Verify runner tags are consistent
  • Review cache size limits

DORA Metrics

What are DORA Metrics?

Four key metrics for DevOps performance:

  1. Deployment Frequency: How often you deploy
  2. Lead Time for Changes: Time from commit to production
  3. Mean Time to Recovery (MTTR): Time to recover from failure
  4. Change Failure Rate: % of deployments causing failures

Source: DORA DevOps Research

Deployment Frequency

Track deployments to production:

# Count deployments per day glab api "/projects/$PROJECT_ID/deployments?environment=production" | \ jq -r '.[] | .created_at' | \ cut -d T -f1 | \ sort | uniq -c

Output:

3 2026-01-05
5 2026-01-06
2 2026-01-07
4 2026-01-08

Benchmarks:

  • Elite: Multiple per day
  • High: Once per day to once per week
  • Medium: Once per week to once per month
  • Low: Less than once per month

Lead Time for Changes

Measure commit production time:

# Tag commit time commit-timestamp: stage: .pre script: - echo "COMMIT_TIME=$(date +%s)" >> metrics.env artifacts: reports: dotenv: metrics.env # Tag deploy time deploy: stage: deploy script: - DEPLOY_TIME=$(date +%s) - LEAD_TIME=$((DEPLOY_TIME - COMMIT_TIME)) - echo "Lead time: $((LEAD_TIME / 3600)) hours"

Benchmarks:

  • Elite: < 1 hour
  • High: 1 day to 1 week
  • Medium: 1 week to 1 month
  • Low: > 1 month

Mean Time to Recovery

Track incident resolution time:

# Find failed deployments FAILED=$(glab api "/projects/$PROJECT_ID/deployments?status=failed&environment=production") # Find subsequent successful deployment for deploy in $(echo $FAILED | jq -r '.[] | .id'); do FAILED_TIME=$(echo $FAILED | jq -r ".[] | select(.id == $deploy) | .created_at") NEXT_SUCCESS=$(glab api "/projects/$PROJECT_ID/deployments?status=success&environment=production" | \ jq -r ".[] | select(.created_at > \"$FAILED_TIME\") | .created_at" | head -1) RECOVERY_TIME=$(( $(date -d "$NEXT_SUCCESS" +%s) - $(date -d "$FAILED_TIME" +%s) )) echo "Recovery time: $((RECOVERY_TIME / 3600)) hours" done

Benchmarks:

  • Elite: < 1 hour
  • High: < 1 day
  • Medium: 1 day to 1 week
  • Low: > 1 week

Change Failure Rate

Calculate % of deployments that fail:

TOTAL=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&per_page=100" | jq 'length') FAILED=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&status=failed&per_page=100" | jq 'length') FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc) echo "Change failure rate: $FAILURE_RATE%"

Benchmarks:

  • Elite: 0-15%
  • High: 16-30%
  • Medium: 31-45%
  • Low: > 45%

DORA Dashboard

Aggregate all metrics:

dora-metrics: stage: report only: - schedules script: - | cat > dora-report.json <<EOF { "deployment_frequency": "$(./calculate-deployment-frequency.sh)", "lead_time_hours": "$(./calculate-lead-time.sh)", "mttr_hours": "$(./calculate-mttr.sh)", "change_failure_rate": "$(./calculate-failure-rate.sh)" } EOF - cat dora-report.json - ./send-to-dashboard.sh dora-report.json artifacts: reports: metrics: dora-report.json

Schedule: Daily

Custom Dashboards

Prometheus + Grafana Integration

Export CI metrics to Prometheus:

export-metrics: stage: .post only: - schedules script: - | # Export pipeline duration echo "gitlab_pipeline_duration_seconds{project=\"$CI_PROJECT_NAME\"} $CI_PIPELINE_DURATION" | \ curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci # Export success rate echo "gitlab_pipeline_success_total{project=\"$CI_PROJECT_NAME\"} 1" | \ curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci

Grafana Dashboard queries:

# Average pipeline duration avg(gitlab_pipeline_duration_seconds) by (project) # Success rate (last 24h) rate(gitlab_pipeline_success_total[24h]) / rate(gitlab_pipeline_total[24h]) # CI minute consumption (estimated) sum(gitlab_job_duration_seconds * gitlab_job_runner_multiplier) by (project) / 60

Custom Analytics Script

Weekly report generation:

#!/bin/bash GROUP_ID="12345" START_DATE=$(date -d '7 days ago' +%Y-%m-%d) END_DATE=$(date +%Y-%m-%d) echo "CI/CD Weekly Report: $START_DATE to $END_DATE" echo "================================================" # Get all projects PROJECTS=$(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id') for project in $PROJECTS; do PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace') # Get pipelines in date range PIPELINES=$(glab api "/projects/$project/pipelines?updated_after=$START_DATE&updated_before=$END_DATE") TOTAL=$(echo $PIPELINES | jq 'length') SUCCESS=$(echo $PIPELINES | jq '[.[] | select(.status == "success")] | length') FAILED=$(echo $PIPELINES | jq '[.[] | select(.status == "failed")] | length') AVG_DURATION=$(echo $PIPELINES | jq '[.[] | .duration] | add / length / 60') if [ "$TOTAL" -gt 0 ]; then SUCCESS_RATE=$(echo "scale=2; 100 * $SUCCESS / $TOTAL" | bc) echo "" echo "Project: $PROJECT_NAME" echo " Pipelines: $TOTAL" echo " Success rate: $SUCCESS_RATE%" echo " Average duration: $AVG_DURATION minutes" if [ $(echo "$SUCCESS_RATE < 90" | bc) -eq 1 ]; then echo " Success rate below target (90%)" fi if [ $(echo "$AVG_DURATION > 15" | bc) -eq 1 ]; then echo " Average duration above target (15 min)" fi fi done

Schedule: Run every Monday, send to team Slack

Alerting

Pipeline Failure Alerts

Slack notification on failure:

notify-failure: stage: .post when: on_failure script: - | curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \ -H 'Content-Type: application/json' \ -d "{ \"text\": \" Pipeline failed: $CI_PROJECT_NAME\", \"attachments\": [{ \"color\": \"danger\", \"fields\": [ {\"title\": \"Branch\", \"value\": \"$CI_COMMIT_REF_NAME\", \"short\": true}, {\"title\": \"Commit\", \"value\": \"$CI_COMMIT_SHORT_SHA\", \"short\": true}, {\"title\": \"Author\", \"value\": \"$GITLAB_USER_NAME\", \"short\": true}, {\"title\": \"Pipeline\", \"value\": \"$CI_PIPELINE_URL\", \"short\": false} ] }] }"

Budget Threshold Alerts

Alert when approaching quota:

check-budget-threshold: stage: monitor only: - schedules script: - | USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.used') QUOTA=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \ "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.limit') PERCENT=$(echo "scale=2; 100 * $USAGE / $QUOTA" | bc) if [ $(echo "$PERCENT > 80" | bc) -eq 1 ]; then echo " CI minute usage at $PERCENT%" # Send alert fi

Performance Degradation Alerts

Alert on pipeline slowdown:

check-performance: stage: monitor only: - schedules script: - | # Get last 10 pipeline durations RECENT=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=10" | \ jq '[.[] | .duration] | add / length') # Get previous 10 pipeline durations PREVIOUS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=20" | \ jq '[.[-10:] | .[] | .duration] | add / length') INCREASE=$(echo "scale=2; 100 * ($RECENT - $PREVIOUS) / $PREVIOUS" | bc) if [ $(echo "$INCREASE > 20" | bc) -eq 1 ]; then echo " Pipeline duration increased by $INCREASE%" # Send alert fi

Flaky Test Detection

Alert on tests that fail intermittently:

detect-flaky-tests: stage: monitor only: - schedules script: - | # Get last 50 test jobs TESTS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=50" | \ jq -r '.[] | .id' | \ while read pipeline; do glab api "/projects/$CI_PROJECT_ID/pipelines/$pipeline/jobs" | \ jq -r '.[] | select(.name == "test") | .status' done) SUCCESS=$(echo "$TESTS" | grep -c success) FAILED=$(echo "$TESTS" | grep -c failed) TOTAL=$((SUCCESS + FAILED)) if [ "$FAILED" -gt 0 ] && [ "$SUCCESS" -gt 0 ]; then FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc) if [ $(echo "$FAILURE_RATE > 10 && $FAILURE_RATE < 90" | bc) -eq 1 ]; then echo " Flaky test detected: $FAILURE_RATE% failure rate" # Send alert fi fi

Audit and Compliance

Pipeline Configuration Audit

Check for required settings:

#!/bin/bash # Audit all projects for compliance for project in $(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id'); do CI_CONFIG=$(glab api "/projects/$project/repository/files/.gitlab-ci.yml/raw?ref=main" 2>/dev/null) PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace') echo "Auditing: $PROJECT_NAME" # Check for security scans if ! echo "$CI_CONFIG" | grep -q "security-scan\|SAST"; then echo " Missing security scans" fi # Check for test coverage if ! echo "$CI_CONFIG" | grep -q "coverage"; then echo " No coverage reporting" fi # Check for caching if ! echo "$CI_CONFIG" | grep -q "cache:"; then echo " No caching configured" fi # Check for interruptible jobs if ! echo "$CI_CONFIG" | grep -q "interruptible"; then echo " No interruptible jobs (cost optimization)" fi done

Schedule: Weekly compliance check

Change Log Tracking

Track pipeline config changes:

# Get commits that modified .gitlab-ci.yml glab api "/projects/$PROJECT_ID/repository/commits?path=.gitlab-ci.yml&per_page=50" | \ jq -r '.[] | "\(.created_at) | \(.author_name) | \(.title)"'

Output:

2026-01-08 | John Doe | ci: add Docker caching
2026-01-05 | Jane Smith | ci: parallelize tests
2026-01-02 | John Doe | ci: update security scans

Use case: Correlate pipeline changes with performance/cost trends

Access Audit

Track who can modify pipelines:

# Get project members with Maintainer/Owner access glab api "/projects/$PROJECT_ID/members" | \ jq -r '.[] | select(.access_level >= 40) | "\(.name) - \(.access_level)"'

Access levels:

  • 50: Owner
  • 40: Maintainer (can edit .gitlab-ci.yml)
  • 30: Developer (can trigger pipelines)
  • 20: Reporter (view only)

Summary Checklist

Essential Monitoring

  • Track CI minute usage weekly
  • Monitor pipeline success rate (target: >90%)
  • Identify slowest jobs and optimize
  • Set up budget alerts (80% threshold)
  • Review top CI minute consumers monthly

Performance Tracking

  • Monitor average pipeline duration
  • Track cache hit rates
  • Identify and fix flaky tests
  • Measure critical path of pipelines
  • Set performance SLOs

DORA Metrics

  • Track deployment frequency
  • Measure lead time for changes
  • Calculate MTTR
  • Monitor change failure rate
  • Set targets based on benchmarks

Compliance and Governance

  • Audit pipeline configs weekly
  • Track security scan coverage
  • Review access controls
  • Document pipeline changes
  • Enforce required scans via policy

Additional Resources


Last Updated: 2026-01-08 Priority: HIGH - Essential for cost control and performance optimization