monitoring

GitLab CI/CD Monitoring and Analytics

Track pipeline performance, costs, and health across all projects.

Built-in Analytics
Cost Monitoring
Performance Metrics
DORA Metrics
Custom Dashboards
Alerting
Audit and Compliance

Built-in Analytics

CI/CD Analytics Overview

Location: Project Analytics CI/CD Analytics Group-level: Group Analytics CI/CD Analytics

Available metrics:

Pipeline success rate
Pipeline duration trends
Job duration breakdown
Failure patterns
Coverage trends

Source: GitLab CI/CD Analytics

Pipeline Charts

Visualizations:

Pipeline status: Success/failed/canceled over time
Pipeline duration: Average duration by branch
Job breakdown: Time spent per job
Stage duration: Bottleneck identification

How to use:

1. Navigate to Analytics  CI/CD Analytics
2. Select time range (7 days, 30 days, 90 days)
3. Filter by branch (main, development, all)
4. Identify trends and anomalies

Example insights:

"Test stage taking 70% of pipeline time parallelize tests"
"Success rate dropped from 95% to 70% last week investigate failures"
"Pipeline duration doubled after dependency update check caching"

Value Stream Analytics

Location: Group Analytics Value Stream

Tracks:

Issue Code Test Production (full cycle)
Time in each stage
Bottlenecks in delivery flow
Lead time for changes

Use case: Understand end-to-end delivery speed

Repository Analytics

Location: Project Analytics Repository

Metrics:

Commits per day/week
Contributors
Programming languages
Code coverage trends

Correlation: High commit frequency + low pipeline success = investigate quality

Cost Monitoring

Compute Minutes Tracking

Location: Group Settings Usage Quotas Pipelines

View:

Current month usage
Usage by project
Minute multipliers (runner sizes)
Quota limits
Historical trends

Example:

Total usage: 38,542 / 50,000 minutes (77%)
Top consumers:
  - project-a: 12,450 minutes (32%)
  - project-b: 8,920 minutes (23%)
  - project-c: 5,670 minutes (15%)

Actions:

Projects using >20% of quota investigate
Usage trending above quota optimize or purchase more minutes
Sudden spikes check for pipeline loops or inefficiencies

Source: GitLab Compute Minutes

Cost Per Project

API Query (get usage per project):

#!/bin/bash

GROUP_ID="12345"
TOKEN="your-gitlab-token"

# Get all projects in group
PROJECTS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
  "https://gitlab.com/api/v4/groups/$GROUP_ID/projects?per_page=100" | \
  jq -r '.[] | "\(.id):\(.path_with_namespace)"')

echo "Project,CI Minutes"

# Get CI minute usage per project
for project in $PROJECTS; do
  PROJECT_ID=$(echo $project | cut -d: -f1)
  PROJECT_NAME=$(echo $project | cut -d: -f2)

  STATS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
    "https://gitlab.com/api/v4/projects/$PROJECT_ID/statistics")

  CI_SECONDS=$(echo $STATS | jq '.statistics.ci_runner_seconds // 0')
  CI_MINUTES=$(echo "scale=2; $CI_SECONDS / 60" | bc)

  echo "$PROJECT_NAME,$CI_MINUTES"
done | sort -t, -k2 -n -r

Output:

Project,CI Minutes
my-group/project-a,12450.25
my-group/project-b,8920.50
my-group/project-c,5670.75
...

Schedule: Run weekly, track trends

Cost Attribution

Add labels to track costs:

.cost-tracking:
  variables:
    COST_CENTER: "team-alpha"
    PROJECT_CODE: "product-x"
    ENVIRONMENT: "production"

build:
  extends: .cost-tracking
  script: npm run build

Extract from logs:

# Parse pipeline logs for cost attribution
glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \
  jq -r '.[] | "\(.name),\(.duration),\(.variables.COST_CENTER)"'

Budget Alerts

Scheduled pipeline to check budget:

# .gitlab-ci.yml in monitoring project
stages:
  - monitor

check-budget:
  stage: monitor
  only:
    - schedules
  script:
    - |
      # Get current usage
      USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \
        "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | \
        jq '.ci_minutes.used')

      BUDGET=50000
      THRESHOLD=0.8  # 80%

      if [ "$USAGE" -gt "$((BUDGET * THRESHOLD))" ]; then
        echo "  CI minute budget at $((100 * USAGE / BUDGET))%"
        # Send alert (Slack, email, etc.)
        curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \
          -d "{\"text\":\"CI budget alert: $USAGE / $BUDGET minutes used\"}"
      fi

Schedule: Daily at 9 AM

Minute Multipliers

Track runner sizes to optimize costs:

Runner Size	vCPU	RAM	Minute Multiplier
Small	1	2GB	1x
Medium	2	4GB	2x
Large	4	8GB	4x
X-Large	8	16GB	8x

Cost analysis:

# Get jobs by runner size
glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \
  jq -r '.[] | "\(.name),\(.duration),\(.runner.size)"' | \
  awk -F, '{
    minutes = $2 / 60
    if ($3 == "small") mult = 1
    else if ($3 == "medium") mult = 2
    else if ($3 == "large") mult = 4
    else mult = 8
    cost = minutes * mult
    print $1 ": " cost " CI minutes"
  }'

Output:

build: 50 CI minutes (10 min  5x large runner)
test: 20 CI minutes (10 min  2x medium runner)
deploy: 5 CI minutes (5 min  1x small runner)
Total: 75 CI minutes

Optimization: Use smallest runner that meets performance needs

Performance Metrics

Pipeline Duration Tracking

API Query:

# Get last 100 pipelines with durations
glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \
  jq -r '.[] | "\(.created_at),\(.duration),\(.status)"' | \
  awk -F, '{
    if ($3 == "success") {
      print $1 "," $2/60 " minutes"
    }
  }'

Track over time:

2026-01-01,12.5 minutes
2026-01-02,11.8 minutes
2026-01-03,15.2 minutes   Spike - investigate
2026-01-04,12.1 minutes

Job-Level Performance

Identify slowest jobs:

# Get all jobs from recent pipelines
glab api "/projects/$PROJECT_ID/pipelines/$PIPELINE_ID/jobs" | \
  jq -r '.[] | "\(.name),\(.duration),\(.status)"' | \
  sort -t, -k2 -n -r | \
  head -10

Output (top 10 slowest jobs):

e2e-tests,1820,success
integration-tests,1205,success
build-docker,890,success
security-scan,650,success
unit-tests,320,success
...

Actions:

e2e-tests (30 min): Parallelize with parallel: 10
build-docker (15 min): Enable Docker layer caching
security-scan (11 min): Run only on MRs, not every commit

Success Rate Monitoring

Track pipeline success rate:

# Last 100 pipelines
glab api "/projects/$PROJECT_ID/pipelines?per_page=100" | \
  jq '[.[] | .status] | group_by(.) | map({status: .[0], count: length})' | \
  jq -r '.[] | "\(.status): \(.count)"'

Output:

success: 78
failed: 15
canceled: 7

Success rate: 78%

Target: >90% success rate

If below target:

Review failure logs
Check for flaky tests
Improve validation (see validation.md)
Add retry logic for transient failures

Job Failure Patterns

Identify most common failures:

# Get failed jobs
glab api "/projects/$PROJECT_ID/pipelines?status=failed&per_page=50" | \
  jq -r '.[] | .id' | \
  while read pipeline_id; do
    glab api "/projects/$PROJECT_ID/pipelines/$pipeline_id/jobs" | \
      jq -r '.[] | select(.status == "failed") | .name'
  done | \
  sort | uniq -c | sort -n -r

Output:

45 test
12 build-docker
8 deploy-staging
3 lint

Action: test failing 45 times investigate test flakiness

Cache Hit Rate

Track cache effectiveness:

test:
  cache:
    key: npm-cache
    paths:
      - node_modules/
  script:
    - CACHE_START=$(date +%s)
    - npm ci
    - CACHE_END=$(date +%s)
    - echo "Cache restore took $((CACHE_END - CACHE_START)) seconds"

Expected:

Cold cache: 120 seconds
Warm cache: 5 seconds
Cache hit rate: >80%

If low hit rate:

Check cache key strategy
Verify runner tags are consistent
Review cache size limits

DORA Metrics

What are DORA Metrics?

Four key metrics for DevOps performance:

Deployment Frequency: How often you deploy
Lead Time for Changes: Time from commit to production
Mean Time to Recovery (MTTR): Time to recover from failure
Change Failure Rate: % of deployments causing failures

Source: DORA DevOps Research

Deployment Frequency

Track deployments to production:

# Count deployments per day
glab api "/projects/$PROJECT_ID/deployments?environment=production" | \
  jq -r '.[] | .created_at' | \
  cut -d T -f1 | \
  sort | uniq -c

Output:

3 2026-01-05
5 2026-01-06
2 2026-01-07
4 2026-01-08

Benchmarks:

Elite: Multiple per day
High: Once per day to once per week
Medium: Once per week to once per month
Low: Less than once per month

Lead Time for Changes

Measure commit production time:

# Tag commit time
commit-timestamp:
  stage: .pre
  script:
    - echo "COMMIT_TIME=$(date +%s)" >> metrics.env
  artifacts:
    reports:
      dotenv: metrics.env

# Tag deploy time
deploy:
  stage: deploy
  script:
    - DEPLOY_TIME=$(date +%s)
    - LEAD_TIME=$((DEPLOY_TIME - COMMIT_TIME))
    - echo "Lead time: $((LEAD_TIME / 3600)) hours"

Benchmarks:

Elite: < 1 hour
High: 1 day to 1 week
Medium: 1 week to 1 month
Low: > 1 month

Mean Time to Recovery

Track incident resolution time:

# Find failed deployments
FAILED=$(glab api "/projects/$PROJECT_ID/deployments?status=failed&environment=production")

# Find subsequent successful deployment
for deploy in $(echo $FAILED | jq -r '.[] | .id'); do
  FAILED_TIME=$(echo $FAILED | jq -r ".[] | select(.id == $deploy) | .created_at")
  NEXT_SUCCESS=$(glab api "/projects/$PROJECT_ID/deployments?status=success&environment=production" | \
    jq -r ".[] | select(.created_at > \"$FAILED_TIME\") | .created_at" | head -1)

  RECOVERY_TIME=$(( $(date -d "$NEXT_SUCCESS" +%s) - $(date -d "$FAILED_TIME" +%s) ))
  echo "Recovery time: $((RECOVERY_TIME / 3600)) hours"
done

Benchmarks:

Elite: < 1 hour
High: < 1 day
Medium: 1 day to 1 week
Low: > 1 week

Change Failure Rate

Calculate % of deployments that fail:

TOTAL=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&per_page=100" | jq 'length')
FAILED=$(glab api "/projects/$PROJECT_ID/deployments?environment=production&status=failed&per_page=100" | jq 'length')

FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc)
echo "Change failure rate: $FAILURE_RATE%"

Benchmarks:

Elite: 0-15%
High: 16-30%
Medium: 31-45%
Low: > 45%

DORA Dashboard

Aggregate all metrics:

dora-metrics:
  stage: report
  only:
    - schedules
  script:
    - |
      cat > dora-report.json <<EOF
      {
        "deployment_frequency": "$(./calculate-deployment-frequency.sh)",
        "lead_time_hours": "$(./calculate-lead-time.sh)",
        "mttr_hours": "$(./calculate-mttr.sh)",
        "change_failure_rate": "$(./calculate-failure-rate.sh)"
      }
      EOF
    - cat dora-report.json
    - ./send-to-dashboard.sh dora-report.json
  artifacts:
    reports:
      metrics: dora-report.json

Schedule: Daily

Custom Dashboards

Prometheus + Grafana Integration

Export CI metrics to Prometheus:

export-metrics:
  stage: .post
  only:
    - schedules
  script:
    - |
      # Export pipeline duration
      echo "gitlab_pipeline_duration_seconds{project=\"$CI_PROJECT_NAME\"} $CI_PIPELINE_DURATION" | \
        curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci

      # Export success rate
      echo "gitlab_pipeline_success_total{project=\"$CI_PROJECT_NAME\"} 1" | \
        curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/gitlab-ci

Grafana Dashboard queries:

# Average pipeline duration
avg(gitlab_pipeline_duration_seconds) by (project)

# Success rate (last 24h)
rate(gitlab_pipeline_success_total[24h]) / rate(gitlab_pipeline_total[24h])

# CI minute consumption (estimated)
sum(gitlab_job_duration_seconds * gitlab_job_runner_multiplier) by (project) / 60

Custom Analytics Script

Weekly report generation:

#!/bin/bash

GROUP_ID="12345"
START_DATE=$(date -d '7 days ago' +%Y-%m-%d)
END_DATE=$(date +%Y-%m-%d)

echo "CI/CD Weekly Report: $START_DATE to $END_DATE"
echo "================================================"

# Get all projects
PROJECTS=$(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id')

for project in $PROJECTS; do
  PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace')

  # Get pipelines in date range
  PIPELINES=$(glab api "/projects/$project/pipelines?updated_after=$START_DATE&updated_before=$END_DATE")

  TOTAL=$(echo $PIPELINES | jq 'length')
  SUCCESS=$(echo $PIPELINES | jq '[.[] | select(.status == "success")] | length')
  FAILED=$(echo $PIPELINES | jq '[.[] | select(.status == "failed")] | length')
  AVG_DURATION=$(echo $PIPELINES | jq '[.[] | .duration] | add / length / 60')

  if [ "$TOTAL" -gt 0 ]; then
    SUCCESS_RATE=$(echo "scale=2; 100 * $SUCCESS / $TOTAL" | bc)

    echo ""
    echo "Project: $PROJECT_NAME"
    echo "  Pipelines: $TOTAL"
    echo "  Success rate: $SUCCESS_RATE%"
    echo "  Average duration: $AVG_DURATION minutes"

    if [ $(echo "$SUCCESS_RATE < 90" | bc) -eq 1 ]; then
      echo "    Success rate below target (90%)"
    fi

    if [ $(echo "$AVG_DURATION > 15" | bc) -eq 1 ]; then
      echo "    Average duration above target (15 min)"
    fi
  fi
done

Schedule: Run every Monday, send to team Slack

Alerting

Pipeline Failure Alerts

Slack notification on failure:

notify-failure:
  stage: .post
  when: on_failure
  script:
    - |
      curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \
        -H 'Content-Type: application/json' \
        -d "{
          \"text\": \" Pipeline failed: $CI_PROJECT_NAME\",
          \"attachments\": [{
            \"color\": \"danger\",
            \"fields\": [
              {\"title\": \"Branch\", \"value\": \"$CI_COMMIT_REF_NAME\", \"short\": true},
              {\"title\": \"Commit\", \"value\": \"$CI_COMMIT_SHORT_SHA\", \"short\": true},
              {\"title\": \"Author\", \"value\": \"$GITLAB_USER_NAME\", \"short\": true},
              {\"title\": \"Pipeline\", \"value\": \"$CI_PIPELINE_URL\", \"short\": false}
            ]
          }]
        }"

Budget Threshold Alerts

Alert when approaching quota:

check-budget-threshold:
  stage: monitor
  only:
    - schedules
  script:
    - |
      USAGE=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \
        "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.used')
      QUOTA=$(curl -s --header "PRIVATE-TOKEN: $CI_JOB_TOKEN" \
        "$CI_API_V4_URL/groups/$GROUP_ID/usage_quotas" | jq '.ci_minutes.limit')

      PERCENT=$(echo "scale=2; 100 * $USAGE / $QUOTA" | bc)

      if [ $(echo "$PERCENT > 80" | bc) -eq 1 ]; then
        echo "  CI minute usage at $PERCENT%"
        # Send alert
      fi

Performance Degradation Alerts

Alert on pipeline slowdown:

check-performance:
  stage: monitor
  only:
    - schedules
  script:
    - |
      # Get last 10 pipeline durations
      RECENT=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=10" | \
        jq '[.[] | .duration] | add / length')

      # Get previous 10 pipeline durations
      PREVIOUS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=20" | \
        jq '[.[-10:] | .[] | .duration] | add / length')

      INCREASE=$(echo "scale=2; 100 * ($RECENT - $PREVIOUS) / $PREVIOUS" | bc)

      if [ $(echo "$INCREASE > 20" | bc) -eq 1 ]; then
        echo "  Pipeline duration increased by $INCREASE%"
        # Send alert
      fi

Flaky Test Detection

Alert on tests that fail intermittently:

detect-flaky-tests:
  stage: monitor
  only:
    - schedules
  script:
    - |
      # Get last 50 test jobs
      TESTS=$(glab api "/projects/$CI_PROJECT_ID/pipelines?per_page=50" | \
        jq -r '.[] | .id' | \
        while read pipeline; do
          glab api "/projects/$CI_PROJECT_ID/pipelines/$pipeline/jobs" | \
            jq -r '.[] | select(.name == "test") | .status'
        done)

      SUCCESS=$(echo "$TESTS" | grep -c success)
      FAILED=$(echo "$TESTS" | grep -c failed)
      TOTAL=$((SUCCESS + FAILED))

      if [ "$FAILED" -gt 0 ] && [ "$SUCCESS" -gt 0 ]; then
        FAILURE_RATE=$(echo "scale=2; 100 * $FAILED / $TOTAL" | bc)
        if [ $(echo "$FAILURE_RATE > 10 && $FAILURE_RATE < 90" | bc) -eq 1 ]; then
          echo "  Flaky test detected: $FAILURE_RATE% failure rate"
          # Send alert
        fi
      fi

Audit and Compliance

Pipeline Configuration Audit

Check for required settings:

#!/bin/bash

# Audit all projects for compliance
for project in $(glab api "/groups/$GROUP_ID/projects?per_page=100" | jq -r '.[].id'); do
  CI_CONFIG=$(glab api "/projects/$project/repository/files/.gitlab-ci.yml/raw?ref=main" 2>/dev/null)

  PROJECT_NAME=$(glab api "/projects/$project" | jq -r '.path_with_namespace')

  echo "Auditing: $PROJECT_NAME"

  # Check for security scans
  if ! echo "$CI_CONFIG" | grep -q "security-scan\|SAST"; then
    echo "   Missing security scans"
  fi

  # Check for test coverage
  if ! echo "$CI_CONFIG" | grep -q "coverage"; then
    echo "    No coverage reporting"
  fi

  # Check for caching
  if ! echo "$CI_CONFIG" | grep -q "cache:"; then
    echo "    No caching configured"
  fi

  # Check for interruptible jobs
  if ! echo "$CI_CONFIG" | grep -q "interruptible"; then
    echo "    No interruptible jobs (cost optimization)"
  fi
done

Schedule: Weekly compliance check

Change Log Tracking

Track pipeline config changes:

# Get commits that modified .gitlab-ci.yml
glab api "/projects/$PROJECT_ID/repository/commits?path=.gitlab-ci.yml&per_page=50" | \
  jq -r '.[] | "\(.created_at) | \(.author_name) | \(.title)"'

Output:

2026-01-08 | John Doe | ci: add Docker caching
2026-01-05 | Jane Smith | ci: parallelize tests
2026-01-02 | John Doe | ci: update security scans

Use case: Correlate pipeline changes with performance/cost trends

Access Audit

Track who can modify pipelines:

# Get project members with Maintainer/Owner access
glab api "/projects/$PROJECT_ID/members" | \
  jq -r '.[] | select(.access_level >= 40) | "\(.name) - \(.access_level)"'

Access levels:

50: Owner
40: Maintainer (can edit .gitlab-ci.yml)
30: Developer (can trigger pipelines)
20: Reporter (view only)

Summary Checklist

Essential Monitoring

Track CI minute usage weekly
Monitor pipeline success rate (target: >90%)
Identify slowest jobs and optimize
Set up budget alerts (80% threshold)
Review top CI minute consumers monthly

Performance Tracking

Monitor average pipeline duration
Track cache hit rates
Identify and fix flaky tests
Measure critical path of pipelines
Set performance SLOs

DORA Metrics

Track deployment frequency
Measure lead time for changes
Calculate MTTR
Monitor change failure rate
Set targets based on benchmarks

Compliance and Governance

Audit pipeline configs weekly
Track security scan coverage
Review access controls
Document pipeline changes
Enforce required scans via policy

Additional Resources

Cost Optimization - Reduce CI minute usage
Pipeline Efficiency - Performance optimization
Multi-Project Management - Monitoring at scale
GitLab CI/CD Analytics Docs
DORA Metrics

Last Updated: 2026-01-08 Priority: HIGH - Essential for cost control and performance optimization

monitoring

GitLab CI/CD Monitoring and Analytics

Table of Contents

Built-in Analytics

CI/CD Analytics Overview

Pipeline Charts

Value Stream Analytics

Repository Analytics

Cost Monitoring

Compute Minutes Tracking

Cost Per Project

Cost Attribution

Budget Alerts

Minute Multipliers

Performance Metrics

Pipeline Duration Tracking

Job-Level Performance

Success Rate Monitoring

Job Failure Patterns

Cache Hit Rate

DORA Metrics

What are DORA Metrics?

Deployment Frequency

Lead Time for Changes

Mean Time to Recovery

Change Failure Rate

DORA Dashboard

Custom Dashboards

Prometheus + Grafana Integration

Custom Analytics Script

Alerting

Pipeline Failure Alerts

Budget Threshold Alerts

Performance Degradation Alerts

Flaky Test Detection

Audit and Compliance

Pipeline Configuration Audit

Change Log Tracking

Access Audit

Summary Checklist

Essential Monitoring

Performance Tracking

DORA Metrics

Compliance and Governance

Additional Resources