monitoring
Cost Monitoring and Analytics
Overview
Continuous monitoring ensures optimizations remain effective and identifies new opportunities for cost reduction.
Key Metrics to Track
1. Total Compute Minutes per Month
Target: Consistent or decreasing
Track:
# Monthly usage glab api /namespaces/:id/ci_minutes | jq '.minutes_used' # Percentage of quota glab api /namespaces/:id/ci_minutes | \ jq '(.minutes_used / .monthly_minutes_limit) * 100'
Trend Analysis:
# Save monthly (automate this) echo "$(date +%Y-%m),$(glab api /namespaces/:id/ci_minutes | jq '.minutes_used')" >> ci-usage.csv # Plot with gnuplot or similar mlr --csv stats1 -a mean,min,max -f 2 ci-usage.csv
2. Top Projects by Usage
Identify cost centers:
# Get top 10 projects glab api /groups/:id/usage_stats | \ jq -r '.projects[] | "\(.ci_minutes)\t\(.name)"' | \ sort -rn | head -10
Export to CSV:
glab api /groups/:id/usage_stats | \ jq -r '.projects[] | [.name, .ci_minutes] | @csv' > project-usage.csv # Analyze with Miller mlr --csv --opprint \ sort -nr ci_minutes then \ head -n 20 project-usage.csv
3. Pipeline Duration Trends
Track over time:
# Average duration last 30 days glab api "/projects/:id/pipelines?per_page=100" | \ jq '[.[] | select(.status == "success") | .duration] | add / length / 60' # By branch glab api "/projects/:id/pipelines?per_page=100" | \ jq -r '.[] | "\(.ref)\t\(.duration / 60)"' | \ mlr --tsv stats1 -a mean,p95 -f 2 -g 1
4. Job Failure Rate
Wasted minutes from failures:
# Failed job percentage glab api "/projects/:id/jobs?per_page=100" | \ jq '[.[] | .status] | group_by(.) | map({status: .[0], count: length})' # Cost of failures glab api "/projects/:id/jobs?per_page=100" | \ jq '[.[] | select(.status == "failed") | (.duration / 60)] | add'
5. Cache Hit Rate
Effectiveness of caching:
# Add to .gitlab-ci.yml test: before_script: - | if [ -d "node_modules" ]; then echo "CACHE_HIT=true" >> metrics.env else echo "CACHE_HIT=false" >> metrics.env fi artifacts: reports: dotenv: metrics.env # Analyze cache hits glab api "/projects/:id/jobs?per_page=100" | \ jq '[.[] | select(.name == "test")] | map(select(.trace | contains("CACHE_HIT=true"))) | length'
6. Cost per Deploy
Efficiency metric:
# Minutes consumed per successful deployment DEPLOY_COUNT=$(glab api "/projects/:id/deployments?per_page=100" | jq 'length') TOTAL_MINUTES=$(glab api /namespaces/:id/ci_minutes | jq '.minutes_used') echo "scale=2; $TOTAL_MINUTES / $DEPLOY_COUNT" | bc
Automated Monitoring Dashboard
GitLab CI Job for Tracking
Create monitoring job:
# .gitlab-ci.yml monitor:usage: image: alpine:latest stage: .post rules: - if: $CI_PIPELINE_SOURCE == "schedule" # Daily before_script: - apk add --no-cache curl jq script: - | echo " CI/CD Cost Report - $(date)" echo "================================" # Get usage data USAGE=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | jq '.minutes_used') QUOTA=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | jq '.monthly_minutes_limit') PERCENT=$((100 * USAGE / QUOTA)) echo "Usage: $USAGE / $QUOTA minutes ($PERCENT%)" # Alert if high if [ $PERCENT -gt 75 ]; then echo " WARNING: High usage detected!" curl -X POST $SLACK_WEBHOOK_URL \ -H 'Content-Type: application/json' \ -d "{\"text\":\" CI Minutes at $PERCENT% of quota\"}" fi # Top projects echo "" echo "Top 5 Projects by Usage:" curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/groups/$CI_PROJECT_NAMESPACE_ID/usage_stats" | \ jq -r '.projects[] | "\(.name): \(.ci_minutes) min"' | \ sort -t: -k2 -rn | head -5 # Save to artifact for trending echo "$USAGE,$QUOTA,$PERCENT,$(date +%Y-%m-%d)" >> usage-history.csv artifacts: paths: - usage-history.csv expire_in: 1 year
Schedule daily:
- Navigate to: CI/CD Schedules
- Add schedule:
0 9 * * *(daily at 9 AM) - Target branch:
main
Slack Notifications
Alert on milestones:
notify:slack: image: alpine:latest stage: .post rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - | USAGE=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | jq '.minutes_used') QUOTA=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | jq '.monthly_minutes_limit') PERCENT=$((100 * USAGE / QUOTA)) MESSAGE="" if [ $PERCENT -ge 90 ]; then MESSAGE=" CRITICAL: CI minutes at $PERCENT% ($USAGE/$QUOTA)" elif [ $PERCENT -ge 75 ]; then MESSAGE=" WARNING: CI minutes at $PERCENT% ($USAGE/$QUOTA)" elif [ $PERCENT -ge 50 ]; then MESSAGE=" INFO: CI minutes at $PERCENT% ($USAGE/$QUOTA)" fi if [ -n "$MESSAGE" ]; then curl -X POST $SLACK_WEBHOOK_URL \ -H 'Content-Type: application/json' \ -d "{ \"text\": \"$MESSAGE\", \"attachments\": [{ \"color\": \"$([ $PERCENT -ge 90 ] && echo 'danger' || echo 'warning')\", \"fields\": [ {\"title\": \"Used\", \"value\": \"$USAGE min\", \"short\": true}, {\"title\": \"Quota\", \"value\": \"$QUOTA min\", \"short\": true} ] }] }" fi
GitLab Ultimate Analytics
CI/CD Analytics Dashboard
Navigate to: Group Analytics CI/CD Analytics
Metrics Available:
- Pipeline duration trends
- Success/failure rates
- DORA metrics (deployment frequency, lead time, MTTR)
- Job duration by stage
Value Stream Analytics
Navigate to: Group Analytics Value Stream
Track:
- Issue to deploy time
- Code review time
- Testing time
- Deployment time
Use for: Identifying bottlenecks that waste CI minutes
Usage Quotas Page
Navigate to: Group Settings Usage Quotas Pipelines
Shows:
- Monthly compute usage
- Projects sorted by usage
- Storage usage (artifacts)
- Minutes consumed per day (graph)
Custom Dashboards with Prometheus/Grafana
GitLab Runner Metrics
Enable Prometheus metrics on runners:
config.toml:
listen_address = ":9252"
Prometheus scrape config:
scrape_configs: - job_name: 'gitlab-runner' static_configs: - targets: - 'runner1.example.com:9252' - 'runner2.example.com:9252'
Key Metrics to Scrape
Runner metrics:
gitlab_runner_jobs{state="running"}
gitlab_runner_jobs{state="failed"}
gitlab_runner_job_duration_seconds
Project metrics (via API exporter):
gitlab_project_pipeline_duration_seconds
gitlab_project_pipeline_status
gitlab_ci_minutes_used
Grafana Dashboard
Panels to include:
- CI Minute Usage (Gauge)
gitlab_ci_minutes_used / gitlab_ci_minutes_quota * 100
- Usage Trend (Graph)
rate(gitlab_ci_minutes_used[1d])
- Top Projects (Table)
topk(10, gitlab_ci_minutes_per_project)
- Pipeline Duration (Graph)
avg(gitlab_project_pipeline_duration_seconds) by (project)
- Failure Rate (Gauge)
sum(rate(gitlab_runner_jobs{state="failed"}[1h])) / sum(rate(gitlab_runner_jobs[1h])) * 100
Import dashboard: Grafana.com dashboard #12833 (GitLab Runner)
Cost Attribution Reports
By Team
Tag projects with team labels:
# .gitlab-ci.yml variables: TEAM: "platform-engineering" COST_CENTER: "infrastructure"
Generate report:
#!/bin/bash # cost-by-team.sh echo "Team,Project,CI Minutes,Estimated Cost" for team in platform-engineering agent-team ossa-team; do # Get all projects for team (customize query) glab api "/groups/blueflyio/projects?search=$team" | \ jq -r '.[] | "\($team),\(.name),\(.statistics.ci_minutes_used // 0)"' | \ while IFS=, read -r team project minutes; do cost=$(echo "scale=2; $minutes / 1000 * 10" | bc) echo "$team,$project,$minutes,\$$cost" done done | mlr --csv stats1 -a sum -f 3,4 -g 1
By Project Type
Categorize projects:
# Projects by category echo "Category,Projects,Total Minutes,Cost" # Frontend projects FRONTEND=$(glab api "/groups/blueflyio/projects?topic=frontend" | \ jq '[.[] | .statistics.ci_minutes_used // 0] | add') # Backend projects BACKEND=$(glab api "/groups/blueflyio/projects?topic=backend" | \ jq '[.[] | .statistics.ci_minutes_used // 0] | add') # Infrastructure projects INFRA=$(glab api "/groups/blueflyio/projects?topic=infrastructure" | \ jq '[.[] | .statistics.ci_minutes_used // 0] | add') echo "Frontend,$(glab api '/groups/blueflyio/projects?topic=frontend' | jq 'length'),$FRONTEND,\$$(echo "scale=2; $FRONTEND / 1000 * 10" | bc)" echo "Backend,$(glab api '/groups/blueflyio/projects?topic=backend' | jq 'length'),$BACKEND,\$$(echo "scale=2; $BACKEND / 1000 * 10" | bc)" echo "Infrastructure,$(glab api '/groups/blueflyio/projects?topic=infrastructure' | jq 'length'),$INFRA,\$$(echo "scale=2; $INFRA / 1000 * 10" | bc)"
Optimization Impact Tracking
Before/After Comparison
Create baseline:
# Save baseline before optimization date=$(date +%Y-%m) glab api /namespaces/:id/ci_minutes > baseline-$date.json echo "Baseline saved: baseline-$date.json" cat baseline-$date.json | jq '{ month: "'$date'", used: .minutes_used, quota: .monthly_minutes_limit, percent: (.minutes_used / .monthly_minutes_limit * 100) }'
Track improvement:
# After optimizations date_new=$(date +%Y-%m) glab api /namespaces/:id/ci_minutes > current-$date_new.json # Compare echo "Optimization Impact Report" echo "==========================" OLD_USAGE=$(jq '.minutes_used' baseline-$date.json) NEW_USAGE=$(jq '.minutes_used' current-$date_new.json) SAVED=$((OLD_USAGE - NEW_USAGE)) PERCENT_SAVED=$((100 * SAVED / OLD_USAGE)) echo "Before: $OLD_USAGE minutes" echo "After: $NEW_USAGE minutes" echo "Saved: $SAVED minutes ($PERCENT_SAVED%)" echo "Cost savings: \$$(echo "scale=2; $SAVED / 1000 * 10" | bc)"
A/B Testing Optimizations
Test optimization in one project:
# Feature flag for optimization workflow: rules: - if: $ENABLE_OPTIMIZATION == "true" variables: USE_CACHE: "true" USE_INTERRUPTIBLE: "true" test: interruptible: $USE_INTERRUPTIBLE cache: key: $CI_COMMIT_REF_SLUG paths: - node_modules/ when: $USE_CACHE
Compare metrics:
# With optimization OPTIMIZED=$(glab api "/projects/:id/pipelines?per_page=50&variables[][key]=ENABLE_OPTIMIZATION&variables[][value]=true" | \ jq '[.[] | .duration] | add / length / 60') # Without optimization BASELINE=$(glab api "/projects/:id/pipelines?per_page=50&variables[][key]=ENABLE_OPTIMIZATION&variables[][value]=false" | \ jq '[.[] | .duration] | add / length / 60') echo "Baseline: $BASELINE minutes" echo "Optimized: $OPTIMIZED minutes" echo "Improvement: $(echo "scale=2; ($BASELINE - $OPTIMIZED) / $BASELINE * 100" | bc)%"
Alerting and Notifications
Email Alerts
Built-in GitLab alerts:
- 75% quota: Warning email
- 95% quota: Critical email
- 100% quota: Exhausted email
Recipients: Namespace owners and maintainers
Custom Alerts
API-based monitoring:
alert:high-usage: image: alpine:latest stage: .post rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - | PERCENT=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | \ jq '(.minutes_used / .monthly_minutes_limit) * 100') if (( $(echo "$PERCENT > 80" | bc -l) )); then # Send email curl -X POST https://api.sendgrid.com/v3/mail/send \ -H "Authorization: Bearer $SENDGRID_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "personalizations": [{ "to": [{"email": "ops@example.com"}], "subject": "GitLab CI Minutes Alert" }], "from": {"email": "noreply@example.com"}, "content": [{ "type": "text/plain", "value": "CI minutes at '"$PERCENT"'% of quota" }] }' fi
PagerDuty Integration
Critical usage alert:
alert:critical: rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - | PERCENT=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | \ jq '(.minutes_used / .monthly_minutes_limit) * 100') if (( $(echo "$PERCENT > 95" | bc -l) )); then curl -X POST https://events.pagerduty.com/v2/enqueue \ -H "Content-Type: application/json" \ -d '{ "routing_key": "'"$PAGERDUTY_KEY"'", "event_action": "trigger", "payload": { "summary": "GitLab CI minutes at critical level", "severity": "critical", "source": "gitlab-ci-monitor", "custom_details": { "usage_percent": "'"$PERCENT"'", "namespace": "'"$CI_PROJECT_NAMESPACE"'" } } }' fi
Monthly Cost Reports
Automated Report Generation
Scheduled pipeline:
report:monthly: image: alpine:latest stage: .post rules: - if: $CI_PIPELINE_SOURCE == "schedule" && $REPORT_TYPE == "monthly" before_script: - apk add --no-cache curl jq bc script: - | echo "# GitLab CI/CD Monthly Cost Report" > report.md echo "Generated: $(date)" >> report.md echo "" >> report.md # Overall usage USAGE=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/namespaces/$CI_PROJECT_NAMESPACE_ID/ci_minutes" | jq '.minutes_used') COST=$(echo "scale=2; $USAGE / 1000 * 10" | bc) echo "## Summary" >> report.md echo "- Total Minutes: $USAGE" >> report.md echo "- Total Cost: \$$COST" >> report.md echo "" >> report.md # Top projects echo "## Top 10 Projects" >> report.md echo "| Project | Minutes | Cost |" >> report.md echo "|---------|---------|------|" >> report.md curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$CI_API_V4_URL/groups/$CI_PROJECT_NAMESPACE_ID/usage_stats" | \ jq -r '.projects[] | "\(.name)|\(.ci_minutes)"' | \ sort -t'|' -k2 -rn | head -10 | \ while IFS='|' read -r name minutes; do cost=$(echo "scale=2; $minutes / 1000 * 10" | bc) echo "| $name | $minutes | \$$cost |" >> report.md done # Recommendations echo "" >> report.md echo "## Recommendations" >> report.md if [ $USAGE -gt 40000 ]; then echo "- High usage detected. Review top projects for optimization opportunities." >> report.md fi cat report.md artifacts: paths: - report.md expire_in: 1 year
Schedule: 1st of each month at 9 AM
0 9 1 * *
Continuous Improvement
Weekly Review Checklist
## Weekly CI/CD Cost Review - [ ] Check current usage vs quota (target <70%) - [ ] Review top 5 projects by usage - [ ] Identify any unusual spikes - [ ] Check pipeline failure rate (target <10%) - [ ] Review cache hit rate (target >80%) - [ ] Look for optimization opportunities - [ ] Update team on findings
Quarterly Optimization Sprint
Every quarter, dedicate time to:
-
Deep dive on top 10 projects
- Profile each pipeline
- Identify optimization opportunities
- Implement improvements
-
Review pipeline patterns
- Are best practices being followed?
- Are components being reused?
- Is caching configured correctly?
-
Update documentation
- Share learnings
- Update guidelines
- Create examples
-
Track ROI
- Measure time invested
- Calculate minutes saved
- Document success stories
Tools and Scripts
ci-cost-analyzer (Custom Tool)
Create analysis tool:
#!/bin/bash # ci-cost-analyzer.sh NAMESPACE_ID="12345" API_URL="https://gitlab.com/api/v4" function usage_summary() { echo " CI/CD Cost Summary" echo "====================" USAGE=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$API_URL/namespaces/$NAMESPACE_ID/ci_minutes" | jq '.minutes_used') QUOTA=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$API_URL/namespaces/$NAMESPACE_ID/ci_minutes" | jq '.monthly_minutes_limit') PERCENT=$(echo "scale=2; $USAGE / $QUOTA * 100" | bc) COST=$(echo "scale=2; $USAGE / 1000 * 10" | bc) echo "Used: $USAGE / $QUOTA minutes ($PERCENT%)" echo "Cost: \$$COST this month" } function top_projects() { echo "" echo " Top 10 Projects" echo "==================" curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$API_URL/groups/$NAMESPACE_ID/usage_stats" | \ jq -r '.projects[] | "\(.name):\t\(.ci_minutes) min"' | \ sort -t: -k2 -rn | head -10 } function optimization_tips() { echo "" echo " Optimization Tips" echo "===================" # Check for common issues FAILED_RATE=$(curl -s --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "$API_URL/groups/$NAMESPACE_ID/pipelines?per_page=100" | \ jq '[.[] | select(.status == "failed")] | length') if [ $FAILED_RATE -gt 20 ]; then echo " High failure rate detected ($FAILED_RATE%). Consider:" echo " - Pre-commit hooks" echo " - Better local testing" fi if (( $(echo "$PERCENT > 75" | bc -l) )); then echo " Usage above 75%. Consider:" echo " - Auto-cancel redundant pipelines" echo " - Aggressive caching" echo " - Self-hosted runners" fi } # Main usage_summary top_projects optimization_tips
Usage:
chmod +x ci-cost-analyzer.sh ./ci-cost-analyzer.sh
Next Steps
- Checklist - Daily/weekly cost monitoring tasks
- Tracking - Detailed usage tracking guide
- Strategies - Apply learnings to reduce costs