pipeline optimization
Pipeline Optimization for Cost Efficiency
Overview
Pipeline optimization combines multiple techniques to create fast, efficient CI/CD workflows that minimize compute minute consumption while maximizing developer productivity.
Rules vs only/except
Problem with only/except (Legacy)
# OLD WAY - Less efficient test: only: - branches except: - main script: - npm test
Issues:
- Limited flexibility
- No complex conditions
- Can't combine multiple conditions
- Deprecated in favor of rules
Modern Approach: rules
# NEW WAY - More efficient and flexible test: rules: - if: $CI_PIPELINE_SOURCE == "merge_request_event" - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH script: - npm test
Performance Impact:
- Rules evaluated before job creation
- Prevents unnecessary job scheduling
- Reduces overhead on shared runners
Complex Rules
deploy: rules: # Deploy on main branch - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH when: always # Manual deploy on release branches - if: $CI_COMMIT_BRANCH =~ /^release\// when: manual # Never deploy on feature branches - when: never script: - npm run deploy
Interruptible Jobs
Concept
Allow jobs to be canceled when they become obsolete (new commit pushed).
Configuration
Enable globally:
default: interruptible: true
Or per-job:
test: interruptible: true script: - npm test deploy: interruptible: false # Never interrupt deployments! script: - npm run deploy
Best Practices
Always interruptible:
- Lint/format checks
- Unit tests
- Integration tests
- Build jobs (if no side effects)
Never interruptible:
- Deployments
- Database migrations
- Publishing packages
- Creating releases
Advanced - Selective Cancellation
workflow: auto_cancel: on_new_commit: interruptible # Only cancel interruptible jobs on_job_failure: all # Cancel all on failure
Options:
interruptible: Only cancel jobs markedinterruptible: trueconservative: Only cancel if NO non-interruptible jobs startednone: Don't auto-cancel
Example Impact:
10:00 - Push commit A Pipeline starts (20 min)
10:05 - Push commit B Pipeline A canceled (saved 15 min)
10:05 - Push commit B Pipeline B starts (20 min)
10:08 - Push commit C Pipeline B canceled (saved 17 min)
Savings: 32 minutes from 2 canceled pipelines
Fail Fast Patterns
Problem
Waiting for all jobs to complete when early failure makes success impossible.
Pattern 1: Job Dependencies with needs
stages: - validate - test - build - deploy # Fast validation first lint: stage: validate script: - npm run lint - exit 1 # If fails, subsequent jobs blocked type-check: stage: validate script: - npm run type-check # Tests depend on validation passing test:unit: stage: test needs: [lint, type-check] script: - npm test test:integration: stage: test needs: [lint, type-check] script: - npm run test:integration # Build depends on tests build: stage: build needs: [test:unit, test:integration] script: - npm run build # Deploy depends on build deploy: stage: deploy needs: [build] script: - npm run deploy
Flow:
lint (fail) test:unit (blocked) build (blocked) deploy (blocked)
test:integration (blocked)
Pipeline stops at 2 minutes instead of running 20 minutes
Pattern 2: Auto-Cancel on Failure
workflow: auto_cancel: on_job_failure: all # Cancel all remaining jobs default: interruptible: true lint: script: - npm run lint # If fails, everything else cancels test:parallel: parallel: 10 script: - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL
Benefit: If lint fails (30 seconds), 10 parallel test jobs (5 min each) are canceled = 50 minutes saved.
Pattern 3: Early Exit in Scripts
test: script: # Run fast checks first, exit immediately on failure - npm run lint || exit 1 - npm run type-check || exit 1 - npm run security-check || exit 1 # Only run slow tests if fast checks pass - npm run test:unit - npm run test:integration - npm run test:e2e
Pattern 4: Allow Failure for Non-Critical Jobs
# Critical jobs lint: script: - npm run lint # Non-critical - don't block pipeline security-scan: allow_failure: true # Pipeline continues if this fails script: - npm audit deploy: needs: [lint] # Only depends on critical jobs script: - npm run deploy
Pipeline Timeout Settings
Global Timeout
Problem: Jobs that hang consume minutes until global timeout (1 hour default).
Solution - Set Project Timeout:
Navigate to: Project Settings CI/CD General pipelines Timeout
Recommended: 30 minutes (projects rarely need more)
Job-Specific Timeouts
# Fast jobs - aggressive timeout lint: timeout: 5m # Should complete in <1 min script: - npm run lint # Medium jobs test: timeout: 15m # Should complete in <10 min script: - npm test # Long jobs test:e2e: timeout: 30m # Can take 20+ min script: - npm run test:e2e # Deployments deploy: timeout: 1h # May need more time script: - npm run deploy
Benefits:
- Catch hung jobs faster
- Reduce wasted minutes
- Force optimization of slow jobs
Timeout Strategy
Set timeout to 1.5x normal duration:
Normal: 10 minutes
Timeout: 15 minutes
If job hits 15 min, investigate why
Job Retry Strategies
Problem
Jobs that fail transiently retry multiple times, consuming extra minutes.
Default Behavior
No configuration = No retries
Smart Retry Configuration
# Retry transient failures only test: retry: max: 2 # Retry up to 2 times when: - runner_system_failure - stuck_or_timeout_failure - unknown_failure - api_failure - runner_unsupported script: - npm test
When NOT to Retry
# Don't retry deterministic failures lint: retry: 0 # Code errors won't fix themselves script: - npm run lint build: retry: 0 # Build errors need code changes script: - npm run build
When TO Retry
# Network-dependent operations deploy: retry: max: 3 when: - runner_system_failure - stuck_or_timeout_failure - script_failure # Network timeouts script: - npm run deploy # Flaky E2E tests test:e2e: retry: max: 1 # One retry for flakes when: - script_failure script: - npm run test:e2e
Cost Analysis:
Job duration: 10 minutes
Retries: 2
Failure rate: 30%
Expected cost:
- Success (70%): 10 min
- Fail + Retry + Success (25%): 20 min
- Fail + Retry + Fail + Retry + Success (5%): 30 min
Average: 10 0.7 + 20 0.25 + 30 0.05 = 13.5 min
vs No retries: 10 min with 30% failure rate
Rule: Only retry if fixing manual intervention cost > retry cost.
Merge Request Pipelines vs Branch Pipelines
Problem: Duplicate Pipelines
Push to MR branch triggers TWO pipelines:
- Branch pipeline (on push)
- MR pipeline (on MR event)
Result: 2x compute minutes
Solution: MR Pipelines Only
workflow: rules: # Only run MR pipelines - if: $CI_PIPELINE_SOURCE == "merge_request_event" # Run on default branch - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH # Run on tags - if: $CI_COMMIT_TAG # Skip all other branch pipelines
Alternative - Combined:
workflow: rules: # For MR branches, only run MR pipeline - if: $CI_PIPELINE_SOURCE == "merge_request_event" # For non-MR branches, run branch pipeline - if: $CI_COMMIT_BRANCH
Merged Results Pipelines (Premium/Ultimate)
Test code AS IF already merged to target branch:
Enable: Project Settings Merge requests Merged results pipelines
Benefits:
- Catch merge conflicts early
- Test with target branch changes
- No duplicate runs
workflow: rules: - if: $CI_PIPELINE_SOURCE == "merged_result_event" - if: $CI_PIPELINE_SOURCE == "merge_request_event"
Savings: 30-50% by preventing duplicates
Parallel Execution Optimization
Pattern 1: Parallel Jobs
Split work across multiple jobs:
test: parallel: 5 # Run 5 instances script: - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL
Cost Impact:
Sequential: 50 minutes (1 job)
Parallel (5x): 10 minutes duration, 50 minutes cost (5 jobs 10 min)
Duration: 80% faster
Cost: Same
Developer time saved: 40 minutes per pipeline
Pattern 2: Matrix Builds
test: parallel: matrix: - NODE_VERSION: ["18", "20", "22"] OS: ["linux", "windows"] image: node:${NODE_VERSION} tags: - ${OS} script: - npm test
Result: 6 jobs (3 versions 2 OSes)
Cost: 6x job duration (but necessary for compatibility)
Pattern 3: Strategic Parallelization
DON'T parallelize short jobs:
# BAD - Overhead exceeds benefit lint: parallel: 3 # Job takes 30 seconds, overhead is 20 seconds script: - npm run lint
DO parallelize long jobs:
# GOOD - Clear benefit test:e2e: parallel: 10 # Job takes 30 minutes 3 minutes parallel script: - npm run test:e2e -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL
Rule: Only parallelize jobs >5 minutes
Conditional Job Execution
Pattern 1: File Changes
# Only test affected services test:api: rules: - changes: - "services/api/**/*" - "shared/lib/**/*" script: - cd services/api && npm test test:ui: rules: - changes: - "services/ui/**/*" - "shared/components/**/*" script: - cd services/ui && npm test
Pattern 2: Schedule-Specific Jobs
# Only run in nightly builds test:stress: rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:stress # Don't run expensive tests in MRs test:e2e:full: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:e2e:full
Pattern 3: Manual Gates
# Expensive operation - manual trigger performance-test: when: manual allow_failure: true # Don't block pipeline script: - npm run test:performance # Auto-deploy to staging, manual to prod deploy:staging: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:staging deploy:production: when: manual rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH needs: [deploy:staging] script: - npm run deploy:production
Stage Optimization
Problem: Sequential Stages
stages: - build - test - deploy build: stage: build script: sleep 600 # 10 min test: stage: test script: sleep 600 # 10 min (waits for build) deploy: stage: deploy script: sleep 600 # 10 min (waits for test) # Total: 30 minutes
Solution: Directed Acyclic Graph (DAG) with needs
build: script: sleep 600 # 10 min test:unit: needs: [build] script: sleep 600 # Starts immediately after build test:integration: needs: [build] script: sleep 600 # Parallel with test:unit deploy: needs: [test:unit, test:integration] script: sleep 600 # Total: 20 minutes (build + max(test:unit, test:integration) + deploy)
Best Practice: Minimal Stages
Instead of:
stages: - lint - type-check - unit-test - integration-test - build - deploy
Use:
stages: - validate - test - deploy # With needs to control order lint: stage: validate test:unit: stage: test needs: [lint] test:integration: stage: test needs: [lint] deploy: stage: deploy needs: [test:unit, test:integration]
Resource Efficiency
Pattern 1: Shared Setup Job
Problem: Every job installs dependencies
# INEFFICIENT test:unit: script: - npm ci # 5 minutes - npm test # 2 minutes test:integration: script: - npm ci # 5 minutes (duplicate!) - npm run test:integration # 3 minutes # Total: 15 minutes
Solution:
# EFFICIENT install: stage: .pre script: - npm ci artifacts: paths: - node_modules/ expire_in: 1 hour test:unit: script: - npm test # 2 minutes test:integration: script: - npm run test:integration # 3 minutes # Total: 10 minutes (5 + max(2,3))
Pattern 2: Artifact Expiration
Don't keep artifacts longer than needed:
build: script: - npm run build artifacts: paths: - dist/ expire_in: 1 day # Not 30 days!
Cost:
- Storage cost (artifact storage quota)
- Download time in downstream jobs
Guideline:
- Build artifacts: 1 day
- Test results: 7 days
- Release artifacts: 30 days or never expire
Environment-Specific Optimization
Development Branches
# Minimal testing on feature branches test:dev: rules: - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH script: - npm run test:unit # Fast tests only
Main Branch
# Comprehensive testing on main test:main: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm test # Full test suite - npm run test:integration - npm run test:e2e
Scheduled Pipelines
# Extensive testing in nightly builds test:nightly: rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:all - npm run test:performance - npm run test:stress
Complete Optimized Pipeline Example
# Global configuration workflow: auto_cancel: on_new_commit: interruptible on_job_failure: all rules: # Skip draft MRs - if: $CI_MERGE_REQUEST_TITLE =~ /^Draft:/ when: never # Only MR pipelines for MRs - if: $CI_PIPELINE_SOURCE == "merge_request_event" # Branch pipelines for main - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH # Tag pipelines - if: $CI_COMMIT_TAG default: interruptible: true image: node:20-alpine cache: key: files: - package-lock.json prefix: $CI_JOB_NAME paths: - node_modules/ policy: pull retry: max: 1 when: - runner_system_failure stages: - validate - test - build - deploy # Fast validation lint: stage: validate timeout: 5m cache: policy: pull-push script: - npm ci --prefer-offline - npm run lint type-check: stage: validate timeout: 5m needs: [] script: - npm ci --prefer-offline - npm run type-check # Conditional tests test:unit: stage: test timeout: 10m needs: [lint, type-check] rules: - changes: - "src/**/*" - "tests/**/*" - "package*.json" parallel: 3 script: - npm ci --prefer-offline - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL test:integration: stage: test timeout: 15m tags: - self-hosted # Long test on self-hosted runner needs: [lint] rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH - if: $CI_PIPELINE_SOURCE == "schedule" - if: $CI_MERGE_REQUEST_IID changes: - "src/**/*" when: manual allow_failure: true script: - npm ci --prefer-offline - npm run test:integration # Build build: stage: build timeout: 10m needs: [test:unit] script: - npm ci --prefer-offline - npm run build artifacts: paths: - dist/ expire_in: 1 day # Deploy (not interruptible) deploy:staging: stage: deploy interruptible: false timeout: 20m needs: [build] rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:staging deploy:production: stage: deploy interruptible: false timeout: 20m needs: [deploy:staging] when: manual rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:production
Measuring Pipeline Efficiency
Key Metrics
Pipeline Duration:
Duration = End time - Start time
Compute Minutes:
Compute Minutes = Σ(Job Duration Cost Factor)
Efficiency Ratio:
Efficiency = Pipeline Duration / Compute Minutes
Lower is better (more parallelization)
Example:
Pipeline Duration: 15 minutes
Compute Minutes: 45 minutes
Efficiency: 0.33 (good - 3x parallelization)
Targets
| Metric | Target |
|---|---|
| MR Pipeline Duration | <15 min |
| Main Pipeline Duration | <30 min |
| Efficiency Ratio | <0.5 |
| Failed Job Rate | <5% |
| Cache Hit Rate | >80% |
Next Steps
- Pre-Push Validation - Catch errors before CI
- Monitoring - Track pipeline performance
- Checklist - Daily optimization checks