Skip to main content

pipeline optimization

Pipeline Optimization for Cost Efficiency

Overview

Pipeline optimization combines multiple techniques to create fast, efficient CI/CD workflows that minimize compute minute consumption while maximizing developer productivity.


Rules vs only/except

Problem with only/except (Legacy)

# OLD WAY - Less efficient test: only: - branches except: - main script: - npm test

Issues:

  • Limited flexibility
  • No complex conditions
  • Can't combine multiple conditions
  • Deprecated in favor of rules

Modern Approach: rules

# NEW WAY - More efficient and flexible test: rules: - if: $CI_PIPELINE_SOURCE == "merge_request_event" - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH script: - npm test

Performance Impact:

  • Rules evaluated before job creation
  • Prevents unnecessary job scheduling
  • Reduces overhead on shared runners

Complex Rules

deploy: rules: # Deploy on main branch - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH when: always # Manual deploy on release branches - if: $CI_COMMIT_BRANCH =~ /^release\// when: manual # Never deploy on feature branches - when: never script: - npm run deploy

Interruptible Jobs

Concept

Allow jobs to be canceled when they become obsolete (new commit pushed).

Configuration

Enable globally:

default: interruptible: true

Or per-job:

test: interruptible: true script: - npm test deploy: interruptible: false # Never interrupt deployments! script: - npm run deploy

Best Practices

Always interruptible:

  • Lint/format checks
  • Unit tests
  • Integration tests
  • Build jobs (if no side effects)

Never interruptible:

  • Deployments
  • Database migrations
  • Publishing packages
  • Creating releases

Advanced - Selective Cancellation

workflow: auto_cancel: on_new_commit: interruptible # Only cancel interruptible jobs on_job_failure: all # Cancel all on failure

Options:

  • interruptible: Only cancel jobs marked interruptible: true
  • conservative: Only cancel if NO non-interruptible jobs started
  • none: Don't auto-cancel

Example Impact:

10:00 - Push commit A  Pipeline starts (20 min)
10:05 - Push commit B  Pipeline A canceled (saved 15 min)
10:05 - Push commit B  Pipeline B starts (20 min)
10:08 - Push commit C  Pipeline B canceled (saved 17 min)

Savings: 32 minutes from 2 canceled pipelines

Fail Fast Patterns

Problem

Waiting for all jobs to complete when early failure makes success impossible.

Pattern 1: Job Dependencies with needs

stages: - validate - test - build - deploy # Fast validation first lint: stage: validate script: - npm run lint - exit 1 # If fails, subsequent jobs blocked type-check: stage: validate script: - npm run type-check # Tests depend on validation passing test:unit: stage: test needs: [lint, type-check] script: - npm test test:integration: stage: test needs: [lint, type-check] script: - npm run test:integration # Build depends on tests build: stage: build needs: [test:unit, test:integration] script: - npm run build # Deploy depends on build deploy: stage: deploy needs: [build] script: - npm run deploy

Flow:

lint (fail)  test:unit (blocked)  build (blocked)  deploy (blocked)
              test:integration (blocked)

Pipeline stops at 2 minutes instead of running 20 minutes

Pattern 2: Auto-Cancel on Failure

workflow: auto_cancel: on_job_failure: all # Cancel all remaining jobs default: interruptible: true lint: script: - npm run lint # If fails, everything else cancels test:parallel: parallel: 10 script: - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Benefit: If lint fails (30 seconds), 10 parallel test jobs (5 min each) are canceled = 50 minutes saved.

Pattern 3: Early Exit in Scripts

test: script: # Run fast checks first, exit immediately on failure - npm run lint || exit 1 - npm run type-check || exit 1 - npm run security-check || exit 1 # Only run slow tests if fast checks pass - npm run test:unit - npm run test:integration - npm run test:e2e

Pattern 4: Allow Failure for Non-Critical Jobs

# Critical jobs lint: script: - npm run lint # Non-critical - don't block pipeline security-scan: allow_failure: true # Pipeline continues if this fails script: - npm audit deploy: needs: [lint] # Only depends on critical jobs script: - npm run deploy

Pipeline Timeout Settings

Global Timeout

Problem: Jobs that hang consume minutes until global timeout (1 hour default).

Solution - Set Project Timeout:

Navigate to: Project Settings CI/CD General pipelines Timeout

Recommended: 30 minutes (projects rarely need more)

Job-Specific Timeouts

# Fast jobs - aggressive timeout lint: timeout: 5m # Should complete in <1 min script: - npm run lint # Medium jobs test: timeout: 15m # Should complete in <10 min script: - npm test # Long jobs test:e2e: timeout: 30m # Can take 20+ min script: - npm run test:e2e # Deployments deploy: timeout: 1h # May need more time script: - npm run deploy

Benefits:

  • Catch hung jobs faster
  • Reduce wasted minutes
  • Force optimization of slow jobs

Timeout Strategy

Set timeout to 1.5x normal duration:

Normal: 10 minutes
Timeout: 15 minutes

If job hits 15 min, investigate why

Job Retry Strategies

Problem

Jobs that fail transiently retry multiple times, consuming extra minutes.

Default Behavior

No configuration = No retries

Smart Retry Configuration

# Retry transient failures only test: retry: max: 2 # Retry up to 2 times when: - runner_system_failure - stuck_or_timeout_failure - unknown_failure - api_failure - runner_unsupported script: - npm test

When NOT to Retry

# Don't retry deterministic failures lint: retry: 0 # Code errors won't fix themselves script: - npm run lint build: retry: 0 # Build errors need code changes script: - npm run build

When TO Retry

# Network-dependent operations deploy: retry: max: 3 when: - runner_system_failure - stuck_or_timeout_failure - script_failure # Network timeouts script: - npm run deploy # Flaky E2E tests test:e2e: retry: max: 1 # One retry for flakes when: - script_failure script: - npm run test:e2e

Cost Analysis:

Job duration: 10 minutes
Retries: 2
Failure rate: 30%

Expected cost:
- Success (70%): 10 min
- Fail + Retry + Success (25%): 20 min
- Fail + Retry + Fail + Retry + Success (5%): 30 min

Average: 10  0.7 + 20  0.25 + 30  0.05 = 13.5 min

vs No retries: 10 min with 30% failure rate

Rule: Only retry if fixing manual intervention cost > retry cost.


Merge Request Pipelines vs Branch Pipelines

Problem: Duplicate Pipelines

Push to MR branch triggers TWO pipelines:

  1. Branch pipeline (on push)
  2. MR pipeline (on MR event)

Result: 2x compute minutes

Solution: MR Pipelines Only

workflow: rules: # Only run MR pipelines - if: $CI_PIPELINE_SOURCE == "merge_request_event" # Run on default branch - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH # Run on tags - if: $CI_COMMIT_TAG # Skip all other branch pipelines

Alternative - Combined:

workflow: rules: # For MR branches, only run MR pipeline - if: $CI_PIPELINE_SOURCE == "merge_request_event" # For non-MR branches, run branch pipeline - if: $CI_COMMIT_BRANCH

Merged Results Pipelines (Premium/Ultimate)

Test code AS IF already merged to target branch:

Enable: Project Settings Merge requests Merged results pipelines

Benefits:

  • Catch merge conflicts early
  • Test with target branch changes
  • No duplicate runs
workflow: rules: - if: $CI_PIPELINE_SOURCE == "merged_result_event" - if: $CI_PIPELINE_SOURCE == "merge_request_event"

Savings: 30-50% by preventing duplicates


Parallel Execution Optimization

Pattern 1: Parallel Jobs

Split work across multiple jobs:

test: parallel: 5 # Run 5 instances script: - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Cost Impact:

Sequential: 50 minutes (1 job)
Parallel (5x): 10 minutes duration, 50 minutes cost (5 jobs  10 min)

Duration: 80% faster
Cost: Same
Developer time saved: 40 minutes per pipeline

Pattern 2: Matrix Builds

test: parallel: matrix: - NODE_VERSION: ["18", "20", "22"] OS: ["linux", "windows"] image: node:${NODE_VERSION} tags: - ${OS} script: - npm test

Result: 6 jobs (3 versions 2 OSes)

Cost: 6x job duration (but necessary for compatibility)

Pattern 3: Strategic Parallelization

DON'T parallelize short jobs:

# BAD - Overhead exceeds benefit lint: parallel: 3 # Job takes 30 seconds, overhead is 20 seconds script: - npm run lint

DO parallelize long jobs:

# GOOD - Clear benefit test:e2e: parallel: 10 # Job takes 30 minutes 3 minutes parallel script: - npm run test:e2e -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Rule: Only parallelize jobs >5 minutes


Conditional Job Execution

Pattern 1: File Changes

# Only test affected services test:api: rules: - changes: - "services/api/**/*" - "shared/lib/**/*" script: - cd services/api && npm test test:ui: rules: - changes: - "services/ui/**/*" - "shared/components/**/*" script: - cd services/ui && npm test

Pattern 2: Schedule-Specific Jobs

# Only run in nightly builds test:stress: rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:stress # Don't run expensive tests in MRs test:e2e:full: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:e2e:full

Pattern 3: Manual Gates

# Expensive operation - manual trigger performance-test: when: manual allow_failure: true # Don't block pipeline script: - npm run test:performance # Auto-deploy to staging, manual to prod deploy:staging: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:staging deploy:production: when: manual rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH needs: [deploy:staging] script: - npm run deploy:production

Stage Optimization

Problem: Sequential Stages

stages: - build - test - deploy build: stage: build script: sleep 600 # 10 min test: stage: test script: sleep 600 # 10 min (waits for build) deploy: stage: deploy script: sleep 600 # 10 min (waits for test) # Total: 30 minutes

Solution: Directed Acyclic Graph (DAG) with needs

build: script: sleep 600 # 10 min test:unit: needs: [build] script: sleep 600 # Starts immediately after build test:integration: needs: [build] script: sleep 600 # Parallel with test:unit deploy: needs: [test:unit, test:integration] script: sleep 600 # Total: 20 minutes (build + max(test:unit, test:integration) + deploy)

Best Practice: Minimal Stages

Instead of:

stages: - lint - type-check - unit-test - integration-test - build - deploy

Use:

stages: - validate - test - deploy # With needs to control order lint: stage: validate test:unit: stage: test needs: [lint] test:integration: stage: test needs: [lint] deploy: stage: deploy needs: [test:unit, test:integration]

Resource Efficiency

Pattern 1: Shared Setup Job

Problem: Every job installs dependencies

# INEFFICIENT test:unit: script: - npm ci # 5 minutes - npm test # 2 minutes test:integration: script: - npm ci # 5 minutes (duplicate!) - npm run test:integration # 3 minutes # Total: 15 minutes

Solution:

# EFFICIENT install: stage: .pre script: - npm ci artifacts: paths: - node_modules/ expire_in: 1 hour test:unit: script: - npm test # 2 minutes test:integration: script: - npm run test:integration # 3 minutes # Total: 10 minutes (5 + max(2,3))

Pattern 2: Artifact Expiration

Don't keep artifacts longer than needed:

build: script: - npm run build artifacts: paths: - dist/ expire_in: 1 day # Not 30 days!

Cost:

  • Storage cost (artifact storage quota)
  • Download time in downstream jobs

Guideline:

  • Build artifacts: 1 day
  • Test results: 7 days
  • Release artifacts: 30 days or never expire

Environment-Specific Optimization

Development Branches

# Minimal testing on feature branches test:dev: rules: - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH script: - npm run test:unit # Fast tests only

Main Branch

# Comprehensive testing on main test:main: rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm test # Full test suite - npm run test:integration - npm run test:e2e

Scheduled Pipelines

# Extensive testing in nightly builds test:nightly: rules: - if: $CI_PIPELINE_SOURCE == "schedule" script: - npm run test:all - npm run test:performance - npm run test:stress

Complete Optimized Pipeline Example

# Global configuration workflow: auto_cancel: on_new_commit: interruptible on_job_failure: all rules: # Skip draft MRs - if: $CI_MERGE_REQUEST_TITLE =~ /^Draft:/ when: never # Only MR pipelines for MRs - if: $CI_PIPELINE_SOURCE == "merge_request_event" # Branch pipelines for main - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH # Tag pipelines - if: $CI_COMMIT_TAG default: interruptible: true image: node:20-alpine cache: key: files: - package-lock.json prefix: $CI_JOB_NAME paths: - node_modules/ policy: pull retry: max: 1 when: - runner_system_failure stages: - validate - test - build - deploy # Fast validation lint: stage: validate timeout: 5m cache: policy: pull-push script: - npm ci --prefer-offline - npm run lint type-check: stage: validate timeout: 5m needs: [] script: - npm ci --prefer-offline - npm run type-check # Conditional tests test:unit: stage: test timeout: 10m needs: [lint, type-check] rules: - changes: - "src/**/*" - "tests/**/*" - "package*.json" parallel: 3 script: - npm ci --prefer-offline - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL test:integration: stage: test timeout: 15m tags: - self-hosted # Long test on self-hosted runner needs: [lint] rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH - if: $CI_PIPELINE_SOURCE == "schedule" - if: $CI_MERGE_REQUEST_IID changes: - "src/**/*" when: manual allow_failure: true script: - npm ci --prefer-offline - npm run test:integration # Build build: stage: build timeout: 10m needs: [test:unit] script: - npm ci --prefer-offline - npm run build artifacts: paths: - dist/ expire_in: 1 day # Deploy (not interruptible) deploy:staging: stage: deploy interruptible: false timeout: 20m needs: [build] rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:staging deploy:production: stage: deploy interruptible: false timeout: 20m needs: [deploy:staging] when: manual rules: - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH script: - npm run deploy:production

Measuring Pipeline Efficiency

Key Metrics

Pipeline Duration:

Duration = End time - Start time

Compute Minutes:

Compute Minutes = Σ(Job Duration  Cost Factor)

Efficiency Ratio:

Efficiency = Pipeline Duration / Compute Minutes

Lower is better (more parallelization)

Example:

Pipeline Duration: 15 minutes
Compute Minutes: 45 minutes
Efficiency: 0.33 (good - 3x parallelization)

Targets

MetricTarget
MR Pipeline Duration<15 min
Main Pipeline Duration<30 min
Efficiency Ratio<0.5
Failed Job Rate<5%
Cache Hit Rate>80%

Next Steps