pipeline optimization

Pipeline Optimization for Cost Efficiency

Overview

Pipeline optimization combines multiple techniques to create fast, efficient CI/CD workflows that minimize compute minute consumption while maximizing developer productivity.

Rules vs only/except

Problem with only/except (Legacy)

# OLD WAY - Less efficient
test:
  only:
    - branches
  except:
    - main
  script:
    - npm test

Issues:

Limited flexibility
No complex conditions
Can't combine multiple conditions
Deprecated in favor of rules

Modern Approach: rules

# NEW WAY - More efficient and flexible
test:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH
  script:
    - npm test

Performance Impact:

Rules evaluated before job creation
Prevents unnecessary job scheduling
Reduces overhead on shared runners

Complex Rules

deploy:
  rules:
    # Deploy on main branch
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
      when: always
    # Manual deploy on release branches
    - if: $CI_COMMIT_BRANCH =~ /^release\//
      when: manual
    # Never deploy on feature branches
    - when: never
  script:
    - npm run deploy

Interruptible Jobs

Concept

Allow jobs to be canceled when they become obsolete (new commit pushed).

Configuration

Enable globally:

default:
  interruptible: true

Or per-job:

test:
  interruptible: true
  script:
    - npm test

deploy:
  interruptible: false  # Never interrupt deployments!
  script:
    - npm run deploy

Best Practices

Always interruptible:

Lint/format checks
Unit tests
Integration tests
Build jobs (if no side effects)

Never interruptible:

Deployments
Database migrations
Publishing packages
Creating releases

Advanced - Selective Cancellation

workflow:
  auto_cancel:
    on_new_commit: interruptible  # Only cancel interruptible jobs
    on_job_failure: all            # Cancel all on failure

Options:

interruptible: Only cancel jobs marked interruptible: true
conservative: Only cancel if NO non-interruptible jobs started
none: Don't auto-cancel

Example Impact:

10:00 - Push commit A  Pipeline starts (20 min)
10:05 - Push commit B  Pipeline A canceled (saved 15 min)
10:05 - Push commit B  Pipeline B starts (20 min)
10:08 - Push commit C  Pipeline B canceled (saved 17 min)

Savings: 32 minutes from 2 canceled pipelines

Fail Fast Patterns

Problem

Waiting for all jobs to complete when early failure makes success impossible.

Pattern 1: Job Dependencies with needs

stages:
  - validate
  - test
  - build
  - deploy

# Fast validation first
lint:
  stage: validate
  script:
    - npm run lint
    - exit 1  # If fails, subsequent jobs blocked

type-check:
  stage: validate
  script:
    - npm run type-check

# Tests depend on validation passing
test:unit:
  stage: test
  needs: [lint, type-check]
  script:
    - npm test

test:integration:
  stage: test
  needs: [lint, type-check]
  script:
    - npm run test:integration

# Build depends on tests
build:
  stage: build
  needs: [test:unit, test:integration]
  script:
    - npm run build

# Deploy depends on build
deploy:
  stage: deploy
  needs: [build]
  script:
    - npm run deploy

Flow:

lint (fail)  test:unit (blocked)  build (blocked)  deploy (blocked)
              test:integration (blocked)

Pipeline stops at 2 minutes instead of running 20 minutes

Pattern 2: Auto-Cancel on Failure

workflow:
  auto_cancel:
    on_job_failure: all  # Cancel all remaining jobs

default:
  interruptible: true

lint:
  script:
    - npm run lint  # If fails, everything else cancels

test:parallel:
  parallel: 10
  script:
    - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Benefit: If lint fails (30 seconds), 10 parallel test jobs (5 min each) are canceled = 50 minutes saved.

Pattern 3: Early Exit in Scripts

test:
  script:
    # Run fast checks first, exit immediately on failure
    - npm run lint || exit 1
    - npm run type-check || exit 1
    - npm run security-check || exit 1

    # Only run slow tests if fast checks pass
    - npm run test:unit
    - npm run test:integration
    - npm run test:e2e

Pattern 4: Allow Failure for Non-Critical Jobs

# Critical jobs
lint:
  script:
    - npm run lint

# Non-critical - don't block pipeline
security-scan:
  allow_failure: true  # Pipeline continues if this fails
  script:
    - npm audit

deploy:
  needs: [lint]  # Only depends on critical jobs
  script:
    - npm run deploy

Pipeline Timeout Settings

Global Timeout

Problem: Jobs that hang consume minutes until global timeout (1 hour default).

Solution - Set Project Timeout:

Navigate to: Project Settings CI/CD General pipelines Timeout

Recommended: 30 minutes (projects rarely need more)

Job-Specific Timeouts

# Fast jobs - aggressive timeout
lint:
  timeout: 5m  # Should complete in <1 min
  script:
    - npm run lint

# Medium jobs
test:
  timeout: 15m  # Should complete in <10 min
  script:
    - npm test

# Long jobs
test:e2e:
  timeout: 30m  # Can take 20+ min
  script:
    - npm run test:e2e

# Deployments
deploy:
  timeout: 1h  # May need more time
  script:
    - npm run deploy

Benefits:

Catch hung jobs faster
Reduce wasted minutes
Force optimization of slow jobs

Timeout Strategy

Set timeout to 1.5x normal duration:

Normal: 10 minutes
Timeout: 15 minutes

If job hits 15 min, investigate why

Job Retry Strategies

Problem

Jobs that fail transiently retry multiple times, consuming extra minutes.

Default Behavior

No configuration = No retries

Smart Retry Configuration

# Retry transient failures only
test:
  retry:
    max: 2  # Retry up to 2 times
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
      - unknown_failure
      - api_failure
      - runner_unsupported
  script:
    - npm test

When NOT to Retry

# Don't retry deterministic failures
lint:
  retry: 0  # Code errors won't fix themselves
  script:
    - npm run lint

build:
  retry: 0  # Build errors need code changes
  script:
    - npm run build

When TO Retry

# Network-dependent operations
deploy:
  retry:
    max: 3
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
      - script_failure  # Network timeouts
  script:
    - npm run deploy

# Flaky E2E tests
test:e2e:
  retry:
    max: 1  # One retry for flakes
    when:
      - script_failure
  script:
    - npm run test:e2e

Cost Analysis:

Job duration: 10 minutes
Retries: 2
Failure rate: 30%

Expected cost:
- Success (70%): 10 min
- Fail + Retry + Success (25%): 20 min
- Fail + Retry + Fail + Retry + Success (5%): 30 min

Average: 10  0.7 + 20  0.25 + 30  0.05 = 13.5 min

vs No retries: 10 min with 30% failure rate

Rule: Only retry if fixing manual intervention cost > retry cost.

Merge Request Pipelines vs Branch Pipelines

Problem: Duplicate Pipelines

Push to MR branch triggers TWO pipelines:

Branch pipeline (on push)
MR pipeline (on MR event)

Result: 2x compute minutes

Solution: MR Pipelines Only

workflow:
  rules:
    # Only run MR pipelines
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    # Run on default branch
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
    # Run on tags
    - if: $CI_COMMIT_TAG
    # Skip all other branch pipelines

Alternative - Combined:

workflow:
  rules:
    # For MR branches, only run MR pipeline
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    # For non-MR branches, run branch pipeline
    - if: $CI_COMMIT_BRANCH

Merged Results Pipelines (Premium/Ultimate)

Test code AS IF already merged to target branch:

Enable: Project Settings Merge requests Merged results pipelines

Benefits:

Catch merge conflicts early
Test with target branch changes
No duplicate runs

workflow:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merged_result_event"
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

Savings: 30-50% by preventing duplicates

Parallel Execution Optimization

Pattern 1: Parallel Jobs

Split work across multiple jobs:

test:
  parallel: 5  # Run 5 instances
  script:
    - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Cost Impact:

Sequential: 50 minutes (1 job)
Parallel (5x): 10 minutes duration, 50 minutes cost (5 jobs  10 min)

Duration: 80% faster
Cost: Same
Developer time saved: 40 minutes per pipeline

Pattern 2: Matrix Builds

test:
  parallel:
    matrix:
      - NODE_VERSION: ["18", "20", "22"]
        OS: ["linux", "windows"]
  image: node:${NODE_VERSION}
  tags:
    - ${OS}
  script:
    - npm test

Result: 6 jobs (3 versions 2 OSes)

Cost: 6x job duration (but necessary for compatibility)

Pattern 3: Strategic Parallelization

DON'T parallelize short jobs:

# BAD - Overhead exceeds benefit
lint:
  parallel: 3  # Job takes 30 seconds, overhead is 20 seconds
  script:
    - npm run lint

DO parallelize long jobs:

# GOOD - Clear benefit
test:e2e:
  parallel: 10  # Job takes 30 minutes  3 minutes parallel
  script:
    - npm run test:e2e -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

Rule: Only parallelize jobs >5 minutes

Conditional Job Execution

Pattern 1: File Changes

# Only test affected services
test:api:
  rules:
    - changes:
        - "services/api/**/*"
        - "shared/lib/**/*"
  script:
    - cd services/api && npm test

test:ui:
  rules:
    - changes:
        - "services/ui/**/*"
        - "shared/components/**/*"
  script:
    - cd services/ui && npm test

Pattern 2: Schedule-Specific Jobs

# Only run in nightly builds
test:stress:
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
  script:
    - npm run test:stress

# Don't run expensive tests in MRs
test:e2e:full:
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
    - if: $CI_PIPELINE_SOURCE == "schedule"
  script:
    - npm run test:e2e:full

Pattern 3: Manual Gates

# Expensive operation - manual trigger
performance-test:
  when: manual
  allow_failure: true  # Don't block pipeline
  script:
    - npm run test:performance

# Auto-deploy to staging, manual to prod
deploy:staging:
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  script:
    - npm run deploy:staging

deploy:production:
  when: manual
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  needs: [deploy:staging]
  script:
    - npm run deploy:production

Stage Optimization

Problem: Sequential Stages

stages:
  - build
  - test
  - deploy

build:
  stage: build
  script: sleep 600  # 10 min

test:
  stage: test
  script: sleep 600  # 10 min (waits for build)

deploy:
  stage: deploy
  script: sleep 600  # 10 min (waits for test)

# Total: 30 minutes

Solution: Directed Acyclic Graph (DAG) with needs

build:
  script: sleep 600  # 10 min

test:unit:
  needs: [build]
  script: sleep 600  # Starts immediately after build

test:integration:
  needs: [build]
  script: sleep 600  # Parallel with test:unit

deploy:
  needs: [test:unit, test:integration]
  script: sleep 600

# Total: 20 minutes (build + max(test:unit, test:integration) + deploy)

Best Practice: Minimal Stages

Instead of:

stages:
  - lint
  - type-check
  - unit-test
  - integration-test
  - build
  - deploy

Use:

stages:
  - validate
  - test
  - deploy

# With needs to control order
lint:
  stage: validate

test:unit:
  stage: test
  needs: [lint]

test:integration:
  stage: test
  needs: [lint]

deploy:
  stage: deploy
  needs: [test:unit, test:integration]

Resource Efficiency

Pattern 1: Shared Setup Job

Problem: Every job installs dependencies

# INEFFICIENT
test:unit:
  script:
    - npm ci  # 5 minutes
    - npm test  # 2 minutes

test:integration:
  script:
    - npm ci  # 5 minutes (duplicate!)
    - npm run test:integration  # 3 minutes

# Total: 15 minutes

Solution:

# EFFICIENT
install:
  stage: .pre
  script:
    - npm ci
  artifacts:
    paths:
      - node_modules/
    expire_in: 1 hour

test:unit:
  script:
    - npm test  # 2 minutes

test:integration:
  script:
    - npm run test:integration  # 3 minutes

# Total: 10 minutes (5 + max(2,3))

Pattern 2: Artifact Expiration

Don't keep artifacts longer than needed:

build:
  script:
    - npm run build
  artifacts:
    paths:
      - dist/
    expire_in: 1 day  # Not 30 days!

Cost:

Storage cost (artifact storage quota)
Download time in downstream jobs

Guideline:

Build artifacts: 1 day
Test results: 7 days
Release artifacts: 30 days or never expire

Environment-Specific Optimization

Development Branches

# Minimal testing on feature branches
test:dev:
  rules:
    - if: $CI_COMMIT_BRANCH && $CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH
  script:
    - npm run test:unit  # Fast tests only

Main Branch

# Comprehensive testing on main
test:main:
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  script:
    - npm test  # Full test suite
    - npm run test:integration
    - npm run test:e2e

Scheduled Pipelines

# Extensive testing in nightly builds
test:nightly:
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
  script:
    - npm run test:all
    - npm run test:performance
    - npm run test:stress

Complete Optimized Pipeline Example

# Global configuration
workflow:
  auto_cancel:
    on_new_commit: interruptible
    on_job_failure: all
  rules:
    # Skip draft MRs
    - if: $CI_MERGE_REQUEST_TITLE =~ /^Draft:/
      when: never
    # Only MR pipelines for MRs
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    # Branch pipelines for main
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
    # Tag pipelines
    - if: $CI_COMMIT_TAG

default:
  interruptible: true
  image: node:20-alpine
  cache:
    key:
      files:
        - package-lock.json
      prefix: $CI_JOB_NAME
    paths:
      - node_modules/
    policy: pull
  retry:
    max: 1
    when:
      - runner_system_failure

stages:
  - validate
  - test
  - build
  - deploy

# Fast validation
lint:
  stage: validate
  timeout: 5m
  cache:
    policy: pull-push
  script:
    - npm ci --prefer-offline
    - npm run lint

type-check:
  stage: validate
  timeout: 5m
  needs: []
  script:
    - npm ci --prefer-offline
    - npm run type-check

# Conditional tests
test:unit:
  stage: test
  timeout: 10m
  needs: [lint, type-check]
  rules:
    - changes:
        - "src/**/*"
        - "tests/**/*"
        - "package*.json"
  parallel: 3
  script:
    - npm ci --prefer-offline
    - npm test -- --shard $CI_NODE_INDEX/$CI_NODE_TOTAL

test:integration:
  stage: test
  timeout: 15m
  tags:
    - self-hosted  # Long test on self-hosted runner
  needs: [lint]
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
    - if: $CI_PIPELINE_SOURCE == "schedule"
    - if: $CI_MERGE_REQUEST_IID
      changes:
        - "src/**/*"
      when: manual
      allow_failure: true
  script:
    - npm ci --prefer-offline
    - npm run test:integration

# Build
build:
  stage: build
  timeout: 10m
  needs: [test:unit]
  script:
    - npm ci --prefer-offline
    - npm run build
  artifacts:
    paths:
      - dist/
    expire_in: 1 day

# Deploy (not interruptible)
deploy:staging:
  stage: deploy
  interruptible: false
  timeout: 20m
  needs: [build]
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  script:
    - npm run deploy:staging

deploy:production:
  stage: deploy
  interruptible: false
  timeout: 20m
  needs: [deploy:staging]
  when: manual
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  script:
    - npm run deploy:production

Measuring Pipeline Efficiency

Key Metrics

Pipeline Duration:

Duration = End time - Start time

Compute Minutes:

Compute Minutes = Σ(Job Duration  Cost Factor)

Efficiency Ratio:

Efficiency = Pipeline Duration / Compute Minutes

Lower is better (more parallelization)

Example:

Pipeline Duration: 15 minutes
Compute Minutes: 45 minutes
Efficiency: 0.33 (good - 3x parallelization)

Targets

Metric	Target
MR Pipeline Duration	<15 min
Main Pipeline Duration	<30 min
Efficiency Ratio	<0.5
Failed Job Rate	<5%
Cache Hit Rate	>80%

Next Steps

Pre-Push Validation - Catch errors before CI
Monitoring - Track pipeline performance
Checklist - Daily optimization checks