Skip to main content

CI/CD Issues

CI/CD Issues

Troubleshooting guide for GitLab CI/CD pipeline failures and related issues.


Issue: Pipeline Stuck in Pending

Symptoms

  • Jobs show "pending" status indefinitely
  • No runners picking up jobs
  • Pipeline queue growing
  • Timeout after waiting

Cause

  1. No available runners with matching tags
  2. Runner offline or disconnected
  3. Runner at capacity
  4. Protected branch/tag restrictions
  5. Runner executor limits exceeded

Solution

# Check runner status glab runner list --all # Verify runner tags match job requirements # In .gitlab-ci.yml: job: tags: - docker # Must match registered runner tags # Check runner on machine sudo gitlab-runner verify sudo gitlab-runner status # Restart runner sudo gitlab-runner restart # Check runner logs sudo journalctl -u gitlab-runner -f # Increase concurrent jobs # In /etc/gitlab-runner/config.toml: concurrent = 4

Prevention

  • Monitor runner health with alerts
  • Configure multiple runners for redundancy
  • Set appropriate job timeouts
  • Use autoscaling runners for variable load

Issue: Docker Build Failures

Symptoms

  • "Cannot connect to Docker daemon"
  • "No space left on device"
  • Layer caching failures
  • Registry authentication errors

Cause

  1. Docker socket not mounted
  2. Disk space exhausted
  3. Registry credentials expired
  4. Network connectivity issues
  5. Invalid Dockerfile syntax

Solution

# For Docker-in-Docker setup, verify config: # In .gitlab-ci.yml: services: - docker:24.0.5-dind variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" # Clean up Docker on runner docker system prune -af --volumes # Verify registry login docker login registry.gitlab.com -u gitlab-ci-token -p $CI_JOB_TOKEN # Check Dockerfile syntax docker build --check -f Dockerfile . # Debug build with verbose output docker build --progress=plain --no-cache -t test .

Prevention

  • Implement Docker cleanup in post-job scripts
  • Use specific image tags, not latest
  • Cache Docker layers in registry
  • Set up registry mirrors for reliability

Issue: Test Failures in CI Only

Symptoms

  • Tests pass locally but fail in CI
  • Intermittent test failures (flaky tests)
  • Timeout failures
  • Environment-specific errors

Cause

  1. Different environment configuration
  2. Missing test fixtures
  3. Race conditions in async tests
  4. Database state issues
  5. Timezone or locale differences

Solution

# Compare environments # In .gitlab-ci.yml: test: before_script: - node --version - npm --version - env | sort > /tmp/ci-env.txt # Run tests with same conditions locally docker run -it --rm \ -v $(pwd):/app \ -w /app \ node:20 npm test # For flaky tests, add retries # In .gitlab-ci.yml: test: retry: max: 2 when: script_failure # Debug with SSH (if enabled) # Add to .gitlab-ci.yml: debug: when: manual script: - sleep 3600 # Pause for SSH debugging

Prevention

  • Use identical Docker images locally and in CI
  • Implement test isolation (reset state between tests)
  • Add explicit waits for async operations
  • Run tests in parallel to detect race conditions

Issue: Deployment Failures

Symptoms

  • Deployment job fails
  • Container not starting after deploy
  • Health checks failing
  • Rollback triggered

Cause

  1. Missing secrets/environment variables
  2. Container image not found
  3. Resource limits exceeded
  4. Configuration errors
  5. Database migration failures

Solution

# Verify secrets are available # In .gitlab-ci.yml: deploy: script: - echo "Checking required variables..." - test -n "$DEPLOY_TOKEN" || (echo "DEPLOY_TOKEN not set" && exit 1) # Check container logs after deploy kubectl logs deployment/<app> -n <namespace> --tail=100 # Verify image exists docker pull registry.gitlab.com/<group>/<project>:<tag> # Manual rollback if needed kubectl rollout undo deployment/<app> -n <namespace> # Check deployment events kubectl describe deployment/<app> -n <namespace>

Prevention

  • Use deployment previews (Review Apps)
  • Implement health check endpoints
  • Set up deployment notifications
  • Test deployments in staging first

Issue: Cache Not Working

Symptoms

  • Jobs running slower than expected
  • Dependencies downloaded every run
  • "Cache not found" messages
  • Inconsistent cache hits

Cause

  1. Cache key mismatch
  2. Cache expired
  3. Different runners not sharing cache
  4. Cache path incorrect
  5. Cache size exceeded

Solution

# Proper cache configuration # In .gitlab-ci.yml: cache: key: files: - package-lock.json # Changes when deps change paths: - node_modules/ policy: pull-push # Default, can use pull or push only # Use fallback keys cache: key: $CI_COMMIT_REF_SLUG paths: - node_modules/ when: always fallback_keys: - main # For distributed runners, use S3 cache # In config.toml: [runners.cache] Type = "s3" Shared = true

Prevention

  • Use file-based cache keys for dependencies
  • Configure shared cache backend (S3, GCS)
  • Set appropriate cache expiration
  • Monitor cache hit rates

Issue: Artifact Upload Failures

Symptoms

  • "Artifact upload failed"
  • Jobs succeed but artifacts missing
  • Artifact size limit exceeded
  • Artifacts expired

Cause

  1. Path pattern not matching files
  2. Size exceeds limits
  3. Storage quota exhausted
  4. Network timeout during upload

Solution

# Debug artifact paths # In .gitlab-ci.yml: build: script: - npm run build - ls -la dist/ # Verify files exist artifacts: paths: - dist/ expire_in: 1 week when: always # Upload even on failure # Increase upload timeout variables: ARTIFACT_UPLOAD_TIMEOUT: "30m" # Compress large artifacts artifacts: paths: - build.tar.gz # Instead of: # - build/

Prevention

  • Set appropriate artifact expiration
  • Compress artifacts before upload
  • Monitor storage usage
  • Use artifact dependencies wisely

Issue: Secret Detection Blocking Commits

Symptoms

  • Push rejected with "secret detected"
  • False positives on secret detection
  • Cannot push urgent fixes
  • Pre-receive hook failures

Cause

  1. Actual secrets in code
  2. Test fixtures containing secret-like patterns
  3. API keys in documentation
  4. Base64 encoded data triggering rules

Solution

# View detected secrets glab ci lint .gitlab-ci.yml # Allowlist false positives # In .gitlab-ci.yml: secret_detection: variables: SECRET_DETECTION_EXCLUDED_PATHS: "tests/fixtures/*,docs/examples/*" # Remove secret from history (if actually exposed) git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch <file-with-secret>" \ --prune-empty --tag-name-filter cat -- --all # Rotate exposed secrets IMMEDIATELY # Then force push cleaned history git push origin --force --all

Prevention

  • Use environment variables for all secrets
  • Implement pre-commit hooks locally
  • Use .gitignore for .env files
  • Train team on secret handling

Issue: Merge Train Failures

Symptoms

  • Merge train aborted
  • Cascading failures in train
  • "Pipeline for merged results failed"
  • Long wait times in queue

Cause

  1. Failing tests in earlier train cars
  2. Merge conflicts during train
  3. Pipeline timeout in train
  4. Resource contention

Solution

# View merge train status glab mr list --state=merged --merged-by=@me # Check merge train configuration # In .gitlab-ci.yml: workflow: rules: - if: $CI_MERGE_REQUEST_EVENT_TYPE == "merge_train" # Optimize for merge trains test: interruptible: true # Allow cancellation parallel: 4 # Speed up tests # Skip redundant jobs in train build: rules: - if: $CI_MERGE_REQUEST_EVENT_TYPE == "merge_train" when: never # Use cached build from MR pipeline - when: on_success

Prevention

  • Keep pipelines fast (< 10 minutes ideal)
  • Use interruptible jobs
  • Limit merge train size
  • Run comprehensive tests before adding to train

Issue: Runner Registration Failures

Symptoms

  • "ERROR: Registering runner... failed"
  • Token authentication errors
  • Runner appears then disappears
  • Connection refused errors

Cause

  1. Invalid registration token
  2. Network/firewall issues
  3. GitLab instance unreachable
  4. TLS certificate problems
  5. Runner already registered with same token

Solution

# Verify network connectivity curl -v https://gitlab.com/api/v4/runners # Register with debug output gitlab-runner register --debug # Check runner config cat /etc/gitlab-runner/config.toml # Unregister and re-register gitlab-runner unregister --all-runners gitlab-runner register \ --non-interactive \ --url https://gitlab.com \ --token <runner-token> \ --executor docker \ --docker-image alpine:latest # Fix TLS issues gitlab-runner register --tls-ca-file=/path/to/ca.crt

Prevention

  • Use runner groups for organization
  • Implement runner monitoring
  • Document runner setup process
  • Use configuration management for runners

Issue: Pipeline YAML Syntax Errors

Symptoms

  • "Invalid YAML syntax"
  • "Unknown key" errors
  • Pipeline not triggering
  • Jobs not appearing

Cause

  1. YAML indentation errors
  2. Invalid key names
  3. Incorrect variable syntax
  4. Missing required fields

Solution

# Validate YAML locally glab ci lint .gitlab-ci.yml # Use online validator # https://gitlab.com/-/ci/lint # Check with yamllint yamllint .gitlab-ci.yml # Validate includes # In .gitlab-ci.yml: include: - local: '.gitlab/ci/test.yml' rules: - exists: ['.gitlab/ci/test.yml']

Common fixes:

# Wrong - variable interpolation in YAML script: - echo $VAR # Works - echo ${VAR} # Works - echo "$VAR" # Works # Wrong - anchors must be defined before use job: <<: *undefined_anchor # Error! # Correct - define anchor first .template: &template script: echo "test" job: <<: *template # Works

Prevention

  • Use CI linting in pre-commit hooks
  • Validate YAML in IDE with schema
  • Use includes for reusable configuration
  • Test pipeline changes in MR first


Back to Drupal Issues | Agent Issues