CI/CD Issues
CI/CD Issues
Troubleshooting guide for GitLab CI/CD pipeline failures and related issues.
Issue: Pipeline Stuck in Pending
Symptoms
- Jobs show "pending" status indefinitely
- No runners picking up jobs
- Pipeline queue growing
- Timeout after waiting
Cause
- No available runners with matching tags
- Runner offline or disconnected
- Runner at capacity
- Protected branch/tag restrictions
- Runner executor limits exceeded
Solution
# Check runner status glab runner list --all # Verify runner tags match job requirements # In .gitlab-ci.yml: job: tags: - docker # Must match registered runner tags # Check runner on machine sudo gitlab-runner verify sudo gitlab-runner status # Restart runner sudo gitlab-runner restart # Check runner logs sudo journalctl -u gitlab-runner -f # Increase concurrent jobs # In /etc/gitlab-runner/config.toml: concurrent = 4
Prevention
- Monitor runner health with alerts
- Configure multiple runners for redundancy
- Set appropriate job timeouts
- Use autoscaling runners for variable load
Issue: Docker Build Failures
Symptoms
- "Cannot connect to Docker daemon"
- "No space left on device"
- Layer caching failures
- Registry authentication errors
Cause
- Docker socket not mounted
- Disk space exhausted
- Registry credentials expired
- Network connectivity issues
- Invalid Dockerfile syntax
Solution
# For Docker-in-Docker setup, verify config: # In .gitlab-ci.yml: services: - docker:24.0.5-dind variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" # Clean up Docker on runner docker system prune -af --volumes # Verify registry login docker login registry.gitlab.com -u gitlab-ci-token -p $CI_JOB_TOKEN # Check Dockerfile syntax docker build --check -f Dockerfile . # Debug build with verbose output docker build --progress=plain --no-cache -t test .
Prevention
- Implement Docker cleanup in post-job scripts
- Use specific image tags, not
latest - Cache Docker layers in registry
- Set up registry mirrors for reliability
Issue: Test Failures in CI Only
Symptoms
- Tests pass locally but fail in CI
- Intermittent test failures (flaky tests)
- Timeout failures
- Environment-specific errors
Cause
- Different environment configuration
- Missing test fixtures
- Race conditions in async tests
- Database state issues
- Timezone or locale differences
Solution
# Compare environments # In .gitlab-ci.yml: test: before_script: - node --version - npm --version - env | sort > /tmp/ci-env.txt # Run tests with same conditions locally docker run -it --rm \ -v $(pwd):/app \ -w /app \ node:20 npm test # For flaky tests, add retries # In .gitlab-ci.yml: test: retry: max: 2 when: script_failure # Debug with SSH (if enabled) # Add to .gitlab-ci.yml: debug: when: manual script: - sleep 3600 # Pause for SSH debugging
Prevention
- Use identical Docker images locally and in CI
- Implement test isolation (reset state between tests)
- Add explicit waits for async operations
- Run tests in parallel to detect race conditions
Issue: Deployment Failures
Symptoms
- Deployment job fails
- Container not starting after deploy
- Health checks failing
- Rollback triggered
Cause
- Missing secrets/environment variables
- Container image not found
- Resource limits exceeded
- Configuration errors
- Database migration failures
Solution
# Verify secrets are available # In .gitlab-ci.yml: deploy: script: - echo "Checking required variables..." - test -n "$DEPLOY_TOKEN" || (echo "DEPLOY_TOKEN not set" && exit 1) # Check container logs after deploy kubectl logs deployment/<app> -n <namespace> --tail=100 # Verify image exists docker pull registry.gitlab.com/<group>/<project>:<tag> # Manual rollback if needed kubectl rollout undo deployment/<app> -n <namespace> # Check deployment events kubectl describe deployment/<app> -n <namespace>
Prevention
- Use deployment previews (Review Apps)
- Implement health check endpoints
- Set up deployment notifications
- Test deployments in staging first
Issue: Cache Not Working
Symptoms
- Jobs running slower than expected
- Dependencies downloaded every run
- "Cache not found" messages
- Inconsistent cache hits
Cause
- Cache key mismatch
- Cache expired
- Different runners not sharing cache
- Cache path incorrect
- Cache size exceeded
Solution
# Proper cache configuration # In .gitlab-ci.yml: cache: key: files: - package-lock.json # Changes when deps change paths: - node_modules/ policy: pull-push # Default, can use pull or push only # Use fallback keys cache: key: $CI_COMMIT_REF_SLUG paths: - node_modules/ when: always fallback_keys: - main # For distributed runners, use S3 cache # In config.toml: [runners.cache] Type = "s3" Shared = true
Prevention
- Use file-based cache keys for dependencies
- Configure shared cache backend (S3, GCS)
- Set appropriate cache expiration
- Monitor cache hit rates
Issue: Artifact Upload Failures
Symptoms
- "Artifact upload failed"
- Jobs succeed but artifacts missing
- Artifact size limit exceeded
- Artifacts expired
Cause
- Path pattern not matching files
- Size exceeds limits
- Storage quota exhausted
- Network timeout during upload
Solution
# Debug artifact paths # In .gitlab-ci.yml: build: script: - npm run build - ls -la dist/ # Verify files exist artifacts: paths: - dist/ expire_in: 1 week when: always # Upload even on failure # Increase upload timeout variables: ARTIFACT_UPLOAD_TIMEOUT: "30m" # Compress large artifacts artifacts: paths: - build.tar.gz # Instead of: # - build/
Prevention
- Set appropriate artifact expiration
- Compress artifacts before upload
- Monitor storage usage
- Use artifact dependencies wisely
Issue: Secret Detection Blocking Commits
Symptoms
- Push rejected with "secret detected"
- False positives on secret detection
- Cannot push urgent fixes
- Pre-receive hook failures
Cause
- Actual secrets in code
- Test fixtures containing secret-like patterns
- API keys in documentation
- Base64 encoded data triggering rules
Solution
# View detected secrets glab ci lint .gitlab-ci.yml # Allowlist false positives # In .gitlab-ci.yml: secret_detection: variables: SECRET_DETECTION_EXCLUDED_PATHS: "tests/fixtures/*,docs/examples/*" # Remove secret from history (if actually exposed) git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch <file-with-secret>" \ --prune-empty --tag-name-filter cat -- --all # Rotate exposed secrets IMMEDIATELY # Then force push cleaned history git push origin --force --all
Prevention
- Use environment variables for all secrets
- Implement pre-commit hooks locally
- Use
.gitignorefor.envfiles - Train team on secret handling
Issue: Merge Train Failures
Symptoms
- Merge train aborted
- Cascading failures in train
- "Pipeline for merged results failed"
- Long wait times in queue
Cause
- Failing tests in earlier train cars
- Merge conflicts during train
- Pipeline timeout in train
- Resource contention
Solution
# View merge train status glab mr list --state=merged --merged-by=@me # Check merge train configuration # In .gitlab-ci.yml: workflow: rules: - if: $CI_MERGE_REQUEST_EVENT_TYPE == "merge_train" # Optimize for merge trains test: interruptible: true # Allow cancellation parallel: 4 # Speed up tests # Skip redundant jobs in train build: rules: - if: $CI_MERGE_REQUEST_EVENT_TYPE == "merge_train" when: never # Use cached build from MR pipeline - when: on_success
Prevention
- Keep pipelines fast (< 10 minutes ideal)
- Use interruptible jobs
- Limit merge train size
- Run comprehensive tests before adding to train
Issue: Runner Registration Failures
Symptoms
- "ERROR: Registering runner... failed"
- Token authentication errors
- Runner appears then disappears
- Connection refused errors
Cause
- Invalid registration token
- Network/firewall issues
- GitLab instance unreachable
- TLS certificate problems
- Runner already registered with same token
Solution
# Verify network connectivity curl -v https://gitlab.com/api/v4/runners # Register with debug output gitlab-runner register --debug # Check runner config cat /etc/gitlab-runner/config.toml # Unregister and re-register gitlab-runner unregister --all-runners gitlab-runner register \ --non-interactive \ --url https://gitlab.com \ --token <runner-token> \ --executor docker \ --docker-image alpine:latest # Fix TLS issues gitlab-runner register --tls-ca-file=/path/to/ca.crt
Prevention
- Use runner groups for organization
- Implement runner monitoring
- Document runner setup process
- Use configuration management for runners
Issue: Pipeline YAML Syntax Errors
Symptoms
- "Invalid YAML syntax"
- "Unknown key" errors
- Pipeline not triggering
- Jobs not appearing
Cause
- YAML indentation errors
- Invalid key names
- Incorrect variable syntax
- Missing required fields
Solution
# Validate YAML locally glab ci lint .gitlab-ci.yml # Use online validator # https://gitlab.com/-/ci/lint # Check with yamllint yamllint .gitlab-ci.yml # Validate includes # In .gitlab-ci.yml: include: - local: '.gitlab/ci/test.yml' rules: - exists: ['.gitlab/ci/test.yml']
Common fixes:
# Wrong - variable interpolation in YAML script: - echo $VAR # Works - echo ${VAR} # Works - echo "$VAR" # Works # Wrong - anchors must be defined before use job: <<: *undefined_anchor # Error! # Correct - define anchor first .template: &template script: echo "test" job: <<: *template # Works
Prevention
- Use CI linting in pre-commit hooks
- Validate YAML in IDE with schema
- Use includes for reusable configuration
- Test pipeline changes in MR first