Skip to main content

best practices

GitLab Duo Agent Platform - Best Practices

Overview

This document consolidates production-ready patterns, security guidelines, performance optimizations, and operational best practices for deploying and managing the GitLab Duo Agent Platform at scale.

Architecture & Design

1. Agent Design Principles

Single Responsibility Principle

Good: Focused Agent

name: security-vulnerability-scanner description: Scans code for security vulnerabilities only capabilities: - SAST scanning - Dependency vulnerability detection - CVE database lookups - Risk assessment

Bad: Unfocused Agent

name: do-everything-agent description: Does security, code review, deployment, and monitoring # Too many responsibilities, hard to maintain and debug

Clear Boundaries

Each agent should have:

  • Well-defined scope: What it does and doesn't do
  • Explicit inputs: What data it needs
  • Predictable outputs: What results it produces
  • Error boundaries: How it handles failures

Example:

agent: name: code-review-agent scope: includes: - Code quality analysis - Bug detection - Style checking excludes: - Security vulnerability scanning (use security-agent) - Performance profiling (use performance-agent) inputs: required: - merge_request_diff - project_style_guide optional: - previous_review_comments - linked_issues outputs: - review_comments: array of inline comments - quality_score: number 0-100 - approval_recommendation: approve | request_changes | comment

2. Flow Design Patterns

Modular Flows

Break complex workflows into smaller, reusable flows:

Good: Modular

# security-scan-flow.yml name: security-scan steps: - scan_dependencies - scan_code - scan_secrets --- # deployment-flow.yml name: deployment steps: - trigger_flow: security-scan # Reuse existing flow - run_tests - deploy

Bad: Monolithic

name: mega-flow steps: # 50+ steps doing everything # Hard to maintain, debug, and reuse

Idempotent Steps

Design steps to be safely retryable:

- name: create_issue agent: planning_agent action: create_issue_if_not_exists # Idempotent # vs action: create_issue # Creates duplicate on retry

Fail Fast

Validate early, fail fast:

flow: steps: - name: validate_inputs agent: validation_agent action: check_prerequisites # Fast validation step first - name: expensive_operation agent: analysis_agent action: deep_analysis condition: - steps.validate_inputs.outputs.valid == true # Only run if validation passed

3. Context and State Management

Minimize State

Keep flows stateless when possible:

Good: Stateless

- name: analyze_code inputs: code: context.merge_request.diff style_guide: project.config.style_guide # All inputs explicit, no hidden state

Bad: Hidden State

- name: analyze_code # Depends on global variables, previous runs, etc. # Hard to test and debug

Context Passing

Be explicit about data flow between steps:

steps: - name: step1 outputs: result: result.data - name: step2 inputs: data: steps.step1.outputs.result # Explicit dependency

Security Best Practices

1. Authentication

Use OIDC Tokens

Recommended: OIDC

id_tokens: AGENT_TOKEN: aud: https://agent-platform.gitlab.com script: - duo-agent auth --token $AGENT_TOKEN

Benefits:

  • Short-lived (1 hour)
  • No stored secrets
  • Granular permissions
  • Full audit trail

Avoid: Long-lived tokens

variables: GITLAB_TOKEN: $LONG_LIVED_PAT # Security risk

Service Account Best Practices

service_account: # Minimum permissions only permissions: read: - code - issues write: - comments forbidden: - merge - delete - settings # Audit all actions audit_logging: true # Rate limiting rate_limits: api_calls: 1000/hour cost: 100 USD/day

2. Input Validation

Always validate and sanitize inputs:

- name: process_user_input agent: agent_name action: process validation: inputs: user_comment: type: string max_length: 10000 sanitize: true # Remove potential injection attacks allowed_patterns: ["^[a-zA-Z0-9\s.,!?-]+$"] file_path: type: string allowed_paths: - "src/**" - "tests/**" forbidden_paths: - "**/.env" - "**/secrets/**"

3. Secrets Management

Never expose secrets in logs or outputs:

agent: secrets: # Reference secrets, never hardcode api_key: vault: gitlab path: duo/agent/api_key database_password: vault: gitlab path: duo/agent/db_password # Mask secrets in logs log_masking: - api_key - database_password - "*password*" - "*token*"

4. Least Privilege

Grant minimum necessary permissions:

agent: name: security-scanner permissions: # Only what's needed gitlab: read: [code, security_reports] write: [comments] # No merge, deploy, or admin permissions knowledge_graph: - query # Read-only external_apis: - name: cve_database methods: [GET] # Read-only

5. Audit Logging

Log all agent actions:

monitoring: audit: enabled: true log_level: info events: - agent_started - agent_completed - api_call_made - permission_checked - error_occurred include_context: - user_id - project_id - merge_request_iid - service_account retention: 90 days

Performance Optimization

1. Caching Strategies

Knowledge Graph Cache

agent: name: code-review-agent cache: knowledge_graph: enabled: true ttl: 5m invalidate_on: - code_change - dependency_update

Response Cache

agent: cache: responses: enabled: true ttl: 1h cache_key: "${context.merge_request.diff_hash}" # Don't cache certain actions exclude_actions: - create_merge_request - approve_merge_request

2. Parallel Execution

Run independent steps in parallel:

steps: # These run in parallel - name: security_scan agent: security_agent parallel: group_1 - name: performance_test agent: performance_agent parallel: group_1 - name: style_check agent: style_agent parallel: group_1 # This waits for all parallel steps - name: aggregate agent: aggregate_agent depends_on: [security_scan, performance_test, style_check]

3. Lazy Loading

Load data only when needed:

- name: analyze agent: code_review_agent lazy_load: # Don't load full diff upfront diff: on_demand # Load Knowledge Graph data as needed knowledge_graph: on_demand # Agent requests specific data during execution

4. Resource Limits

Set appropriate limits to prevent resource exhaustion:

agent: limits: # Execution time max_execution_time: 5m # Most operations should be fast timeout_warning: 3m # Warn if taking too long # API calls max_api_calls: 500 # Prevent API abuse api_call_timeout: 30s # Memory and compute max_memory: 2GB max_cpu_cores: 2 # Token usage (LLM) max_input_tokens: 128000 max_output_tokens: 32000 # Cost control max_cost_per_execution: 1 USD max_daily_cost: 100 USD

5. Batch Operations

Batch API calls when possible:

Bad: N+1 API calls

- name: process_files for_each: context.merge_request.changed_files action: process_file # One API call per file

Good: Batch processing

- name: process_files action: process_files_batch inputs: files: context.merge_request.changed_files # One API call for all files

Reliability & Resilience

1. Error Handling

Graceful Degradation

- name: advanced_analysis agent: analysis_agent action: deep_analysis on_error: # Fallback to simpler analysis - name: basic_analysis action: simple_analysis # If that fails too, provide minimal output - name: minimal_analysis action: minimal_analysis

Retry Logic

- name: external_api_call agent: integration_agent action: call_api retry: max_attempts: 3 delay: 5s backoff: exponential # 5s, 10s, 20s retry_on: - error_type: timeout - error_type: rate_limited - status_code: [503, 429] dont_retry_on: - error_type: authentication_failed - status_code: [401, 403, 404]

Circuit Breaker

agent: circuit_breaker: enabled: true # Open circuit after failures failure_threshold: 5 # Open after 5 failures failure_window: 1m # Within 1 minute # Try again after timeout timeout: 30s # Wait 30s before retry # Close circuit after successes success_threshold: 3 # Close after 3 successes

2. Timeouts

Set timeouts at multiple levels:

flow: # Flow-level timeout timeout: 30m steps: - name: step1 timeout: 5m # Step-level timeout action: api_call timeout: 30s # Action-level timeout agent: default_step_timeout: 5m # Default for all steps

3. Health Checks

Monitor agent health:

agent: health_check: enabled: true interval: 30s checks: - name: api_connectivity action: ping_gitlab_api timeout: 5s - name: knowledge_graph_connectivity action: ping_knowledge_graph timeout: 5s - name: model_availability action: ping_ai_model timeout: 10s on_unhealthy: - action: alert_team channel: "#agent-platform-alerts" - action: disable_agent duration: 5m # Disable for 5 minutes

Cost Management

1. Token Optimization

Minimize LLM token usage:

agent: system_prompt: # Concise but complete role: "You are a security analyst..." guidelines: "..." # Don't include unnecessary verbosity # role: "You are an incredibly amazing and wonderful..." token_optimization: # Truncate large inputs max_context_size: 50000 truncation_strategy: "smart" # Preserve important parts # Summarize large outputs max_output_size: 10000 summarize_if_exceeds: true

2. Caching

Cache expensive operations:

agent: cache: # Cache model responses model_responses: enabled: true ttl: 1h cache_key: "${prompt_hash}" # Cache Knowledge Graph queries knowledge_graph: enabled: true ttl: 5m

3. Smart Model Selection

Use appropriate models for tasks:

agents: # Fast model for simple tasks - name: label-classifier model: claude-3-haiku # Fast, cheap # Advanced model for complex tasks - name: security-analyst model: claude-3-opus # Powerful, expensive # Tiered approach - name: code-reviewer models: - primary: claude-3-sonnet # Default - fallback: claude-3-haiku # If budget exceeded - complex_tasks: claude-3-opus # For difficult reviews

4. Cost Monitoring

Track and alert on costs:

monitoring: cost_tracking: enabled: true granularity: per_agent budgets: - agent: security-analyst-agent daily: 50 USD monthly: 1000 USD - agent: code-review-agent daily: 100 USD monthly: 2000 USD alerts: - threshold: 80% # of budget action: notify_team channel: "#agent-platform-costs" - threshold: 95% action: throttle_agent reduction: 50% # Reduce usage by 50% - threshold: 100% action: disable_agent until: next_budget_period

Observability

1. Structured Logging

Log with consistent structure:

agent: logging: format: json level: info include: - timestamp - session_id - agent_name - step_name - duration_ms - token_count - cost_usd - status exclude: - sensitive_data - user_passwords - api_keys

Example log:

{ "timestamp": "2026-01-08T10:15:30Z", "level": "info", "session_id": "sess-abc123", "agent": "security-analyst-agent", "step": "scan_vulnerabilities", "duration_ms": 12453, "tokens": 15234, "cost_usd": 0.15, "status": "success", "metadata": { "project_id": 12345, "mr_iid": 42, "vulnerabilities_found": 3 } }

2. Metrics

Track key performance indicators:

monitoring: metrics: enabled: true export_to: prometheus track: # Usage metrics - agent_executions_total - agent_execution_duration_seconds - agent_tokens_consumed_total # Quality metrics - agent_success_rate - agent_user_satisfaction - agent_correction_rate # How often humans override agent # Cost metrics - agent_cost_usd_total - agent_cost_per_execution # Performance metrics - agent_api_call_duration - agent_cache_hit_ratio - agent_knowledge_graph_query_duration

3. Distributed Tracing

Trace requests across services:

agent: tracing: enabled: true exporter: opentelemetry span_attributes: - service.name: agent-platform - agent.name: ${agent.name} - session.id: ${session.id} - project.id: ${context.project_id} trace_sampling: rate: 0.1 # Sample 10% of requests force_sample_on_error: true

4. Alerting

Alert on important events:

monitoring: alerts: # Performance degradation - name: high_latency condition: p95_duration > 60s severity: warning channel: "#agent-platform-alerts" # High error rate - name: error_rate_high condition: error_rate > 0.05 severity: critical channel: "#agent-platform-alerts" pagerduty: true # Cost overrun - name: cost_budget_exceeded condition: daily_cost > budget severity: warning channel: "#agent-platform-costs" # Agent stuck - name: agent_timeout condition: execution_time > max_timeout severity: critical action: kill_agent

Testing

1. Unit Testing Agents

Test agent logic in isolation:

// agent.test.ts import { SecurityAnalystAgent } from './security-analyst-agent'; describe('SecurityAnalystAgent', () => { let agent: SecurityAnalystAgent; beforeEach(() => { agent = new SecurityAnalystAgent({ knowledgeGraph: mockKnowledgeGraph, gitlabApi: mockGitlabApi }); }); it('should identify SQL injection vulnerability', async () => { const code = ` const query = "SELECT * FROM users WHERE id = " + userId; `; const result = await agent.analyzeCode(code); expect(result.vulnerabilities).toHaveLength(1); expect(result.vulnerabilities[0].type).toBe('sql_injection'); expect(result.vulnerabilities[0].severity).toBe('high'); }); it('should calculate risk score correctly', async () => { const vulnerability = { cvss: 8.1, epss: 0.15, reachable: true }; const risk = await agent.calculateRisk(vulnerability); expect(risk.score).toBeGreaterThan(7.0); expect(risk.priority).toBe('critical'); }); });

2. Integration Testing Flows

Test flows end-to-end:

# Test flow with mock context glab duo flow test security-triage-flow.yml \ --context test/fixtures/mr-context.json \ --mock-agents \ --expect-output test/fixtures/expected-output.json

Mock context:

{ "project_id": 12345, "merge_request": { "iid": 42, "diff": "...", "changed_files": ["src/auth.ts"] }, "security_scan_results": { "vulnerabilities": [...] } }

3. Canary Deployments

Gradually roll out agent changes:

agent: deployment: strategy: canary stages: - percentage: 10 duration: 1h success_criteria: error_rate: < 0.05 p95_latency: < 30s - percentage: 50 duration: 6h success_criteria: error_rate: < 0.03 user_satisfaction: > 4.0 - percentage: 100 # Full rollout on_failure: action: rollback notification: "#agent-platform-alerts"

4. Shadow Mode

Test agents without affecting users:

agent: mode: shadow shadow: # Run agent alongside existing system run_alongside: manual_review # Log results but don't post to MR log_only: true # Compare with human decisions compare_with: manual_review_decisions # Track accuracy metrics: - agreement_rate - false_positive_rate - false_negative_rate

Documentation

1. Agent Documentation

Document agent capabilities and usage:

agent: name: security-analyst-agent version: 2.0.0 documentation: description: | Analyzes security vulnerabilities and provides risk assessment. usage: examples: - description: Triage vulnerabilities in MR trigger: "@security-analyst analyze this MR" expected_output: "Risk assessment and prioritized action items" - description: Generate security report trigger: "@security-analyst generate report for last sprint" expected_output: "Comprehensive security report with metrics" capabilities: - "Vulnerability scanning (SAST, dependency scanning)" - "Risk assessment (CVSS, EPSS, reachability)" - "Remediation recommendations" - "Compliance checking (SOC 2, GDPR)" limitations: - "Does not perform DAST (dynamic) scanning" - "Does not execute code or test exploits" - "Recommendations require human validation" requirements: - "GitLab Ultimate license" - "Security scanning enabled in project" - "Developer role or higher" support: contact: "#agent-platform-support" documentation: "https://docs.example.com/agents/security-analyst" source_code: "https://gitlab.com/agents/security-analyst"

2. Flow Documentation

Document workflow patterns:

flow: name: deployment-validation version: 1.5.0 documentation: description: | Validates deployment readiness before production release. when_to_use: | Use this flow before deploying to production. It checks: - All tests pass - No critical security vulnerabilities - Performance benchmarks met - Documentation updated - Monitoring configured how_to_trigger: | Automatically triggered when: - MR targets main branch - MR assigned to @deployment-validator Manually trigger: @deployment-validator check production readiness what_it_checks: - Test coverage (minimum 80%) - Security scan (no high/critical issues) - Performance (p95 latency < 500ms) - Documentation (README, CHANGELOG updated) - Monitoring (alerts configured) expected_duration: "2-5 minutes" possible_outcomes: - " Approved: All checks passed, safe to deploy" - " Warning: Minor issues, human review recommended" - " Blocked: Critical issues, cannot deploy" examples: successful: "link-to-example-mr-1" failed: "link-to-example-mr-2"

Governance

1. Agent Approval Process

Require approval for new agents:

governance: agent_approval: required: true approval_flow: - role: security_team required_approvals: 1 conditions: - agent.permissions.includes("sensitive_data") - role: platform_team required_approvals: 2 conditions: - agent.cost_estimate > 1000 USD/month review_criteria: - "Security review completed" - "Cost estimate provided" - "Documentation complete" - "Test coverage > 80%" - "Monitoring configured"

2. Change Management

Control agent updates:

change_management: # Require approval for major changes major_version: approval_required: true notification_period: 7 days rollback_plan_required: true # Auto-approve minor changes minor_version: approval_required: false notification: true # Auto-apply patches patch_version: auto_apply: true

3. Compliance

Ensure regulatory compliance:

compliance: frameworks: - sox - gdpr - hipaa requirements: sox: - "All financial data access logged" - "Dual approval for production changes" - "Audit trail retained for 7 years" gdpr: - "PII processing documented" - "Data retention policies enforced" - "Right to erasure supported" audit: frequency: quarterly report_to: - compliance_officer - security_team

Migration and Rollback

1. Agent Migration

Safely migrate to new agent versions:

migration: from_version: 1.x to_version: 2.0 strategy: blue_green steps: # Deploy new version alongside old - deploy_new_version: traffic: 0% # Gradually shift traffic - increase_traffic: increment: 10% interval: 1h monitor_metrics: true # Full cutover - complete_migration: traffic: 100% deprecate_old_version: true rollback_trigger: - error_rate > 0.05 - user_satisfaction < 3.5 - cost_increase > 50%

2. Quick Rollback

Prepare for rapid rollback:

rollback: # Keep previous version ready keep_previous_versions: 2 # One-command rollback command: "glab duo agent rollback security-analyst --version 1.9.0" # Automatic rollback triggers automatic: enabled: true triggers: - error_rate > 0.10 - p95_latency > 120s - cost > 150% of baseline # Notification notify: - "#agent-platform-alerts" - on_call_engineer

Summary Checklist

Before Production Deployment

  • Security review completed
  • Cost estimates approved
  • Documentation complete
  • Tests passing (> 80% coverage)
  • Monitoring and alerting configured
  • Resource limits set
  • Error handling implemented
  • Audit logging enabled
  • Rollback plan documented
  • Team training completed

Production Operations

  • Monitor error rates daily
  • Review costs weekly
  • Check user satisfaction metrics
  • Analyze performance trends
  • Update documentation as needed
  • Respond to alerts promptly
  • Conduct monthly reviews
  • Plan capacity scaling

Continuous Improvement

  • Collect user feedback
  • Analyze agent decisions
  • Identify optimization opportunities
  • Update system prompts
  • Refine flows based on usage
  • Share learnings with team
  • Document lessons learned
  • Iterate on improvements

Last Updated: January 2026 GitLab Version: 18.7 (Beta), 18.8 GA (Upcoming)