best practices
GitLab Duo Agent Platform - Best Practices
Overview
This document consolidates production-ready patterns, security guidelines, performance optimizations, and operational best practices for deploying and managing the GitLab Duo Agent Platform at scale.
Architecture & Design
1. Agent Design Principles
Single Responsibility Principle
Good: Focused Agent
name: security-vulnerability-scanner description: Scans code for security vulnerabilities only capabilities: - SAST scanning - Dependency vulnerability detection - CVE database lookups - Risk assessment
Bad: Unfocused Agent
name: do-everything-agent description: Does security, code review, deployment, and monitoring # Too many responsibilities, hard to maintain and debug
Clear Boundaries
Each agent should have:
- Well-defined scope: What it does and doesn't do
- Explicit inputs: What data it needs
- Predictable outputs: What results it produces
- Error boundaries: How it handles failures
Example:
agent: name: code-review-agent scope: includes: - Code quality analysis - Bug detection - Style checking excludes: - Security vulnerability scanning (use security-agent) - Performance profiling (use performance-agent) inputs: required: - merge_request_diff - project_style_guide optional: - previous_review_comments - linked_issues outputs: - review_comments: array of inline comments - quality_score: number 0-100 - approval_recommendation: approve | request_changes | comment
2. Flow Design Patterns
Modular Flows
Break complex workflows into smaller, reusable flows:
Good: Modular
# security-scan-flow.yml name: security-scan steps: - scan_dependencies - scan_code - scan_secrets --- # deployment-flow.yml name: deployment steps: - trigger_flow: security-scan # Reuse existing flow - run_tests - deploy
Bad: Monolithic
name: mega-flow steps: # 50+ steps doing everything # Hard to maintain, debug, and reuse
Idempotent Steps
Design steps to be safely retryable:
- name: create_issue agent: planning_agent action: create_issue_if_not_exists # Idempotent # vs action: create_issue # Creates duplicate on retry
Fail Fast
Validate early, fail fast:
flow: steps: - name: validate_inputs agent: validation_agent action: check_prerequisites # Fast validation step first - name: expensive_operation agent: analysis_agent action: deep_analysis condition: - steps.validate_inputs.outputs.valid == true # Only run if validation passed
3. Context and State Management
Minimize State
Keep flows stateless when possible:
Good: Stateless
- name: analyze_code inputs: code: context.merge_request.diff style_guide: project.config.style_guide # All inputs explicit, no hidden state
Bad: Hidden State
- name: analyze_code # Depends on global variables, previous runs, etc. # Hard to test and debug
Context Passing
Be explicit about data flow between steps:
steps: - name: step1 outputs: result: result.data - name: step2 inputs: data: steps.step1.outputs.result # Explicit dependency
Security Best Practices
1. Authentication
Use OIDC Tokens
Recommended: OIDC
id_tokens: AGENT_TOKEN: aud: https://agent-platform.gitlab.com script: - duo-agent auth --token $AGENT_TOKEN
Benefits:
- Short-lived (1 hour)
- No stored secrets
- Granular permissions
- Full audit trail
Avoid: Long-lived tokens
variables: GITLAB_TOKEN: $LONG_LIVED_PAT # Security risk
Service Account Best Practices
service_account: # Minimum permissions only permissions: read: - code - issues write: - comments forbidden: - merge - delete - settings # Audit all actions audit_logging: true # Rate limiting rate_limits: api_calls: 1000/hour cost: 100 USD/day
2. Input Validation
Always validate and sanitize inputs:
- name: process_user_input agent: agent_name action: process validation: inputs: user_comment: type: string max_length: 10000 sanitize: true # Remove potential injection attacks allowed_patterns: ["^[a-zA-Z0-9\s.,!?-]+$"] file_path: type: string allowed_paths: - "src/**" - "tests/**" forbidden_paths: - "**/.env" - "**/secrets/**"
3. Secrets Management
Never expose secrets in logs or outputs:
agent: secrets: # Reference secrets, never hardcode api_key: vault: gitlab path: duo/agent/api_key database_password: vault: gitlab path: duo/agent/db_password # Mask secrets in logs log_masking: - api_key - database_password - "*password*" - "*token*"
4. Least Privilege
Grant minimum necessary permissions:
agent: name: security-scanner permissions: # Only what's needed gitlab: read: [code, security_reports] write: [comments] # No merge, deploy, or admin permissions knowledge_graph: - query # Read-only external_apis: - name: cve_database methods: [GET] # Read-only
5. Audit Logging
Log all agent actions:
monitoring: audit: enabled: true log_level: info events: - agent_started - agent_completed - api_call_made - permission_checked - error_occurred include_context: - user_id - project_id - merge_request_iid - service_account retention: 90 days
Performance Optimization
1. Caching Strategies
Knowledge Graph Cache
agent: name: code-review-agent cache: knowledge_graph: enabled: true ttl: 5m invalidate_on: - code_change - dependency_update
Response Cache
agent: cache: responses: enabled: true ttl: 1h cache_key: "${context.merge_request.diff_hash}" # Don't cache certain actions exclude_actions: - create_merge_request - approve_merge_request
2. Parallel Execution
Run independent steps in parallel:
steps: # These run in parallel - name: security_scan agent: security_agent parallel: group_1 - name: performance_test agent: performance_agent parallel: group_1 - name: style_check agent: style_agent parallel: group_1 # This waits for all parallel steps - name: aggregate agent: aggregate_agent depends_on: [security_scan, performance_test, style_check]
3. Lazy Loading
Load data only when needed:
- name: analyze agent: code_review_agent lazy_load: # Don't load full diff upfront diff: on_demand # Load Knowledge Graph data as needed knowledge_graph: on_demand # Agent requests specific data during execution
4. Resource Limits
Set appropriate limits to prevent resource exhaustion:
agent: limits: # Execution time max_execution_time: 5m # Most operations should be fast timeout_warning: 3m # Warn if taking too long # API calls max_api_calls: 500 # Prevent API abuse api_call_timeout: 30s # Memory and compute max_memory: 2GB max_cpu_cores: 2 # Token usage (LLM) max_input_tokens: 128000 max_output_tokens: 32000 # Cost control max_cost_per_execution: 1 USD max_daily_cost: 100 USD
5. Batch Operations
Batch API calls when possible:
Bad: N+1 API calls
- name: process_files for_each: context.merge_request.changed_files action: process_file # One API call per file
Good: Batch processing
- name: process_files action: process_files_batch inputs: files: context.merge_request.changed_files # One API call for all files
Reliability & Resilience
1. Error Handling
Graceful Degradation
- name: advanced_analysis agent: analysis_agent action: deep_analysis on_error: # Fallback to simpler analysis - name: basic_analysis action: simple_analysis # If that fails too, provide minimal output - name: minimal_analysis action: minimal_analysis
Retry Logic
- name: external_api_call agent: integration_agent action: call_api retry: max_attempts: 3 delay: 5s backoff: exponential # 5s, 10s, 20s retry_on: - error_type: timeout - error_type: rate_limited - status_code: [503, 429] dont_retry_on: - error_type: authentication_failed - status_code: [401, 403, 404]
Circuit Breaker
agent: circuit_breaker: enabled: true # Open circuit after failures failure_threshold: 5 # Open after 5 failures failure_window: 1m # Within 1 minute # Try again after timeout timeout: 30s # Wait 30s before retry # Close circuit after successes success_threshold: 3 # Close after 3 successes
2. Timeouts
Set timeouts at multiple levels:
flow: # Flow-level timeout timeout: 30m steps: - name: step1 timeout: 5m # Step-level timeout action: api_call timeout: 30s # Action-level timeout agent: default_step_timeout: 5m # Default for all steps
3. Health Checks
Monitor agent health:
agent: health_check: enabled: true interval: 30s checks: - name: api_connectivity action: ping_gitlab_api timeout: 5s - name: knowledge_graph_connectivity action: ping_knowledge_graph timeout: 5s - name: model_availability action: ping_ai_model timeout: 10s on_unhealthy: - action: alert_team channel: "#agent-platform-alerts" - action: disable_agent duration: 5m # Disable for 5 minutes
Cost Management
1. Token Optimization
Minimize LLM token usage:
agent: system_prompt: # Concise but complete role: "You are a security analyst..." guidelines: "..." # Don't include unnecessary verbosity # role: "You are an incredibly amazing and wonderful..." token_optimization: # Truncate large inputs max_context_size: 50000 truncation_strategy: "smart" # Preserve important parts # Summarize large outputs max_output_size: 10000 summarize_if_exceeds: true
2. Caching
Cache expensive operations:
agent: cache: # Cache model responses model_responses: enabled: true ttl: 1h cache_key: "${prompt_hash}" # Cache Knowledge Graph queries knowledge_graph: enabled: true ttl: 5m
3. Smart Model Selection
Use appropriate models for tasks:
agents: # Fast model for simple tasks - name: label-classifier model: claude-3-haiku # Fast, cheap # Advanced model for complex tasks - name: security-analyst model: claude-3-opus # Powerful, expensive # Tiered approach - name: code-reviewer models: - primary: claude-3-sonnet # Default - fallback: claude-3-haiku # If budget exceeded - complex_tasks: claude-3-opus # For difficult reviews
4. Cost Monitoring
Track and alert on costs:
monitoring: cost_tracking: enabled: true granularity: per_agent budgets: - agent: security-analyst-agent daily: 50 USD monthly: 1000 USD - agent: code-review-agent daily: 100 USD monthly: 2000 USD alerts: - threshold: 80% # of budget action: notify_team channel: "#agent-platform-costs" - threshold: 95% action: throttle_agent reduction: 50% # Reduce usage by 50% - threshold: 100% action: disable_agent until: next_budget_period
Observability
1. Structured Logging
Log with consistent structure:
agent: logging: format: json level: info include: - timestamp - session_id - agent_name - step_name - duration_ms - token_count - cost_usd - status exclude: - sensitive_data - user_passwords - api_keys
Example log:
{ "timestamp": "2026-01-08T10:15:30Z", "level": "info", "session_id": "sess-abc123", "agent": "security-analyst-agent", "step": "scan_vulnerabilities", "duration_ms": 12453, "tokens": 15234, "cost_usd": 0.15, "status": "success", "metadata": { "project_id": 12345, "mr_iid": 42, "vulnerabilities_found": 3 } }
2. Metrics
Track key performance indicators:
monitoring: metrics: enabled: true export_to: prometheus track: # Usage metrics - agent_executions_total - agent_execution_duration_seconds - agent_tokens_consumed_total # Quality metrics - agent_success_rate - agent_user_satisfaction - agent_correction_rate # How often humans override agent # Cost metrics - agent_cost_usd_total - agent_cost_per_execution # Performance metrics - agent_api_call_duration - agent_cache_hit_ratio - agent_knowledge_graph_query_duration
3. Distributed Tracing
Trace requests across services:
agent: tracing: enabled: true exporter: opentelemetry span_attributes: - service.name: agent-platform - agent.name: ${agent.name} - session.id: ${session.id} - project.id: ${context.project_id} trace_sampling: rate: 0.1 # Sample 10% of requests force_sample_on_error: true
4. Alerting
Alert on important events:
monitoring: alerts: # Performance degradation - name: high_latency condition: p95_duration > 60s severity: warning channel: "#agent-platform-alerts" # High error rate - name: error_rate_high condition: error_rate > 0.05 severity: critical channel: "#agent-platform-alerts" pagerduty: true # Cost overrun - name: cost_budget_exceeded condition: daily_cost > budget severity: warning channel: "#agent-platform-costs" # Agent stuck - name: agent_timeout condition: execution_time > max_timeout severity: critical action: kill_agent
Testing
1. Unit Testing Agents
Test agent logic in isolation:
// agent.test.ts import { SecurityAnalystAgent } from './security-analyst-agent'; describe('SecurityAnalystAgent', () => { let agent: SecurityAnalystAgent; beforeEach(() => { agent = new SecurityAnalystAgent({ knowledgeGraph: mockKnowledgeGraph, gitlabApi: mockGitlabApi }); }); it('should identify SQL injection vulnerability', async () => { const code = ` const query = "SELECT * FROM users WHERE id = " + userId; `; const result = await agent.analyzeCode(code); expect(result.vulnerabilities).toHaveLength(1); expect(result.vulnerabilities[0].type).toBe('sql_injection'); expect(result.vulnerabilities[0].severity).toBe('high'); }); it('should calculate risk score correctly', async () => { const vulnerability = { cvss: 8.1, epss: 0.15, reachable: true }; const risk = await agent.calculateRisk(vulnerability); expect(risk.score).toBeGreaterThan(7.0); expect(risk.priority).toBe('critical'); }); });
2. Integration Testing Flows
Test flows end-to-end:
# Test flow with mock context glab duo flow test security-triage-flow.yml \ --context test/fixtures/mr-context.json \ --mock-agents \ --expect-output test/fixtures/expected-output.json
Mock context:
{ "project_id": 12345, "merge_request": { "iid": 42, "diff": "...", "changed_files": ["src/auth.ts"] }, "security_scan_results": { "vulnerabilities": [...] } }
3. Canary Deployments
Gradually roll out agent changes:
agent: deployment: strategy: canary stages: - percentage: 10 duration: 1h success_criteria: error_rate: < 0.05 p95_latency: < 30s - percentage: 50 duration: 6h success_criteria: error_rate: < 0.03 user_satisfaction: > 4.0 - percentage: 100 # Full rollout on_failure: action: rollback notification: "#agent-platform-alerts"
4. Shadow Mode
Test agents without affecting users:
agent: mode: shadow shadow: # Run agent alongside existing system run_alongside: manual_review # Log results but don't post to MR log_only: true # Compare with human decisions compare_with: manual_review_decisions # Track accuracy metrics: - agreement_rate - false_positive_rate - false_negative_rate
Documentation
1. Agent Documentation
Document agent capabilities and usage:
agent: name: security-analyst-agent version: 2.0.0 documentation: description: | Analyzes security vulnerabilities and provides risk assessment. usage: examples: - description: Triage vulnerabilities in MR trigger: "@security-analyst analyze this MR" expected_output: "Risk assessment and prioritized action items" - description: Generate security report trigger: "@security-analyst generate report for last sprint" expected_output: "Comprehensive security report with metrics" capabilities: - "Vulnerability scanning (SAST, dependency scanning)" - "Risk assessment (CVSS, EPSS, reachability)" - "Remediation recommendations" - "Compliance checking (SOC 2, GDPR)" limitations: - "Does not perform DAST (dynamic) scanning" - "Does not execute code or test exploits" - "Recommendations require human validation" requirements: - "GitLab Ultimate license" - "Security scanning enabled in project" - "Developer role or higher" support: contact: "#agent-platform-support" documentation: "https://docs.example.com/agents/security-analyst" source_code: "https://gitlab.com/agents/security-analyst"
2. Flow Documentation
Document workflow patterns:
flow: name: deployment-validation version: 1.5.0 documentation: description: | Validates deployment readiness before production release. when_to_use: | Use this flow before deploying to production. It checks: - All tests pass - No critical security vulnerabilities - Performance benchmarks met - Documentation updated - Monitoring configured how_to_trigger: | Automatically triggered when: - MR targets main branch - MR assigned to @deployment-validator Manually trigger: @deployment-validator check production readiness what_it_checks: - Test coverage (minimum 80%) - Security scan (no high/critical issues) - Performance (p95 latency < 500ms) - Documentation (README, CHANGELOG updated) - Monitoring (alerts configured) expected_duration: "2-5 minutes" possible_outcomes: - " Approved: All checks passed, safe to deploy" - " Warning: Minor issues, human review recommended" - " Blocked: Critical issues, cannot deploy" examples: successful: "link-to-example-mr-1" failed: "link-to-example-mr-2"
Governance
1. Agent Approval Process
Require approval for new agents:
governance: agent_approval: required: true approval_flow: - role: security_team required_approvals: 1 conditions: - agent.permissions.includes("sensitive_data") - role: platform_team required_approvals: 2 conditions: - agent.cost_estimate > 1000 USD/month review_criteria: - "Security review completed" - "Cost estimate provided" - "Documentation complete" - "Test coverage > 80%" - "Monitoring configured"
2. Change Management
Control agent updates:
change_management: # Require approval for major changes major_version: approval_required: true notification_period: 7 days rollback_plan_required: true # Auto-approve minor changes minor_version: approval_required: false notification: true # Auto-apply patches patch_version: auto_apply: true
3. Compliance
Ensure regulatory compliance:
compliance: frameworks: - sox - gdpr - hipaa requirements: sox: - "All financial data access logged" - "Dual approval for production changes" - "Audit trail retained for 7 years" gdpr: - "PII processing documented" - "Data retention policies enforced" - "Right to erasure supported" audit: frequency: quarterly report_to: - compliance_officer - security_team
Migration and Rollback
1. Agent Migration
Safely migrate to new agent versions:
migration: from_version: 1.x to_version: 2.0 strategy: blue_green steps: # Deploy new version alongside old - deploy_new_version: traffic: 0% # Gradually shift traffic - increase_traffic: increment: 10% interval: 1h monitor_metrics: true # Full cutover - complete_migration: traffic: 100% deprecate_old_version: true rollback_trigger: - error_rate > 0.05 - user_satisfaction < 3.5 - cost_increase > 50%
2. Quick Rollback
Prepare for rapid rollback:
rollback: # Keep previous version ready keep_previous_versions: 2 # One-command rollback command: "glab duo agent rollback security-analyst --version 1.9.0" # Automatic rollback triggers automatic: enabled: true triggers: - error_rate > 0.10 - p95_latency > 120s - cost > 150% of baseline # Notification notify: - "#agent-platform-alerts" - on_call_engineer
Summary Checklist
Before Production Deployment
- Security review completed
- Cost estimates approved
- Documentation complete
- Tests passing (> 80% coverage)
- Monitoring and alerting configured
- Resource limits set
- Error handling implemented
- Audit logging enabled
- Rollback plan documented
- Team training completed
Production Operations
- Monitor error rates daily
- Review costs weekly
- Check user satisfaction metrics
- Analyze performance trends
- Update documentation as needed
- Respond to alerts promptly
- Conduct monthly reviews
- Plan capacity scaling
Continuous Improvement
- Collect user feedback
- Analyze agent decisions
- Identify optimization opportunities
- Update system prompts
- Refine flows based on usage
- Share learnings with team
- Document lessons learned
- Iterate on improvements
Last Updated: January 2026 GitLab Version: 18.7 (Beta), 18.8 GA (Upcoming)