vector databases integration guide

Vector Databases Integration Guide

Overview

Vector databases are specialized data stores optimized for storing, indexing, and searching high-dimensional vector embeddings. In the context of Drupal and AI systems, they enable semantic search, similarity matching, and retrieval-augmented generation (RAG) capabilities by storing embeddings generated from content.

Vector databases differ from traditional relational databases by:

Optimized for similarity search: Using distance metrics (cosine, Euclidean, dot product) instead of exact matches
Approximate Nearest Neighbor (ANN) search: Fast retrieval from millions of vectors without full scan
Metadata filtering: Combining vector similarity with traditional filtering
Scalability: Handling high-dimensional data efficiently

Supported Vector Database Systems

1. ChromaDB

Overview: Open-source vector database designed for AI applications, particularly popular for local development and small-to-medium deployments.

Key Characteristics:

Lightweight and easy to set up
Supports both in-memory and persistent storage
Built-in support for multiple embedding providers
Native Python and JavaScript SDKs
Automatic deduplication of embeddings

Advantages:

Zero external dependencies for basic setup
Great for prototyping and development
Built-in embedding generation
Simple REST API for integration

Limitations:

Limited horizontal scaling capabilities
Not optimized for massive-scale deployments (100M+ vectors)
Smaller community compared to alternatives

Integration with Drupal:

// Using ChromaDB HTTP Client
<?php

namespace Drupal\my_ai_module\Services;

use GuzzleHttp\ClientInterface;
use Drupal\Core\Logger\LoggerChannelFactoryInterface;

class ChromaDbService {
  protected $httpClient;
  protected $logger;
  protected $chromaUrl;

  public function __construct(
    ClientInterface $httpClient,
    LoggerChannelFactoryInterface $loggerFactory,
    string $chromaUrl = 'http://localhost:8000'
  ) {
    $this->httpClient = $httpClient;
    $this->logger = $loggerFactory->get('chromadb');
    $this->chromaUrl = $chromaUrl;
  }

  /**
   * Create or get a collection
   */
  public function getOrCreateCollection(string $collectionName): array {
    try {
      $response = $this->httpClient->post(
        "{$this->chromaUrl}/api/v1/collections",
        [
          'json' => [
            'name' => $collectionName,
            'metadata' => ['hnsw:space' => 'cosine'],
          ],
        ]
      );

      return json_decode($response->getBody(), true);
    } catch (\Exception $e) {
      $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Add documents with embeddings
   */
  public function addDocuments(
    string $collectionName,
    array $documents,
    array $embeddings,
    array $metadatas = [],
    array $ids = []
  ): void {
    if (empty($ids)) {
      $ids = array_map(fn($i) => "doc_$i", range(0, count($documents) - 1));
    }

    try {
      $this->httpClient->post(
        "{$this->chromaUrl}/api/v1/collections/$collectionName/add",
        [
          'json' => [
            'ids' => $ids,
            'embeddings' => $embeddings,
            'documents' => $documents,
            'metadatas' => $metadatas,
          ],
        ]
      );
    } catch (\Exception $e) {
      $this->logger->error('Failed to add documents: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Query similar documents
   */
  public function query(
    string $collectionName,
    array $queryEmbeddings,
    int $nResults = 10,
    array $whereFilter = null
  ): array {
    try {
      $params = [
        'query_embeddings' => $queryEmbeddings,
        'n_results' => $nResults,
      ];

      if ($whereFilter) {
        $params['where'] = $whereFilter;
      }

      $response = $this->httpClient->post(
        "{$this->chromaUrl}/api/v1/collections/$collectionName/query",
        ['json' => $params]
      );

      return json_decode($response->getBody(), true);
    } catch (\Exception $e) {
      $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Delete documents
   */
  public function deleteDocuments(string $collectionName, array $ids): void {
    try {
      $this->httpClient->post(
        "{$this->chromaUrl}/api/v1/collections/$collectionName/delete",
        ['json' => ['ids' => $ids]]
      );
    } catch (\Exception $e) {
      $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

Docker Setup:

version: '3.8'

services:
  chromadb:
    image: ghcr.io/chroma-core/chroma:latest
    ports:
      - "8000:8000"
    environment:
      - CHROMA_DB_IMPL=duckdb+parquet
      - PERSIST_DIRECTORY=/chroma/data
    volumes:
      - chromadb_data:/chroma/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  chromadb_data:

2. Pinecone

Overview: Fully managed vector database service with enterprise-grade infrastructure, multi-region deployment, and high availability.

Key Characteristics:

Fully managed SaaS platform
Automatic scaling and high availability
Multi-region deployment options
Advanced filtering and metadata support
Integrated API key management and security

Advantages:

Zero infrastructure management
Exceptional search performance at scale
Enterprise SLAs and support
Automatic backups and disaster recovery
Real-time index updates

Limitations:

Significant cost for large-scale usage
Vendor lock-in concerns
Requires external service dependency
Data residency considerations

Integration with Drupal:

<?php

namespace Drupal\my_ai_module\Services;

use Pinecone\Client as PineconeClient;
use Drupal\Core\Logger\LoggerChannelFactoryInterface;
use Drupal\Core\Config\ConfigFactoryInterface;

class PineconeService {
  protected $client;
  protected $logger;
  protected $configFactory;
  protected $indexName;

  public function __construct(
    LoggerChannelFactoryInterface $loggerFactory,
    ConfigFactoryInterface $configFactory
  ) {
    $this->logger = $loggerFactory->get('pinecone');
    $this->configFactory = $configFactory;

    $config = $configFactory->get('my_ai_module.pinecone');
    $apiKey = $this->getSecureApiKey('pinecone_api_key');
    $environment = $config->get('environment');

    $this->client = PineconeClient::create([
      'api_key' => $apiKey,
      'environment' => $environment,
    ]);

    $this->indexName = $config->get('index_name');
  }

  /**
   * Securely retrieve API key from environment or Drupal key management
   */
  protected function getSecureApiKey(string $keyName): string {
    // First try environment variable
    if ($apiKey = getenv('PINECONE_API_KEY')) {
      return $apiKey;
    }

    // Fall back to Drupal key management module if available
    if (\Drupal::moduleHandler()->moduleExists('key')) {
      $key = \Drupal::service('key.repository')->getKey('pinecone_api_key');
      if ($key) {
        return $key->getKeyValue();
      }
    }

    throw new \RuntimeException('Pinecone API key not configured');
  }

  /**
   * Upsert vectors (insert or update)
   */
  public function upsertVectors(
    array $vectors,
    string $namespace = 'default'
  ): void {
    try {
      $index = $this->client->index($this->indexName);

      $upsertData = array_map(function($vector) {
        return [
          'id' => $vector['id'],
          'values' => $vector['values'],
          'metadata' => $vector['metadata'] ?? [],
        ];
      }, $vectors);

      $index->upsert(vectors: $upsertData, namespace: $namespace);

      $this->logger->info('Upserted @count vectors to @namespace', [
        '@count' => count($vectors),
        '@namespace' => $namespace,
      ]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to upsert vectors: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Query vectors with metadata filtering
   */
  public function query(
    array $vector,
    int $topK = 10,
    array $filter = null,
    string $namespace = 'default'
  ): array {
    try {
      $index = $this->client->index($this->indexName);

      $queryParams = [
        'vector' => $vector,
        'topK' => $topK,
        'namespace' => $namespace,
        'includeMetadata' => true,
      ];

      if ($filter) {
        $queryParams['filter'] = $filter;
      }

      $results = $index->query($queryParams);

      return [
        'matches' => $results['matches'] ?? [],
        'namespace' => $namespace,
      ];
    } catch (\Exception $e) {
      $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Delete vectors by ID
   */
  public function deleteVectors(
    array $ids,
    string $namespace = 'default'
  ): void {
    try {
      $index = $this->client->index($this->indexName);
      $index->delete(ids: $ids, namespace: $namespace);

      $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]);
    } catch (\Exception $e) {
      $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Get index statistics
   */
  public function getIndexStats(): array {
    try {
      $index = $this->client->index($this->indexName);
      return $index->describeIndexStats();
    } catch (\Exception $e) {
      $this->logger->error('Failed to get index stats: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

Configuration Schema (my_ai_module.schema.yml):

my_ai_module.pinecone:
  type: config_object
  label: 'Pinecone Configuration'
  mapping:
    api_key_key:
      type: string
      label: 'API Key ID (from Key module)'
      description: 'Reference to stored Pinecone API key'
    environment:
      type: string
      label: 'Pinecone Environment'
      description: 'e.g., gcp-starter, us-east1-aws'
    index_name:
      type: string
      label: 'Index Name'
      description: 'Name of the Pinecone index'
    dimension:
      type: integer
      label: 'Vector Dimension'
      description: 'Dimension of embeddings (e.g., 1536 for OpenAI)'

3. Milvus

Overview: Open-source vector database designed for high-performance similarity search on massive-scale datasets, supporting both cloud and on-premise deployments.

Key Characteristics:

High performance with HNSW and IVF indexing algorithms
Distributed architecture for horizontal scaling
Supports GPU acceleration
Multiple distance metrics (L2, IP, cosine, Hamming)
Time-based filtering and partitioning

Advantages:

Open-source with no vendor lock-in
Excellent performance at scale (100M+ vectors)
Flexible deployment options (Milvus Lite, standalone, distributed)
GPU acceleration for large-scale operations
Rich filtering capabilities with scalar data

Limitations:

More complex operational overhead than managed services
Steeper learning curve for configuration
Requires infrastructure management
Community support model

Integration with Drupal:

<?php

namespace Drupal\my_ai_module\Services;

use Milvus\MilvusClient;
use Drupal\Core\Logger\LoggerChannelFactoryInterface;
use Drupal\Core\Config\ConfigFactoryInterface;

class MilvusService {
  protected $client;
  protected $logger;
  protected $configFactory;
  protected $collectionName;

  public function __construct(
    LoggerChannelFactoryInterface $loggerFactory,
    ConfigFactoryInterface $configFactory
  ) {
    $this->logger = $loggerFactory->get('milvus');
    $this->configFactory = $configFactory;

    $config = $configFactory->get('my_ai_module.milvus');
    $host = $config->get('host') ?? 'localhost';
    $port = $config->get('port') ?? 19530;

    $this->client = new MilvusClient([
      'host' => $host,
      'port' => $port,
    ]);

    $this->collectionName = $config->get('collection_name');
  }

  /**
   * Create collection with schema
   */
  public function createCollection(
    string $collectionName,
    array $fields,
    string $description = ''
  ): void {
    try {
      $this->client->createCollection([
        'collection_name' => $collectionName,
        'fields' => $fields,
        'description' => $description,
      ]);

      $this->logger->info('Collection @name created', ['@name' => $collectionName]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Insert vectors with auto-ID generation
   */
  public function insertVectors(
    string $collectionName,
    array $vectors,
    array $metadatas = []
  ): array {
    try {
      $insertData = [];
      foreach ($vectors as $i => $vector) {
        $insertData[] = [
          'embedding' => $vector,
          'metadata' => json_encode($metadatas[$i] ?? []),
          'created_at' => time() * 1000, // Milvus timestamp in milliseconds
        ];
      }

      $response = $this->client->insert($collectionName, $insertData);

      $this->logger->info('Inserted @count vectors', ['@count' => count($vectors)]);

      return $response;
    } catch (\Exception $e) {
      $this->logger->error('Insert failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Search with vector similarity and metadata filtering
   */
  public function search(
    string $collectionName,
    array $queryVector,
    int $limit = 10,
    array $metadataFilter = null
  ): array {
    try {
      $searchParams = [
        'collection_name' => $collectionName,
        'vectors' => [$queryVector],
        'limit' => $limit,
        'metric_type' => 'COSINE', // or L2, IP
        'vector_field_name' => 'embedding',
      ];

      if ($metadataFilter) {
        $searchParams['expr'] = $this->buildFilterExpression($metadataFilter);
      }

      $response = $this->client->search($searchParams);

      return $response['results'][0] ?? [];
    } catch (\Exception $e) {
      $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Build Milvus filter expression from conditions
   */
  protected function buildFilterExpression(array $filter): string {
    $expressions = [];

    foreach ($filter as $field => $value) {
      if (is_array($value)) {
        // Handle IN operators
        $values = implode(',', array_map(fn($v) => "'$v'", $value));
        $expressions[] = "$field in [$values]";
      } else {
        // Handle equality
        $expressions[] = "$field == '$value'";
      }
    }

    return implode(' && ', $expressions);
  }

  /**
   * Delete vectors by IDs
   */
  public function deleteVectors(
    string $collectionName,
    array $ids
  ): void {
    try {
      $this->client->delete($collectionName, [
        'expr' => 'id in [' . implode(',', $ids) . ']',
      ]);

      $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]);
    } catch (\Exception $e) {
      $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Create index for performance
   */
  public function createIndex(
    string $collectionName,
    string $fieldName = 'embedding',
    string $indexType = 'HNSW'
  ): void {
    try {
      $this->client->createIndex([
        'collection_name' => $collectionName,
        'field_name' => $fieldName,
        'index_name' => "{$collectionName}_{$fieldName}_index",
        'index_type' => $indexType, // HNSW, IVF_FLAT, IVF_SQ8
        'params' => [
          'M' => 8,
          'efConstruction' => 200,
        ],
      ]);

      $this->logger->info('Index created on @field', ['@field' => $fieldName]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to create index: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

Docker Compose Setup:

version: '3.8'

services:
  minio:
    image: minio/minio:latest
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: minio server /minio_data
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_data:/minio_data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 10s
      timeout: 5s
      retries: 5

  etcd:
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
    ports:
      - "2379:2379"
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
    volumes:
      - etcd_data:/etcd_data

  milvus:
    image: milvusdb/milvus:latest
    depends_on:
      - minio
      - etcd
    environment:
      COMMON_STORAGETYPE: minio
      MINIO_ADDRESS: minio:9000
      ETCD_ADDRESS: etcd:2379
    ports:
      - "19530:19530"
    volumes:
      - milvus_data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:19530/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  minio_data:
  etcd_data:
  milvus_data:

4. Weaviate

Overview: Open-source vector database with built-in NLP modules, combining vector search with semantic understanding and graph capabilities.

Key Characteristics:

Built-in multi-language support
GraphQL API for complex queries
Integrated module system (transformers, Q&A, NER)
Hybrid search combining vector and keyword search
Rich filtering with reference properties

Advantages:

Excellent for semantic understanding out-of-the-box
GraphQL support for complex queries
Built-in ML modules reduce external dependencies
Hybrid search for better relevance
Strong documentation and community

Limitations:

Higher resource requirements
Complexity learning curve for GraphQL
Memory-intensive for large-scale deployments
Module licensing considerations

Integration with Drupal:

<?php

namespace Drupal\my_ai_module\Services;

use GuzzleHttp\ClientInterface;
use Drupal\Core\Logger\LoggerChannelFactoryInterface;

class WeaviateService {
  protected $httpClient;
  protected $logger;
  protected $weaviateUrl;

  public function __construct(
    ClientInterface $httpClient,
    LoggerChannelFactoryInterface $loggerFactory,
    string $weaviateUrl = 'http://localhost:8080'
  ) {
    $this->httpClient = $httpClient;
    $this->logger = $loggerFactory->get('weaviate');
    $this->weaviateUrl = $weaviateUrl;
  }

  /**
   * Create a class schema
   */
  public function createClass(
    string $className,
    array $properties,
    array $vectorizer = null
  ): void {
    try {
      $schema = [
        'class' => $className,
        'description' => "Class $className",
        'properties' => $properties,
      ];

      if ($vectorizer) {
        $schema['vectorizer'] = $vectorizer;
      }

      $this->httpClient->post(
        "{$this->weaviateUrl}/v1/schema",
        ['json' => $schema]
      );

      $this->logger->info('Class @class created', ['@class' => $className]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to create class: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Add objects with vectors
   */
  public function addObjects(
    string $className,
    array $objects,
    array $vectors = []
  ): array {
    try {
      $batchObjects = [];

      foreach ($objects as $i => $obj) {
        $batchObject = [
          'class' => $className,
          'properties' => $obj,
        ];

        if (isset($vectors[$i])) {
          $batchObject['vector'] = $vectors[$i];
        }

        $batchObjects[] = $batchObject;
      }

      $response = $this->httpClient->post(
        "{$this->weaviateUrl}/v1/batch/objects",
        ['json' => ['objects' => $batchObjects]]
      );

      $result = json_decode($response->getBody(), true);

      $this->logger->info('Added @count objects', ['@count' => count($objects)]);

      return $result;
    } catch (\Exception $e) {
      $this->logger->error('Failed to add objects: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * GraphQL query with semantic search
   */
  public function query(
    string $className,
    array $queryVector,
    int $limit = 10,
    array $properties = [],
    array $where = null
  ): array {
    try {
      $propertiesStr = empty($properties)
        ? ''
        : implode(' ', $properties);

      $whereClause = '';
      if ($where) {
        $whereClause = $this->buildWhereClause($where);
      }

      $query = <<<GQL
      {
        Get {
          $className(
            nearVector: {
              vector: [" . implode(',', $queryVector) . "]
              distance: 0.8
            }
            limit: $limit
            $whereClause
          ) {
            $propertiesStr
            _additional {
              distance
              vector
            }
          }
        }
      }
      GQL;

      $response = $this->httpClient->post(
        "{$this->weaviateUrl}/v1/graphql",
        ['json' => ['query' => $query]]
      );

      return json_decode($response->getBody(), true);
    } catch (\Exception $e) {
      $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Hybrid search combining vector and keyword search
   */
  public function hybridSearch(
    string $className,
    string $searchText,
    array $queryVector,
    int $limit = 10
  ): array {
    try {
      $query = <<<GQL
      {
        Get {
          $className(
            hybrid: {
              query: "$searchText"
              vector: [" . implode(',', $queryVector) . "]
              alpha: 0.5
            }
            limit: $limit
          ) {
            text
            _additional {
              score
              distance
            }
          }
        }
      }
      GQL;

      $response = $this->httpClient->post(
        "{$this->weaviateUrl}/v1/graphql",
        ['json' => ['query' => $query]]
      );

      return json_decode($response->getBody(), true);
    } catch (\Exception $e) {
      $this->logger->error('Hybrid search failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Delete objects by filter
   */
  public function deleteObjects(string $className, array $where): void {
    try {
      $whereClause = $this->buildWhereClause($where);

      $this->httpClient->delete(
        "{$this->weaviateUrl}/v1/objects",
        [
          'json' => [
            'where' => $where,
          ],
        ]
      );

      $this->logger->info('Objects deleted from @class', ['@class' => $className]);
    } catch (\Exception $e) {
      $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Helper to build WHERE clause for filtering
   */
  protected function buildWhereClause(array $conditions): string {
    if (empty($conditions)) {
      return '';
    }

    $clauses = [];
    foreach ($conditions as $field => $value) {
      $clauses[] = "{ path: [\"$field\"] operator: Equal valueString: \"$value\" }";
    }

    return 'where: { ' . implode(' ', $clauses) . ' }';
  }

  /**
   * Get vectorizer status
   */
  public function getVectorizerStatus(): array {
    try {
      $response = $this->httpClient->get("{$this->weaviateUrl}/v1/modules");
      return json_decode($response->getBody(), true);
    } catch (\Exception $e) {
      $this->logger->error('Failed to get status: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

Docker Setup:

version: '3.8'

services:
  weaviate:
    image: semitechnologies/weaviate:latest
    ports:
      - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 100
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: /var/lib/weaviate
      ENABLE_MODULES: 'text2vec-openai,generative-openai'
      OPENAI_APIKEY: ${OPENAI_API_KEY}
      OPENAI_INFERENCE_API: openai
    volumes:
      - weaviate_data:/var/lib/weaviate
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/v1/.well-known/ready"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  weaviate_data:

Semantic Search Implementation

Understanding Embeddings

Embeddings are dense vector representations of text, capturing semantic meaning. Each word, sentence, or document is converted to a vector of numbers (typically 384-1536 dimensions depending on the model).

Embedding Models:

OpenAI: text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims)
Sentence Transformers: all-MiniLM-L6-v2 (384 dims), all-mpnet-base-v2 (768 dims)
Cohere: embed-english-v3.0 (1024 dims)

Complete Semantic Search Implementation

<?php

namespace Drupal\my_ai_module\Services;

use Drupal\Core\Entity\EntityTypeManagerInterface;
use Drupal\Core\Logger\LoggerChannelFactoryInterface;
use Drupal\node\Entity\Node;

class SemanticSearchService {
  protected $embeddingService;
  protected $vectorDb;
  protected $entityTypeManager;
  protected $logger;

  public function __construct(
    EmbeddingService $embeddingService,
    VectorDatabaseInterface $vectorDb,
    EntityTypeManagerInterface $entityTypeManager,
    LoggerChannelFactoryInterface $loggerFactory
  ) {
    $this->embeddingService = $embeddingService;
    $this->vectorDb = $vectorDb;
    $this->entityTypeManager = $entityTypeManager;
    $this->logger = $loggerFactory->get('semantic_search');
  }

  /**
   * Index a node for semantic search
   */
  public function indexNode(Node $node): void {
    try {
      // Extract searchable text
      $text = $this->extractNodeText($node);

      // Generate embedding
      $embedding = $this->embeddingService->embed($text);

      // Prepare metadata
      $metadata = [
        'nid' => $node->id(),
        'node_type' => $node->getType(),
        'title' => $node->getTitle(),
        'author' => $node->getOwner()->getAccountName(),
        'created' => $node->getCreatedTime(),
        'updated' => $node->getChangedTime(),
        'language' => $node->language()->getId(),
      ];

      // Store in vector database
      $this->vectorDb->upsert(
        id: "node_{$node->id()}",
        embedding: $embedding,
        metadata: $metadata,
        document: $text
      );

      $this->logger->info('Indexed node @nid', ['@nid' => $node->id()]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to index node: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Extract text from node for embedding
   */
  protected function extractNodeText(Node $node): string {
    $text = $node->getTitle() . "\n\n";

    // Add body field if available
    if ($node->hasField('body')) {
      $text .= $node->get('body')->value ?? '';
    }

    // Add other fields
    foreach ($node->getFieldDefinitions() as $field) {
      $fieldName = $field->getName();
      if (!in_array($fieldName, ['title', 'body', 'created', 'changed', 'uid'])) {
        if ($node->hasField($fieldName)) {
          $value = $node->get($fieldName)->value ?? '';
          if (is_string($value)) {
            $text .= "\n" . $value;
          }
        }
      }
    }

    return $text;
  }

  /**
   * Semantic search across indexed content
   */
  public function search(
    string $query,
    int $limit = 10,
    array $filters = []
  ): array {
    try {
      // Generate embedding for query
      $queryEmbedding = $this->embeddingService->embed($query);

      // Search vector database
      $results = $this->vectorDb->search(
        embedding: $queryEmbedding,
        limit: $limit,
        filters: $filters
      );

      // Enrich results with full node data
      $enrichedResults = [];
      foreach ($results as $result) {
        $nid = $result['metadata']['nid'] ?? null;
        if ($nid) {
          $node = Node::load($nid);
          if ($node) {
            $enrichedResults[] = [
              'score' => $result['score'],
              'node' => $node,
              'similarity' => $result['score'],
              'snippet' => $this->generateSnippet($result['document'], $query),
            ];
          }
        }
      }

      return $enrichedResults;
    } catch (\Exception $e) {
      $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Generate snippet with query context
   */
  protected function generateSnippet(string $text, string $query, int $length = 150): string {
    $words = str_word_count($text, 1);
    $queryWords = str_word_count($query, 1);

    // Find position of first query word match
    $position = 0;
    foreach ($words as $i => $word) {
      if (in_array(strtolower($word), array_map('strtolower', $queryWords))) {
        $position = max(0, $i - 5);
        break;
      }
    }

    $snippet = implode(' ', array_slice($words, $position, 30));
    return substr($snippet, 0, $length) . '...';
  }

  /**
   * Remove node from search index
   */
  public function removeNode(int $nid): void {
    try {
      $this->vectorDb->delete("node_$nid");
      $this->logger->info('Removed node @nid from search', ['@nid' => $nid]);
    } catch (\Exception $e) {
      $this->logger->error('Failed to remove node: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

RAG (Retrieval Augmented Generation) Patterns

Core RAG Architecture

RAG enhances language models by retrieving relevant documents before generation, reducing hallucinations and improving accuracy.

Flow: Query Embedding Vector Search Context Retrieval LLM Generation

Chunking Strategies

Proper document chunking is critical for RAG effectiveness:

<?php

namespace Drupal\my_ai_module\Services;

class DocumentChunkingService {

  /**
   * Chunk by fixed size with overlap
   */
  public function chunkBySize(
    string $text,
    int $chunkSize = 512,
    int $overlapSize = 100
  ): array {
    $words = str_word_count($text, 1);
    $chunks = [];

    for ($i = 0; $i < count($words); $i += ($chunkSize - $overlapSize)) {
      $chunk = array_slice($words, $i, $chunkSize);
      $chunks[] = implode(' ', $chunk);
    }

    return $chunks;
  }

  /**
   * Chunk by semantic boundaries (sentences/paragraphs)
   */
  public function chunkBySemantic(
    string $text,
    int $targetSize = 512
  ): array {
    $paragraphs = preg_split('/\n\n+/', $text);
    $chunks = [];
    $currentChunk = '';
    $currentSize = 0;

    foreach ($paragraphs as $paragraph) {
      $paragraphSize = str_word_count($paragraph);

      if ($currentSize + $paragraphSize > $targetSize && !empty($currentChunk)) {
        $chunks[] = $currentChunk;
        $currentChunk = '';
        $currentSize = 0;
      }

      $currentChunk .= $paragraph . "\n\n";
      $currentSize += $paragraphSize;
    }

    if (!empty($currentChunk)) {
      $chunks[] = $currentChunk;
    }

    return $chunks;
  }

  /**
   * Chunk by markdown headers for structured documents
   */
  public function chunkByStructure(string $text): array {
    $chunks = [];
    $lines = explode("\n", $text);
    $currentChunk = '';
    $currentHeader = '';

    foreach ($lines as $line) {
      // Check for headers
      if (preg_match('/^#+\s+(.+)$/', $line, $matches)) {
        if (!empty($currentChunk)) {
          $chunks[] = [
            'header' => $currentHeader,
            'content' => $currentChunk,
          ];
        }
        $currentHeader = $matches[1];
        $currentChunk = $line . "\n";
      } else {
        $currentChunk .= $line . "\n";
      }
    }

    if (!empty($currentChunk)) {
      $chunks[] = [
        'header' => $currentHeader,
        'content' => $currentChunk,
      ];
    }

    return $chunks;
  }
}

Complete RAG Implementation

<?php

namespace Drupal\my_ai_module\Services;

class RAGService {
  protected $vectorDb;
  protected $embeddingService;
  protected $llmService;
  protected $chunkingService;
  protected $logger;

  public function __construct(
    VectorDatabaseInterface $vectorDb,
    EmbeddingService $embeddingService,
    LLMService $llmService,
    DocumentChunkingService $chunkingService,
    LoggerChannelFactoryInterface $loggerFactory
  ) {
    $this->vectorDb = $vectorDb;
    $this->embeddingService = $embeddingService;
    $this->llmService = $llmService;
    $this->chunkingService = $chunkingService;
    $this->logger = $loggerFactory->get('rag');
  }

  /**
   * Ingest document into RAG system
   */
  public function ingestDocument(
    string $documentId,
    string $content,
    array $metadata = []
  ): void {
    try {
      // Chunk document
      $chunks = $this->chunkingService->chunkBySemantic($content);

      // Create embeddings and store
      $vectorData = [];
      foreach ($chunks as $i => $chunkText) {
        $embedding = $this->embeddingService->embed($chunkText);

        $vectorData[] = [
          'id' => "{$documentId}_chunk_{$i}",
          'embedding' => $embedding,
          'metadata' => [
            'document_id' => $documentId,
            'chunk_index' => $i,
            'chunk_count' => count($chunks),
            ...array_merge($metadata, [
              'word_count' => str_word_count($chunkText),
            ]),
          ],
          'content' => $chunkText,
        ];
      }

      // Batch insert into vector database
      $this->vectorDb->batchInsert($vectorData);

      $this->logger->info(
        'Ingested document @doc with @chunks chunks',
        ['@doc' => $documentId, '@chunks' => count($chunks)]
      );
    } catch (\Exception $e) {
      $this->logger->error('Ingestion failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Retrieve context for query
   */
  public function retrieveContext(
    string $query,
    int $topK = 5,
    array $filters = []
  ): array {
    try {
      // Embed query
      $queryEmbedding = $this->embeddingService->embed($query);

      // Search vector database
      $results = $this->vectorDb->search(
        embedding: $queryEmbedding,
        limit: $topK,
        filters: $filters
      );

      // Rank and deduplicate by document
      return $this->rankAndDeduplicate($results);
    } catch (\Exception $e) {
      $this->logger->error('Context retrieval failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Generate response using RAG
   */
  public function generateResponse(
    string $query,
    array $retrievalFilters = [],
    array $llmOptions = []
  ): string {
    try {
      // Retrieve context
      $contextChunks = $this->retrieveContext($query, topK: 5, filters: $retrievalFilters);

      // Build prompt with context
      $contextText = $this->buildContextString($contextChunks);
      $systemPrompt = $this->buildSystemPrompt($contextText);

      // Generate response
      $response = $this->llmService->generate(
        messages: [
          ['role' => 'system', 'content' => $systemPrompt],
          ['role' => 'user', 'content' => $query],
        ],
        options: $llmOptions
      );

      $this->logger->info('Generated RAG response for query');

      return $response;
    } catch (\Exception $e) {
      $this->logger->error('Response generation failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }

  /**
   * Build context string from retrieved chunks
   */
  protected function buildContextString(array $chunks): string {
    $context = "# Retrieved Context\n\n";

    foreach ($chunks as $chunk) {
      $context .= "## Source: " . $chunk['metadata']['document_id'] . "\n";
      $context .= "**Relevance Score:** " . round($chunk['score'] * 100) . "%\n\n";
      $context .= $chunk['content'] . "\n\n";
      $context .= "---\n\n";
    }

    return $context;
  }

  /**
   * Build system prompt with instructions
   */
  protected function buildSystemPrompt(string $context): string {
    return <<<PROMPT
You are a helpful assistant with access to the following context information.
Use this context to answer questions accurately and cite your sources.

$context

Instructions:
1. Answer based primarily on the provided context
2. Cite which source document you're referencing
3. If context is insufficient, clearly state the limitation
4. Do not make up information
PROMPT;
  }

  /**
   * Rank and deduplicate results by document
   */
  protected function rankAndDeduplicate(array $results): array {
    $deduped = [];

    foreach ($results as $result) {
      $docId = $result['metadata']['document_id'] ?? 'unknown';

      if (!isset($deduped[$docId])) {
        $deduped[$docId] = $result;
      } else {
        // Keep highest scoring chunk per document
        if ($result['score'] > $deduped[$docId]['score']) {
          $deduped[$docId] = $result;
        }
      }
    }

    // Sort by score
    usort($deduped, fn($a, $b) => $b['score'] <=> $a['score']);

    return $deduped;
  }

  /**
   * Remove document from RAG
   */
  public function removeDocument(string $documentId): void {
    try {
      $this->vectorDb->deleteByMetadata(['document_id' => $documentId]);
      $this->logger->info('Removed document @doc from RAG', ['@doc' => $documentId]);
    } catch (\Exception $e) {
      $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]);
      throw $e;
    }
  }
}

Configuration and Setup

Module Structure

my_ai_module/
 src/
    Services/
       ChromaDbService.php
       PineconeService.php
       MilvusService.php
       WeaviateService.php
       EmbeddingService.php
       SemanticSearchService.php
       RAGService.php
    Form/
        VectorDatabaseSettingsForm.php
 config/
    schema/
        my_ai_module.schema.yml
 composer.json
 my_ai_module.module

Drupal Configuration Schema

# my_ai_module.schema.yml

my_ai_module.vector_database:
  type: config_object
  label: 'Vector Database Configuration'
  mapping:
    provider:
      type: string
      label: 'Vector Database Provider'
      description: 'chromadb, pinecone, milvus, or weaviate'
      constraints:
        - AllowedValues:
            choices: [chromadb, pinecone, milvus, weaviate]

    embedding_model:
      type: string
      label: 'Embedding Model'
      description: 'Model for generating embeddings'
      default: 'text-embedding-3-small'

my_ai_module.embeddings:
  type: config_object
  label: 'Embedding Service Configuration'
  mapping:
    provider:
      type: string
      label: 'Embedding Provider'
      description: 'openai, cohere, huggingface'

    api_key_key:
      type: string
      label: 'API Key Reference'
      description: 'Key module key ID for storing API key'

    model:
      type: string
      label: 'Model Name'

    dimension:
      type: integer
      label: 'Embedding Dimension'

my_ai_module.rag:
  type: config_object
  label: 'RAG Configuration'
  mapping:
    enabled:
      type: boolean
      label: 'Enable RAG'
      default: true

    chunk_size:
      type: integer
      label: 'Chunk Size (words)'
      default: 512

    chunk_overlap:
      type: integer
      label: 'Chunk Overlap'
      default: 100

    retrieval_limit:
      type: integer
      label: 'Context Chunks to Retrieve'
      default: 5

Composer Dependencies

{
  "require": {
    "php": ">=8.1",
    "drupal/core": "^10.0",
    "guzzlehttp/guzzle": "^7.0",
    "openai-php/client": "^0.8.0",
    "cohere-ai/cohere-php": "^1.0",
    "milvus/milvus": "^2.0"
  },
  "require-dev": {
    "phpunit/phpunit": "^10.0",
    "drupal/core-dev": "^10.0"
  }
}

Services Registration

# my_ai_module.services.yml

services:
  my_ai_module.chromadb:
    class: Drupal\my_ai_module\Services\ChromaDbService
    arguments:
      - '@http_client'
      - '@logger.factory'
      - '%my_ai_module.chromadb.url%'

  my_ai_module.pinecone:
    class: Drupal\my_ai_module\Services\PineconeService
    arguments:
      - '@logger.factory'
      - '@config.factory'

  my_ai_module.milvus:
    class: Drupal\my_ai_module\Services\MilvusService
    arguments:
      - '@logger.factory'
      - '@config.factory'

  my_ai_module.weaviate:
    class: Drupal\my_ai_module\Services\WeaviateService
    arguments:
      - '@http_client'
      - '@logger.factory'
      - '%my_ai_module.weaviate.url%'

  my_ai_module.embedding:
    class: Drupal\my_ai_module\Services\EmbeddingService
    arguments:
      - '@http_client'
      - '@config.factory'
      - '@logger.factory'

  my_ai_module.semantic_search:
    class: Drupal\my_ai_module\Services\SemanticSearchService
    arguments:
      - '@my_ai_module.embedding'
      - '@my_ai_module.vector_db'
      - '@entity_type.manager'
      - '@logger.factory'

  my_ai_module.rag:
    class: Drupal\my_ai_module\Services\RAGService
    arguments:
      - '@my_ai_module.vector_db'
      - '@my_ai_module.embedding'
      - '@my_ai_module.llm'
      - '@my_ai_module.chunking'
      - '@logger.factory'

Environment Variables

# .env.local

# Vector Database Configuration
VECTOR_DB_PROVIDER=pinecone  # or chromadb, milvus, weaviate
PINECONE_API_KEY=your_api_key_here
PINECONE_ENVIRONMENT=gcp-starter
PINECONE_INDEX=drupal-content

# Embedding Configuration
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=your_openai_key
EMBEDDING_MODEL=text-embedding-3-small

# Vector Database Endpoints
CHROMADB_URL=http://localhost:8000
MILVUS_HOST=localhost
MILVUS_PORT=19530
WEAVIATE_URL=http://localhost:8080

API Key Management

Use Drupal's Key module for secure storage:

<?php

namespace Drupal\my_ai_module\Services;

class SecureKeyManagement {

  /**
   * Store API key securely
   */
  public static function storeKey(
    string $keyId,
    string $keyValue,
    string $description = ''
  ): void {
    if (!\Drupal::moduleHandler()->moduleExists('key')) {
      throw new \RuntimeException('Key module required');
    }

    $key = \Drupal::entityTypeManager()
      ->getStorage('key')
      ->create([
        'id' => $keyId,
        'label' => "API Key: $keyId",
        'description' => $description,
        'key_type' => 'authentication',
        'key_provider' => 'config',
        'key_input' => 'textarea',
      ]);

    $key->setKeyValue($keyValue);
    $key->save();
  }

  /**
   * Retrieve API key securely
   */
  public static function getKey(string $keyId): ?string {
    if (!\Drupal::moduleHandler()->moduleExists('key')) {
      return getenv('${keyId}_API_KEY');
    }

    try {
      $key = \Drupal::service('key.repository')->getKey($keyId);
      return $key ? $key->getKeyValue() : null;
    } catch (\Exception $e) {
      \Drupal::logger('my_ai_module')->warning('Key not found: @key', ['@key' => $keyId]);
      return null;
    }
  }
}

Performance Tuning

<?php

namespace Drupal\my_ai_module\Services;

class VectorDatabaseOptimization {

  /**
   * Batch indexing for performance
   */
  public static function batchIndex(
    VectorDatabaseInterface $db,
    array $documents,
    int $batchSize = 100
  ): void {
    $batches = array_chunk($documents, $batchSize);

    foreach ($batches as $batch) {
      $db->batchInsert($batch);
      // Small delay to avoid overwhelming the database
      usleep(100000); // 100ms
    }
  }

  /**
   * Index pruning - remove old or unused vectors
   */
  public static function pruneIndex(
    VectorDatabaseInterface $db,
    int $daysOld = 90
  ): int {
    $cutoffTime = strtotime("-$daysOld days");
    $deleted = $db->deleteByMetadata([
      'created' => ['$lt' => $cutoffTime],
    ]);

    return $deleted;
  }

  /**
   * Rebuild index with optimal settings
   */
  public static function rebuildIndex(
    VectorDatabaseInterface $db,
    string $collectionName
  ): void {
    // This is database-specific
    // For Milvus:
    // $db->compactCollection($collectionName);

    // For Pinecone:
    // Force recreation of index with optimal parameters
  }

  /**
   * Caching strategy for frequent queries
   */
  public static function cacheQueryResults(
    CacheBackendInterface $cache,
    string $query,
    array $results,
    int $ttl = 3600
  ): void {
    $cacheId = 'vector_search:' . md5($query);
    $cache->set($cacheId, $results, time() + $ttl);
  }

  /**
   * Get cached results if available
   */
  public static function getCachedResults(
    CacheBackendInterface $cache,
    string $query
  ): ?array {
    $cacheId = 'vector_search:' . md5($query);
    $cached = $cache->get($cacheId);
    return $cached ? $cached->data : null;
  }
}

Comparison Matrix

Feature	ChromaDB	Pinecone	Milvus	Weaviate
Deployment	Local/Cloud	SaaS only	Self-hosted/Cloud	Both
Scaling	Limited	Excellent	Excellent	Good
Cost	Free	$$$	$	Free/$
Setup Complexity	Low	Very Low	Medium	Medium
GraphQL Support	No	No	No	Yes
Hybrid Search	No	No	No	Yes
GPU Acceleration	No	No	Yes	No
Metadata Filtering	Basic	Advanced	Advanced	Good
Community Size	Growing	Large	Large	Growing
Learning Curve	Easy	Easy	Medium	Medium

Troubleshooting

Common Issues

Vector Dimension Mismatch:

Ensure embedding model dimension matches index configuration
Example: OpenAI embeddings are 1536-dim, Sentence Transformers often 384-dim

Query Performance Issues:

Enable proper indexing (HNSW for Milvus, HNSW for Weaviate)
Batch insert operations
Implement caching for frequent queries

Memory Issues:

Use streaming/batching for large document sets
Implement vector pruning for old/unused data
Consider metadata filtering to reduce search scope

Code Examples Summary

This guide includes production-ready implementations for:

Vector database abstraction services
Semantic search with multi-stage ranking
RAG with multiple chunking strategies
Configuration management and API key handling
Performance optimization and caching
Docker deployment configurations

All code follows Drupal coding standards and integrates seamlessly with Drupal's services architecture.