Skip to main content

vector databases integration guide

Vector Databases Integration Guide

Overview

Vector databases are specialized data stores optimized for storing, indexing, and searching high-dimensional vector embeddings. In the context of Drupal and AI systems, they enable semantic search, similarity matching, and retrieval-augmented generation (RAG) capabilities by storing embeddings generated from content.

Vector databases differ from traditional relational databases by:

  • Optimized for similarity search: Using distance metrics (cosine, Euclidean, dot product) instead of exact matches
  • Approximate Nearest Neighbor (ANN) search: Fast retrieval from millions of vectors without full scan
  • Metadata filtering: Combining vector similarity with traditional filtering
  • Scalability: Handling high-dimensional data efficiently

Supported Vector Database Systems

1. ChromaDB

Overview: Open-source vector database designed for AI applications, particularly popular for local development and small-to-medium deployments.

Key Characteristics:

  • Lightweight and easy to set up
  • Supports both in-memory and persistent storage
  • Built-in support for multiple embedding providers
  • Native Python and JavaScript SDKs
  • Automatic deduplication of embeddings

Advantages:

  • Zero external dependencies for basic setup
  • Great for prototyping and development
  • Built-in embedding generation
  • Simple REST API for integration

Limitations:

  • Limited horizontal scaling capabilities
  • Not optimized for massive-scale deployments (100M+ vectors)
  • Smaller community compared to alternatives

Integration with Drupal:

// Using ChromaDB HTTP Client <?php namespace Drupal\my_ai_module\Services; use GuzzleHttp\ClientInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; class ChromaDbService { protected $httpClient; protected $logger; protected $chromaUrl; public function __construct( ClientInterface $httpClient, LoggerChannelFactoryInterface $loggerFactory, string $chromaUrl = 'http://localhost:8000' ) { $this->httpClient = $httpClient; $this->logger = $loggerFactory->get('chromadb'); $this->chromaUrl = $chromaUrl; } /** * Create or get a collection */ public function getOrCreateCollection(string $collectionName): array { try { $response = $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections", [ 'json' => [ 'name' => $collectionName, 'metadata' => ['hnsw:space' => 'cosine'], ], ] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Add documents with embeddings */ public function addDocuments( string $collectionName, array $documents, array $embeddings, array $metadatas = [], array $ids = [] ): void { if (empty($ids)) { $ids = array_map(fn($i) => "doc_$i", range(0, count($documents) - 1)); } try { $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/add", [ 'json' => [ 'ids' => $ids, 'embeddings' => $embeddings, 'documents' => $documents, 'metadatas' => $metadatas, ], ] ); } catch (\Exception $e) { $this->logger->error('Failed to add documents: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Query similar documents */ public function query( string $collectionName, array $queryEmbeddings, int $nResults = 10, array $whereFilter = null ): array { try { $params = [ 'query_embeddings' => $queryEmbeddings, 'n_results' => $nResults, ]; if ($whereFilter) { $params['where'] = $whereFilter; } $response = $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/query", ['json' => $params] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete documents */ public function deleteDocuments(string $collectionName, array $ids): void { try { $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/delete", ['json' => ['ids' => $ids]] ); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } }

Docker Setup:

version: '3.8' services: chromadb: image: ghcr.io/chroma-core/chroma:latest ports: - "8000:8000" environment: - CHROMA_DB_IMPL=duckdb+parquet - PERSIST_DIRECTORY=/chroma/data volumes: - chromadb_data:/chroma/data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"] interval: 10s timeout: 5s retries: 5 volumes: chromadb_data:

2. Pinecone

Overview: Fully managed vector database service with enterprise-grade infrastructure, multi-region deployment, and high availability.

Key Characteristics:

  • Fully managed SaaS platform
  • Automatic scaling and high availability
  • Multi-region deployment options
  • Advanced filtering and metadata support
  • Integrated API key management and security

Advantages:

  • Zero infrastructure management
  • Exceptional search performance at scale
  • Enterprise SLAs and support
  • Automatic backups and disaster recovery
  • Real-time index updates

Limitations:

  • Significant cost for large-scale usage
  • Vendor lock-in concerns
  • Requires external service dependency
  • Data residency considerations

Integration with Drupal:

<?php namespace Drupal\my_ai_module\Services; use Pinecone\Client as PineconeClient; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\Core\Config\ConfigFactoryInterface; class PineconeService { protected $client; protected $logger; protected $configFactory; protected $indexName; public function __construct( LoggerChannelFactoryInterface $loggerFactory, ConfigFactoryInterface $configFactory ) { $this->logger = $loggerFactory->get('pinecone'); $this->configFactory = $configFactory; $config = $configFactory->get('my_ai_module.pinecone'); $apiKey = $this->getSecureApiKey('pinecone_api_key'); $environment = $config->get('environment'); $this->client = PineconeClient::create([ 'api_key' => $apiKey, 'environment' => $environment, ]); $this->indexName = $config->get('index_name'); } /** * Securely retrieve API key from environment or Drupal key management */ protected function getSecureApiKey(string $keyName): string { // First try environment variable if ($apiKey = getenv('PINECONE_API_KEY')) { return $apiKey; } // Fall back to Drupal key management module if available if (\Drupal::moduleHandler()->moduleExists('key')) { $key = \Drupal::service('key.repository')->getKey('pinecone_api_key'); if ($key) { return $key->getKeyValue(); } } throw new \RuntimeException('Pinecone API key not configured'); } /** * Upsert vectors (insert or update) */ public function upsertVectors( array $vectors, string $namespace = 'default' ): void { try { $index = $this->client->index($this->indexName); $upsertData = array_map(function($vector) { return [ 'id' => $vector['id'], 'values' => $vector['values'], 'metadata' => $vector['metadata'] ?? [], ]; }, $vectors); $index->upsert(vectors: $upsertData, namespace: $namespace); $this->logger->info('Upserted @count vectors to @namespace', [ '@count' => count($vectors), '@namespace' => $namespace, ]); } catch (\Exception $e) { $this->logger->error('Failed to upsert vectors: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Query vectors with metadata filtering */ public function query( array $vector, int $topK = 10, array $filter = null, string $namespace = 'default' ): array { try { $index = $this->client->index($this->indexName); $queryParams = [ 'vector' => $vector, 'topK' => $topK, 'namespace' => $namespace, 'includeMetadata' => true, ]; if ($filter) { $queryParams['filter'] = $filter; } $results = $index->query($queryParams); return [ 'matches' => $results['matches'] ?? [], 'namespace' => $namespace, ]; } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete vectors by ID */ public function deleteVectors( array $ids, string $namespace = 'default' ): void { try { $index = $this->client->index($this->indexName); $index->delete(ids: $ids, namespace: $namespace); $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Get index statistics */ public function getIndexStats(): array { try { $index = $this->client->index($this->indexName); return $index->describeIndexStats(); } catch (\Exception $e) { $this->logger->error('Failed to get index stats: @error', ['@error' => $e->getMessage()]); throw $e; } } }

Configuration Schema (my_ai_module.schema.yml):

my_ai_module.pinecone: type: config_object label: 'Pinecone Configuration' mapping: api_key_key: type: string label: 'API Key ID (from Key module)' description: 'Reference to stored Pinecone API key' environment: type: string label: 'Pinecone Environment' description: 'e.g., gcp-starter, us-east1-aws' index_name: type: string label: 'Index Name' description: 'Name of the Pinecone index' dimension: type: integer label: 'Vector Dimension' description: 'Dimension of embeddings (e.g., 1536 for OpenAI)'

3. Milvus

Overview: Open-source vector database designed for high-performance similarity search on massive-scale datasets, supporting both cloud and on-premise deployments.

Key Characteristics:

  • High performance with HNSW and IVF indexing algorithms
  • Distributed architecture for horizontal scaling
  • Supports GPU acceleration
  • Multiple distance metrics (L2, IP, cosine, Hamming)
  • Time-based filtering and partitioning

Advantages:

  • Open-source with no vendor lock-in
  • Excellent performance at scale (100M+ vectors)
  • Flexible deployment options (Milvus Lite, standalone, distributed)
  • GPU acceleration for large-scale operations
  • Rich filtering capabilities with scalar data

Limitations:

  • More complex operational overhead than managed services
  • Steeper learning curve for configuration
  • Requires infrastructure management
  • Community support model

Integration with Drupal:

<?php namespace Drupal\my_ai_module\Services; use Milvus\MilvusClient; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\Core\Config\ConfigFactoryInterface; class MilvusService { protected $client; protected $logger; protected $configFactory; protected $collectionName; public function __construct( LoggerChannelFactoryInterface $loggerFactory, ConfigFactoryInterface $configFactory ) { $this->logger = $loggerFactory->get('milvus'); $this->configFactory = $configFactory; $config = $configFactory->get('my_ai_module.milvus'); $host = $config->get('host') ?? 'localhost'; $port = $config->get('port') ?? 19530; $this->client = new MilvusClient([ 'host' => $host, 'port' => $port, ]); $this->collectionName = $config->get('collection_name'); } /** * Create collection with schema */ public function createCollection( string $collectionName, array $fields, string $description = '' ): void { try { $this->client->createCollection([ 'collection_name' => $collectionName, 'fields' => $fields, 'description' => $description, ]); $this->logger->info('Collection @name created', ['@name' => $collectionName]); } catch (\Exception $e) { $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Insert vectors with auto-ID generation */ public function insertVectors( string $collectionName, array $vectors, array $metadatas = [] ): array { try { $insertData = []; foreach ($vectors as $i => $vector) { $insertData[] = [ 'embedding' => $vector, 'metadata' => json_encode($metadatas[$i] ?? []), 'created_at' => time() * 1000, // Milvus timestamp in milliseconds ]; } $response = $this->client->insert($collectionName, $insertData); $this->logger->info('Inserted @count vectors', ['@count' => count($vectors)]); return $response; } catch (\Exception $e) { $this->logger->error('Insert failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Search with vector similarity and metadata filtering */ public function search( string $collectionName, array $queryVector, int $limit = 10, array $metadataFilter = null ): array { try { $searchParams = [ 'collection_name' => $collectionName, 'vectors' => [$queryVector], 'limit' => $limit, 'metric_type' => 'COSINE', // or L2, IP 'vector_field_name' => 'embedding', ]; if ($metadataFilter) { $searchParams['expr'] = $this->buildFilterExpression($metadataFilter); } $response = $this->client->search($searchParams); return $response['results'][0] ?? []; } catch (\Exception $e) { $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Build Milvus filter expression from conditions */ protected function buildFilterExpression(array $filter): string { $expressions = []; foreach ($filter as $field => $value) { if (is_array($value)) { // Handle IN operators $values = implode(',', array_map(fn($v) => "'$v'", $value)); $expressions[] = "$field in [$values]"; } else { // Handle equality $expressions[] = "$field == '$value'"; } } return implode(' && ', $expressions); } /** * Delete vectors by IDs */ public function deleteVectors( string $collectionName, array $ids ): void { try { $this->client->delete($collectionName, [ 'expr' => 'id in [' . implode(',', $ids) . ']', ]); $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Create index for performance */ public function createIndex( string $collectionName, string $fieldName = 'embedding', string $indexType = 'HNSW' ): void { try { $this->client->createIndex([ 'collection_name' => $collectionName, 'field_name' => $fieldName, 'index_name' => "{$collectionName}_{$fieldName}_index", 'index_type' => $indexType, // HNSW, IVF_FLAT, IVF_SQ8 'params' => [ 'M' => 8, 'efConstruction' => 200, ], ]); $this->logger->info('Index created on @field', ['@field' => $fieldName]); } catch (\Exception $e) { $this->logger->error('Failed to create index: @error', ['@error' => $e->getMessage()]); throw $e; } } }

Docker Compose Setup:

version: '3.8' services: minio: image: minio/minio:latest environment: MINIO_ROOT_USER: minioadmin MINIO_ROOT_PASSWORD: minioadmin command: minio server /minio_data ports: - "9000:9000" - "9001:9001" volumes: - minio_data:/minio_data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 10s timeout: 5s retries: 5 etcd: image: quay.io/coreos/etcd:v3.5.5 environment: - ETCD_AUTO_COMPACTION_MODE=revision - ETCD_AUTO_COMPACTION_RETENTION=1000 ports: - "2379:2379" command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 volumes: - etcd_data:/etcd_data milvus: image: milvusdb/milvus:latest depends_on: - minio - etcd environment: COMMON_STORAGETYPE: minio MINIO_ADDRESS: minio:9000 ETCD_ADDRESS: etcd:2379 ports: - "19530:19530" volumes: - milvus_data:/var/lib/milvus healthcheck: test: ["CMD", "curl", "-f", "http://localhost:19530/healthz"] interval: 10s timeout: 5s retries: 5 volumes: minio_data: etcd_data: milvus_data:

4. Weaviate

Overview: Open-source vector database with built-in NLP modules, combining vector search with semantic understanding and graph capabilities.

Key Characteristics:

  • Built-in multi-language support
  • GraphQL API for complex queries
  • Integrated module system (transformers, Q&A, NER)
  • Hybrid search combining vector and keyword search
  • Rich filtering with reference properties

Advantages:

  • Excellent for semantic understanding out-of-the-box
  • GraphQL support for complex queries
  • Built-in ML modules reduce external dependencies
  • Hybrid search for better relevance
  • Strong documentation and community

Limitations:

  • Higher resource requirements
  • Complexity learning curve for GraphQL
  • Memory-intensive for large-scale deployments
  • Module licensing considerations

Integration with Drupal:

<?php namespace Drupal\my_ai_module\Services; use GuzzleHttp\ClientInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; class WeaviateService { protected $httpClient; protected $logger; protected $weaviateUrl; public function __construct( ClientInterface $httpClient, LoggerChannelFactoryInterface $loggerFactory, string $weaviateUrl = 'http://localhost:8080' ) { $this->httpClient = $httpClient; $this->logger = $loggerFactory->get('weaviate'); $this->weaviateUrl = $weaviateUrl; } /** * Create a class schema */ public function createClass( string $className, array $properties, array $vectorizer = null ): void { try { $schema = [ 'class' => $className, 'description' => "Class $className", 'properties' => $properties, ]; if ($vectorizer) { $schema['vectorizer'] = $vectorizer; } $this->httpClient->post( "{$this->weaviateUrl}/v1/schema", ['json' => $schema] ); $this->logger->info('Class @class created', ['@class' => $className]); } catch (\Exception $e) { $this->logger->error('Failed to create class: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Add objects with vectors */ public function addObjects( string $className, array $objects, array $vectors = [] ): array { try { $batchObjects = []; foreach ($objects as $i => $obj) { $batchObject = [ 'class' => $className, 'properties' => $obj, ]; if (isset($vectors[$i])) { $batchObject['vector'] = $vectors[$i]; } $batchObjects[] = $batchObject; } $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/batch/objects", ['json' => ['objects' => $batchObjects]] ); $result = json_decode($response->getBody(), true); $this->logger->info('Added @count objects', ['@count' => count($objects)]); return $result; } catch (\Exception $e) { $this->logger->error('Failed to add objects: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * GraphQL query with semantic search */ public function query( string $className, array $queryVector, int $limit = 10, array $properties = [], array $where = null ): array { try { $propertiesStr = empty($properties) ? '' : implode(' ', $properties); $whereClause = ''; if ($where) { $whereClause = $this->buildWhereClause($where); } $query = <<<GQL { Get { $className( nearVector: { vector: [" . implode(',', $queryVector) . "] distance: 0.8 } limit: $limit $whereClause ) { $propertiesStr _additional { distance vector } } } } GQL; $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/graphql", ['json' => ['query' => $query]] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Hybrid search combining vector and keyword search */ public function hybridSearch( string $className, string $searchText, array $queryVector, int $limit = 10 ): array { try { $query = <<<GQL { Get { $className( hybrid: { query: "$searchText" vector: [" . implode(',', $queryVector) . "] alpha: 0.5 } limit: $limit ) { text _additional { score distance } } } } GQL; $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/graphql", ['json' => ['query' => $query]] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Hybrid search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete objects by filter */ public function deleteObjects(string $className, array $where): void { try { $whereClause = $this->buildWhereClause($where); $this->httpClient->delete( "{$this->weaviateUrl}/v1/objects", [ 'json' => [ 'where' => $where, ], ] ); $this->logger->info('Objects deleted from @class', ['@class' => $className]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Helper to build WHERE clause for filtering */ protected function buildWhereClause(array $conditions): string { if (empty($conditions)) { return ''; } $clauses = []; foreach ($conditions as $field => $value) { $clauses[] = "{ path: [\"$field\"] operator: Equal valueString: \"$value\" }"; } return 'where: { ' . implode(' ', $clauses) . ' }'; } /** * Get vectorizer status */ public function getVectorizerStatus(): array { try { $response = $this->httpClient->get("{$this->weaviateUrl}/v1/modules"); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Failed to get status: @error', ['@error' => $e->getMessage()]); throw $e; } } }

Docker Setup:

version: '3.8' services: weaviate: image: semitechnologies/weaviate:latest ports: - "8080:8080" environment: QUERY_DEFAULTS_LIMIT: 100 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: /var/lib/weaviate ENABLE_MODULES: 'text2vec-openai,generative-openai' OPENAI_APIKEY: ${OPENAI_API_KEY} OPENAI_INFERENCE_API: openai volumes: - weaviate_data:/var/lib/weaviate healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/v1/.well-known/ready"] interval: 10s timeout: 5s retries: 5 volumes: weaviate_data:

Semantic Search Implementation

Understanding Embeddings

Embeddings are dense vector representations of text, capturing semantic meaning. Each word, sentence, or document is converted to a vector of numbers (typically 384-1536 dimensions depending on the model).

Embedding Models:

  • OpenAI: text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims)
  • Sentence Transformers: all-MiniLM-L6-v2 (384 dims), all-mpnet-base-v2 (768 dims)
  • Cohere: embed-english-v3.0 (1024 dims)

Complete Semantic Search Implementation

<?php namespace Drupal\my_ai_module\Services; use Drupal\Core\Entity\EntityTypeManagerInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\node\Entity\Node; class SemanticSearchService { protected $embeddingService; protected $vectorDb; protected $entityTypeManager; protected $logger; public function __construct( EmbeddingService $embeddingService, VectorDatabaseInterface $vectorDb, EntityTypeManagerInterface $entityTypeManager, LoggerChannelFactoryInterface $loggerFactory ) { $this->embeddingService = $embeddingService; $this->vectorDb = $vectorDb; $this->entityTypeManager = $entityTypeManager; $this->logger = $loggerFactory->get('semantic_search'); } /** * Index a node for semantic search */ public function indexNode(Node $node): void { try { // Extract searchable text $text = $this->extractNodeText($node); // Generate embedding $embedding = $this->embeddingService->embed($text); // Prepare metadata $metadata = [ 'nid' => $node->id(), 'node_type' => $node->getType(), 'title' => $node->getTitle(), 'author' => $node->getOwner()->getAccountName(), 'created' => $node->getCreatedTime(), 'updated' => $node->getChangedTime(), 'language' => $node->language()->getId(), ]; // Store in vector database $this->vectorDb->upsert( id: "node_{$node->id()}", embedding: $embedding, metadata: $metadata, document: $text ); $this->logger->info('Indexed node @nid', ['@nid' => $node->id()]); } catch (\Exception $e) { $this->logger->error('Failed to index node: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Extract text from node for embedding */ protected function extractNodeText(Node $node): string { $text = $node->getTitle() . "\n\n"; // Add body field if available if ($node->hasField('body')) { $text .= $node->get('body')->value ?? ''; } // Add other fields foreach ($node->getFieldDefinitions() as $field) { $fieldName = $field->getName(); if (!in_array($fieldName, ['title', 'body', 'created', 'changed', 'uid'])) { if ($node->hasField($fieldName)) { $value = $node->get($fieldName)->value ?? ''; if (is_string($value)) { $text .= "\n" . $value; } } } } return $text; } /** * Semantic search across indexed content */ public function search( string $query, int $limit = 10, array $filters = [] ): array { try { // Generate embedding for query $queryEmbedding = $this->embeddingService->embed($query); // Search vector database $results = $this->vectorDb->search( embedding: $queryEmbedding, limit: $limit, filters: $filters ); // Enrich results with full node data $enrichedResults = []; foreach ($results as $result) { $nid = $result['metadata']['nid'] ?? null; if ($nid) { $node = Node::load($nid); if ($node) { $enrichedResults[] = [ 'score' => $result['score'], 'node' => $node, 'similarity' => $result['score'], 'snippet' => $this->generateSnippet($result['document'], $query), ]; } } } return $enrichedResults; } catch (\Exception $e) { $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Generate snippet with query context */ protected function generateSnippet(string $text, string $query, int $length = 150): string { $words = str_word_count($text, 1); $queryWords = str_word_count($query, 1); // Find position of first query word match $position = 0; foreach ($words as $i => $word) { if (in_array(strtolower($word), array_map('strtolower', $queryWords))) { $position = max(0, $i - 5); break; } } $snippet = implode(' ', array_slice($words, $position, 30)); return substr($snippet, 0, $length) . '...'; } /** * Remove node from search index */ public function removeNode(int $nid): void { try { $this->vectorDb->delete("node_$nid"); $this->logger->info('Removed node @nid from search', ['@nid' => $nid]); } catch (\Exception $e) { $this->logger->error('Failed to remove node: @error', ['@error' => $e->getMessage()]); throw $e; } } }

RAG (Retrieval Augmented Generation) Patterns

Core RAG Architecture

RAG enhances language models by retrieving relevant documents before generation, reducing hallucinations and improving accuracy.

Flow: Query Embedding Vector Search Context Retrieval LLM Generation

Chunking Strategies

Proper document chunking is critical for RAG effectiveness:

<?php namespace Drupal\my_ai_module\Services; class DocumentChunkingService { /** * Chunk by fixed size with overlap */ public function chunkBySize( string $text, int $chunkSize = 512, int $overlapSize = 100 ): array { $words = str_word_count($text, 1); $chunks = []; for ($i = 0; $i < count($words); $i += ($chunkSize - $overlapSize)) { $chunk = array_slice($words, $i, $chunkSize); $chunks[] = implode(' ', $chunk); } return $chunks; } /** * Chunk by semantic boundaries (sentences/paragraphs) */ public function chunkBySemantic( string $text, int $targetSize = 512 ): array { $paragraphs = preg_split('/\n\n+/', $text); $chunks = []; $currentChunk = ''; $currentSize = 0; foreach ($paragraphs as $paragraph) { $paragraphSize = str_word_count($paragraph); if ($currentSize + $paragraphSize > $targetSize && !empty($currentChunk)) { $chunks[] = $currentChunk; $currentChunk = ''; $currentSize = 0; } $currentChunk .= $paragraph . "\n\n"; $currentSize += $paragraphSize; } if (!empty($currentChunk)) { $chunks[] = $currentChunk; } return $chunks; } /** * Chunk by markdown headers for structured documents */ public function chunkByStructure(string $text): array { $chunks = []; $lines = explode("\n", $text); $currentChunk = ''; $currentHeader = ''; foreach ($lines as $line) { // Check for headers if (preg_match('/^#+\s+(.+)$/', $line, $matches)) { if (!empty($currentChunk)) { $chunks[] = [ 'header' => $currentHeader, 'content' => $currentChunk, ]; } $currentHeader = $matches[1]; $currentChunk = $line . "\n"; } else { $currentChunk .= $line . "\n"; } } if (!empty($currentChunk)) { $chunks[] = [ 'header' => $currentHeader, 'content' => $currentChunk, ]; } return $chunks; } }

Complete RAG Implementation

<?php namespace Drupal\my_ai_module\Services; class RAGService { protected $vectorDb; protected $embeddingService; protected $llmService; protected $chunkingService; protected $logger; public function __construct( VectorDatabaseInterface $vectorDb, EmbeddingService $embeddingService, LLMService $llmService, DocumentChunkingService $chunkingService, LoggerChannelFactoryInterface $loggerFactory ) { $this->vectorDb = $vectorDb; $this->embeddingService = $embeddingService; $this->llmService = $llmService; $this->chunkingService = $chunkingService; $this->logger = $loggerFactory->get('rag'); } /** * Ingest document into RAG system */ public function ingestDocument( string $documentId, string $content, array $metadata = [] ): void { try { // Chunk document $chunks = $this->chunkingService->chunkBySemantic($content); // Create embeddings and store $vectorData = []; foreach ($chunks as $i => $chunkText) { $embedding = $this->embeddingService->embed($chunkText); $vectorData[] = [ 'id' => "{$documentId}_chunk_{$i}", 'embedding' => $embedding, 'metadata' => [ 'document_id' => $documentId, 'chunk_index' => $i, 'chunk_count' => count($chunks), ...array_merge($metadata, [ 'word_count' => str_word_count($chunkText), ]), ], 'content' => $chunkText, ]; } // Batch insert into vector database $this->vectorDb->batchInsert($vectorData); $this->logger->info( 'Ingested document @doc with @chunks chunks', ['@doc' => $documentId, '@chunks' => count($chunks)] ); } catch (\Exception $e) { $this->logger->error('Ingestion failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Retrieve context for query */ public function retrieveContext( string $query, int $topK = 5, array $filters = [] ): array { try { // Embed query $queryEmbedding = $this->embeddingService->embed($query); // Search vector database $results = $this->vectorDb->search( embedding: $queryEmbedding, limit: $topK, filters: $filters ); // Rank and deduplicate by document return $this->rankAndDeduplicate($results); } catch (\Exception $e) { $this->logger->error('Context retrieval failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Generate response using RAG */ public function generateResponse( string $query, array $retrievalFilters = [], array $llmOptions = [] ): string { try { // Retrieve context $contextChunks = $this->retrieveContext($query, topK: 5, filters: $retrievalFilters); // Build prompt with context $contextText = $this->buildContextString($contextChunks); $systemPrompt = $this->buildSystemPrompt($contextText); // Generate response $response = $this->llmService->generate( messages: [ ['role' => 'system', 'content' => $systemPrompt], ['role' => 'user', 'content' => $query], ], options: $llmOptions ); $this->logger->info('Generated RAG response for query'); return $response; } catch (\Exception $e) { $this->logger->error('Response generation failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Build context string from retrieved chunks */ protected function buildContextString(array $chunks): string { $context = "# Retrieved Context\n\n"; foreach ($chunks as $chunk) { $context .= "## Source: " . $chunk['metadata']['document_id'] . "\n"; $context .= "**Relevance Score:** " . round($chunk['score'] * 100) . "%\n\n"; $context .= $chunk['content'] . "\n\n"; $context .= "---\n\n"; } return $context; } /** * Build system prompt with instructions */ protected function buildSystemPrompt(string $context): string { return <<<PROMPT You are a helpful assistant with access to the following context information. Use this context to answer questions accurately and cite your sources. $context Instructions: 1. Answer based primarily on the provided context 2. Cite which source document you're referencing 3. If context is insufficient, clearly state the limitation 4. Do not make up information PROMPT; } /** * Rank and deduplicate results by document */ protected function rankAndDeduplicate(array $results): array { $deduped = []; foreach ($results as $result) { $docId = $result['metadata']['document_id'] ?? 'unknown'; if (!isset($deduped[$docId])) { $deduped[$docId] = $result; } else { // Keep highest scoring chunk per document if ($result['score'] > $deduped[$docId]['score']) { $deduped[$docId] = $result; } } } // Sort by score usort($deduped, fn($a, $b) => $b['score'] <=> $a['score']); return $deduped; } /** * Remove document from RAG */ public function removeDocument(string $documentId): void { try { $this->vectorDb->deleteByMetadata(['document_id' => $documentId]); $this->logger->info('Removed document @doc from RAG', ['@doc' => $documentId]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } }

Configuration and Setup

Module Structure

my_ai_module/
 src/
    Services/
       ChromaDbService.php
       PineconeService.php
       MilvusService.php
       WeaviateService.php
       EmbeddingService.php
       SemanticSearchService.php
       RAGService.php
    Form/
        VectorDatabaseSettingsForm.php
 config/
    schema/
        my_ai_module.schema.yml
 composer.json
 my_ai_module.module

Drupal Configuration Schema

# my_ai_module.schema.yml my_ai_module.vector_database: type: config_object label: 'Vector Database Configuration' mapping: provider: type: string label: 'Vector Database Provider' description: 'chromadb, pinecone, milvus, or weaviate' constraints: - AllowedValues: choices: [chromadb, pinecone, milvus, weaviate] embedding_model: type: string label: 'Embedding Model' description: 'Model for generating embeddings' default: 'text-embedding-3-small' my_ai_module.embeddings: type: config_object label: 'Embedding Service Configuration' mapping: provider: type: string label: 'Embedding Provider' description: 'openai, cohere, huggingface' api_key_key: type: string label: 'API Key Reference' description: 'Key module key ID for storing API key' model: type: string label: 'Model Name' dimension: type: integer label: 'Embedding Dimension' my_ai_module.rag: type: config_object label: 'RAG Configuration' mapping: enabled: type: boolean label: 'Enable RAG' default: true chunk_size: type: integer label: 'Chunk Size (words)' default: 512 chunk_overlap: type: integer label: 'Chunk Overlap' default: 100 retrieval_limit: type: integer label: 'Context Chunks to Retrieve' default: 5

Composer Dependencies

{ "require": { "php": ">=8.1", "drupal/core": "^10.0", "guzzlehttp/guzzle": "^7.0", "openai-php/client": "^0.8.0", "cohere-ai/cohere-php": "^1.0", "milvus/milvus": "^2.0" }, "require-dev": { "phpunit/phpunit": "^10.0", "drupal/core-dev": "^10.0" } }

Services Registration

# my_ai_module.services.yml services: my_ai_module.chromadb: class: Drupal\my_ai_module\Services\ChromaDbService arguments: - '@http_client' - '@logger.factory' - '%my_ai_module.chromadb.url%' my_ai_module.pinecone: class: Drupal\my_ai_module\Services\PineconeService arguments: - '@logger.factory' - '@config.factory' my_ai_module.milvus: class: Drupal\my_ai_module\Services\MilvusService arguments: - '@logger.factory' - '@config.factory' my_ai_module.weaviate: class: Drupal\my_ai_module\Services\WeaviateService arguments: - '@http_client' - '@logger.factory' - '%my_ai_module.weaviate.url%' my_ai_module.embedding: class: Drupal\my_ai_module\Services\EmbeddingService arguments: - '@http_client' - '@config.factory' - '@logger.factory' my_ai_module.semantic_search: class: Drupal\my_ai_module\Services\SemanticSearchService arguments: - '@my_ai_module.embedding' - '@my_ai_module.vector_db' - '@entity_type.manager' - '@logger.factory' my_ai_module.rag: class: Drupal\my_ai_module\Services\RAGService arguments: - '@my_ai_module.vector_db' - '@my_ai_module.embedding' - '@my_ai_module.llm' - '@my_ai_module.chunking' - '@logger.factory'

Environment Variables

# .env.local # Vector Database Configuration VECTOR_DB_PROVIDER=pinecone # or chromadb, milvus, weaviate PINECONE_API_KEY=your_api_key_here PINECONE_ENVIRONMENT=gcp-starter PINECONE_INDEX=drupal-content # Embedding Configuration EMBEDDING_PROVIDER=openai OPENAI_API_KEY=your_openai_key EMBEDDING_MODEL=text-embedding-3-small # Vector Database Endpoints CHROMADB_URL=http://localhost:8000 MILVUS_HOST=localhost MILVUS_PORT=19530 WEAVIATE_URL=http://localhost:8080

API Key Management

Use Drupal's Key module for secure storage:

<?php namespace Drupal\my_ai_module\Services; class SecureKeyManagement { /** * Store API key securely */ public static function storeKey( string $keyId, string $keyValue, string $description = '' ): void { if (!\Drupal::moduleHandler()->moduleExists('key')) { throw new \RuntimeException('Key module required'); } $key = \Drupal::entityTypeManager() ->getStorage('key') ->create([ 'id' => $keyId, 'label' => "API Key: $keyId", 'description' => $description, 'key_type' => 'authentication', 'key_provider' => 'config', 'key_input' => 'textarea', ]); $key->setKeyValue($keyValue); $key->save(); } /** * Retrieve API key securely */ public static function getKey(string $keyId): ?string { if (!\Drupal::moduleHandler()->moduleExists('key')) { return getenv('${keyId}_API_KEY'); } try { $key = \Drupal::service('key.repository')->getKey($keyId); return $key ? $key->getKeyValue() : null; } catch (\Exception $e) { \Drupal::logger('my_ai_module')->warning('Key not found: @key', ['@key' => $keyId]); return null; } } }

Performance Tuning

<?php namespace Drupal\my_ai_module\Services; class VectorDatabaseOptimization { /** * Batch indexing for performance */ public static function batchIndex( VectorDatabaseInterface $db, array $documents, int $batchSize = 100 ): void { $batches = array_chunk($documents, $batchSize); foreach ($batches as $batch) { $db->batchInsert($batch); // Small delay to avoid overwhelming the database usleep(100000); // 100ms } } /** * Index pruning - remove old or unused vectors */ public static function pruneIndex( VectorDatabaseInterface $db, int $daysOld = 90 ): int { $cutoffTime = strtotime("-$daysOld days"); $deleted = $db->deleteByMetadata([ 'created' => ['$lt' => $cutoffTime], ]); return $deleted; } /** * Rebuild index with optimal settings */ public static function rebuildIndex( VectorDatabaseInterface $db, string $collectionName ): void { // This is database-specific // For Milvus: // $db->compactCollection($collectionName); // For Pinecone: // Force recreation of index with optimal parameters } /** * Caching strategy for frequent queries */ public static function cacheQueryResults( CacheBackendInterface $cache, string $query, array $results, int $ttl = 3600 ): void { $cacheId = 'vector_search:' . md5($query); $cache->set($cacheId, $results, time() + $ttl); } /** * Get cached results if available */ public static function getCachedResults( CacheBackendInterface $cache, string $query ): ?array { $cacheId = 'vector_search:' . md5($query); $cached = $cache->get($cacheId); return $cached ? $cached->data : null; } }

Comparison Matrix

FeatureChromaDBPineconeMilvusWeaviate
DeploymentLocal/CloudSaaS onlySelf-hosted/CloudBoth
ScalingLimitedExcellentExcellentGood
CostFree$$$$Free/$
Setup ComplexityLowVery LowMediumMedium
GraphQL SupportNoNoNoYes
Hybrid SearchNoNoNoYes
GPU AccelerationNoNoYesNo
Metadata FilteringBasicAdvancedAdvancedGood
Community SizeGrowingLargeLargeGrowing
Learning CurveEasyEasyMediumMedium

Troubleshooting

Common Issues

Vector Dimension Mismatch:

  • Ensure embedding model dimension matches index configuration
  • Example: OpenAI embeddings are 1536-dim, Sentence Transformers often 384-dim

Query Performance Issues:

  • Enable proper indexing (HNSW for Milvus, HNSW for Weaviate)
  • Batch insert operations
  • Implement caching for frequent queries

Memory Issues:

  • Use streaming/batching for large document sets
  • Implement vector pruning for old/unused data
  • Consider metadata filtering to reduce search scope

Code Examples Summary

This guide includes production-ready implementations for:

  1. Vector database abstraction services
  2. Semantic search with multi-stage ranking
  3. RAG with multiple chunking strategies
  4. Configuration management and API key handling
  5. Performance optimization and caching
  6. Docker deployment configurations

All code follows Drupal coding standards and integrates seamlessly with Drupal's services architecture.