vector databases integration guide
Vector Databases Integration Guide
Overview
Vector databases are specialized data stores optimized for storing, indexing, and searching high-dimensional vector embeddings. In the context of Drupal and AI systems, they enable semantic search, similarity matching, and retrieval-augmented generation (RAG) capabilities by storing embeddings generated from content.
Vector databases differ from traditional relational databases by:
- Optimized for similarity search: Using distance metrics (cosine, Euclidean, dot product) instead of exact matches
- Approximate Nearest Neighbor (ANN) search: Fast retrieval from millions of vectors without full scan
- Metadata filtering: Combining vector similarity with traditional filtering
- Scalability: Handling high-dimensional data efficiently
Supported Vector Database Systems
1. ChromaDB
Overview: Open-source vector database designed for AI applications, particularly popular for local development and small-to-medium deployments.
Key Characteristics:
- Lightweight and easy to set up
- Supports both in-memory and persistent storage
- Built-in support for multiple embedding providers
- Native Python and JavaScript SDKs
- Automatic deduplication of embeddings
Advantages:
- Zero external dependencies for basic setup
- Great for prototyping and development
- Built-in embedding generation
- Simple REST API for integration
Limitations:
- Limited horizontal scaling capabilities
- Not optimized for massive-scale deployments (100M+ vectors)
- Smaller community compared to alternatives
Integration with Drupal:
// Using ChromaDB HTTP Client <?php namespace Drupal\my_ai_module\Services; use GuzzleHttp\ClientInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; class ChromaDbService { protected $httpClient; protected $logger; protected $chromaUrl; public function __construct( ClientInterface $httpClient, LoggerChannelFactoryInterface $loggerFactory, string $chromaUrl = 'http://localhost:8000' ) { $this->httpClient = $httpClient; $this->logger = $loggerFactory->get('chromadb'); $this->chromaUrl = $chromaUrl; } /** * Create or get a collection */ public function getOrCreateCollection(string $collectionName): array { try { $response = $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections", [ 'json' => [ 'name' => $collectionName, 'metadata' => ['hnsw:space' => 'cosine'], ], ] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Add documents with embeddings */ public function addDocuments( string $collectionName, array $documents, array $embeddings, array $metadatas = [], array $ids = [] ): void { if (empty($ids)) { $ids = array_map(fn($i) => "doc_$i", range(0, count($documents) - 1)); } try { $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/add", [ 'json' => [ 'ids' => $ids, 'embeddings' => $embeddings, 'documents' => $documents, 'metadatas' => $metadatas, ], ] ); } catch (\Exception $e) { $this->logger->error('Failed to add documents: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Query similar documents */ public function query( string $collectionName, array $queryEmbeddings, int $nResults = 10, array $whereFilter = null ): array { try { $params = [ 'query_embeddings' => $queryEmbeddings, 'n_results' => $nResults, ]; if ($whereFilter) { $params['where'] = $whereFilter; } $response = $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/query", ['json' => $params] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete documents */ public function deleteDocuments(string $collectionName, array $ids): void { try { $this->httpClient->post( "{$this->chromaUrl}/api/v1/collections/$collectionName/delete", ['json' => ['ids' => $ids]] ); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } }
Docker Setup:
version: '3.8' services: chromadb: image: ghcr.io/chroma-core/chroma:latest ports: - "8000:8000" environment: - CHROMA_DB_IMPL=duckdb+parquet - PERSIST_DIRECTORY=/chroma/data volumes: - chromadb_data:/chroma/data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"] interval: 10s timeout: 5s retries: 5 volumes: chromadb_data:
2. Pinecone
Overview: Fully managed vector database service with enterprise-grade infrastructure, multi-region deployment, and high availability.
Key Characteristics:
- Fully managed SaaS platform
- Automatic scaling and high availability
- Multi-region deployment options
- Advanced filtering and metadata support
- Integrated API key management and security
Advantages:
- Zero infrastructure management
- Exceptional search performance at scale
- Enterprise SLAs and support
- Automatic backups and disaster recovery
- Real-time index updates
Limitations:
- Significant cost for large-scale usage
- Vendor lock-in concerns
- Requires external service dependency
- Data residency considerations
Integration with Drupal:
<?php namespace Drupal\my_ai_module\Services; use Pinecone\Client as PineconeClient; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\Core\Config\ConfigFactoryInterface; class PineconeService { protected $client; protected $logger; protected $configFactory; protected $indexName; public function __construct( LoggerChannelFactoryInterface $loggerFactory, ConfigFactoryInterface $configFactory ) { $this->logger = $loggerFactory->get('pinecone'); $this->configFactory = $configFactory; $config = $configFactory->get('my_ai_module.pinecone'); $apiKey = $this->getSecureApiKey('pinecone_api_key'); $environment = $config->get('environment'); $this->client = PineconeClient::create([ 'api_key' => $apiKey, 'environment' => $environment, ]); $this->indexName = $config->get('index_name'); } /** * Securely retrieve API key from environment or Drupal key management */ protected function getSecureApiKey(string $keyName): string { // First try environment variable if ($apiKey = getenv('PINECONE_API_KEY')) { return $apiKey; } // Fall back to Drupal key management module if available if (\Drupal::moduleHandler()->moduleExists('key')) { $key = \Drupal::service('key.repository')->getKey('pinecone_api_key'); if ($key) { return $key->getKeyValue(); } } throw new \RuntimeException('Pinecone API key not configured'); } /** * Upsert vectors (insert or update) */ public function upsertVectors( array $vectors, string $namespace = 'default' ): void { try { $index = $this->client->index($this->indexName); $upsertData = array_map(function($vector) { return [ 'id' => $vector['id'], 'values' => $vector['values'], 'metadata' => $vector['metadata'] ?? [], ]; }, $vectors); $index->upsert(vectors: $upsertData, namespace: $namespace); $this->logger->info('Upserted @count vectors to @namespace', [ '@count' => count($vectors), '@namespace' => $namespace, ]); } catch (\Exception $e) { $this->logger->error('Failed to upsert vectors: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Query vectors with metadata filtering */ public function query( array $vector, int $topK = 10, array $filter = null, string $namespace = 'default' ): array { try { $index = $this->client->index($this->indexName); $queryParams = [ 'vector' => $vector, 'topK' => $topK, 'namespace' => $namespace, 'includeMetadata' => true, ]; if ($filter) { $queryParams['filter'] = $filter; } $results = $index->query($queryParams); return [ 'matches' => $results['matches'] ?? [], 'namespace' => $namespace, ]; } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete vectors by ID */ public function deleteVectors( array $ids, string $namespace = 'default' ): void { try { $index = $this->client->index($this->indexName); $index->delete(ids: $ids, namespace: $namespace); $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Get index statistics */ public function getIndexStats(): array { try { $index = $this->client->index($this->indexName); return $index->describeIndexStats(); } catch (\Exception $e) { $this->logger->error('Failed to get index stats: @error', ['@error' => $e->getMessage()]); throw $e; } } }
Configuration Schema (my_ai_module.schema.yml):
my_ai_module.pinecone: type: config_object label: 'Pinecone Configuration' mapping: api_key_key: type: string label: 'API Key ID (from Key module)' description: 'Reference to stored Pinecone API key' environment: type: string label: 'Pinecone Environment' description: 'e.g., gcp-starter, us-east1-aws' index_name: type: string label: 'Index Name' description: 'Name of the Pinecone index' dimension: type: integer label: 'Vector Dimension' description: 'Dimension of embeddings (e.g., 1536 for OpenAI)'
3. Milvus
Overview: Open-source vector database designed for high-performance similarity search on massive-scale datasets, supporting both cloud and on-premise deployments.
Key Characteristics:
- High performance with HNSW and IVF indexing algorithms
- Distributed architecture for horizontal scaling
- Supports GPU acceleration
- Multiple distance metrics (L2, IP, cosine, Hamming)
- Time-based filtering and partitioning
Advantages:
- Open-source with no vendor lock-in
- Excellent performance at scale (100M+ vectors)
- Flexible deployment options (Milvus Lite, standalone, distributed)
- GPU acceleration for large-scale operations
- Rich filtering capabilities with scalar data
Limitations:
- More complex operational overhead than managed services
- Steeper learning curve for configuration
- Requires infrastructure management
- Community support model
Integration with Drupal:
<?php namespace Drupal\my_ai_module\Services; use Milvus\MilvusClient; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\Core\Config\ConfigFactoryInterface; class MilvusService { protected $client; protected $logger; protected $configFactory; protected $collectionName; public function __construct( LoggerChannelFactoryInterface $loggerFactory, ConfigFactoryInterface $configFactory ) { $this->logger = $loggerFactory->get('milvus'); $this->configFactory = $configFactory; $config = $configFactory->get('my_ai_module.milvus'); $host = $config->get('host') ?? 'localhost'; $port = $config->get('port') ?? 19530; $this->client = new MilvusClient([ 'host' => $host, 'port' => $port, ]); $this->collectionName = $config->get('collection_name'); } /** * Create collection with schema */ public function createCollection( string $collectionName, array $fields, string $description = '' ): void { try { $this->client->createCollection([ 'collection_name' => $collectionName, 'fields' => $fields, 'description' => $description, ]); $this->logger->info('Collection @name created', ['@name' => $collectionName]); } catch (\Exception $e) { $this->logger->error('Failed to create collection: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Insert vectors with auto-ID generation */ public function insertVectors( string $collectionName, array $vectors, array $metadatas = [] ): array { try { $insertData = []; foreach ($vectors as $i => $vector) { $insertData[] = [ 'embedding' => $vector, 'metadata' => json_encode($metadatas[$i] ?? []), 'created_at' => time() * 1000, // Milvus timestamp in milliseconds ]; } $response = $this->client->insert($collectionName, $insertData); $this->logger->info('Inserted @count vectors', ['@count' => count($vectors)]); return $response; } catch (\Exception $e) { $this->logger->error('Insert failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Search with vector similarity and metadata filtering */ public function search( string $collectionName, array $queryVector, int $limit = 10, array $metadataFilter = null ): array { try { $searchParams = [ 'collection_name' => $collectionName, 'vectors' => [$queryVector], 'limit' => $limit, 'metric_type' => 'COSINE', // or L2, IP 'vector_field_name' => 'embedding', ]; if ($metadataFilter) { $searchParams['expr'] = $this->buildFilterExpression($metadataFilter); } $response = $this->client->search($searchParams); return $response['results'][0] ?? []; } catch (\Exception $e) { $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Build Milvus filter expression from conditions */ protected function buildFilterExpression(array $filter): string { $expressions = []; foreach ($filter as $field => $value) { if (is_array($value)) { // Handle IN operators $values = implode(',', array_map(fn($v) => "'$v'", $value)); $expressions[] = "$field in [$values]"; } else { // Handle equality $expressions[] = "$field == '$value'"; } } return implode(' && ', $expressions); } /** * Delete vectors by IDs */ public function deleteVectors( string $collectionName, array $ids ): void { try { $this->client->delete($collectionName, [ 'expr' => 'id in [' . implode(',', $ids) . ']', ]); $this->logger->info('Deleted @count vectors', ['@count' => count($ids)]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Create index for performance */ public function createIndex( string $collectionName, string $fieldName = 'embedding', string $indexType = 'HNSW' ): void { try { $this->client->createIndex([ 'collection_name' => $collectionName, 'field_name' => $fieldName, 'index_name' => "{$collectionName}_{$fieldName}_index", 'index_type' => $indexType, // HNSW, IVF_FLAT, IVF_SQ8 'params' => [ 'M' => 8, 'efConstruction' => 200, ], ]); $this->logger->info('Index created on @field', ['@field' => $fieldName]); } catch (\Exception $e) { $this->logger->error('Failed to create index: @error', ['@error' => $e->getMessage()]); throw $e; } } }
Docker Compose Setup:
version: '3.8' services: minio: image: minio/minio:latest environment: MINIO_ROOT_USER: minioadmin MINIO_ROOT_PASSWORD: minioadmin command: minio server /minio_data ports: - "9000:9000" - "9001:9001" volumes: - minio_data:/minio_data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 10s timeout: 5s retries: 5 etcd: image: quay.io/coreos/etcd:v3.5.5 environment: - ETCD_AUTO_COMPACTION_MODE=revision - ETCD_AUTO_COMPACTION_RETENTION=1000 ports: - "2379:2379" command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 volumes: - etcd_data:/etcd_data milvus: image: milvusdb/milvus:latest depends_on: - minio - etcd environment: COMMON_STORAGETYPE: minio MINIO_ADDRESS: minio:9000 ETCD_ADDRESS: etcd:2379 ports: - "19530:19530" volumes: - milvus_data:/var/lib/milvus healthcheck: test: ["CMD", "curl", "-f", "http://localhost:19530/healthz"] interval: 10s timeout: 5s retries: 5 volumes: minio_data: etcd_data: milvus_data:
4. Weaviate
Overview: Open-source vector database with built-in NLP modules, combining vector search with semantic understanding and graph capabilities.
Key Characteristics:
- Built-in multi-language support
- GraphQL API for complex queries
- Integrated module system (transformers, Q&A, NER)
- Hybrid search combining vector and keyword search
- Rich filtering with reference properties
Advantages:
- Excellent for semantic understanding out-of-the-box
- GraphQL support for complex queries
- Built-in ML modules reduce external dependencies
- Hybrid search for better relevance
- Strong documentation and community
Limitations:
- Higher resource requirements
- Complexity learning curve for GraphQL
- Memory-intensive for large-scale deployments
- Module licensing considerations
Integration with Drupal:
<?php namespace Drupal\my_ai_module\Services; use GuzzleHttp\ClientInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; class WeaviateService { protected $httpClient; protected $logger; protected $weaviateUrl; public function __construct( ClientInterface $httpClient, LoggerChannelFactoryInterface $loggerFactory, string $weaviateUrl = 'http://localhost:8080' ) { $this->httpClient = $httpClient; $this->logger = $loggerFactory->get('weaviate'); $this->weaviateUrl = $weaviateUrl; } /** * Create a class schema */ public function createClass( string $className, array $properties, array $vectorizer = null ): void { try { $schema = [ 'class' => $className, 'description' => "Class $className", 'properties' => $properties, ]; if ($vectorizer) { $schema['vectorizer'] = $vectorizer; } $this->httpClient->post( "{$this->weaviateUrl}/v1/schema", ['json' => $schema] ); $this->logger->info('Class @class created', ['@class' => $className]); } catch (\Exception $e) { $this->logger->error('Failed to create class: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Add objects with vectors */ public function addObjects( string $className, array $objects, array $vectors = [] ): array { try { $batchObjects = []; foreach ($objects as $i => $obj) { $batchObject = [ 'class' => $className, 'properties' => $obj, ]; if (isset($vectors[$i])) { $batchObject['vector'] = $vectors[$i]; } $batchObjects[] = $batchObject; } $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/batch/objects", ['json' => ['objects' => $batchObjects]] ); $result = json_decode($response->getBody(), true); $this->logger->info('Added @count objects', ['@count' => count($objects)]); return $result; } catch (\Exception $e) { $this->logger->error('Failed to add objects: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * GraphQL query with semantic search */ public function query( string $className, array $queryVector, int $limit = 10, array $properties = [], array $where = null ): array { try { $propertiesStr = empty($properties) ? '' : implode(' ', $properties); $whereClause = ''; if ($where) { $whereClause = $this->buildWhereClause($where); } $query = <<<GQL { Get { $className( nearVector: { vector: [" . implode(',', $queryVector) . "] distance: 0.8 } limit: $limit $whereClause ) { $propertiesStr _additional { distance vector } } } } GQL; $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/graphql", ['json' => ['query' => $query]] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Query failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Hybrid search combining vector and keyword search */ public function hybridSearch( string $className, string $searchText, array $queryVector, int $limit = 10 ): array { try { $query = <<<GQL { Get { $className( hybrid: { query: "$searchText" vector: [" . implode(',', $queryVector) . "] alpha: 0.5 } limit: $limit ) { text _additional { score distance } } } } GQL; $response = $this->httpClient->post( "{$this->weaviateUrl}/v1/graphql", ['json' => ['query' => $query]] ); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Hybrid search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Delete objects by filter */ public function deleteObjects(string $className, array $where): void { try { $whereClause = $this->buildWhereClause($where); $this->httpClient->delete( "{$this->weaviateUrl}/v1/objects", [ 'json' => [ 'where' => $where, ], ] ); $this->logger->info('Objects deleted from @class', ['@class' => $className]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Helper to build WHERE clause for filtering */ protected function buildWhereClause(array $conditions): string { if (empty($conditions)) { return ''; } $clauses = []; foreach ($conditions as $field => $value) { $clauses[] = "{ path: [\"$field\"] operator: Equal valueString: \"$value\" }"; } return 'where: { ' . implode(' ', $clauses) . ' }'; } /** * Get vectorizer status */ public function getVectorizerStatus(): array { try { $response = $this->httpClient->get("{$this->weaviateUrl}/v1/modules"); return json_decode($response->getBody(), true); } catch (\Exception $e) { $this->logger->error('Failed to get status: @error', ['@error' => $e->getMessage()]); throw $e; } } }
Docker Setup:
version: '3.8' services: weaviate: image: semitechnologies/weaviate:latest ports: - "8080:8080" environment: QUERY_DEFAULTS_LIMIT: 100 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: /var/lib/weaviate ENABLE_MODULES: 'text2vec-openai,generative-openai' OPENAI_APIKEY: ${OPENAI_API_KEY} OPENAI_INFERENCE_API: openai volumes: - weaviate_data:/var/lib/weaviate healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/v1/.well-known/ready"] interval: 10s timeout: 5s retries: 5 volumes: weaviate_data:
Semantic Search Implementation
Understanding Embeddings
Embeddings are dense vector representations of text, capturing semantic meaning. Each word, sentence, or document is converted to a vector of numbers (typically 384-1536 dimensions depending on the model).
Embedding Models:
- OpenAI:
text-embedding-3-small(1536 dims),text-embedding-3-large(3072 dims) - Sentence Transformers:
all-MiniLM-L6-v2(384 dims),all-mpnet-base-v2(768 dims) - Cohere:
embed-english-v3.0(1024 dims)
Complete Semantic Search Implementation
<?php namespace Drupal\my_ai_module\Services; use Drupal\Core\Entity\EntityTypeManagerInterface; use Drupal\Core\Logger\LoggerChannelFactoryInterface; use Drupal\node\Entity\Node; class SemanticSearchService { protected $embeddingService; protected $vectorDb; protected $entityTypeManager; protected $logger; public function __construct( EmbeddingService $embeddingService, VectorDatabaseInterface $vectorDb, EntityTypeManagerInterface $entityTypeManager, LoggerChannelFactoryInterface $loggerFactory ) { $this->embeddingService = $embeddingService; $this->vectorDb = $vectorDb; $this->entityTypeManager = $entityTypeManager; $this->logger = $loggerFactory->get('semantic_search'); } /** * Index a node for semantic search */ public function indexNode(Node $node): void { try { // Extract searchable text $text = $this->extractNodeText($node); // Generate embedding $embedding = $this->embeddingService->embed($text); // Prepare metadata $metadata = [ 'nid' => $node->id(), 'node_type' => $node->getType(), 'title' => $node->getTitle(), 'author' => $node->getOwner()->getAccountName(), 'created' => $node->getCreatedTime(), 'updated' => $node->getChangedTime(), 'language' => $node->language()->getId(), ]; // Store in vector database $this->vectorDb->upsert( id: "node_{$node->id()}", embedding: $embedding, metadata: $metadata, document: $text ); $this->logger->info('Indexed node @nid', ['@nid' => $node->id()]); } catch (\Exception $e) { $this->logger->error('Failed to index node: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Extract text from node for embedding */ protected function extractNodeText(Node $node): string { $text = $node->getTitle() . "\n\n"; // Add body field if available if ($node->hasField('body')) { $text .= $node->get('body')->value ?? ''; } // Add other fields foreach ($node->getFieldDefinitions() as $field) { $fieldName = $field->getName(); if (!in_array($fieldName, ['title', 'body', 'created', 'changed', 'uid'])) { if ($node->hasField($fieldName)) { $value = $node->get($fieldName)->value ?? ''; if (is_string($value)) { $text .= "\n" . $value; } } } } return $text; } /** * Semantic search across indexed content */ public function search( string $query, int $limit = 10, array $filters = [] ): array { try { // Generate embedding for query $queryEmbedding = $this->embeddingService->embed($query); // Search vector database $results = $this->vectorDb->search( embedding: $queryEmbedding, limit: $limit, filters: $filters ); // Enrich results with full node data $enrichedResults = []; foreach ($results as $result) { $nid = $result['metadata']['nid'] ?? null; if ($nid) { $node = Node::load($nid); if ($node) { $enrichedResults[] = [ 'score' => $result['score'], 'node' => $node, 'similarity' => $result['score'], 'snippet' => $this->generateSnippet($result['document'], $query), ]; } } } return $enrichedResults; } catch (\Exception $e) { $this->logger->error('Search failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Generate snippet with query context */ protected function generateSnippet(string $text, string $query, int $length = 150): string { $words = str_word_count($text, 1); $queryWords = str_word_count($query, 1); // Find position of first query word match $position = 0; foreach ($words as $i => $word) { if (in_array(strtolower($word), array_map('strtolower', $queryWords))) { $position = max(0, $i - 5); break; } } $snippet = implode(' ', array_slice($words, $position, 30)); return substr($snippet, 0, $length) . '...'; } /** * Remove node from search index */ public function removeNode(int $nid): void { try { $this->vectorDb->delete("node_$nid"); $this->logger->info('Removed node @nid from search', ['@nid' => $nid]); } catch (\Exception $e) { $this->logger->error('Failed to remove node: @error', ['@error' => $e->getMessage()]); throw $e; } } }
RAG (Retrieval Augmented Generation) Patterns
Core RAG Architecture
RAG enhances language models by retrieving relevant documents before generation, reducing hallucinations and improving accuracy.
Flow: Query Embedding Vector Search Context Retrieval LLM Generation
Chunking Strategies
Proper document chunking is critical for RAG effectiveness:
<?php namespace Drupal\my_ai_module\Services; class DocumentChunkingService { /** * Chunk by fixed size with overlap */ public function chunkBySize( string $text, int $chunkSize = 512, int $overlapSize = 100 ): array { $words = str_word_count($text, 1); $chunks = []; for ($i = 0; $i < count($words); $i += ($chunkSize - $overlapSize)) { $chunk = array_slice($words, $i, $chunkSize); $chunks[] = implode(' ', $chunk); } return $chunks; } /** * Chunk by semantic boundaries (sentences/paragraphs) */ public function chunkBySemantic( string $text, int $targetSize = 512 ): array { $paragraphs = preg_split('/\n\n+/', $text); $chunks = []; $currentChunk = ''; $currentSize = 0; foreach ($paragraphs as $paragraph) { $paragraphSize = str_word_count($paragraph); if ($currentSize + $paragraphSize > $targetSize && !empty($currentChunk)) { $chunks[] = $currentChunk; $currentChunk = ''; $currentSize = 0; } $currentChunk .= $paragraph . "\n\n"; $currentSize += $paragraphSize; } if (!empty($currentChunk)) { $chunks[] = $currentChunk; } return $chunks; } /** * Chunk by markdown headers for structured documents */ public function chunkByStructure(string $text): array { $chunks = []; $lines = explode("\n", $text); $currentChunk = ''; $currentHeader = ''; foreach ($lines as $line) { // Check for headers if (preg_match('/^#+\s+(.+)$/', $line, $matches)) { if (!empty($currentChunk)) { $chunks[] = [ 'header' => $currentHeader, 'content' => $currentChunk, ]; } $currentHeader = $matches[1]; $currentChunk = $line . "\n"; } else { $currentChunk .= $line . "\n"; } } if (!empty($currentChunk)) { $chunks[] = [ 'header' => $currentHeader, 'content' => $currentChunk, ]; } return $chunks; } }
Complete RAG Implementation
<?php namespace Drupal\my_ai_module\Services; class RAGService { protected $vectorDb; protected $embeddingService; protected $llmService; protected $chunkingService; protected $logger; public function __construct( VectorDatabaseInterface $vectorDb, EmbeddingService $embeddingService, LLMService $llmService, DocumentChunkingService $chunkingService, LoggerChannelFactoryInterface $loggerFactory ) { $this->vectorDb = $vectorDb; $this->embeddingService = $embeddingService; $this->llmService = $llmService; $this->chunkingService = $chunkingService; $this->logger = $loggerFactory->get('rag'); } /** * Ingest document into RAG system */ public function ingestDocument( string $documentId, string $content, array $metadata = [] ): void { try { // Chunk document $chunks = $this->chunkingService->chunkBySemantic($content); // Create embeddings and store $vectorData = []; foreach ($chunks as $i => $chunkText) { $embedding = $this->embeddingService->embed($chunkText); $vectorData[] = [ 'id' => "{$documentId}_chunk_{$i}", 'embedding' => $embedding, 'metadata' => [ 'document_id' => $documentId, 'chunk_index' => $i, 'chunk_count' => count($chunks), ...array_merge($metadata, [ 'word_count' => str_word_count($chunkText), ]), ], 'content' => $chunkText, ]; } // Batch insert into vector database $this->vectorDb->batchInsert($vectorData); $this->logger->info( 'Ingested document @doc with @chunks chunks', ['@doc' => $documentId, '@chunks' => count($chunks)] ); } catch (\Exception $e) { $this->logger->error('Ingestion failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Retrieve context for query */ public function retrieveContext( string $query, int $topK = 5, array $filters = [] ): array { try { // Embed query $queryEmbedding = $this->embeddingService->embed($query); // Search vector database $results = $this->vectorDb->search( embedding: $queryEmbedding, limit: $topK, filters: $filters ); // Rank and deduplicate by document return $this->rankAndDeduplicate($results); } catch (\Exception $e) { $this->logger->error('Context retrieval failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Generate response using RAG */ public function generateResponse( string $query, array $retrievalFilters = [], array $llmOptions = [] ): string { try { // Retrieve context $contextChunks = $this->retrieveContext($query, topK: 5, filters: $retrievalFilters); // Build prompt with context $contextText = $this->buildContextString($contextChunks); $systemPrompt = $this->buildSystemPrompt($contextText); // Generate response $response = $this->llmService->generate( messages: [ ['role' => 'system', 'content' => $systemPrompt], ['role' => 'user', 'content' => $query], ], options: $llmOptions ); $this->logger->info('Generated RAG response for query'); return $response; } catch (\Exception $e) { $this->logger->error('Response generation failed: @error', ['@error' => $e->getMessage()]); throw $e; } } /** * Build context string from retrieved chunks */ protected function buildContextString(array $chunks): string { $context = "# Retrieved Context\n\n"; foreach ($chunks as $chunk) { $context .= "## Source: " . $chunk['metadata']['document_id'] . "\n"; $context .= "**Relevance Score:** " . round($chunk['score'] * 100) . "%\n\n"; $context .= $chunk['content'] . "\n\n"; $context .= "---\n\n"; } return $context; } /** * Build system prompt with instructions */ protected function buildSystemPrompt(string $context): string { return <<<PROMPT You are a helpful assistant with access to the following context information. Use this context to answer questions accurately and cite your sources. $context Instructions: 1. Answer based primarily on the provided context 2. Cite which source document you're referencing 3. If context is insufficient, clearly state the limitation 4. Do not make up information PROMPT; } /** * Rank and deduplicate results by document */ protected function rankAndDeduplicate(array $results): array { $deduped = []; foreach ($results as $result) { $docId = $result['metadata']['document_id'] ?? 'unknown'; if (!isset($deduped[$docId])) { $deduped[$docId] = $result; } else { // Keep highest scoring chunk per document if ($result['score'] > $deduped[$docId]['score']) { $deduped[$docId] = $result; } } } // Sort by score usort($deduped, fn($a, $b) => $b['score'] <=> $a['score']); return $deduped; } /** * Remove document from RAG */ public function removeDocument(string $documentId): void { try { $this->vectorDb->deleteByMetadata(['document_id' => $documentId]); $this->logger->info('Removed document @doc from RAG', ['@doc' => $documentId]); } catch (\Exception $e) { $this->logger->error('Delete failed: @error', ['@error' => $e->getMessage()]); throw $e; } } }
Configuration and Setup
Module Structure
my_ai_module/
src/
Services/
ChromaDbService.php
PineconeService.php
MilvusService.php
WeaviateService.php
EmbeddingService.php
SemanticSearchService.php
RAGService.php
Form/
VectorDatabaseSettingsForm.php
config/
schema/
my_ai_module.schema.yml
composer.json
my_ai_module.module
Drupal Configuration Schema
# my_ai_module.schema.yml my_ai_module.vector_database: type: config_object label: 'Vector Database Configuration' mapping: provider: type: string label: 'Vector Database Provider' description: 'chromadb, pinecone, milvus, or weaviate' constraints: - AllowedValues: choices: [chromadb, pinecone, milvus, weaviate] embedding_model: type: string label: 'Embedding Model' description: 'Model for generating embeddings' default: 'text-embedding-3-small' my_ai_module.embeddings: type: config_object label: 'Embedding Service Configuration' mapping: provider: type: string label: 'Embedding Provider' description: 'openai, cohere, huggingface' api_key_key: type: string label: 'API Key Reference' description: 'Key module key ID for storing API key' model: type: string label: 'Model Name' dimension: type: integer label: 'Embedding Dimension' my_ai_module.rag: type: config_object label: 'RAG Configuration' mapping: enabled: type: boolean label: 'Enable RAG' default: true chunk_size: type: integer label: 'Chunk Size (words)' default: 512 chunk_overlap: type: integer label: 'Chunk Overlap' default: 100 retrieval_limit: type: integer label: 'Context Chunks to Retrieve' default: 5
Composer Dependencies
{ "require": { "php": ">=8.1", "drupal/core": "^10.0", "guzzlehttp/guzzle": "^7.0", "openai-php/client": "^0.8.0", "cohere-ai/cohere-php": "^1.0", "milvus/milvus": "^2.0" }, "require-dev": { "phpunit/phpunit": "^10.0", "drupal/core-dev": "^10.0" } }
Services Registration
# my_ai_module.services.yml services: my_ai_module.chromadb: class: Drupal\my_ai_module\Services\ChromaDbService arguments: - '@http_client' - '@logger.factory' - '%my_ai_module.chromadb.url%' my_ai_module.pinecone: class: Drupal\my_ai_module\Services\PineconeService arguments: - '@logger.factory' - '@config.factory' my_ai_module.milvus: class: Drupal\my_ai_module\Services\MilvusService arguments: - '@logger.factory' - '@config.factory' my_ai_module.weaviate: class: Drupal\my_ai_module\Services\WeaviateService arguments: - '@http_client' - '@logger.factory' - '%my_ai_module.weaviate.url%' my_ai_module.embedding: class: Drupal\my_ai_module\Services\EmbeddingService arguments: - '@http_client' - '@config.factory' - '@logger.factory' my_ai_module.semantic_search: class: Drupal\my_ai_module\Services\SemanticSearchService arguments: - '@my_ai_module.embedding' - '@my_ai_module.vector_db' - '@entity_type.manager' - '@logger.factory' my_ai_module.rag: class: Drupal\my_ai_module\Services\RAGService arguments: - '@my_ai_module.vector_db' - '@my_ai_module.embedding' - '@my_ai_module.llm' - '@my_ai_module.chunking' - '@logger.factory'
Environment Variables
# .env.local # Vector Database Configuration VECTOR_DB_PROVIDER=pinecone # or chromadb, milvus, weaviate PINECONE_API_KEY=your_api_key_here PINECONE_ENVIRONMENT=gcp-starter PINECONE_INDEX=drupal-content # Embedding Configuration EMBEDDING_PROVIDER=openai OPENAI_API_KEY=your_openai_key EMBEDDING_MODEL=text-embedding-3-small # Vector Database Endpoints CHROMADB_URL=http://localhost:8000 MILVUS_HOST=localhost MILVUS_PORT=19530 WEAVIATE_URL=http://localhost:8080
API Key Management
Use Drupal's Key module for secure storage:
<?php namespace Drupal\my_ai_module\Services; class SecureKeyManagement { /** * Store API key securely */ public static function storeKey( string $keyId, string $keyValue, string $description = '' ): void { if (!\Drupal::moduleHandler()->moduleExists('key')) { throw new \RuntimeException('Key module required'); } $key = \Drupal::entityTypeManager() ->getStorage('key') ->create([ 'id' => $keyId, 'label' => "API Key: $keyId", 'description' => $description, 'key_type' => 'authentication', 'key_provider' => 'config', 'key_input' => 'textarea', ]); $key->setKeyValue($keyValue); $key->save(); } /** * Retrieve API key securely */ public static function getKey(string $keyId): ?string { if (!\Drupal::moduleHandler()->moduleExists('key')) { return getenv('${keyId}_API_KEY'); } try { $key = \Drupal::service('key.repository')->getKey($keyId); return $key ? $key->getKeyValue() : null; } catch (\Exception $e) { \Drupal::logger('my_ai_module')->warning('Key not found: @key', ['@key' => $keyId]); return null; } } }
Performance Tuning
<?php namespace Drupal\my_ai_module\Services; class VectorDatabaseOptimization { /** * Batch indexing for performance */ public static function batchIndex( VectorDatabaseInterface $db, array $documents, int $batchSize = 100 ): void { $batches = array_chunk($documents, $batchSize); foreach ($batches as $batch) { $db->batchInsert($batch); // Small delay to avoid overwhelming the database usleep(100000); // 100ms } } /** * Index pruning - remove old or unused vectors */ public static function pruneIndex( VectorDatabaseInterface $db, int $daysOld = 90 ): int { $cutoffTime = strtotime("-$daysOld days"); $deleted = $db->deleteByMetadata([ 'created' => ['$lt' => $cutoffTime], ]); return $deleted; } /** * Rebuild index with optimal settings */ public static function rebuildIndex( VectorDatabaseInterface $db, string $collectionName ): void { // This is database-specific // For Milvus: // $db->compactCollection($collectionName); // For Pinecone: // Force recreation of index with optimal parameters } /** * Caching strategy for frequent queries */ public static function cacheQueryResults( CacheBackendInterface $cache, string $query, array $results, int $ttl = 3600 ): void { $cacheId = 'vector_search:' . md5($query); $cache->set($cacheId, $results, time() + $ttl); } /** * Get cached results if available */ public static function getCachedResults( CacheBackendInterface $cache, string $query ): ?array { $cacheId = 'vector_search:' . md5($query); $cached = $cache->get($cacheId); return $cached ? $cached->data : null; } }
Comparison Matrix
| Feature | ChromaDB | Pinecone | Milvus | Weaviate |
|---|---|---|---|---|
| Deployment | Local/Cloud | SaaS only | Self-hosted/Cloud | Both |
| Scaling | Limited | Excellent | Excellent | Good |
| Cost | Free | $$$ | $ | Free/$ |
| Setup Complexity | Low | Very Low | Medium | Medium |
| GraphQL Support | No | No | No | Yes |
| Hybrid Search | No | No | No | Yes |
| GPU Acceleration | No | No | Yes | No |
| Metadata Filtering | Basic | Advanced | Advanced | Good |
| Community Size | Growing | Large | Large | Growing |
| Learning Curve | Easy | Easy | Medium | Medium |
Troubleshooting
Common Issues
Vector Dimension Mismatch:
- Ensure embedding model dimension matches index configuration
- Example: OpenAI embeddings are 1536-dim, Sentence Transformers often 384-dim
Query Performance Issues:
- Enable proper indexing (HNSW for Milvus, HNSW for Weaviate)
- Batch insert operations
- Implement caching for frequent queries
Memory Issues:
- Use streaming/batching for large document sets
- Implement vector pruning for old/unused data
- Consider metadata filtering to reduce search scope
Code Examples Summary
This guide includes production-ready implementations for:
- Vector database abstraction services
- Semantic search with multi-stage ranking
- RAG with multiple chunking strategies
- Configuration management and API key handling
- Performance optimization and caching
- Docker deployment configurations
All code follows Drupal coding standards and integrates seamlessly with Drupal's services architecture.