Vector Databases
Pinecone, Chroma, FAISS — semantic search এর storage।
১ million document এর মধ্যে ৫ টা most similar বের করতে হবে — naive cosine similarity তে ১ million calculation। 100 million এ? Impossible। Vector Database এর জন্ম এই সমস্যা থেকে — ANN (Approximate Nearest Neighbor) algorithm দিয়ে millisecond এ billion-scale similarity search। Pinecone, Chroma, FAISS, Qdrant, pgvector — RAG এর storage backbone।
Vector Database = vectors store + ANN index দিয়ে fast similarity search এর জন্য optimized database। Key algorithm: HNSW (Hierarchical Navigable Small World), IVF (Inverted File), Annoy। Trade-off: 100% accuracy বনাম speed (ANN ~99% accuracy দিয়ে 1000x faster)। Modern vector DB hybrid search (vector + keyword), metadata filter, multi-tenancy support।
ভাবুন একটা library — naive search মানে প্রতিটা বই pick করে compare। ANN মানে আগে থেকেই বই গুলোকে similar group এ সাজানো — query এলে শুধু সঠিক group এ search। HNSW graph structure বানায় যেখানে similar vector একসাথে connected — query থেকে graph traverse করে quickly nearest neighbor পৌঁছায়।
import chromadb
import os
from openai import OpenAI
client_ai = OpenAI(
base_url="https://ai.gateway.lovable.dev/v1",
api_key=os.environ["LOVABLE_API_KEY"],
)
def embed(texts):
res = client_ai.embeddings.create(model="google/gemini-embedding-001", input=texts)
return [d.embedding for d in res.data]
chroma = chromadb.Client()
collection = chroma.create_collection(name="articles")
docs = [
"Python is a popular programming language for AI and data science.",
"FastAPI is a modern Python web framework with automatic OpenAPI docs.",
"Machine learning models predict outcomes from labeled training data.",
"Bangladesh cricket team won an exciting match against India yesterday.",
"The Tokyo Olympics featured swimming, athletics, and gymnastics events.",
]
ids = [f"doc_{i}" for i in range(len(docs))]
metadatas = [{"topic": "tech" if i < 3 else "sports"} for i in range(len(docs))]
collection.add(documents=docs, embeddings=embed(docs), ids=ids, metadatas=metadatas)
query = "Tell me about AI frameworks"
results = collection.query(
query_embeddings=embed([query]),
n_results=3,
where={"topic": "tech"},
)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{dist:.3f}] {doc}")Chroma embedded mode (zero-setup, in-process)। Collection create করে documents + embeddings + metadata insert। query() এ embedding + metadata filter (where) + top-k। Result এ document + similarity distance। Production এ persistent mode বা cloud Chroma।
১০০০+ news article ingest করুন pgvector এ। FastAPI endpoint বানান যা query, optional date filter, optional category filter নিয়ে top-10 relevant article return করে। Frontend simple search box।