A fully working Vector Database and RAG pipeline built from first principles — no Pinecone, no Chroma, no shortcuts. HNSW, KD-Tree, and Brute Force search implemented in pure Python with a FastAPI backend and a live web UI.
- Built HNSW vector search from scratch in Python — same algorithm used by Pinecone, Weaviate, and Chroma
- Implemented KD-Tree and Brute Force search side-by-side for live algorithm benchmarking
- Developed a complete Retrieval-Augmented Generation (RAG) pipeline with streaming token output via SSE
- Integrated local LLMs via Ollama (Mistral / Llama3.2) — no API keys, fully offline
- Added built-in RAG evaluation endpoint scoring context hit rate and retrieval distance
- Implemented PDF/TXT ingestion, overlapping chunking, embedding, and semantic search end-to-end
| Feature | Description |
|---|---|
| 3 Search Algorithms | HNSW (production-grade), KD-Tree, Brute Force — run all three and compare speed |
| 3 Distance Metrics | Cosine similarity, Euclidean distance, Manhattan distance |
| 16D Demo Vectors | 50 pre-loaded semantic vectors across 4 categories (CS, Math, Food, Sports) |
| 2D PCA Scatter Plot | Live visualization of semantic space — watch clusters form |
| Real Document Embedding | Paste any text → Ollama embeds it with nomic-embed-text (768D) |
| RAG Pipeline | Ask questions about your documents → HNSW retrieves context → local LLM answers |
| Full REST API | CRUD endpoints: insert, delete, search, benchmark, hnsw-info |
| PDF & TXT Upload | Upload files directly — text extracted, chunked, and embedded automatically |
| Streaming Responses | /doc/ask streams tokens word-by-word via Server-Sent Events (SSE) |
| Category Filter | Filter search results by category: ?category=cs |
| Similarity Threshold | Control RAG retrieval strictness via max_distance parameter |
| Numpy Optimized | Distance functions (Euclidean, Cosine, Manhattan) use vectorized numpy operations |
| Persistent Storage | Document index survives server restarts via JSON serialization |
| RAG Evaluation | Built-in /doc/evaluate endpoint scores retrieval quality without external tools |
Your Text
│
▼
Ollama (nomic-embed-text) ← converts text to a 768-dimensional vector
│
▼
HNSW Index (Python) ← indexes the vector in a multilayer graph
│
▼
Semantic Search ← finds nearest neighbors in vector space
│
▼
Ollama (mistral) ← reads retrieved chunks, generates an answer
│
▼
Answer (streamed token by token via SSE)
HNSW (Hierarchical Navigable Small World) is the same algorithm used by Pinecone, Weaviate, Chroma, and Milvus. It builds a multilayer graph where each layer is progressively sparser — searches start at the top layer and zoom in, achieving O(log N) complexity instead of O(N) for brute force.
Production vector databases like Pinecone and Weaviate abstract away almost everything — you call an API and results appear. That's great for shipping products, but it makes it nearly impossible to reason about why a search returns what it does, how the index degrades under certain distributions, or what tradeoffs were made in the retrieval layer. I built this project to remove that abstraction entirely: every distance function, every graph traversal, every RAG prompt construction is written from scratch in readable Python. The goal was to be able to look at a bad retrieval result and trace it all the way back to a specific algorithmic decision — and that's now possible here.
| Feature | This Project | Pinecone | Chroma | Weaviate |
|---|---|---|---|---|
| Algorithm | HNSW (hand-coded) + KD-Tree + Brute Force | HNSW (Rust/C++) | HNSW (hnswlib) | HNSW (Go) |
| Embedding | Ollama local models | OpenAI / bring your own | OpenAI / bring your own | OpenAI / bring your own |
| Persistence | JSON file | Managed cloud | SQLite / cloud | RocksDB / cloud |
| Filtering | Category filter (in-memory) | Metadata filters (indexed) | where clause |
GraphQL filters |
| Scalability | Single process, in-memory | Millions of vectors, distributed | Millions of vectors | Billions of vectors |
| Streaming | SSE token streaming | Not applicable | Not applicable | Not applicable |
| Introspection | HNSW graph inspector endpoint | None | None | None |
| Purpose | Education / portfolio | Production SaaS | Local dev / prototyping | Enterprise production |
The key difference: this project exposes internal state (layer counts, edge counts, graph topology) that production systems hide entirely. It also runs a side-by-side benchmark of all three search algorithms in real time — something no production system offers because they've already committed to one algorithm.
| Layer | Technology | Why |
|---|---|---|
| Backend | Python 3.10+, FastAPI | Async-native, minimal boilerplate |
| Search Algorithms | Pure Python (HNSW, KD-Tree, BruteForce) | Built from scratch to understand internals |
| Distance Metrics | NumPy | Vectorized ops — np.linalg.norm, np.dot, np.abs |
| Embeddings | Ollama + nomic-embed-text |
Local, free, 768D semantic embeddings |
| LLM | Ollama + mistral |
Local inference, no API keys required |
| PDF Parsing | PyMuPDF (fitz) |
Fast, reliable text extraction |
| Streaming | Server-Sent Events (SSE) | Token-by-token streaming like ChatGPT |
| Frontend | Vanilla HTML/JS | PCA scatter plot, benchmark chart, chat UI |
| Testing | pytest | 40+ unit tests, no mocking required |
-
HNSW tradeoffs are non-obvious. The
M(max connections) andef_constructionparameters have a direct impact on recall vs. build time. A higheref_constructionfinds better neighbors during indexing but slows inserts. At low values, the graph develops "weak bridges" between clusters that hurt recall on edge cases. -
KD-Trees fall apart in high dimensions. The ball-within-hyperslab pruning that makes KD-Trees fast at 16D is almost entirely useless at 768D — nearly every subtree needs to be visited, making it equivalent to brute force. Implementing both side-by-side made this viscerally clear rather than just theoretically understood.
-
RAG quality is dominated by chunking strategy, not the LLM. The
chunk_textfunction'soverlap_wordsparameter matters more than model choice: too little overlap and a key sentence gets cut across two chunks; too much and the index bloats with redundant embeddings that pollute top-k retrieval. -
SSE streaming requires careful generator design. The event stream sends three event types (
context,token,done) and the generator must handle mid-stream Ollama disconnects gracefully. Getting theStreamingResponseandevent_stream()generator to compose correctly in FastAPI took multiple iterations. -
NumPy vectorization is worth the dependency. Replacing a Python
forloop distance calculation withnp.linalg.norm(np.array(a) - np.array(b))cut benchmark times by 4–6x on 768D vectors. The difference is negligible at 16D but significant when the document index grows.
You need 3 things installed on your machine:
- Python 3.10+ (download from python.org)
- Git
- Ollama (runs the local AI models)
- Go to https://python.org/downloads and download Python 3.10 or newer
- Run the installer — check "Add Python to PATH" before clicking Install
- Verify in PowerShell / terminal:
python --version- Go to https://git-scm.com/download/win and download Git for Windows
- Run the installer with default settings
- Verify in PowerShell:
git --version- Go to https://ollama.com and click Download for Windows
- Run the installer
- Ollama starts automatically in the system tray
- Open PowerShell and pull the two required models:
ollama pull nomic-embed-text(~274 MB — this is the embedding model)
ollama pull mistral(~4 GB — this is the language model)
- Verify Ollama is running:
ollama listYou should see both models listed.
Minimum specs for Ollama: 8GB RAM recommended. The models will use ~3GB total.
Open PowerShell / terminal and run:
git clone https://github.com/rajadityaaa/YourOwnAI.git
cd YourOwnAIInside the project folder, run:
pip install fastapi uvicorn requests numpy pymupdf python-multipartOr use the requirements file:
pip install -r requirements.txtTroubleshooting:
pip: command not found→ Python not in PATH, redo Step 1ModuleNotFoundError: fitz→ runpip install pymupdf422 Unprocessable Entityon file upload → runpip install python-multipart
Terminal 1 — Start Ollama (if not already running):
ollama serve(If Ollama is already in the system tray on Windows, skip this)
Terminal 2 — Start the Python server:
python main.pyYou should see:
=== VectorDB Engine ===
http://localhost:8080
50 demo vectors | 16 dims | HNSW+KD-Tree+BruteForce
Ollama: ONLINE
embed model: nomic-embed-text gen model: mistral
Open your browser and go to:
http://localhost:8080
- Type any concept in the search box:
binary tree,sushi,basketball,calculus - Choose your algorithm: HNSW, KD-Tree, or Brute Force
- Choose distance metric: Cosine, Euclidean, or Manhattan
- Click ⚡ SEARCH — results appear with distances, the matching point glows on the scatter plot
- Click ▶ COMPARE ALL ALGOS to run all 3 algorithms and compare their speed
The scatter plot shows all 50 vectors projected to 2D using PCA. Notice how the 4 semantic categories (CS, Math, Food, Sports) form distinct clusters — this is what "semantic similarity" looks like visually.
Category filter: Add ?category=cs to filter results to one category only.
This uses Ollama to generate real 768-dimensional embeddings from any text or file.
Option A — Paste text:
- Type a title (e.g.,
Operating Systems Notes) - Paste any text — lecture notes, textbook paragraphs, Wikipedia articles
- Click ⚡ EMBED & INSERT
Option B — Upload a file:
- Click the PDF or TXT tab in the Upload File section
- Click or drag-and-drop your file
- Click ⚡ UPLOAD & EMBED
Long documents are automatically split into overlapping 250-word chunks. Each chunk gets its own embedding and is stored in a separate HNSW index.
- Make sure you have inserted some documents in Tab 2 first
- Type a question about your documents
- Click 🤖 ASK AI
What happens behind the scenes:
1. Your question → embedded with nomic-embed-text (768D vector)
2. HNSW search → finds 3 most semantically similar chunks
3. Retrieved chunks → sent as context to mistral
4. mistral → generates an answer based only on your documents
The answer streams in token by token via Server-Sent Events (SSE) — just like ChatGPT. Click the context chips below the answer to see exactly which document chunks were used to generate it.
You can also control retrieval strictness by passing "max_distance": 0.5 (stricter) or 0.9 (looser) in the request body.
The server exposes a full REST API at http://localhost:8080.
| Method | Endpoint | Description |
|---|---|---|
GET |
/search?v=f1,f2,...&k=5&metric=cosine&algo=hnsw |
K-NN search |
POST |
/insert |
Insert a demo vector |
DELETE |
/delete/:id |
Delete by ID |
GET |
/items |
List all demo vectors |
GET |
/benchmark?v=...&k=5&metric=cosine |
Compare all 3 algorithms |
GET |
/hnsw-info |
HNSW graph structure and layer stats |
GET |
/stats |
Database statistics |
| Method | Endpoint | Body | Description |
|---|---|---|---|
POST |
/doc/insert |
{"title":"...","text":"..."} |
Embed and store document |
POST |
/doc/upload-pdf |
multipart/form-data |
Upload and embed a PDF file |
POST |
/doc/upload-txt |
multipart/form-data |
Upload and embed a TXT file |
GET |
/doc/list |
— | List all stored documents |
DELETE |
/doc/delete/:id |
— | Delete document chunk |
POST |
/doc/search |
{"question":"...","max_distance":0.7} |
Semantic search only (no LLM) |
POST |
/doc/ask |
{"question":"...","k":3,"max_distance":0.7} |
RAG: streaming token response |
POST |
/doc/ask-sync |
{"question":"...","k":3} |
RAG: full response (non-streaming) |
POST |
/doc/evaluate |
{"pairs":[{"question":"...","expected_answer":"..."}]} |
Evaluate RAG retrieval quality |
GET |
/status |
— | Ollama status and model info |
# Search all categories
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&metric=cosine&algo=hnsw"
# Filter by category
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&category=cs"# Streaming (default)
curl -X POST http://localhost:8080/doc/ask \
-H "Content-Type: application/json" \
-d '{"question":"What is dynamic programming?","k":3}'
# Non-streaming (sync)
curl -X POST http://localhost:8080/doc/ask-sync \
-H "Content-Type: application/json" \
-d '{"question":"What is dynamic programming?","k":3,"max_distance":0.5}'YourOwnAI/
├── main.py ← Python backend (HNSW, KD-Tree, BruteForce, FastAPI, RAG)
├── index.html ← Frontend (PCA scatter plot, chat UI, benchmark chart)
├── test_search.py ← pytest suite (40+ tests, no mocking, no Ollama required)
├── requirements.txt← Python dependencies
└── README.md ← This file
BruteForce O(N·d) Exact, baseline
KDTree O(log N) Exact, axis-aligned partitioning
HNSW O(log N) Approximate, multilayer small-world graph
VectorDB Unified interface over all 3 (16D demo vectors)
DocumentDB HNSW-only index for real Ollama embeddings (768D+)
OllamaClient HTTP client → /api/embeddings + /api/generate (streaming)
FastAPI REST server with CORS, file upload, SSE streaming
Nodes are inserted into a multilayer graph. Each node randomly gets assigned a maximum layer. Layer 0 has all nodes with many connections; higher layers have fewer nodes (exponentially fewer) with longer-range connections.
Insert: Start at the top layer, greedily find the nearest node, drop a layer, repeat. At each layer from your assigned max down to 0, run a beam search (ef_construction=200) and connect to the M nearest neighbors bidirectionally.
Search: Same greedy descent from top layer. At layer 0, expand to ef nearest candidates using a priority queue.
Why it's fast: The upper layers act like a highway — you quickly get to the right neighborhood, then zoom in at layer 0.
Binary space partitioning. Each node splits space along one dimension (cycling through all dimensions). Search prunes entire subtrees when the closest possible point in that subtree can't beat the current best — the "ball within hyperslab" check.
Weakness: Degrades with high dimensions (curse of dimensionality). Works well for ≤20D, becomes close to brute force at 768D.
KD-Tree pruning relies on axis-aligned distance bounds. In high dimensions, almost all the space is near the boundary of the hypersphere — no subtrees get pruned. HNSW's graph-based approach doesn't have this problem.
| Problem | Fix |
|---|---|
Ollama: OFFLINE in header |
Run ollama serve in a terminal |
| Embedding takes forever | Ollama is downloading the model on first use, wait 2 min |
ModuleNotFoundError |
Run pip install -r requirements.txt |
| Port 8080 already in use | Kill the process: `netstat -ano |
| LLM answer is slow | Normal — mistral takes 10–30s on a laptop CPU. Use a smaller model for faster answers |
If mistral is too slow on your laptop, switch to llama3.2:1b:
ollama pull llama3.2:1bThen edit main.py where gen_model is set in OllamaClient.__init__:
self.gen_model = "llama3.2:1b" # change this lineRestart the server — no recompile needed.
- OpenAI embeddings support — swap
nomic-embed-textfortext-embedding-3-smallvia API key, enabling cloud-hosted deployment without Ollama - Persistent storage — replace the JSON flat file with SQLite for proper ACID transactions, faster startup with large document sets, and concurrent write safety
- Docker deployment —
docker-compose.ymlbundling the FastAPI server + Ollama in a single-command setup for reproducible demos - Fine-tuning integration — experiment with fine-tuned embedding models to show how domain-specific tuning shifts cluster geometry in the PCA scatter plot
- Multi-vector search — support querying by multiple vectors simultaneously (e.g., average embeddings of a multi-sentence query) to improve RAG recall
pip install pytest
pytest test_search.py -vAll 40+ tests run without Ollama or a running server — the test suite imports the algorithm classes directly and tests them in isolation.
MIT — use this however you want.





