YourOwnAI — Vector Database + RAG Pipeline from Scratch in Python

A fully working Vector Database and RAG pipeline built from first principles — no Pinecone, no Chroma, no shortcuts. HNSW, KD-Tree, and Brute Force search implemented in pure Python with a FastAPI backend and a live web UI.

Screenshots

Vector Search Dashboard

Document Upload & Indexing

Documents Indexed

RAG Question Answering

Evaluation Metrics

Key Achievements

Built HNSW vector search from scratch in Python — same algorithm used by Pinecone, Weaviate, and Chroma
Implemented KD-Tree and Brute Force search side-by-side for live algorithm benchmarking
Developed a complete Retrieval-Augmented Generation (RAG) pipeline with streaming token output via SSE
Integrated local LLMs via Ollama (Mistral / Llama3.2) — no API keys, fully offline
Added built-in RAG evaluation endpoint scoring context hit rate and retrieval distance
Implemented PDF/TXT ingestion, overlapping chunking, embedding, and semantic search end-to-end

What This Project Does

Feature	Description
3 Search Algorithms	HNSW (production-grade), KD-Tree, Brute Force — run all three and compare speed
3 Distance Metrics	Cosine similarity, Euclidean distance, Manhattan distance
16D Demo Vectors	50 pre-loaded semantic vectors across 4 categories (CS, Math, Food, Sports)
2D PCA Scatter Plot	Live visualization of semantic space — watch clusters form
Real Document Embedding	Paste any text → Ollama embeds it with `nomic-embed-text` (768D)
RAG Pipeline	Ask questions about your documents → HNSW retrieves context → local LLM answers
Full REST API	CRUD endpoints: insert, delete, search, benchmark, hnsw-info
PDF & TXT Upload	Upload files directly — text extracted, chunked, and embedded automatically
Streaming Responses	`/doc/ask` streams tokens word-by-word via Server-Sent Events (SSE)
Category Filter	Filter search results by category: `?category=cs`
Similarity Threshold	Control RAG retrieval strictness via `max_distance` parameter
Numpy Optimized	Distance functions (Euclidean, Cosine, Manhattan) use vectorized numpy operations
Persistent Storage	Document index survives server restarts via JSON serialization
RAG Evaluation	Built-in `/doc/evaluate` endpoint scores retrieval quality without external tools

How It Works

Your Text
    │
    ▼
Ollama (nomic-embed-text)          ← converts text to a 768-dimensional vector
    │
    ▼
HNSW Index (Python)                ← indexes the vector in a multilayer graph
    │
    ▼
Semantic Search                    ← finds nearest neighbors in vector space
    │
    ▼
Ollama (mistral)                   ← reads retrieved chunks, generates an answer
    │
    ▼
Answer (streamed token by token via SSE)

HNSW (Hierarchical Navigable Small World) is the same algorithm used by Pinecone, Weaviate, Chroma, and Milvus. It builds a multilayer graph where each layer is progressively sparser — searches start at the top layer and zoom in, achieving O(log N) complexity instead of O(N) for brute force.

Why I Built This

Production vector databases like Pinecone and Weaviate abstract away almost everything — you call an API and results appear. That's great for shipping products, but it makes it nearly impossible to reason about why a search returns what it does, how the index degrades under certain distributions, or what tradeoffs were made in the retrieval layer. I built this project to remove that abstraction entirely: every distance function, every graph traversal, every RAG prompt construction is written from scratch in readable Python. The goal was to be able to look at a bad retrieval result and trace it all the way back to a specific algorithmic decision — and that's now possible here.

How It Compares to Production Systems

Feature	This Project	Pinecone	Chroma	Weaviate
Algorithm	HNSW (hand-coded) + KD-Tree + Brute Force	HNSW (Rust/C++)	HNSW (hnswlib)	HNSW (Go)
Embedding	Ollama local models	OpenAI / bring your own	OpenAI / bring your own	OpenAI / bring your own
Persistence	JSON file	Managed cloud	SQLite / cloud	RocksDB / cloud
Filtering	Category filter (in-memory)	Metadata filters (indexed)	`where` clause	GraphQL filters
Scalability	Single process, in-memory	Millions of vectors, distributed	Millions of vectors	Billions of vectors
Streaming	SSE token streaming	Not applicable	Not applicable	Not applicable
Introspection	HNSW graph inspector endpoint	None	None	None
Purpose	Education / portfolio	Production SaaS	Local dev / prototyping	Enterprise production

The key difference: this project exposes internal state (layer counts, edge counts, graph topology) that production systems hide entirely. It also runs a side-by-side benchmark of all three search algorithms in real time — something no production system offers because they've already committed to one algorithm.

Tech Stack

Layer	Technology	Why
Backend	Python 3.10+, FastAPI	Async-native, minimal boilerplate
Search Algorithms	Pure Python (HNSW, KD-Tree, BruteForce)	Built from scratch to understand internals
Distance Metrics	NumPy	Vectorized ops — `np.linalg.norm`, `np.dot`, `np.abs`
Embeddings	Ollama + `nomic-embed-text`	Local, free, 768D semantic embeddings
LLM	Ollama + `mistral`	Local inference, no API keys required
PDF Parsing	PyMuPDF (`fitz`)	Fast, reliable text extraction
Streaming	Server-Sent Events (SSE)	Token-by-token streaming like ChatGPT
Frontend	Vanilla HTML/JS	PCA scatter plot, benchmark chart, chat UI
Testing	pytest	40+ unit tests, no mocking required

What I Learned

HNSW tradeoffs are non-obvious. The M (max connections) and ef_construction parameters have a direct impact on recall vs. build time. A higher ef_construction finds better neighbors during indexing but slows inserts. At low values, the graph develops "weak bridges" between clusters that hurt recall on edge cases.
KD-Trees fall apart in high dimensions. The ball-within-hyperslab pruning that makes KD-Trees fast at 16D is almost entirely useless at 768D — nearly every subtree needs to be visited, making it equivalent to brute force. Implementing both side-by-side made this viscerally clear rather than just theoretically understood.
RAG quality is dominated by chunking strategy, not the LLM. The chunk_text function's overlap_words parameter matters more than model choice: too little overlap and a key sentence gets cut across two chunks; too much and the index bloats with redundant embeddings that pollute top-k retrieval.
SSE streaming requires careful generator design. The event stream sends three event types (context, token, done) and the generator must handle mid-stream Ollama disconnects gracefully. Getting the StreamingResponse and event_stream() generator to compose correctly in FastAPI took multiple iterations.
NumPy vectorization is worth the dependency. Replacing a Python for loop distance calculation with np.linalg.norm(np.array(a) - np.array(b)) cut benchmark times by 4–6x on 768D vectors. The difference is negligible at 16D but significant when the document index grows.

Prerequisites

You need 3 things installed on your machine:

Python 3.10+ (download from python.org)
Git
Ollama (runs the local AI models)

Step-by-Step Setup

Step 1 — Install Python 3.10+

Go to https://python.org/downloads and download Python 3.10 or newer
Run the installer — check "Add Python to PATH" before clicking Install
Verify in PowerShell / terminal:

python --version

Step 2 — Install Git

Go to https://git-scm.com/download/win and download Git for Windows
Run the installer with default settings
Verify in PowerShell:

git --version

Step 3 — Install Ollama (Local AI Models)

Go to https://ollama.com and click Download for Windows
Run the installer
Ollama starts automatically in the system tray
Open PowerShell and pull the two required models:

ollama pull nomic-embed-text

(~274 MB — this is the embedding model)

ollama pull mistral

(~4 GB — this is the language model)

Verify Ollama is running:

ollama list

You should see both models listed.

Minimum specs for Ollama: 8GB RAM recommended. The models will use ~3GB total.

Step 4 — Clone the Repository

Open PowerShell / terminal and run:

git clone https://github.com/rajadityaaa/YourOwnAI.git
cd YourOwnAI

Step 5 — Install Python Dependencies

Inside the project folder, run:

pip install fastapi uvicorn requests numpy pymupdf python-multipart

Or use the requirements file:

pip install -r requirements.txt

Troubleshooting:

pip: command not found → Python not in PATH, redo Step 1

ModuleNotFoundError: fitz → run pip install pymupdf

422 Unprocessable Entity on file upload → run pip install python-multipart

Step 6 — Run Everything

Terminal 1 — Start Ollama (if not already running):

ollama serve

(If Ollama is already in the system tray on Windows, skip this)

Terminal 2 — Start the Python server:

python main.py

You should see:

=== VectorDB Engine ===
http://localhost:8080
50 demo vectors | 16 dims | HNSW+KD-Tree+BruteForce
Ollama: ONLINE
  embed model: nomic-embed-text  gen model: mistral

Open your browser and go to:

http://localhost:8080

Using the Application

Tab 1: Search (Demo Vectors)

Type any concept in the search box: binary tree, sushi, basketball, calculus
Choose your algorithm: HNSW, KD-Tree, or Brute Force
Choose distance metric: Cosine, Euclidean, or Manhattan
Click ⚡ SEARCH — results appear with distances, the matching point glows on the scatter plot
Click ▶ COMPARE ALL ALGOS to run all 3 algorithms and compare their speed

The scatter plot shows all 50 vectors projected to 2D using PCA. Notice how the 4 semantic categories (CS, Math, Food, Sports) form distinct clusters — this is what "semantic similarity" looks like visually.

Category filter: Add ?category=cs to filter results to one category only.

Tab 2: Documents (Real Embeddings)

This uses Ollama to generate real 768-dimensional embeddings from any text or file.

Option A — Paste text:

Type a title (e.g., Operating Systems Notes)
Paste any text — lecture notes, textbook paragraphs, Wikipedia articles
Click ⚡ EMBED & INSERT

Option B — Upload a file:

Click the PDF or TXT tab in the Upload File section
Click or drag-and-drop your file
Click ⚡ UPLOAD & EMBED

Long documents are automatically split into overlapping 250-word chunks. Each chunk gets its own embedding and is stored in a separate HNSW index.

Tab 3: Ask AI (RAG Pipeline)

Make sure you have inserted some documents in Tab 2 first
Type a question about your documents
Click 🤖 ASK AI

What happens behind the scenes:

1. Your question → embedded with nomic-embed-text (768D vector)
2. HNSW search → finds 3 most semantically similar chunks
3. Retrieved chunks → sent as context to mistral
4. mistral → generates an answer based only on your documents

The answer streams in token by token via Server-Sent Events (SSE) — just like ChatGPT. Click the context chips below the answer to see exactly which document chunks were used to generate it.

You can also control retrieval strictness by passing "max_distance": 0.5 (stricter) or 0.9 (looser) in the request body.

REST API Reference

The server exposes a full REST API at http://localhost:8080.

Demo Vector Endpoints

Method	Endpoint	Description
`GET`	`/search?v=f1,f2,...&k=5&metric=cosine&algo=hnsw`	K-NN search
`POST`	`/insert`	Insert a demo vector
`DELETE`	`/delete/:id`	Delete by ID
`GET`	`/items`	List all demo vectors
`GET`	`/benchmark?v=...&k=5&metric=cosine`	Compare all 3 algorithms
`GET`	`/hnsw-info`	HNSW graph structure and layer stats
`GET`	`/stats`	Database statistics

Document & RAG Endpoints

Method	Endpoint	Body	Description
`POST`	`/doc/insert`	`{"title":"...","text":"..."}`	Embed and store document
`POST`	`/doc/upload-pdf`	`multipart/form-data`	Upload and embed a PDF file
`POST`	`/doc/upload-txt`	`multipart/form-data`	Upload and embed a TXT file
`GET`	`/doc/list`	—	List all stored documents
`DELETE`	`/doc/delete/:id`	—	Delete document chunk
`POST`	`/doc/search`	`{"question":"...","max_distance":0.7}`	Semantic search only (no LLM)
`POST`	`/doc/ask`	`{"question":"...","k":3,"max_distance":0.7}`	RAG: streaming token response
`POST`	`/doc/ask-sync`	`{"question":"...","k":3}`	RAG: full response (non-streaming)
`POST`	`/doc/evaluate`	`{"pairs":[{"question":"...","expected_answer":"..."}]}`	Evaluate RAG retrieval quality
`GET`	`/status`	—	Ollama status and model info

Example: Search via curl

# Search all categories
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&metric=cosine&algo=hnsw"

# Filter by category
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&category=cs"

Example: Ask a question via curl

# Streaming (default)
curl -X POST http://localhost:8080/doc/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"What is dynamic programming?","k":3}'

# Non-streaming (sync)
curl -X POST http://localhost:8080/doc/ask-sync \
  -H "Content-Type: application/json" \
  -d '{"question":"What is dynamic programming?","k":3,"max_distance":0.5}'

Project Structure

YourOwnAI/
├── main.py         ← Python backend (HNSW, KD-Tree, BruteForce, FastAPI, RAG)
├── index.html      ← Frontend (PCA scatter plot, chat UI, benchmark chart)
├── test_search.py  ← pytest suite (40+ tests, no mocking, no Ollama required)
├── requirements.txt← Python dependencies
└── README.md       ← This file

Architecture (main.py)

BruteForce          O(N·d)      Exact, baseline
KDTree              O(log N)    Exact, axis-aligned partitioning
HNSW                O(log N)    Approximate, multilayer small-world graph

VectorDB            Unified interface over all 3 (16D demo vectors)
DocumentDB          HNSW-only index for real Ollama embeddings (768D+)
OllamaClient        HTTP client → /api/embeddings + /api/generate (streaming)
FastAPI             REST server with CORS, file upload, SSE streaming

Algorithm Deep Dive

HNSW (Hierarchical Navigable Small World)

Nodes are inserted into a multilayer graph. Each node randomly gets assigned a maximum layer. Layer 0 has all nodes with many connections; higher layers have fewer nodes (exponentially fewer) with longer-range connections.

Insert: Start at the top layer, greedily find the nearest node, drop a layer, repeat. At each layer from your assigned max down to 0, run a beam search (ef_construction=200) and connect to the M nearest neighbors bidirectionally.

Search: Same greedy descent from top layer. At layer 0, expand to ef nearest candidates using a priority queue.

Why it's fast: The upper layers act like a highway — you quickly get to the right neighborhood, then zoom in at layer 0.

KD-Tree (K-Dimensional Tree)

Binary space partitioning. Each node splits space along one dimension (cycling through all dimensions). Search prunes entire subtrees when the closest possible point in that subtree can't beat the current best — the "ball within hyperslab" check.

Weakness: Degrades with high dimensions (curse of dimensionality). Works well for ≤20D, becomes close to brute force at 768D.

Why HNSW Wins at High Dimensions

KD-Tree pruning relies on axis-aligned distance bounds. In high dimensions, almost all the space is near the boundary of the hypersphere — no subtrees get pruned. HNSW's graph-based approach doesn't have this problem.

Common Issues

Problem	Fix
`Ollama: OFFLINE` in header	Run `ollama serve` in a terminal
Embedding takes forever	Ollama is downloading the model on first use, wait 2 min
`ModuleNotFoundError`	Run `pip install -r requirements.txt`
Port 8080 already in use	Kill the process: `netstat -ano
LLM answer is slow	Normal — mistral takes 10–30s on a laptop CPU. Use a smaller model for faster answers

Use a Smaller/Faster LLM

If mistral is too slow on your laptop, switch to llama3.2:1b:

ollama pull llama3.2:1b

Then edit main.py where gen_model is set in OllamaClient.__init__:

self.gen_model = "llama3.2:1b"   # change this line

Restart the server — no recompile needed.

Roadmap

OpenAI embeddings support — swap nomic-embed-text for text-embedding-3-small via API key, enabling cloud-hosted deployment without Ollama
Persistent storage — replace the JSON flat file with SQLite for proper ACID transactions, faster startup with large document sets, and concurrent write safety
Docker deployment — docker-compose.yml bundling the FastAPI server + Ollama in a single-command setup for reproducible demos
Fine-tuning integration — experiment with fine-tuned embedding models to show how domain-specific tuning shifts cluster geometry in the PCA scatter plot
Multi-vector search — support querying by multiple vectors simultaneously (e.g., average embeddings of a multi-sentence query) to improve RAG recall

Running the Tests

pip install pytest
pytest test_search.py -v

All 40+ tests run without Ollama or a running server — the test suite imports the algorithm classes directly and tests them in isolation.

License

MIT — use this however you want.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
index.html		index.html
main.py		main.py
requirements.txt		requirements.txt
test_search.py		test_search.py

Folders and files

Latest commit

History

Repository files navigation