Skip to content

rajadityaaa/YourOwnAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YourOwnAI — Vector Database + RAG Pipeline from Scratch in Python

A fully working Vector Database and RAG pipeline built from first principles — no Pinecone, no Chroma, no shortcuts. HNSW, KD-Tree, and Brute Force search implemented in pure Python with a FastAPI backend and a live web UI.

Python FastAPI License Tests


Screenshots

Vector Search Dashboard

Homepage

Document Upload & Indexing

Upload

Documents Indexed

Indexed

RAG Question Answering

RAG

Evaluation Metrics

Metrics


Key Achievements

  • Built HNSW vector search from scratch in Python — same algorithm used by Pinecone, Weaviate, and Chroma
  • Implemented KD-Tree and Brute Force search side-by-side for live algorithm benchmarking
  • Developed a complete Retrieval-Augmented Generation (RAG) pipeline with streaming token output via SSE
  • Integrated local LLMs via Ollama (Mistral / Llama3.2) — no API keys, fully offline
  • Added built-in RAG evaluation endpoint scoring context hit rate and retrieval distance
  • Implemented PDF/TXT ingestion, overlapping chunking, embedding, and semantic search end-to-end

What This Project Does

Feature Description
3 Search Algorithms HNSW (production-grade), KD-Tree, Brute Force — run all three and compare speed
3 Distance Metrics Cosine similarity, Euclidean distance, Manhattan distance
16D Demo Vectors 50 pre-loaded semantic vectors across 4 categories (CS, Math, Food, Sports)
2D PCA Scatter Plot Live visualization of semantic space — watch clusters form
Real Document Embedding Paste any text → Ollama embeds it with nomic-embed-text (768D)
RAG Pipeline Ask questions about your documents → HNSW retrieves context → local LLM answers
Full REST API CRUD endpoints: insert, delete, search, benchmark, hnsw-info
PDF & TXT Upload Upload files directly — text extracted, chunked, and embedded automatically
Streaming Responses /doc/ask streams tokens word-by-word via Server-Sent Events (SSE)
Category Filter Filter search results by category: ?category=cs
Similarity Threshold Control RAG retrieval strictness via max_distance parameter
Numpy Optimized Distance functions (Euclidean, Cosine, Manhattan) use vectorized numpy operations
Persistent Storage Document index survives server restarts via JSON serialization
RAG Evaluation Built-in /doc/evaluate endpoint scores retrieval quality without external tools

How It Works

Architecture

Your Text
    │
    ▼
Ollama (nomic-embed-text)          ← converts text to a 768-dimensional vector
    │
    ▼
HNSW Index (Python)                ← indexes the vector in a multilayer graph
    │
    ▼
Semantic Search                    ← finds nearest neighbors in vector space
    │
    ▼
Ollama (mistral)                   ← reads retrieved chunks, generates an answer
    │
    ▼
Answer (streamed token by token via SSE)

HNSW (Hierarchical Navigable Small World) is the same algorithm used by Pinecone, Weaviate, Chroma, and Milvus. It builds a multilayer graph where each layer is progressively sparser — searches start at the top layer and zoom in, achieving O(log N) complexity instead of O(N) for brute force.


Why I Built This

Production vector databases like Pinecone and Weaviate abstract away almost everything — you call an API and results appear. That's great for shipping products, but it makes it nearly impossible to reason about why a search returns what it does, how the index degrades under certain distributions, or what tradeoffs were made in the retrieval layer. I built this project to remove that abstraction entirely: every distance function, every graph traversal, every RAG prompt construction is written from scratch in readable Python. The goal was to be able to look at a bad retrieval result and trace it all the way back to a specific algorithmic decision — and that's now possible here.


How It Compares to Production Systems

Feature This Project Pinecone Chroma Weaviate
Algorithm HNSW (hand-coded) + KD-Tree + Brute Force HNSW (Rust/C++) HNSW (hnswlib) HNSW (Go)
Embedding Ollama local models OpenAI / bring your own OpenAI / bring your own OpenAI / bring your own
Persistence JSON file Managed cloud SQLite / cloud RocksDB / cloud
Filtering Category filter (in-memory) Metadata filters (indexed) where clause GraphQL filters
Scalability Single process, in-memory Millions of vectors, distributed Millions of vectors Billions of vectors
Streaming SSE token streaming Not applicable Not applicable Not applicable
Introspection HNSW graph inspector endpoint None None None
Purpose Education / portfolio Production SaaS Local dev / prototyping Enterprise production

The key difference: this project exposes internal state (layer counts, edge counts, graph topology) that production systems hide entirely. It also runs a side-by-side benchmark of all three search algorithms in real time — something no production system offers because they've already committed to one algorithm.


Tech Stack

Layer Technology Why
Backend Python 3.10+, FastAPI Async-native, minimal boilerplate
Search Algorithms Pure Python (HNSW, KD-Tree, BruteForce) Built from scratch to understand internals
Distance Metrics NumPy Vectorized ops — np.linalg.norm, np.dot, np.abs
Embeddings Ollama + nomic-embed-text Local, free, 768D semantic embeddings
LLM Ollama + mistral Local inference, no API keys required
PDF Parsing PyMuPDF (fitz) Fast, reliable text extraction
Streaming Server-Sent Events (SSE) Token-by-token streaming like ChatGPT
Frontend Vanilla HTML/JS PCA scatter plot, benchmark chart, chat UI
Testing pytest 40+ unit tests, no mocking required

What I Learned

  • HNSW tradeoffs are non-obvious. The M (max connections) and ef_construction parameters have a direct impact on recall vs. build time. A higher ef_construction finds better neighbors during indexing but slows inserts. At low values, the graph develops "weak bridges" between clusters that hurt recall on edge cases.

  • KD-Trees fall apart in high dimensions. The ball-within-hyperslab pruning that makes KD-Trees fast at 16D is almost entirely useless at 768D — nearly every subtree needs to be visited, making it equivalent to brute force. Implementing both side-by-side made this viscerally clear rather than just theoretically understood.

  • RAG quality is dominated by chunking strategy, not the LLM. The chunk_text function's overlap_words parameter matters more than model choice: too little overlap and a key sentence gets cut across two chunks; too much and the index bloats with redundant embeddings that pollute top-k retrieval.

  • SSE streaming requires careful generator design. The event stream sends three event types (context, token, done) and the generator must handle mid-stream Ollama disconnects gracefully. Getting the StreamingResponse and event_stream() generator to compose correctly in FastAPI took multiple iterations.

  • NumPy vectorization is worth the dependency. Replacing a Python for loop distance calculation with np.linalg.norm(np.array(a) - np.array(b)) cut benchmark times by 4–6x on 768D vectors. The difference is negligible at 16D but significant when the document index grows.


Prerequisites

You need 3 things installed on your machine:

  1. Python 3.10+ (download from python.org)
  2. Git
  3. Ollama (runs the local AI models)

Step-by-Step Setup

Step 1 — Install Python 3.10+

  1. Go to https://python.org/downloads and download Python 3.10 or newer
  2. Run the installer — check "Add Python to PATH" before clicking Install
  3. Verify in PowerShell / terminal:
python --version

Step 2 — Install Git

  1. Go to https://git-scm.com/download/win and download Git for Windows
  2. Run the installer with default settings
  3. Verify in PowerShell:
git --version

Step 3 — Install Ollama (Local AI Models)

  1. Go to https://ollama.com and click Download for Windows
  2. Run the installer
  3. Ollama starts automatically in the system tray
  4. Open PowerShell and pull the two required models:
ollama pull nomic-embed-text

(~274 MB — this is the embedding model)

ollama pull mistral

(~4 GB — this is the language model)

  1. Verify Ollama is running:
ollama list

You should see both models listed.

Minimum specs for Ollama: 8GB RAM recommended. The models will use ~3GB total.


Step 4 — Clone the Repository

Open PowerShell / terminal and run:

git clone https://github.com/rajadityaaa/YourOwnAI.git
cd YourOwnAI

Step 5 — Install Python Dependencies

Inside the project folder, run:

pip install fastapi uvicorn requests numpy pymupdf python-multipart

Or use the requirements file:

pip install -r requirements.txt

Troubleshooting:

  • pip: command not found → Python not in PATH, redo Step 1
  • ModuleNotFoundError: fitz → run pip install pymupdf
  • 422 Unprocessable Entity on file upload → run pip install python-multipart

Step 6 — Run Everything

Terminal 1 — Start Ollama (if not already running):

ollama serve

(If Ollama is already in the system tray on Windows, skip this)

Terminal 2 — Start the Python server:

python main.py

You should see:

=== VectorDB Engine ===
http://localhost:8080
50 demo vectors | 16 dims | HNSW+KD-Tree+BruteForce
Ollama: ONLINE
  embed model: nomic-embed-text  gen model: mistral

Open your browser and go to:

http://localhost:8080

Using the Application

Tab 1: Search (Demo Vectors)

  • Type any concept in the search box: binary tree, sushi, basketball, calculus
  • Choose your algorithm: HNSW, KD-Tree, or Brute Force
  • Choose distance metric: Cosine, Euclidean, or Manhattan
  • Click ⚡ SEARCH — results appear with distances, the matching point glows on the scatter plot
  • Click ▶ COMPARE ALL ALGOS to run all 3 algorithms and compare their speed

The scatter plot shows all 50 vectors projected to 2D using PCA. Notice how the 4 semantic categories (CS, Math, Food, Sports) form distinct clusters — this is what "semantic similarity" looks like visually.

Category filter: Add ?category=cs to filter results to one category only.

Tab 2: Documents (Real Embeddings)

This uses Ollama to generate real 768-dimensional embeddings from any text or file.

Option A — Paste text:

  1. Type a title (e.g., Operating Systems Notes)
  2. Paste any text — lecture notes, textbook paragraphs, Wikipedia articles
  3. Click ⚡ EMBED & INSERT

Option B — Upload a file:

  1. Click the PDF or TXT tab in the Upload File section
  2. Click or drag-and-drop your file
  3. Click ⚡ UPLOAD & EMBED

Long documents are automatically split into overlapping 250-word chunks. Each chunk gets its own embedding and is stored in a separate HNSW index.

Tab 3: Ask AI (RAG Pipeline)

  1. Make sure you have inserted some documents in Tab 2 first
  2. Type a question about your documents
  3. Click 🤖 ASK AI

What happens behind the scenes:

1. Your question → embedded with nomic-embed-text (768D vector)
2. HNSW search → finds 3 most semantically similar chunks
3. Retrieved chunks → sent as context to mistral
4. mistral → generates an answer based only on your documents

The answer streams in token by token via Server-Sent Events (SSE) — just like ChatGPT. Click the context chips below the answer to see exactly which document chunks were used to generate it.

You can also control retrieval strictness by passing "max_distance": 0.5 (stricter) or 0.9 (looser) in the request body.


REST API Reference

The server exposes a full REST API at http://localhost:8080.

Demo Vector Endpoints

Method Endpoint Description
GET /search?v=f1,f2,...&k=5&metric=cosine&algo=hnsw K-NN search
POST /insert Insert a demo vector
DELETE /delete/:id Delete by ID
GET /items List all demo vectors
GET /benchmark?v=...&k=5&metric=cosine Compare all 3 algorithms
GET /hnsw-info HNSW graph structure and layer stats
GET /stats Database statistics

Document & RAG Endpoints

Method Endpoint Body Description
POST /doc/insert {"title":"...","text":"..."} Embed and store document
POST /doc/upload-pdf multipart/form-data Upload and embed a PDF file
POST /doc/upload-txt multipart/form-data Upload and embed a TXT file
GET /doc/list List all stored documents
DELETE /doc/delete/:id Delete document chunk
POST /doc/search {"question":"...","max_distance":0.7} Semantic search only (no LLM)
POST /doc/ask {"question":"...","k":3,"max_distance":0.7} RAG: streaming token response
POST /doc/ask-sync {"question":"...","k":3} RAG: full response (non-streaming)
POST /doc/evaluate {"pairs":[{"question":"...","expected_answer":"..."}]} Evaluate RAG retrieval quality
GET /status Ollama status and model info

Example: Search via curl

# Search all categories
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&metric=cosine&algo=hnsw"

# Filter by category
curl "http://localhost:8080/search?v=0.9,0.8,0.7,0.6,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1&k=3&category=cs"

Example: Ask a question via curl

# Streaming (default)
curl -X POST http://localhost:8080/doc/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"What is dynamic programming?","k":3}'

# Non-streaming (sync)
curl -X POST http://localhost:8080/doc/ask-sync \
  -H "Content-Type: application/json" \
  -d '{"question":"What is dynamic programming?","k":3,"max_distance":0.5}'

Project Structure

YourOwnAI/
├── main.py         ← Python backend (HNSW, KD-Tree, BruteForce, FastAPI, RAG)
├── index.html      ← Frontend (PCA scatter plot, chat UI, benchmark chart)
├── test_search.py  ← pytest suite (40+ tests, no mocking, no Ollama required)
├── requirements.txt← Python dependencies
└── README.md       ← This file

Architecture (main.py)

BruteForce          O(N·d)      Exact, baseline
KDTree              O(log N)    Exact, axis-aligned partitioning
HNSW                O(log N)    Approximate, multilayer small-world graph

VectorDB            Unified interface over all 3 (16D demo vectors)
DocumentDB          HNSW-only index for real Ollama embeddings (768D+)
OllamaClient        HTTP client → /api/embeddings + /api/generate (streaming)
FastAPI             REST server with CORS, file upload, SSE streaming

Algorithm Deep Dive

HNSW (Hierarchical Navigable Small World)

Nodes are inserted into a multilayer graph. Each node randomly gets assigned a maximum layer. Layer 0 has all nodes with many connections; higher layers have fewer nodes (exponentially fewer) with longer-range connections.

Insert: Start at the top layer, greedily find the nearest node, drop a layer, repeat. At each layer from your assigned max down to 0, run a beam search (ef_construction=200) and connect to the M nearest neighbors bidirectionally.

Search: Same greedy descent from top layer. At layer 0, expand to ef nearest candidates using a priority queue.

Why it's fast: The upper layers act like a highway — you quickly get to the right neighborhood, then zoom in at layer 0.

KD-Tree (K-Dimensional Tree)

Binary space partitioning. Each node splits space along one dimension (cycling through all dimensions). Search prunes entire subtrees when the closest possible point in that subtree can't beat the current best — the "ball within hyperslab" check.

Weakness: Degrades with high dimensions (curse of dimensionality). Works well for ≤20D, becomes close to brute force at 768D.

Why HNSW Wins at High Dimensions

KD-Tree pruning relies on axis-aligned distance bounds. In high dimensions, almost all the space is near the boundary of the hypersphere — no subtrees get pruned. HNSW's graph-based approach doesn't have this problem.


Common Issues

Problem Fix
Ollama: OFFLINE in header Run ollama serve in a terminal
Embedding takes forever Ollama is downloading the model on first use, wait 2 min
ModuleNotFoundError Run pip install -r requirements.txt
Port 8080 already in use Kill the process: `netstat -ano
LLM answer is slow Normal — mistral takes 10–30s on a laptop CPU. Use a smaller model for faster answers

Use a Smaller/Faster LLM

If mistral is too slow on your laptop, switch to llama3.2:1b:

ollama pull llama3.2:1b

Then edit main.py where gen_model is set in OllamaClient.__init__:

self.gen_model = "llama3.2:1b"   # change this line

Restart the server — no recompile needed.


Roadmap

  • OpenAI embeddings support — swap nomic-embed-text for text-embedding-3-small via API key, enabling cloud-hosted deployment without Ollama
  • Persistent storage — replace the JSON flat file with SQLite for proper ACID transactions, faster startup with large document sets, and concurrent write safety
  • Docker deploymentdocker-compose.yml bundling the FastAPI server + Ollama in a single-command setup for reproducible demos
  • Fine-tuning integration — experiment with fine-tuned embedding models to show how domain-specific tuning shifts cluster geometry in the PCA scatter plot
  • Multi-vector search — support querying by multiple vectors simultaneously (e.g., average embeddings of a multi-sentence query) to improve RAG recall

Running the Tests

pip install pytest
pytest test_search.py -v

All 40+ tests run without Ollama or a running server — the test suite imports the algorithm classes directly and tests them in isolation.


License

MIT — use this however you want.

About

Custom Vector Database and RAG Pipeline in Python featuring HNSW, KD-Tree, FastAPI, Ollama, document ingestion, and retrieval evaluation.

Topics

Resources

Stars

Watchers

Forks

Contributors