RAGScope

You can't fix what you can't see.

A local diagnostic tool for RAG pipelines. Scores every query your pipeline processes and tells you exactly what's wrong — before you ship.

  PASS  90/100  █████████░  my-rag-app
  │  "What is RAG?"
  │
  │  ✓  precision    90  █████████░  9/10 chunks used
  │  ✓  efficiency   80  ████████░░  20% tokens wasted
  │  ✓  uniqueness  100  ██████████  chunks are distinct
  │  ✓  coverage    100  ██████████  all chunks scored
  │

  WARN  54/100  █████░░░░░  my-rag-app
  │  "What is dense passage retrieval?"
  │
  │  ✗  precision    40  ████░░░░░░  4/10 chunks used
  │  ~  efficiency   50  █████░░░░░  50% tokens wasted
  │  ~  uniqueness   65  ███████░░░  2 near-duplicate pairs
  │  ✓  coverage    100  ██████████  all chunks scored
  │
  │  → Reduce TOP_K 10→4 (only 4 chunks reached LLM)
  │  → 50% of retrieved tokens never reached the LLM
  │  → 2 near-duplicate chunks — deduplicate at ingest time
  │

  ──────────────────────────────────────────────────
  Session  2 queries  ·  avg 72/100  ↓

The problem

Most RAG pipelines fail silently. You retrieve 10 chunks, the LLM prompt only contains 3, and nothing in your logs tells you the other 7 were dropped. You're retrieving near-duplicates that eat your context window. Your similarity scores are zero because your vector store returns distances and nobody normalized them. The model gives vague answers, and there's nothing to debug.

These are all retrieval mechanics problems. They're fixable. But you can't see them without tracing.

RAGScope makes them visible — scored, labelled, and actionable — in your terminal, before users ever see the output.

What it is

RAGScope is a local OTLP receiver that runs on port 4321 alongside your development server. It receives the telemetry your RAG pipeline already emits, analyzes the full trace end-to-end, and prints a diagnostic score to your terminal the moment each query completes.

It is a dev-time quality gate — the same category as a linter or type checker. You run it locally while building, catch the problems, and ship with confidence. It is not a production monitoring tool.

How it works

Your RAG app emits OpenTelemetry spans as it runs: a retrieval span (with chunk IDs, scores, and content), optionally a reranking span, and an LLM span (with the full prompt text). RAGScope receives these via standard OTLP and extracts what it needs.

Your RAG app  ──(OTLP/JSON)──▶  RAGScope :4321
                                      │
                     ┌────────────────┼────────────────────┐
                     ▼                ▼                     ▼
               parse spans      normalize scores      detect context
               (wire format     (Qdrant/Chroma/        (which chunks
               → typed)         Pinecone → [0,1])      reached the LLM)
                     │                │                     │
                     └────────────────┼─────────────────────┘
                                      ▼
                              score the trace
                        (precision · efficiency ·
                          uniqueness · coverage)
                                      │
                                      ▼
                             print to terminal

Context detection is the key mechanism: RAGScope compares each chunk's content against the actual LLM prompt text. If the chunk appears in the prompt, it was used (inContext: true). If not, it was retrieved and discarded. This is how RAGScope measures precision and efficiency without any special integration — it reads the spans your instrumentation already emits.

All trace data lives in memory for the session. Nothing is written to disk. Nothing leaves your machine.

Quick start

# 1. Start RAGScope (no install needed)
npx ragscope start

# 2. Point your pipeline at it — one environment variable
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4321

# 3. Run your test queries — scores appear the moment each one completes

No config files. No accounts. No data leaving your machine. Requires Node.js ≥ 24.

The four scores

Every trace gets a 0–100 score built from four sub-scores. The weights reflect their practical impact on answer quality.

Retrieval Precision — 40%

What it measures: The fraction of retrieved chunks that actually appeared in the LLM's prompt.

Why it's weighted highest: A chunk that doesn't reach the LLM contributes nothing to the answer. It costs retrieval latency, vector store bandwidth, and context window space — and then gets silently dropped. If your pipeline retrieves 10 chunks and the LLM only sees 3, your TOP_K is more than 3× too high for this query.

What a bad score looks like:

│  ✗  precision    30  ███░░░░░░░  3/10 chunks used
│  → Reduce TOP_K 10→3 (only 3 chunks reached LLM)

Context Efficiency — 30%

What it measures: The fraction of retrieved tokens that the LLM actually consumed.

Why it matters: Every token in a retrieved chunk that doesn't reach the prompt is a token you paid to embed, store, and retrieve — and then threw away. Low efficiency means your context window is being filled with chunks that get cut before the LLM sees them, which also means the chunks that do matter might be getting truncated.

What a bad score looks like:

│  ✗  efficiency   45  ████░░░░░░  55% tokens wasted
│  → 55% of retrieved tokens never reached the LLM

Uniqueness — 20%

What it measures: How distinct your retrieved chunks are from each other. 100 = fully unique, 0 = all near-duplicates. Computed from text overlap between adjacent chunks.

Why it matters: When your chunking strategy creates overlapping segments, the model receives the same information multiple times. This wastes context window space, can bias the model toward repeated facts, and usually indicates a chunking configuration problem rather than a retrieval problem.

What a bad score looks like:

│  ~  uniqueness   60  ██████░░░░  2 near-duplicate pairs
│  → 2 near-duplicate chunks — deduplicate at ingest time

Score Coverage — 10%

What it measures: Whether retrieved chunks carry non-zero similarity scores.

Why it matters: Without similarity scores you can't understand which chunks are the strongest matches, can't tune retrieval thresholds, and can't detect when your vector store is returning results in arbitrary order. This score is a signal flag, not a performance metric.

Common cause of zero scores: Langfuse sometimes omits scores from trace exports. Chroma returns L2 distances — RAGScope normalizes those automatically, so they won't trigger this flag.

Labels

Score	Label	Meaning
≥ 75	PASS	Pipeline is healthy for this query
50–74	WARN	Issues present — review recommendations
< 50	FAIL	Significant retrieval problems

The per-score breakdown with recommendations is shown by default. Use --compact for a single-line summary per query.

Integrations

RAGScope accepts traces via two paths.

Path 1 — OTLP (any OTel-compatible tool)

Set one environment variable. No code changes in most setups.

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4321

Works with any instrumentation library that emits OpenTelemetry spans. Common setups:

Traceloop / OpenLLMetry — auto-instruments LangChain, LlamaIndex, OpenAI, Pinecone, Qdrant, Cohere, and more:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Traceloop } from '@traceloop/node-server-sdk';

Traceloop.init({
  exporter: new OTLPTraceExporter({ url: 'http://localhost:4321/v1/traces' }),
});

Vercel AI SDK:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4321/v1/traces' }),
}).start();

Phoenix (Arize): set PHOENIX_COLLECTOR_ENDPOINT=http://localhost:4321

OpenLLMetry: set TRACELOOP_BASE_URL=http://localhost:4321

Manual instrumentation — the minimum RAGScope needs to score a trace is a retrieval span with two attributes:

span.setAttribute('gen_ai.operation.name', 'retrieve');
span.setAttribute(
  'gen_ai.retrieval.documents',
  JSON.stringify([
    { id: 'chunk-1', score: 0.92, content: 'The actual chunk text...' },
    { id: 'chunk-2', score: 0.81, content: 'Another chunk...' },
  ]),
);

For context detection to work, add an LLM span with the full prompt:

span.setAttribute('gen_ai.operation.name', 'chat');
span.setAttribute('ai.prompt', fullPromptText);

Path 2 — Langfuse (zero code changes)

If you're already logging traces to Langfuse, set two env vars. RAGScope polls your project every 30 seconds and scores any new traces it finds — no code changes, no redeployment.

LANGFUSE_PUBLIC_KEY=pk-lf-... \
LANGFUSE_SECRET_KEY=sk-lf-... \
npx ragscope start

For a self-hosted instance, add LANGFUSE_BASE_URL=https://your-langfuse.com.

Coming soon: LangSmith · Helicone adapters. Open an issue to request or contribute one.

Problems it catches

Symptom	Root cause	RAGScope signal
Answers are vague despite relevant documents existing	TOP_K too high — most retrieved chunks are discarded before the LLM sees them	Low precision + specific TOP_K recommendation
High token costs, slow responses	Retrieving large chunks that mostly get dropped	Low efficiency + wasted token count
Model repeats the same information	Near-duplicate chunks in retrieval output	Low uniqueness score + near-duplicate count
Can't tune retrieval thresholds	Scores missing from trace data	Low coverage + normalization note
Reranker not improving answer quality	Chunks are reordered but the same ones are still dropped	Reranker diff in `/api/traces/:id`

CLI reference

npx ragscope start [options]

  --port <n>     Port to listen on (default: 4321)
  --compact      One-line output per query instead of the full per-score breakdown

Compatibility

	Supported
Node.js	≥ 24.0.0
Ingestion	OTLP/HTTP · Langfuse polling
Languages	Any with OTel support (Node.js, Python, Go, Java…)
Frameworks	LangChain · LlamaIndex · Vercel AI SDK · custom pipelines
Vector stores	Qdrant · Chroma · Pinecone · Weaviate · pgvector · any OTLP source
Rerankers	Cohere Rerank · any span with `gen_ai.operation.name = rerank`
Models	OpenAI · Anthropic · Cohere · Mistral · any OTel-instrumented provider

What it doesn't do

RAGScope is deliberately narrow. Knowing what it won't do matters as much as knowing what it will.

It is not a production monitoring tool. Trace data lives in memory for the process lifetime. Use Langfuse, Phoenix, or Arize for production observability.

It does not evaluate answer quality. RAGScope measures retrieval mechanics — whether the right chunks are reaching the LLM efficiently. It does not judge whether the answers are factually correct or semantically appropriate.

It does not run your reranker. It observes your existing reranker span and reports whether the reordering is helping. It does not add reranking to your pipeline.

It has limited support for agentic patterns. It understands linear CHAIN → RETRIEVER → RERANKER → LLM pipelines well. Complex agent loops with tool use, multi-hop retrieval, or dynamic prompt construction may produce partial or misleading scores.

Why not Langfuse / Phoenix / Arize?

Those are production observability platforms. They're designed to record, store, and analyze what happens after you ship — at scale, over time, with dashboards and alerting.

RAGScope is a development tool. It answers a different question: "Is my pipeline working correctly right now, before I ship?" Zero setup, zero cloud, immediate feedback in the terminal. Different job, different tool.

Think of it as the difference between a linter (runs in your editor while you code) and a production error tracker (records what breaks for real users). You need both. They don't replace each other.

Roadmap

Shipped (v0.3.0)

Planned

LangSmith adapter — poll runs via LangSmith API, zero code changes
Helicone adapter — fetch requests via Helicone API
Langfuse webhooks — real-time instead of 30s polling
Audit report — npx ragscope report exports a Markdown/JSON summary of the session

Later

Compare mode — npx ragscope compare diffs two pipeline versions from separate sessions
Threshold config — .ragscope.json for custom PASS/WARN/FAIL thresholds per project
Trace drill-down — --trace <id> to inspect a single trace in detail in the terminal
Python instrumentation helpers — common patterns for Python RAG stacks

Vote on features or propose new ones — open an issue.

Contributing

See CONTRIBUTING.md for setup instructions, code layout, and how to add a new ingestion adapter.

Good first contributions: LangSmith adapter · Helicone adapter · audit report export · scoring heuristic improvements

Privacy

All trace data stays in memory on your machine. The only outbound network request RAGScope ever makes is the Langfuse poll — and only if you configure it with your own keys. No telemetry, no analytics, no accounts.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
.husky		.husky
bin		bin
scripts		scripts
src		src
.gitignore		.gitignore
.prettierrc		.prettierrc
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGScope

The problem

What it is

How it works

Quick start

The four scores

Retrieval Precision — 40%

Context Efficiency — 30%

Uniqueness — 20%

Score Coverage — 10%

Labels

Integrations

Path 1 — OTLP (any OTel-compatible tool)

Path 2 — Langfuse (zero code changes)

Problems it catches

CLI reference

Compatibility

What it doesn't do

Why not Langfuse / Phoenix / Arize?

Roadmap

Shipped (v0.3.0)

Planned

Later

Contributing

Privacy

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAGScope

The problem

What it is

How it works

Quick start

The four scores

Retrieval Precision — 40%

Context Efficiency — 30%

Uniqueness — 20%

Score Coverage — 10%

Labels

Integrations

Path 1 — OTLP (any OTel-compatible tool)

Path 2 — Langfuse (zero code changes)

Problems it catches

CLI reference

Compatibility

What it doesn't do

Why not Langfuse / Phoenix / Arize?

Roadmap

Shipped (v0.3.0)

Planned

Later

Contributing

Privacy

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages