CDWAgent

An MCP (Model Context Protocol) server that exposes a de-identified Epic Caboodle Clinical Data Warehouse (SQL Server) to BioRouter.

Built for clinical researchers who need natural-language access to EHR data without writing SQL. Designed as a sibling of UCSFOMOPAgent: CDWAgent targets the UF Epic Caboodle schema while OMOPAgent targets the OHDSI/OMOP common data model. Both can be enabled in the same BioRouter session — tool names are namespace-prefixed to prevent collision, and CDWAgent includes a crossmap_patient tool that resolves OMOP person_id values to CDW PatientDurableKey.

Architecture is based on the MedCP template by the UCSF Baranzini Lab, with a modular tool registry, expanded clinical tools, and no knowledge graph dependency.

BioRouter Extension

Download cdwagent.brxt

Drag the .brxt file into BioRouter's Extensions → Add extension dialog. BioRouter will install the virtual environment automatically and prompt for required credentials.

Variable	Required	Default	Description
`CLINICAL_RECORDS_USERNAME`	✅	—	UCSF network username (e.g. `CAMPUS\youruser`)
`CLINICAL_RECORDS_PASSWORD`	✅	—	UCSF network password
`CLINICAL_RECORDS_SERVER`	optional	`QCDIDDWDB001.ucsfmedicalcenter.org`	SQL Server hostname
`CLINICAL_RECORDS_DATABASE`	optional	`CDW_NEW`	Database name
`CDW_NAMESPACE`	optional	`CDW`	Tool namespace prefix
`CDW_SCHEMA`	optional	`deid_uf`	SQL schema name
`CDW_LOG_LEVEL`	optional	`INFO`	Logging level

Authors

Gianmarco Bellucci
Wanjun Gu

Features

21 MCP tools organized into 6 domain modules
3 guided workflow prompts for common research tasks
OMOP → CDW patient crossmapping with birth-date sanity check
Read-only SQL enforcement with comprehensive write-blocking
Schema discovery from a pre-parsed data dictionary (no DB connection needed)
Clinical notes search and retrieval
Cohort building with aggregate demographics
CSV export for large result sets
Configurable tool namespace and database schema

Tools

All tool names are namespace-prefixed with CDW- at runtime so they coexist with sibling agents (e.g., UCSFOMOPAgent) inside a single BioRouter session. The descriptions below are the canonical entry points each tool exposes; for the per-tool flow diagrams see docs/agent-flows/02-tool-flows/.

Schema Discovery (3)

These tools read from the bundled schema_reference.json and require no database connection — they work offline for exploratory research.

Tool	Description
`get_database_overview`	List every CDW table with one-line description, patient/encounter linkage flags, and column counts. The agent uses this as its first move when a research question lacks an obvious target table.
`describe_table`	Return the column list for a named table — names, data types, descriptions, and foreign-key relationships. Used to construct schema-aware SQL after a candidate table has been identified.
`search_schema`	Keyword search across table and column names plus their descriptions. Useful when the user describes a clinical concept (e.g. "lab results") rather than a table name.

Clinical Queries (7)

These tools execute SELECT-only SQL against the de-identified Epic Caboodle warehouse. Every executed statement is validated by ClinicalQueryValidator (read-only enforcement, no semicolon chaining, blocked write verbs) and appended to the SQL audit log at $TMPDIR/cdwagent_sql.log.

Tool	Description
`query`	Execute a read-only SQL `SELECT` query (or `WITH ... SELECT`) and return the rows as CSV. The validator blocks every write verb. The cohort subquery pattern (`WHERE PatientDurableKey IN (...)`) is the recommended composition primitive for cross-fact queries.
`get_patient_demographics`	Return the most recent demographic record for a `PatientDurableKey` from `PatientDim` (filtered by `IsCurrent = 1`). Sex, birth date, race, ethnicity, language, status.
`get_encounters`	Encounter history from `EncounterFact` for one patient, ordered by `DateKey` descending. Includes department specialty, encounter type, and visit type.
`get_medications`	Medication orders from `MedicationOrderFact` for one patient, with `OrderedDateKey`/`StartDateKey`/`EndDateKey` so the agent can reconstruct treatment duration.
`get_diagnoses`	Diagnosis history from `DiagnosisEventFact` for one patient, ordered by `StartDateKey`. Joined to `DiagnosisDim` for human-readable names and to `DiagnosisTerminologyDim` for the originating code system.
`get_labs`	Lab results from `LabComponentResultFact` for one patient. Returns the `Value` string field rather than `NumericValue` (de-identified and unreliable for analysis).
`crossmap_patient`	Resolve an OMOP `person_id` to a CDW `PatientDurableKey` via `OMOP_DEID.dbo.person.person_source_value = CDW_NEW.deid_uf.PatientDim.PatientEpicId` with `IsCurrent = 1`. Returns demographics plus a `birth_date_match` boolean for sanity-checking the join. The bridge tool when a study starts on the OMOP side and needs CDW depth.

Clinical Notes (4)

A two-tier retrieval surface: an NLP concept layer (cTAKES) for fast semantic search, and a verbatim layer for chart review or exact-phrase matching. The cTAKES layer is the preferred entry point for clinical concepts; verbatim retrieval is reserved for cases where the NLP layer would not normalise the phrase (specific provider names, exact dose phrasing, idiosyncratic wording).

Tool	Description
`search_note_concepts`	Search the NLP-extracted concept layer (`note_concepts`, populated by cTAKES) by canonical text or UMLS CUI, optionally restricted to a cohort of one or more `PatientDurableKey` values. Defaults exclude negated mentions and family-history mentions; historical mentions are kept (commonly relevant for retrospective research). Population-mode (no cohort) applies an early-termination optimisation and emits a `[NOTICE: ...]` banner that the agent must surface to the user.
`search_note_sdoh`	Search Social Determinants of Health concepts (`note_concepts_sdoh`, populated by the cTAKES SDOH module) — housing instability, food insecurity, employment, transportation barriers, substance use, social isolation, financial strain. Use for equity and vulnerability research where structured fields rarely capture the signal. Same population-mode notice convention as `search_note_concepts`.
`search_notes`	Verbatim text retrieval over `note_text` and `note_metadata`, scoped to a cohort of one or more `PatientDurableKey` values. Supports an optional keyword filter; without a keyword the call performs a chronological chart review. SQL Server `IN`-clause cap of 2000 patients.
`get_note`	Retrieve the full text of one clinical note by its `deid_note_key`, typically discovered via `search_note_concepts` or `search_notes`.

Concept Search (4)

These tools resolve human-language concept names or terminology codes into the surrogate keys used by fact tables. The agent uses them as the first step in any cohort-building workflow: it finds the relevant *Key values and then composes a ... IN (...) filter on the corresponding fact table.

Tool	Description
`search_diagnoses_by_code`	Resolve ICD/SNOMED codes or diagnosis names against `DiagnosisTerminologyDim` joined to `DiagnosisDim`. Returns `DiagnosisKey` values for use in `DiagnosisEventFact.DiagnosisKey IN (...)`.
`search_medications_by_code`	Resolve NDC/RxNorm codes, brand names, or generic names against `MedicationCodeDim`. Returns `MedicationKey` values for use in `MedicationOrderFact.MedicationKey IN (...)`.
`search_labs_by_code`	Resolve LOINC codes or lab component names (e.g. "hemoglobin a1c", "creatinine") against `LabComponentDim`. Returns `LabComponentKey` values for use in `LabComponentResultFact.LabComponentKey IN (...)`. Note the LOINC column is `LoincCode`, not `Loinc`.
`search_procedures_by_code`	Resolve CPT/HCPCS codes or procedure names against `ProcedureTerminologyDim`. Returns `ProcedureTerminologyKey` values for use in `ProcedureEventFact.ProcedureTerminologyKey IN (...)`.

Data Export (1)

Tool	Description
`export_query_to_csv`	Execute a read-only SQL query and write the rows to a CSV file at a caller-specified path. Validator and audit log apply identically to `query`. The target directory must exist.

Statistics (2)

Tool	Description
`summarize_table`	Per-table descriptive statistics: row count, per-column null rates, and sample value distributions for low-cardinality categorical columns.
`cohort_summary`	Aggregate demographics (age statistics, sex, race, ethnicity) for a cohort defined by a SQL subquery returning `PatientDurableKey`. Used as the closing summary at the end of a cohort-building workflow.

Guided Prompts

The server includes three MCP prompts that guide the LLM through common workflows:

clinical_data_exploration — Step-by-step CDW exploration: schema overview, table discovery, query building
cohort_building — Cohort identification workflow with correct patient identifier patterns and query optimization tips
notes_analysis — Clinical notes investigation from patient identification through note retrieval and summarization

Validation

CDWAgent has been end-to-end validated against the two BAA-covered LLM providers supported at UCSF:

Azure OpenAI GPT-5.2 via the UCSF unified-api endpoint
AWS Bedrock — Sonnet 4.6

The eval suite covers cohort identification by structured codes, multi-criteria intersection, longitudinal lab and medication trajectories, NLP-based phenotype extraction over the cTAKES note_concepts and note_concepts_sdoh layers, OMOP↔CDW patient crossmapping, ambiguity disambiguation, and read-only enforcement. All cases pass against both providers under the v0.4.3 release. The eval harness lives in neuroGB/CDWAgent_testing (private).

Installation

Requirements

Python >= 3.11
Access to a SQL Server Clinical Data Warehouse
uv package manager (recommended)

Quick install (uvx)

uvx --from git+https://github.com/BaranziniLab/CDWAgent cdwagent

From source

git clone https://github.com/BaranziniLab/CDWAgent.git
cd CDWAgent
uv sync
cp .env.example .env
# Edit .env with your database connection details
uv run cdwagent

Run as a module

python -m cdwagent

Configuration

All configuration is via environment variables (see .env.example):

Variable	Required	Description
`CLINICAL_RECORDS_USERNAME`	Yes	SQL Server username
`CLINICAL_RECORDS_PASSWORD`	Yes	SQL Server password
`CLINICAL_RECORDS_SERVER`	No	SQL Server hostname (default: `QCDIDDWDB001.ucsfmedicalcenter.org`)
`CLINICAL_RECORDS_DATABASE`	No	Database name (default: `CDW_NEW`)
`CDW_NAMESPACE`	No	Tool name prefix (default: `CDW`)
`CDW_SCHEMA`	No	Database schema for table qualification (default: `deid_uf`)
`CDW_LOG_LEVEL`	No	Logging level (default: `INFO`)

The server and database default to the UCSF CDW deployment. Set the env vars only to override (e.g. a different host or a development database).

Use with BioRouter

CDWAgent is a standard stdio MCP server, so it registers as a BioRouter Extension exactly like UCSFOMOPAgent does — no BioRouter-specific code needed.

Add this block to ~/.config/biorouter/config.yaml:

extensions:
  cdwagent:
    type: stdio
    name: CDWAgent
    description: UF Epic Caboodle de-identified Clinical Data Warehouse (SQL Server, read-only)
    enabled: true
    cmd: uvx
    args: ["--from", "git+https://github.com/BaranziniLab/CDWAgent", "cdwagent"]
    timeout: 600
    envs:
      CLINICAL_RECORDS_USERNAME: "your-username"
      CLINICAL_RECORDS_PASSWORD: "your-password"
      CDW_SCHEMA: "deid_uf"

Server and database are hard-coded to the UCSF CDW deployment; override with CLINICAL_RECORDS_SERVER / CLINICAL_RECORDS_DATABASE only if needed.

Or via CLI:

biorouter session --with-extension "CLINICAL_RECORDS_USERNAME=... CLINICAL_RECORDS_PASSWORD=... uvx --from git+https://github.com/BaranziniLab/CDWAgent cdwagent"

Tip — pairing with OMOPAgent: enable both extensions to translate between the two clinical data representations. Ask BioRouter "for OMOP person_id 12345, pull lab trends from the CDW side" and it will call CDW-crossmap_patient then CDW-get_labs. See docs/BIOROUTER.md for operational details (timeouts, malware check, tool-name disambiguation).

Context Strategy (LLM dispatch optimization)

CDW Epic Caboodle uses a proprietary schema the LLM does not know from its training data (unlike OMOP CDM, where OHDSI terms are well-known). To minimize roundtrips and context usage, CDWAgent ships schema context at two layers:

MCP server instructions — a concise overview of the 14 most-used tables, patient identifier rules, date-column mapping per fact table, and the cohort subquery pattern. Sent once at session init via InitializeResult.instructions (FastMCP feature). BioRouter and other MCP clients fold this into the LLM's system prompt. Net effect: the LLM knows the schema the moment it picks any CDW tool, without a get_database_overview roundtrip.
Tool descriptions — kept short (~150 words each). Only the single most common failure mode (schema-qualification with deid_uf.) is repeated in the query tool description as a banner, since it is the top error source. Everything else lives in the server instructions.

Long tail: 139 total tables, ~5000 columns. Full listing is available on-demand via get_database_overview and describe_table — not pushed into the system prompt.

This is the generic pattern for MCPs targeting non-standard schemas. Thin tool descriptions + rich server instructions keeps turn-by-turn context small (tool descriptions are sent on every LLM turn; instructions are sent once) while still providing the context the LLM needs up front.

Schema Reference

Schema discovery tools (get_database_overview, describe_table, search_schema) read from a pre-parsed JSON at src/cdwagent/data/schema_reference.json (bundled inside the Python package so uvx installs work out of the box) — no database connection is required for schema exploration. The JSON contains only structural metadata: table names, column names, data types, and descriptions. No patient data, no institutional identifiers.

The source Epic Caboodle data dictionary (.xlsx) is intentionally NOT bundled with this repository. It is a local governance artifact of each institution. The committed JSON is a derived representation — everything CDWAgent needs at runtime — but the original xlsx stays under institutional control.

If you need to regenerate src/cdwagent/data/schema_reference.json from an updated dictionary, obtain the xlsx through your institution's CDW governance channel and run:

uv run python scripts/parse_data_dictionary.py /path/to/deid_uf_data_dictionary.xlsx

Project Structure

src/cdwagent/
├── __init__.py          # Package exports
├── __main__.py          # python -m cdwagent
├── cli.py               # CLI entry point
├── server.py            # FastMCP instance, tool registration, prompts
├── config.py            # Pydantic configuration models
├── db.py                # Per-query pymssql connection management
├── validation.py        # SQL read-only validation
└── tools/
    ├── schema.py        # Schema discovery tools
    ├── queries.py       # Query execution, clinical record retrieval, OMOP→CDW crossmap
    ├── notes.py         # Clinical notes search and retrieval
    ├── export.py        # CSV export
    ├── concepts.py      # Diagnosis/medication/procedure code search
    └── stats.py         # Table and cohort summary statistics

Security Policy

Read-Only Enforcement

All SQL queries are validated before execution by ClinicalQueryValidator:

Only SELECT, WITH, and DECLARE statements are allowed
Write operations (INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, EXEC, MERGE, CREATE) are blocked
Semicolons are rejected to prevent statement chaining
Queries are validated after stripping SQL comments

Credential Handling

Database credentials are passed via environment variables, never hardcoded
BioRouter stores credentials via envs (inline) or env_keys (OS keyring) in its config
No credentials are logged or included in tool responses

Disclaimer

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. The authors (Gianmarco Bellucci and Wanjun Gu) make no representations or warranties regarding the accuracy, completeness, or reliability of the software or its outputs.

Important notices:

This tool is designed for research purposes only and is not intended for clinical decision-making or direct patient care.
The authors are not responsible for any consequences arising from the use or misuse of this software, including but not limited to: incorrect query results, data misinterpretation, security incidents, or regulatory non-compliance.
Users are solely responsible for ensuring their use of this software complies with all applicable institutional policies, data use agreements, IRB protocols, and privacy regulations (including HIPAA where applicable).
The read-only SQL validation provides a defense-in-depth layer but should not be the sole security control. Database-level permissions and network controls should be configured independently.
Clinical data accessed through this tool is de-identified per the source data warehouse configuration. Users must not attempt to re-identify patients.

License

MIT

Acknowledgments

MedCP — architecture template by the UCSF Baranzini Lab.
UCSFOMOPAgent — sibling agent for the OMOP CDM, which CDWAgent is designed to pair with inside BioRouter.
BioRouter — agent framework (a fork of Block's Goose) that coordinates clinical MCP agents.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src/cdwagent		src/cdwagent
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
README.md		README.md
logo.png		logo.png
manifest.json		manifest.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CDWAgent

BioRouter Extension

Authors

Features

Tools

Schema Discovery (3)

Clinical Queries (7)

Clinical Notes (4)

Concept Search (4)

Data Export (1)

Statistics (2)

Guided Prompts

Validation

Installation

Requirements

Quick install (uvx)

From source

Run as a module

Configuration

Use with BioRouter

Context Strategy (LLM dispatch optimization)

Schema Reference

Project Structure

Security Policy

Read-Only Enforcement

Credential Handling

Disclaimer

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages