PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.
PubMind is an AI-driven framework that uses large language models (LLMs) to extract genetic variant–disease–pathogenicity associations directly from biomedical literature. It combines fine-tuned BERT models for input filtering with instruction-tuned LLMs for extracting variant, disease, and functional evidence, covering SNVs, CNVs, SVs, and gene fusions. Extracted variants are normalized to genomic and transcript coordinates and stored in PubMind-DB, a web-accessible knowledgebase. Applied to >41M PubMed abstracts and >5M PMC full texts, PubMind-DB contains ~0.7M consolidated unique variants with rich annotations, of which only ~10% overlap with ClinVar—yet >80% of those show concordant pathogenicity labels, including full agreement for four-star expert-reviewed variants. PubMind provides a scalable, generalizable, and open-source framework that transforms unstructured text into structured genomic knowledge, supporting variant interpretation and precision medicine.
Please refer to environment.yml and requirements.txt for required environments and packages. For installation, please use the two-step approach below:
conda env create -f environment.yml
conda activate pubmind
pip install -r requirements.txtPlease refer to run_PubMind.ipynb for how to use PubMind. All inputs and outputs during this example PubMind run are in the example folder.
PubMind frameworkds includes the following modules:
- Filtering Module (finetuned BERT model)
- Wangwpi/PubMind_finetuned_BERT (Hugging Face)
- Inference Module (instruction-tuned LLM)
- meta-llama/Llama-3.3-70B-Instruct (Hugging Face)
- Normalization Module
- Quality filter (gene name, pathogenicity)
- Variant parser (cDNA, protein, RSID)
- Map to transcript
- Map to genome cooridnates
- MONDO Disease name
- HPO term
PubMind-DB could be accessed here: https://pubmind.wglab.org/
Wang, P. and K. Wang (2025). PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models. bioRxiv: 2025.2010.2013.682183.
PubMind is freely available for academic use. For license details, please refer to this page.
