Skip to content

WGLab/PubMind

Repository files navigation

pubmind_logo_v1

PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.

image

PubMind is an AI-driven framework that uses large language models (LLMs) to extract genetic variant–disease–pathogenicity associations directly from biomedical literature. It combines fine-tuned BERT models for input filtering with instruction-tuned LLMs for extracting variant, disease, and functional evidence, covering SNVs, CNVs, SVs, and gene fusions. Extracted variants are normalized to genomic and transcript coordinates and stored in PubMind-DB, a web-accessible knowledgebase. Applied to >41M PubMed abstracts and >5M PMC full texts, PubMind-DB contains ~0.7M consolidated unique variants with rich annotations, of which only ~10% overlap with ClinVar—yet >80% of those show concordant pathogenicity labels, including full agreement for four-star expert-reviewed variants. PubMind provides a scalable, generalizable, and open-source framework that transforms unstructured text into structured genomic knowledge, supporting variant interpretation and precision medicine.

Prerequisites and Installation

Please refer to environment.yml and requirements.txt for required environments and packages. For installation, please use the two-step approach below:

conda env create -f environment.yml
conda activate pubmind
pip install -r requirements.txt

Run PubMind

Please refer to run_PubMind.ipynb for how to use PubMind. All inputs and outputs during this example PubMind run are in the example folder.

PubMind frameworkds includes the following modules:

  1. Filtering Module (finetuned BERT model)
    • Wangwpi/PubMind_finetuned_BERT (Hugging Face)
  2. Inference Module (instruction-tuned LLM)
    • meta-llama/Llama-3.3-70B-Instruct (Hugging Face)
  3. Normalization Module
    • Quality filter (gene name, pathogenicity)
    • Variant parser (cDNA, protein, RSID)
    • Map to transcript
    • Map to genome cooridnates
    • MONDO Disease name
    • HPO term

PubMind-DB

PubMind-DB could be accessed here: https://pubmind.wglab.org/

Reference (Preprint)

Wang, P. and K. Wang (2025). PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models. bioRxiv: 2025.2010.2013.682183.

License

PubMind is freely available for academic use. For license details, please refer to this page.

About

PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors