Scholarly publications play a vital role in developing hypotheses, research projects, reports, theses, and evidence-based policies. However, despite the rapid growth of scientific literature, much of this knowledge remains locked within unstructured formats such as PDFs and lengthy reports, limiting its discoverability, reuse, synthesis, and policy impact. The increasing volume of publications also makes it challenging for researchers to stay updated with emerging evidence. Furthermore, access to literature is often constrained by repository download limits and publisher restrictions on bulk retrieval.
The current scholarly communication model is largely centered on individual papers: users search for a publication, download it, and manually read and extract relevant information. This approach is increasingly insufficient for addressing large-scale research questions that require systematic analysis of thousands of documents.
The semanticClimate approach moves beyond traditional document access by making scholarly content semantically accessible. This enables not only human readers—including those who rely on audio or alternative formats—but also machines to discover, analyze, and connect knowledge automatically. Such semantic enrichment supports the creation of machine-readable corpora, knowledge graphs, and AI-assisted literature review workflows.
To support this vision, semanticClimate promotes a suite of open-source, Python-based toolkits for large-scale literature retrieval and corpus creation pygetpapers, document processing amilib, and semantic extraction of entities such as species, locations, chemical compounds, and other climate-relevant concepts through document analysis and named entity recognition (NER) workflows. Together, these tools provide open and reproducible infrastructure for large-scale evidence synthesis, interdisciplinary research, and the transformation of scholarly knowledge beyond static PDF and text formats.
Beyond technology development, the project contributes to building open knowledge infrastructure and strengthening research capacity. By providing accessible tools, workflows, and training resources, semanticClimate supports students, early-career researchers, librarians, and domain experts in developing skills for open, machine-readable, and AI-enabled scholarship.
- Extract knowledge from scholarly publications using semantic tools
- Convert research outputs into machine-readable formats
- Support AI-assisted literature reviews
- Create reusable semantic resources and knowledge graphs
- Promote open scholarship and reproducible research
- Provide training materials and community learning resources
- Foster collaboration between researchers, librarians, data scientists, and students
This project explores and develops workflows using:
- Python
- Jupyter Notebooks
- amilib
- pygetpapers
- docanalysis
- Wikidata
- Knowledge Graph Technologies
- Natural Language Processing (NLP)
- Large Language Models (LLMs)
- GitHub for Open Collaboration
- Extraction of a structured climate ontology as a knowledge graph.
- Interrogation of 15,000 pages of the IPCC AR6 reports (and hopefully emerging releases of AR7) , the current Open scientific literature.
- Enrichment with trusted knowledge (IPCC, publications, Wikipedia)
- Development of scholarly tech – semanticCorpus (2026) holds a searchable collection of articles and links with metadata management.
- Development of encyclopedia/knowledge_graph -a generally accessible technology which can give non-experts answers within an hour
- Beginner tutorials
- AI-assisted literature review workflows
- Keyphrases extraction
- Knowledge graph demonstrations
- FORCE11 community resources
- Training materials from the semanticClimate community
If you use materials from this repository, please cite the project and acknowledge the FORCE11 Task Group.
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: LICENSE