Data-Forge

Data-Forge is an asynchronous service for generating reference files (starting with Kerchunk) for large climate datasets. It converts local NetCDF-style inputs into cloud-friendly reference files and writes them to local filesystem or S3 destinations.

Features

Asynchronous job handling and monitoring
Kerchunk reference file generation (NetCDF to Zarr references)
Kerchunk output: local filesystem or S3
REST API and CLI for job submission and tracking
No internal storage—the service writes directly to user-managed destinations

Typical Workflow

Submit a job with local NetCDF files and parameters (e.g., chunking).
Monitor job status and progress asynchronously via API/CLI.
Download or access the generated Kerchunk reference files at your storage endpoint.

Example CLI Usage

# Submit a local NetCDF-to-Kerchunk job
$ data-forge submit \
  --input ./data/dataset/*.nc \
  --concat-dims time \
  --metadata '{"project": "CMIP6"}'

# Monitor job progress
$ data-forge status <job-id> --watch

# Get reference file URL or download
$ data-forge get-url <job-id>
$ data-forge download <job-id> --output ./local_refs/

uv Setup

uv venv
uv sync --all-groups
uv run pytest -vvv

High-Level Architecture

API: FastAPI (REST endpoints), job monitoring/status, OpenAPI docs
Job Queue: Dramatiq + Redis (asynchronous processing)
Workers: Process Kerchunk conversion and write outputs directly to user-managed destinations
Output: Reference files are written to local filesystem or S3
No Internal Storage: Reference files are written directly to user-managed destinations

Roadmap

Remote input support
STAC / ESGF publish integration
Globus Auth
Dask-based scaling

Deployment

Docker Compose for local/single-node deployment
Helm chart for Kubernetes (production, scalable)
Minimal required services: API, worker(s), Redis

Documentation

See the docs/ directory for:

Full user guide and CLI reference
API specification (OpenAPI/Swagger)
Deployment guides (Docker, Kubernetes)
Architecture/design docs
Contribution instructions

Data-Forge aims to make FAIR, cloud-optimized data publishing simple and scalable for the global climate data community.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
deploy/garage		deploy/garage
docker		docker
docs		docs
scripts		scripts
src/dataforge		src/dataforge
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.s3.yml		docker-compose.s3.yml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

uv Setup

High-Level Architecture

Roadmap

Deployment

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

uv Setup

High-Level Architecture

Roadmap

Deployment

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages