SiEval

SiEval is a model delivery quality verification system with an asynchronous streaming evaluation engine, iterative feedback loop, and resilient sharded persistence. It verifies the entire model delivery pipeline — training → conversion → inference → evaluation.

Features

Asynchronous streaming — process samples concurrently without waiting for batch completion
Iterative feedback loop — multi-turn evaluation with feedback
Resilient persistence — sharded, append-only storage for crash recovery
11 mainstream benchmarks — AIME 2024/2025, DROP, GPQA-Diamond, HumanEval, IFEval, LiveCodeBench, MATH-500, MMLU, MMLU-Pro, T-Eval (math, code, reasoning, knowledge, instruction-following, tool-use)
Type-safe pipelines — fully typed task stages (preprocess → infer → postprocess → feedback)
YAML-based configuration — batch evaluation with model derivation and quota allocation
Inference orchestration — recipe-driven inference with auto-resolve and backend abstraction (vLLM, SGLang)
Anomaly detection — built-in detection rules for output quality, performance, and correctness
Profiling — stage timing, I/O metrics, and token usage tracking

Installation

Requirements: Unix (Linux, macOS), Python ≥ 3.12, PDM (recommended) or pip

git clone https://github.com/scitix/sieval.git
cd sieval
pdm install          # or: pip install -e .

Optional extras (per-benchmark dependencies):

pip install -e ".[math]"     # AIME 2024/2025, MATH-500 (math-verify)
pip install -e ".[drop]"     # DROP (numpy, scipy)
pip install -e ".[ifeval]"   # IFEval (absl, langdetect, nltk, immutabledict)
pip install -e ".[t-eval]"   # T-Eval (numpy, sentence-transformers)
pip install -e ".[math,drop,ifeval,t-eval]"   # all extras at once

Quick Start

Dataset paths below use HuggingFace repo ids for HF-sourced datasets (e.g. HuggingFaceH4/aime_2024) and ${SIEVAL_DATA_DIR}/<name> for URL-sourced datasets. Set SIEVAL_DATA_DIR (default ~/.sieval/data) before running any command that resolves a URL-sourced dataset.

Start from an example — two-step flow:

cp examples/quickstart.yaml eval.yaml
$EDITOR eval.yaml           # set model checkpoint + container image

# Step 1: stage the data
sieval dataset download aime_2024

# Step 2: run eval
sieval eval eval.yaml

See examples/README.md for more scenarios (leaderboard, recipe overrides) and examples/hardware/ for hardware-pinned reference configs.

Discover tasks / datasets:

sieval dataset list                   # registered datasets + licenses + download status
sieval task list --domain Mathematics # filter tasks by domain
sieval dataset show aime_2024         # dataset detail, incl. the YAML path: to paste
sieval dataset download aime_2024     # stage data into $SIEVAL_DATA_DIR

All-in-one (launch inference, evaluate, cleanup — recommended entry point):

sieval run config.yaml
sieval run config.yaml --resume

Evaluate against an already-online endpoint:

sieval eval leaderboards/sft_fast_202511.yaml --model gpt-4o
# `sieval eval` is a shortcut for the underlying resource verb:
sieval leaderboard run leaderboards/sft_fast_202511.yaml --model gpt-4o

Inference management:

sieval infer start /path/to/Qwen3-8B          # auto-resolve and launch
sieval infer list                               # show running services
sieval infer logs qwen3-8b -f                   # stream engine logs
sieval infer stop qwen3-8b                      # graceful shutdown

Programmatic usage:

import anyio

from sieval.datasets import MMLUDataset
from sieval.tasks import MMLUZeroShotGenTask
from sieval.core.models import ChatModel
from sieval.core.runners import TaskRunner, TaskRunnerConfig

async def main():
    dataset = MMLUDataset("cais/mmlu")
    model = ChatModel("gpt-4o", max_retries=3, concurrency_limit=128)
    task = MMLUZeroShotGenTask(dataset=dataset, model=model)

    runner = TaskRunner(
        task=task,
        config=TaskRunnerConfig(result_dir="./outputs/mmlu", auto_resume=True),
    )
    results = await runner.arun()
    print(results)

anyio.run(main)

Documentation

Configuration Guide — YAML format, task pipeline, model resource pool, anomaly detection
Concurrency Control — four-level concurrency model
Profiling & Observability — stage timing, I/O metrics, token tracking
Inference Management — full infer subcommand reference

Contributing

See CONTRIBUTING.md for development setup, project architecture, code conventions, and the PR process.

License

Apache License 2.0 — see LICENSE for details.

Citation

@software{sieval2026,
  title = {SiEval: Asynchronous Streaming Evaluation Framework},
  author = {{ScitiX}},
  year = {2026},
  url = {https://github.com/scitix/sieval}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
.github		.github
docs/guide		docs/guide
examples		examples
leaderboards		leaderboards
scripts		scripts
sieval		sieval
submodules		submodules
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SiEval

Features

Installation

Quick Start

Documentation

Contributing

License

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SiEval

Features

Installation

Quick Start

Documentation

Contributing

License

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages