SiEval is a model delivery quality verification system with an asynchronous streaming evaluation engine, iterative feedback loop, and resilient sharded persistence. It verifies the entire model delivery pipeline — training → conversion → inference → evaluation.
- Asynchronous streaming — process samples concurrently without waiting for batch completion
- Iterative feedback loop — multi-turn evaluation with feedback
- Resilient persistence — sharded, append-only storage for crash recovery
- 11 mainstream benchmarks — AIME 2024/2025, DROP, GPQA-Diamond, HumanEval, IFEval, LiveCodeBench, MATH-500, MMLU, MMLU-Pro, T-Eval (math, code, reasoning, knowledge, instruction-following, tool-use)
- Type-safe pipelines — fully typed task stages (preprocess → infer → postprocess → feedback)
- YAML-based configuration — batch evaluation with model derivation and quota allocation
- Inference orchestration — recipe-driven inference with auto-resolve and backend abstraction (vLLM, SGLang)
- Anomaly detection — built-in detection rules for output quality, performance, and correctness
- Profiling — stage timing, I/O metrics, and token usage tracking
Requirements: Unix (Linux, macOS), Python ≥ 3.12, PDM (recommended) or pip
git clone https://github.com/scitix/sieval.git
cd sieval
pdm install # or: pip install -e .Optional extras (per-benchmark dependencies):
pip install -e ".[math]" # AIME 2024/2025, MATH-500 (math-verify)
pip install -e ".[drop]" # DROP (numpy, scipy)
pip install -e ".[ifeval]" # IFEval (absl, langdetect, nltk, immutabledict)
pip install -e ".[t-eval]" # T-Eval (numpy, sentence-transformers)
pip install -e ".[math,drop,ifeval,t-eval]" # all extras at onceDataset paths below use HuggingFace repo ids for HF-sourced datasets (e.g.
HuggingFaceH4/aime_2024) and${SIEVAL_DATA_DIR}/<name>for URL-sourced datasets. SetSIEVAL_DATA_DIR(default~/.sieval/data) before running any command that resolves a URL-sourced dataset.
Start from an example — two-step flow:
cp examples/quickstart.yaml eval.yaml
$EDITOR eval.yaml # set model checkpoint + container image
# Step 1: stage the data
sieval dataset download aime_2024
# Step 2: run eval
sieval eval eval.yamlSee examples/README.md for more scenarios (leaderboard, recipe overrides) and examples/hardware/ for hardware-pinned reference configs.
Discover tasks / datasets:
sieval dataset list # registered datasets + licenses + download status
sieval task list --domain Mathematics # filter tasks by domain
sieval dataset show aime_2024 # dataset detail, incl. the YAML path: to paste
sieval dataset download aime_2024 # stage data into $SIEVAL_DATA_DIRAll-in-one (launch inference, evaluate, cleanup — recommended entry point):
sieval run config.yaml
sieval run config.yaml --resumeEvaluate against an already-online endpoint:
sieval eval leaderboards/sft_fast_202511.yaml --model gpt-4o
# `sieval eval` is a shortcut for the underlying resource verb:
sieval leaderboard run leaderboards/sft_fast_202511.yaml --model gpt-4oInference management:
sieval infer start /path/to/Qwen3-8B # auto-resolve and launch
sieval infer list # show running services
sieval infer logs qwen3-8b -f # stream engine logs
sieval infer stop qwen3-8b # graceful shutdownProgrammatic usage:
import anyio
from sieval.datasets import MMLUDataset
from sieval.tasks import MMLUZeroShotGenTask
from sieval.core.models import ChatModel
from sieval.core.runners import TaskRunner, TaskRunnerConfig
async def main():
dataset = MMLUDataset("cais/mmlu")
model = ChatModel("gpt-4o", max_retries=3, concurrency_limit=128)
task = MMLUZeroShotGenTask(dataset=dataset, model=model)
runner = TaskRunner(
task=task,
config=TaskRunnerConfig(result_dir="./outputs/mmlu", auto_resume=True),
)
results = await runner.arun()
print(results)
anyio.run(main)- Configuration Guide — YAML format, task pipeline, model resource pool, anomaly detection
- Concurrency Control — four-level concurrency model
- Profiling & Observability — stage timing, I/O metrics, token tracking
- Inference Management — full infer subcommand reference
See CONTRIBUTING.md for development setup, project architecture, code conventions, and the PR process.
Apache License 2.0 — see LICENSE for details.
@software{sieval2026,
title = {SiEval: Asynchronous Streaming Evaluation Framework},
author = {{ScitiX}},
year = {2026},
url = {https://github.com/scitix/sieval}
}