Update with the new mainstream structure by Jeronymous · Pull Request #5 · OpenLLM-France/lighteval

Jeronymous · 2026-04-22T09:28:34Z

No description provided.

* option1 * also debugging the judge * also debugging the judge * debug * eval tracker fix 1 * likely fix for the GSM+ issue * stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites * more memory for flow judge

…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names

…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* use inspect-ai to evaluate aime25 and gsm8k * revert file * working for 3 tasks * parallel evals of tasks * adds gpqa diamond to inspect * move tasks to individual files * move tasks to individual files * enable extended tasks as well * run precomit hook * fix mkqa * chaange extended suite to lighteval * chaange extended suite to lighteval * add metdata to tasks * add metdata to tasks * remove license notice and put docstring on top of file * homogenize tags * add docstring for all multilingual tasks * add docstring for all multilingual tasks * add name and dataset to metadata * use TASKS_TABLE for multilingual tasks * use TASKS_TABLE for default tasks * use TASKS_TABLE for default tasks * loads all tasks correclty * move community tasks to default tasks and update doc * move community tasks to default tasks and update doc * revert uneeded changes * fix doc build * fix doc build * remove custom tasks and let user decide if loading multilingual tasks * load-tasks multilingual fix * update doc * remove uneeded file * update readme * update readme * update readme * fix test * add back the custom tasks * add back the custom tasks * fix tasks * fix tasks * fix tasks * fix tests * fix tests

adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance - this allows for: - better logs - better paralelixzation - easier to add tasks tasks compatible with inspect ai (at term all the tasks will be compatible): - gpqa (fewshot compatible) - ifeval - hle - gsm8k (fewshot compatible) - agieval - aime24,25 ### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`: ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` result: ``` | Model |agieval|aime25|gpqa| |----------------------------------------------------------------------|------:|-----:|---:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25| ``` ### compare few shots diff on gsm8k ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gsm8k|0,lighteval|gsm8k|3" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` ``` | Model |gsm8k|gsm8k_3_shots| |----------------------------------------------------------------------|----:|------------:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8| ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot

…1034)

…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* run all hf-providers * add example * remove uneeded params

* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

…uggingface#1051) * remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * Remove suite argument iin task config * Remove suite argument iin task config * fix try to cache functool.partial function * fix styling

…gface#1052) * add a task dump in registry for better documentation of tasks * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix * remove * fix aimo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

set(1,2,3) -> {1,2,3} Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai> Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>

Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>

even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name

moves all the prompts from `default_prompts.py` to their respective task file

Commented out the entire workflow configuration for the PR Style Bot.

Commenting out the entire workflow configuration for the PR Style Bot.

@NathanHB

* Refactor PR Style Bot workflow with new inputs Updated the PR Style Bot workflow to enhance functionality and include new inputs for style commands and Python quality dependencies. * Refactor PR Style Bot to use reusable action Updated the PR Style Bot workflow to use a reusable style bot action from the huggingface_hub repository. * Apply suggestion from @NathanHB

* enable use of data files for custom tasks * addressing PR comments, create new doc file, update docstring with types * Update docs/source/offline-evaluation.md Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> * add new doc to toctree * Add offline evaluation section to documentation --------- Co-authored-by: David Biagioni <dbiagioni@proofpoint.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Dave Biagioni <7086434+davebiagioni@users.noreply.github.com>

* initial commit * multi challenge impl, ready for review * docstring fixes * addressed comments * addressed comments * modifs to multiturn api for inspect --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>

@NathanHB

* initial commit * fixes for mathvista * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB * make style --------- Co-authored-by: Omkar Kabde <omkarkabde@gmail.com>

* ready for review * some fixes * addressing comments + fixing multi turn * fit the task in one file --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>

@NathanHB

* [EVAL] BIG-Bench Extra Hard * update * fixing the prompt * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB --------- Co-authored-by: Jigyasu <jigyasu@outlook.in>

…alanced') (huggingface#1111)

* add eval results tip * Update docs/source/index.mdx

* Upgrade vLLM from 0.10.1.1 to 0.14.1 - Update pyproject.toml to vllm>=0.11.0 - Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers - Add comprehensive test suite for V1 engine compatibility - Add smoke tests for quick validation Changes: - pyproject.toml: Updated vllm version constraint - vllm_model.py: Updated get_tokenizer import path - llm_as_judge.py: Updated get_tokenizer import path - Added smoke_test_vllm_v11.py: Quick validation tests - Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests All tests passing - V1 engine compatible, basic inference working. * Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup The vLLM slow tests were failing with OOM errors when running after accelerate tests. The issue was: 1. vLLM V1 engine requires a specific amount of free GPU memory at startup 2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB) 3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB Fixes: - Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB) - Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests - Improve AsyncVLLMModel.cleanup() to properly delete model object The gpu_memory_utilization parameter only affects KV cache allocation and does not impact model outputs with temperature=0.0, so this change is safe. * Fix vLLM CI test by increasing gpu_memory_utilization to 0.4 The CI test was failing with 'ValueError: To serve at least one request with the model's max seq len (8192), 1.5 GiB KV cache is needed, which is larger than the available KV cache memory (1.42 GiB).' Root cause: - Tesla T4 GPU (15.36 GB) in CI environment - With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache - Required 1.5 GiB for max_seq_len=8192 - Shortfall: 80 MB Fix: - Increase gpu_memory_utilization from 0.35 to 0.4 - Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement) - Does not affect model outputs with temperature=0.0 (deterministic) * Fix vLLM CI test and add GPU memory monitoring This commit addresses two issues: 1. Fix vLLM engine initialization failure in CI - Root cause: Triton library requires Python.h headers to compile CUDA utilities - Solution: Install python3.10-dev package in CI workflow - Error was: 'fatal error: Python.h: No such file or directory' 2. Add comprehensive GPU memory monitoring for slow tests - Add _log_gpu_memory() helper function in conftest.py - Log GPU memory before/after each slow test (device, total, allocated, reserved, free) - Add memory logging to model cleanup methods: * VLLMModel.cleanup() * AsyncVLLMModel.cleanup() * TransformersModel.cleanup() - Shows memory freed during cleanup operations This will help diagnose OOM issues and verify proper memory cleanup between tests. Changes: - .github/workflows/slow_tests.yaml: Add python3.10-dev installation step - tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture - src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods - src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup * Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler) for JIT kernel compilation during warmup. The error was: 'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist' Fixes: - Set CUDA_HOME=/usr/local/cuda-12.4 environment variable - Add /usr/local/cuda-12.4/bin to PATH for nvcc access - This allows FlashInfer to JIT-compile custom attention kernels Previous fixes in this PR: - ✅ Installed python3.10-dev for Python.h headers (Triton compilation) - ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache - ✅ Added comprehensive GPU memory monitoring GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB), so the issue is purely build-time tooling for JIT compilation. * Fix vLLM CI: Pass CUDA environment variables to test subprocess The vLLM v1 engine spawns subprocesses that don't inherit environment variables set in . The previous fix set CUDA_HOME in the GitHub Actions environment, but the vLLM EngineCore subprocess couldn't access it, causing: '/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found' Fix: - Set CUDA_HOME and PATH directly in the test run command - This ensures the environment variables are inherited by all subprocesses - Now nvcc will be found during FlashInfer JIT compilation The issue was subprocess environment isolation, not the parent environment. * Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation - Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90 - Cache /usr/local/cuda-12.8 to speed up subsequent CI runs - Add verification step to check nvcc availability - Update CUDA_HOME and PATH to use CUDA 12.8 - Use export in test run to ensure subprocess inherits environment variables This fixes the issue where vLLM v0.15.x with FlashInfer backend requires nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5). Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found * Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture - Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len - Replace model.model_config.max_seq_len_to_capture with max_model_len for async model - This attribute was renamed in vLLM v0.15.x Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture' * Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids - Replace prompt_token_ids= with prompts= in LLM.generate() calls - Update both VLLMModel and AsyncVLLMModel - Update llm_as_judge.py for VLLM backend In vLLM v0.15.x, the LLM.generate() method signature changed: - Old: generate(prompt_token_ids=..., sampling_params=...) - New: generate(prompts=..., sampling_params=...) Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids' * Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure In vLLM v0.15.x, the prompt_logprobs structure changed: - Now returns dict[int, Logprob] at each position (FlatLogprobs class) - Only contains top-k tokens (default was 1, causing KeyError for continuation tokens) - Need to access logprobs_at_position[token] instead of direct dict access Changes: 1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included 2. Add defensive error handling with helpful message if token not found 3. Update variable names for clarity (logprobs -> logprobs_at_position) * Fix vLLM v0.15.x logprobs API compatibility * working omg * revert * revert * revert * Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12 The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev fails with 'Unable to locate package'. This updates the workflow to install python3.12-dev to match the runner's Python version. * lower memory need * add debug prints * upgrade ruff * upgrade ruff * fix dependencies * fix dependencies

* Upgrade vLLM from 0.10.1.1 to 0.14.1 - Update pyproject.toml to vllm>=0.11.0 - Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers - Add comprehensive test suite for V1 engine compatibility - Add smoke tests for quick validation Changes: - pyproject.toml: Updated vllm version constraint - vllm_model.py: Updated get_tokenizer import path - llm_as_judge.py: Updated get_tokenizer import path - Added smoke_test_vllm_v11.py: Quick validation tests - Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests All tests passing - V1 engine compatible, basic inference working. * Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup The vLLM slow tests were failing with OOM errors when running after accelerate tests. The issue was: 1. vLLM V1 engine requires a specific amount of free GPU memory at startup 2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB) 3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB Fixes: - Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB) - Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests - Improve AsyncVLLMModel.cleanup() to properly delete model object The gpu_memory_utilization parameter only affects KV cache allocation and does not impact model outputs with temperature=0.0, so this change is safe. * Fix vLLM CI test by increasing gpu_memory_utilization to 0.4 The CI test was failing with 'ValueError: To serve at least one request with the model's max seq len (8192), 1.5 GiB KV cache is needed, which is larger than the available KV cache memory (1.42 GiB).' Root cause: - Tesla T4 GPU (15.36 GB) in CI environment - With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache - Required 1.5 GiB for max_seq_len=8192 - Shortfall: 80 MB Fix: - Increase gpu_memory_utilization from 0.35 to 0.4 - Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement) - Does not affect model outputs with temperature=0.0 (deterministic) * Fix vLLM CI test and add GPU memory monitoring This commit addresses two issues: 1. Fix vLLM engine initialization failure in CI - Root cause: Triton library requires Python.h headers to compile CUDA utilities - Solution: Install python3.10-dev package in CI workflow - Error was: 'fatal error: Python.h: No such file or directory' 2. Add comprehensive GPU memory monitoring for slow tests - Add _log_gpu_memory() helper function in conftest.py - Log GPU memory before/after each slow test (device, total, allocated, reserved, free) - Add memory logging to model cleanup methods: * VLLMModel.cleanup() * AsyncVLLMModel.cleanup() * TransformersModel.cleanup() - Shows memory freed during cleanup operations This will help diagnose OOM issues and verify proper memory cleanup between tests. Changes: - .github/workflows/slow_tests.yaml: Add python3.10-dev installation step - tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture - src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods - src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup * Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler) for JIT kernel compilation during warmup. The error was: 'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist' Fixes: - Set CUDA_HOME=/usr/local/cuda-12.4 environment variable - Add /usr/local/cuda-12.4/bin to PATH for nvcc access - This allows FlashInfer to JIT-compile custom attention kernels Previous fixes in this PR: - ✅ Installed python3.10-dev for Python.h headers (Triton compilation) - ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache - ✅ Added comprehensive GPU memory monitoring GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB), so the issue is purely build-time tooling for JIT compilation. * Fix vLLM CI: Pass CUDA environment variables to test subprocess The vLLM v1 engine spawns subprocesses that don't inherit environment variables set in . The previous fix set CUDA_HOME in the GitHub Actions environment, but the vLLM EngineCore subprocess couldn't access it, causing: '/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found' Fix: - Set CUDA_HOME and PATH directly in the test run command - This ensures the environment variables are inherited by all subprocesses - Now nvcc will be found during FlashInfer JIT compilation The issue was subprocess environment isolation, not the parent environment. * Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation - Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90 - Cache /usr/local/cuda-12.8 to speed up subsequent CI runs - Add verification step to check nvcc availability - Update CUDA_HOME and PATH to use CUDA 12.8 - Use export in test run to ensure subprocess inherits environment variables This fixes the issue where vLLM v0.15.x with FlashInfer backend requires nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5). Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found * Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture - Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len - Replace model.model_config.max_seq_len_to_capture with max_model_len for async model - This attribute was renamed in vLLM v0.15.x Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture' * Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids - Replace prompt_token_ids= with prompts= in LLM.generate() calls - Update both VLLMModel and AsyncVLLMModel - Update llm_as_judge.py for VLLM backend In vLLM v0.15.x, the LLM.generate() method signature changed: - Old: generate(prompt_token_ids=..., sampling_params=...) - New: generate(prompts=..., sampling_params=...) Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids' * Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure In vLLM v0.15.x, the prompt_logprobs structure changed: - Now returns dict[int, Logprob] at each position (FlatLogprobs class) - Only contains top-k tokens (default was 1, causing KeyError for continuation tokens) - Need to access logprobs_at_position[token] instead of direct dict access Changes: 1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included 2. Add defensive error handling with helpful message if token not found 3. Update variable names for clarity (logprobs -> logprobs_at_position) * Fix vLLM v0.15.x logprobs API compatibility * working omg * revert * revert * revert * Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12 The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev fails with 'Unable to locate package'. This updates the workflow to install python3.12-dev to match the runner's Python version. * lower memory need * add debug prints * upgrade ruff * upgrade ruff * fix dependencies * fix dependencies * workflow test against vllm nighlty * workflow test against vllm nighlty

* 🔒 pin slow_tests.yaml actions to commit SHAs * 🔒 pin quality.yaml actions to commit SHAs * 🔒 pin doc-build.yml actions to commit SHAs * 🔒 pin doc-pr-build.yml actions to commit SHAs * 🔒 pin doc-pr-upload.yml actions to commit SHAs * 🔒 pin pr_style_bot.yaml actions to commit SHAs * 🔒 pin trufflehog.yml actions to commit SHAs

* Fix vLLM 0.11 compatibility and restore hellaswag_cf Co-authored-by: OpenAI Codex <codex@openai.com> * Support vLLM 0.19 prompt schema Co-authored-by: OpenAI Codex <codex@openai.com> * Address vLLM PR review feedback Co-authored-by: OpenAI Codex <codex@openai.com> * Remove temporary hellaswag_cf task Co-authored-by: OpenAI Codex <codex@openai.com> * Clarify vLLM compatibility branches Co-authored-by: OpenAI Codex <codex@openai.com> * Handle tied MCQ logits in slow sample comparisons Co-authored-by: OpenAI Codex <codex@openai.com> * Handle flat VLM token outputs in tie checks Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>

Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism

NathanHB and others added 30 commits October 14, 2025 16:06

Split up enhancement and features in release notes template (huggingf…

c2b83e2

…ace#984)

Fix nltk import failing (huggingface#1013)

3af8925

Fix 999: always provide parameters in the metric name to allow using …

70acb85

…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names

added fallback for incomplete configs for vlm models launched as llms (…

e7d885c

…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Fixing naming for sample evals + adding reqs in aime24 (huggingface#989)

161d47c

* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test

add translation literals indic (huggingface#1015)

bf8b547

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

adds mmlu-pro (huggingface#1031)

fa4860f

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai

Fix inspect reasoning effrot (huggingface#1033)

17e024b

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot

Update huggingface-cli login to use newer hf auth login (huggingface#…

97303ac

…1034)

add openai and inspect ai lower bound (huggingface#1035)

5aa09c5

fix lighteval task inspect command and tiny bench task (huggingface…

b5cbd91

…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

run all hf providers with :all (huggingface#1039)

5b7ca62

* run all hf-providers * add example * remove uneeded params

remove suites and make fewshot optional (huggingface#1038)

31433cc

* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests

put lower bound on typer to use literal type (huggingface#1042)

566a7be

remove suites from serbian_eval.py (huggingface#1044)

d04e4f9

neater bundle and logdir (huggingface#1043)

cd91dde

not forcing use_logits at True (huggingface#1050)

2247df7

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

wrong attribute self.k -> self.n (huggingface#1049)

6524c6a

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

Fix set using wrong syntax (huggingface#1057)

cb97d5c

set(1,2,3) -> {1,2,3} Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai> Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>

Fix: correct argument order in MajAtN.compute (huggingface#1058)

391d5b4

Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>

Update LiteLLM configuration for hosted_vllm provider (huggingface#1060)

af6b5b4

even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name

use correct hf subset for ifbench multiturn (huggingface#1061)

ad58fed

One file one task definition (huggingface#1059)

babeec9

moves all the prompts from `default_prompts.py` to their respective task file

adding satrred tag for frontend

d9ea404

Adding AA Omniscience task (huggingface#1066)

5425c33

paulinebm and others added 19 commits December 10, 2025 20:45

Comment out PR Style Bot workflow configuration

0a8b90a

Commented out the entire workflow configuration for the PR Style Bot.

Refactor PR Style Bot workflow configuration

8238d3e

Comment out PR Style Bot workflow configuration

f48af0b

Commenting out the entire workflow configuration for the PR Style Bot.

Update comment bot secrets in workflow (huggingface#1107)

7fba130

refactor: adding api_key param to litellm (huggingface#1114)

36b3e6c

multi challenge (huggingface#1120)

d9a9401

* initial commit * multi challenge impl, ready for review * docstring fixes * addressed comments * addressed comments * modifs to multiturn api for inspect --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>

refactor: add formatted response to litellm (huggingface#1116)

845c989

Mathvista (huggingface#1118)

e7048c3

* initial commit * fixes for mathvista * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB * make style --------- Co-authored-by: Omkar Kabde <omkarkabde@gmail.com>

long horizon execution (huggingface#1119)

61c547b

* ready for review * some fixes * addressing comments + fixing multi turn * fit the task in one file --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>

bbeh (huggingface#1124)

f888858

* [EVAL] BIG-Bench Extra Hard * update * fixing the prompt * Apply suggestion from @NathanHB * Apply suggestion from @NathanHB --------- Co-authored-by: Jigyasu <jigyasu@outlook.in>

Fix typo in few_shots_select option error message. ('fbalanced' -> 'b…

6ce93d5

…alanced') (huggingface#1111)

add eval results tip (huggingface#1126)

06aee5b

* add eval results tip * Update docs/source/index.mdx

chore: bump doc-builder SHA for PR upload workflow (huggingface#1213)

10b9104

Jeronymous requested review from Lduignan1 and Oligou April 22, 2026 09:28

Jeronymous force-pushed the merge_hf_main branch from d621dce to d1cf663 Compare April 22, 2026 10:18

Jeronymous added 7 commits April 22, 2026 14:06

Fix ruff style and lint after merge

180975c

Solve version incompatibility in project install

2466d64

less differences with the upstream branch

68494ca

Add copyright

9ca1f4b

less differences with the upstream branch

6ee2a9e

do not build doc on fork

d9fe736

Add safety / red-teaming benchmarks

379ed71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update with the new mainstream structure#5

Update with the new mainstream structure#5
Jeronymous wants to merge 77 commits intomainfrom
merge_hf_main

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants