Open
Conversation
* option1 * also debugging the judge * also debugging the judge * debug * eval tracker fix 1 * likely fix for the GSM+ issue * stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites * more memory for flow judge
…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names
…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* use inspect-ai to evaluate aime25 and gsm8k * revert file * working for 3 tasks * parallel evals of tasks * adds gpqa diamond to inspect * move tasks to individual files * move tasks to individual files * enable extended tasks as well * run precomit hook * fix mkqa * chaange extended suite to lighteval * chaange extended suite to lighteval * add metdata to tasks * add metdata to tasks * remove license notice and put docstring on top of file * homogenize tags * add docstring for all multilingual tasks * add docstring for all multilingual tasks * add name and dataset to metadata * use TASKS_TABLE for multilingual tasks * use TASKS_TABLE for default tasks * use TASKS_TABLE for default tasks * loads all tasks correclty * move community tasks to default tasks and update doc * move community tasks to default tasks and update doc * revert uneeded changes * fix doc build * fix doc build * remove custom tasks and let user decide if loading multilingual tasks * load-tasks multilingual fix * update doc * remove uneeded file * update readme * update readme * update readme * fix test * add back the custom tasks * add back the custom tasks * fix tasks * fix tasks * fix tasks * fix tests * fix tests
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance - this allows for: - better logs - better paralelixzation - easier to add tasks tasks compatible with inspect ai (at term all the tasks will be compatible): - gpqa (fewshot compatible) - ifeval - hle - gsm8k (fewshot compatible) - agieval - aime24,25 ### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`: ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` result: ``` | Model |agieval|aime25|gpqa| |----------------------------------------------------------------------|------:|-----:|---:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25| ``` ### compare few shots diff on gsm8k ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gsm8k|0,lighteval|gsm8k|3" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` ``` | Model |gsm8k|gsm8k_3_shots| |----------------------------------------------------------------------|----:|------------:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8| ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai
* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot
…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* run all hf-providers * add example * remove uneeded params
* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
…uggingface#1051) * remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * Remove suite argument iin task config * Remove suite argument iin task config * fix try to cache functool.partial function * fix styling
…gface#1052) * add a task dump in registry for better documentation of tasks * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix * remove * fix aimo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
set(1,2,3) -> {1,2,3}
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>
Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>
even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name
moves all the prompts from `default_prompts.py` to their respective task file
Commented out the entire workflow configuration for the PR Style Bot.
Commenting out the entire workflow configuration for the PR Style Bot.
* Refactor PR Style Bot workflow with new inputs Updated the PR Style Bot workflow to enhance functionality and include new inputs for style commands and Python quality dependencies. * Refactor PR Style Bot to use reusable action Updated the PR Style Bot workflow to use a reusable style bot action from the huggingface_hub repository. * Apply suggestion from @NathanHB
* enable use of data files for custom tasks * addressing PR comments, create new doc file, update docstring with types * Update docs/source/offline-evaluation.md Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> * add new doc to toctree * Add offline evaluation section to documentation --------- Co-authored-by: David Biagioni <dbiagioni@proofpoint.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Dave Biagioni <7086434+davebiagioni@users.noreply.github.com>
* initial commit * multi challenge impl, ready for review * docstring fixes * addressed comments * addressed comments * modifs to multiturn api for inspect --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>
* ready for review * some fixes * addressing comments + fixing multi turn * fit the task in one file --------- Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>
* add eval results tip * Update docs/source/index.mdx
* Upgrade vLLM from 0.10.1.1 to 0.14.1
- Update pyproject.toml to vllm>=0.11.0
- Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
- Add comprehensive test suite for V1 engine compatibility
- Add smoke tests for quick validation
Changes:
- pyproject.toml: Updated vllm version constraint
- vllm_model.py: Updated get_tokenizer import path
- llm_as_judge.py: Updated get_tokenizer import path
- Added smoke_test_vllm_v11.py: Quick validation tests
- Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests
All tests passing - V1 engine compatible, basic inference working.
* Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup
The vLLM slow tests were failing with OOM errors when running after
accelerate tests. The issue was:
1. vLLM V1 engine requires a specific amount of free GPU memory at startup
2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB)
3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB
Fixes:
- Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB)
- Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests
- Improve AsyncVLLMModel.cleanup() to properly delete model object
The gpu_memory_utilization parameter only affects KV cache allocation and
does not impact model outputs with temperature=0.0, so this change is safe.
* Fix vLLM CI test by increasing gpu_memory_utilization to 0.4
The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'
Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB
Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)
* Fix vLLM CI test and add GPU memory monitoring
This commit addresses two issues:
1. Fix vLLM engine initialization failure in CI
- Root cause: Triton library requires Python.h headers to compile CUDA utilities
- Solution: Install python3.10-dev package in CI workflow
- Error was: 'fatal error: Python.h: No such file or directory'
2. Add comprehensive GPU memory monitoring for slow tests
- Add _log_gpu_memory() helper function in conftest.py
- Log GPU memory before/after each slow test (device, total, allocated, reserved, free)
- Add memory logging to model cleanup methods:
* VLLMModel.cleanup()
* AsyncVLLMModel.cleanup()
* TransformersModel.cleanup()
- Shows memory freed during cleanup operations
This will help diagnose OOM issues and verify proper memory cleanup between tests.
Changes:
- .github/workflows/slow_tests.yaml: Add python3.10-dev installation step
- tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture
- src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods
- src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup
* Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation
The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler)
for JIT kernel compilation during warmup. The error was:
'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist'
Fixes:
- Set CUDA_HOME=/usr/local/cuda-12.4 environment variable
- Add /usr/local/cuda-12.4/bin to PATH for nvcc access
- This allows FlashInfer to JIT-compile custom attention kernels
Previous fixes in this PR:
- ✅ Installed python3.10-dev for Python.h headers (Triton compilation)
- ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache
- ✅ Added comprehensive GPU memory monitoring
GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB),
so the issue is purely build-time tooling for JIT compilation.
* Fix vLLM CI: Pass CUDA environment variables to test subprocess
The vLLM v1 engine spawns subprocesses that don't inherit environment
variables set in . The previous fix set CUDA_HOME in the
GitHub Actions environment, but the vLLM EngineCore subprocess couldn't
access it, causing:
'/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found'
Fix:
- Set CUDA_HOME and PATH directly in the test run command
- This ensures the environment variables are inherited by all subprocesses
- Now nvcc will be found during FlashInfer JIT compilation
The issue was subprocess environment isolation, not the parent environment.
* Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation
- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90
- Cache /usr/local/cuda-12.8 to speed up subsequent CI runs
- Add verification step to check nvcc availability
- Update CUDA_HOME and PATH to use CUDA 12.8
- Use export in test run to ensure subprocess inherits environment variables
This fixes the issue where vLLM v0.15.x with FlashInfer backend requires
nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5).
Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found
* Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture
- Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len
- Replace model.model_config.max_seq_len_to_capture with max_model_len for async model
- This attribute was renamed in vLLM v0.15.x
Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'
* Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids
- Replace prompt_token_ids= with prompts= in LLM.generate() calls
- Update both VLLMModel and AsyncVLLMModel
- Update llm_as_judge.py for VLLM backend
In vLLM v0.15.x, the LLM.generate() method signature changed:
- Old: generate(prompt_token_ids=..., sampling_params=...)
- New: generate(prompts=..., sampling_params=...)
Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'
* Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure
In vLLM v0.15.x, the prompt_logprobs structure changed:
- Now returns dict[int, Logprob] at each position (FlatLogprobs class)
- Only contains top-k tokens (default was 1, causing KeyError for continuation tokens)
- Need to access logprobs_at_position[token] instead of direct dict access
Changes:
1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included
2. Add defensive error handling with helpful message if token not found
3. Update variable names for clarity (logprobs -> logprobs_at_position)
* Fix vLLM v0.15.x logprobs API compatibility
* working omg
* revert
* revert
* revert
* Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12
The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev
fails with 'Unable to locate package'. This updates the workflow to install
python3.12-dev to match the runner's Python version.
* lower memory need
* add debug prints
* upgrade ruff
* upgrade ruff
* fix dependencies
* fix dependencies
* Upgrade vLLM from 0.10.1.1 to 0.14.1
- Update pyproject.toml to vllm>=0.11.0
- Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
- Add comprehensive test suite for V1 engine compatibility
- Add smoke tests for quick validation
Changes:
- pyproject.toml: Updated vllm version constraint
- vllm_model.py: Updated get_tokenizer import path
- llm_as_judge.py: Updated get_tokenizer import path
- Added smoke_test_vllm_v11.py: Quick validation tests
- Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests
All tests passing - V1 engine compatible, basic inference working.
* Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup
The vLLM slow tests were failing with OOM errors when running after
accelerate tests. The issue was:
1. vLLM V1 engine requires a specific amount of free GPU memory at startup
2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB)
3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB
Fixes:
- Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB)
- Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests
- Improve AsyncVLLMModel.cleanup() to properly delete model object
The gpu_memory_utilization parameter only affects KV cache allocation and
does not impact model outputs with temperature=0.0, so this change is safe.
* Fix vLLM CI test by increasing gpu_memory_utilization to 0.4
The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'
Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB
Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)
* Fix vLLM CI test and add GPU memory monitoring
This commit addresses two issues:
1. Fix vLLM engine initialization failure in CI
- Root cause: Triton library requires Python.h headers to compile CUDA utilities
- Solution: Install python3.10-dev package in CI workflow
- Error was: 'fatal error: Python.h: No such file or directory'
2. Add comprehensive GPU memory monitoring for slow tests
- Add _log_gpu_memory() helper function in conftest.py
- Log GPU memory before/after each slow test (device, total, allocated, reserved, free)
- Add memory logging to model cleanup methods:
* VLLMModel.cleanup()
* AsyncVLLMModel.cleanup()
* TransformersModel.cleanup()
- Shows memory freed during cleanup operations
This will help diagnose OOM issues and verify proper memory cleanup between tests.
Changes:
- .github/workflows/slow_tests.yaml: Add python3.10-dev installation step
- tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture
- src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods
- src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup
* Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation
The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler)
for JIT kernel compilation during warmup. The error was:
'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist'
Fixes:
- Set CUDA_HOME=/usr/local/cuda-12.4 environment variable
- Add /usr/local/cuda-12.4/bin to PATH for nvcc access
- This allows FlashInfer to JIT-compile custom attention kernels
Previous fixes in this PR:
- ✅ Installed python3.10-dev for Python.h headers (Triton compilation)
- ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache
- ✅ Added comprehensive GPU memory monitoring
GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB),
so the issue is purely build-time tooling for JIT compilation.
* Fix vLLM CI: Pass CUDA environment variables to test subprocess
The vLLM v1 engine spawns subprocesses that don't inherit environment
variables set in . The previous fix set CUDA_HOME in the
GitHub Actions environment, but the vLLM EngineCore subprocess couldn't
access it, causing:
'/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found'
Fix:
- Set CUDA_HOME and PATH directly in the test run command
- This ensures the environment variables are inherited by all subprocesses
- Now nvcc will be found during FlashInfer JIT compilation
The issue was subprocess environment isolation, not the parent environment.
* Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation
- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90
- Cache /usr/local/cuda-12.8 to speed up subsequent CI runs
- Add verification step to check nvcc availability
- Update CUDA_HOME and PATH to use CUDA 12.8
- Use export in test run to ensure subprocess inherits environment variables
This fixes the issue where vLLM v0.15.x with FlashInfer backend requires
nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5).
Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found
* Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture
- Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len
- Replace model.model_config.max_seq_len_to_capture with max_model_len for async model
- This attribute was renamed in vLLM v0.15.x
Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'
* Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids
- Replace prompt_token_ids= with prompts= in LLM.generate() calls
- Update both VLLMModel and AsyncVLLMModel
- Update llm_as_judge.py for VLLM backend
In vLLM v0.15.x, the LLM.generate() method signature changed:
- Old: generate(prompt_token_ids=..., sampling_params=...)
- New: generate(prompts=..., sampling_params=...)
Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'
* Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure
In vLLM v0.15.x, the prompt_logprobs structure changed:
- Now returns dict[int, Logprob] at each position (FlatLogprobs class)
- Only contains top-k tokens (default was 1, causing KeyError for continuation tokens)
- Need to access logprobs_at_position[token] instead of direct dict access
Changes:
1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included
2. Add defensive error handling with helpful message if token not found
3. Update variable names for clarity (logprobs -> logprobs_at_position)
* Fix vLLM v0.15.x logprobs API compatibility
* working omg
* revert
* revert
* revert
* Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12
The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev
fails with 'Unable to locate package'. This updates the workflow to install
python3.12-dev to match the runner's Python version.
* lower memory need
* add debug prints
* upgrade ruff
* upgrade ruff
* fix dependencies
* fix dependencies
* workflow test against vllm nighlty
* workflow test against vllm nighlty
* 🔒 pin slow_tests.yaml actions to commit SHAs * 🔒 pin quality.yaml actions to commit SHAs * 🔒 pin doc-build.yml actions to commit SHAs * 🔒 pin doc-pr-build.yml actions to commit SHAs * 🔒 pin doc-pr-upload.yml actions to commit SHAs * 🔒 pin pr_style_bot.yaml actions to commit SHAs * 🔒 pin trufflehog.yml actions to commit SHAs
* Fix vLLM 0.11 compatibility and restore hellaswag_cf Co-authored-by: OpenAI Codex <codex@openai.com> * Support vLLM 0.19 prompt schema Co-authored-by: OpenAI Codex <codex@openai.com> * Address vLLM PR review feedback Co-authored-by: OpenAI Codex <codex@openai.com> * Remove temporary hellaswag_cf task Co-authored-by: OpenAI Codex <codex@openai.com> * Clarify vLLM compatibility branches Co-authored-by: OpenAI Codex <codex@openai.com> * Handle tied MCQ logits in slow sample comparisons Co-authored-by: OpenAI Codex <codex@openai.com> * Handle flat VLM token outputs in tie checks Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>
Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism
d621dce to
d1cf663
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.