Skip to content

Update with the new mainstream structure#5

Open
Jeronymous wants to merge 77 commits intomainfrom
merge_hf_main
Open

Update with the new mainstream structure#5
Jeronymous wants to merge 77 commits intomainfrom
merge_hf_main

Conversation

@Jeronymous
Copy link
Copy Markdown
Member

No description provided.

NathanHB and others added 30 commits October 14, 2025 16:06
* option1

* also debugging the judge

* also debugging the judge

* debug

* eval tracker fix 1

* likely fix for the GSM+ issue

* stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites

* more memory for flow judge
…several combinations (huggingface#1017)

* fix

* added a warning message

* fix unit tests

* fix unit tests 2

* mini fix

* minifix

* test

* update new metrics name

* updated var names
…huggingface#828)

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* homogeneize k and n in parametrizable metrics

* updated aime, last metric fixs

* fix

* restore rm import

* restore

* update doc

* gpqa fix

* pass at

* recall

* test
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* use inspect-ai to evaluate aime25 and gsm8k

* revert file

* working for 3 tasks

* parallel evals of tasks

* adds gpqa diamond to inspect

* move tasks to individual files

* move tasks to individual files

* enable extended tasks as well

* run precomit hook

* fix mkqa

* chaange extended suite to lighteval

* chaange extended suite to lighteval

* add metdata to tasks

* add metdata to tasks

* remove license notice and put docstring on top of file

* homogenize tags

* add docstring for all multilingual tasks

* add docstring for all multilingual tasks

* add name and dataset to metadata

* use TASKS_TABLE for multilingual tasks

* use TASKS_TABLE for default tasks

* use TASKS_TABLE for default tasks

* loads all tasks correclty

* move community tasks to default tasks and update doc

* move community tasks to default tasks and update doc

* revert uneeded changes

* fix doc build

* fix doc build

* remove custom tasks and let user decide if loading multilingual tasks

* load-tasks multilingual fix

* update doc

* remove uneeded file

* update readme

* update readme

* update readme

* fix test

* add back the custom tasks

* add back the custom tasks

* fix tasks

* fix tasks

* fix tasks

* fix tests

* fix tests
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance

- this allows for:
- better logs
- better paralelixzation
- easier to add tasks

tasks compatible with inspect ai (at term all the tasks will be compatible):

- gpqa (fewshot compatible)
- ifeval
- hle
- gsm8k (fewshot compatible)
- agieval
- aime24,25

### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```

result:

```
|                                Model                                 |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |   0.53|     0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|   0.71|     1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |   0.53|     0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |   0.65|     0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |   0.35|     0|0.25|
```


### compare few shots diff on gsm8k

```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```

```
|                                Model                                 |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |  0.7|          0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |  0.5|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |  0.4|          0.8|
```

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* adds mmlu-pro

* adds mmlu-pro

* add mmlu-pro with inspectai
* adds mmlu-pro

* adds mmlu-pro

* add mmlu-pro with inspectai

* fix reasoning effrot
…#992)

* fix

* revert uneeded changes

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* run all hf-providers

* add example

* remove uneeded params
* remove suites and make fewshot optional

* fix docs to remove suites and fewshots

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
…uggingface#1051)

* remove suites and make fewshot optional

* fix docs to remove suites and fewshots

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* Remove suite argument iin task config

* Remove suite argument iin task config

* fix try to cache functool.partial function

* fix styling
…gface#1052)

* add a task dump in registry for better documentation of tasks

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix

* remove

* fix aimo

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
set(1,2,3) -> {1,2,3}

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>
Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>
even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name
moves all the prompts from `default_prompts.py` to their respective task file
paulinebm and others added 19 commits December 10, 2025 20:45
Commented out the entire workflow configuration for the PR Style Bot.
Commenting out the entire workflow configuration for the PR Style Bot.
* Refactor PR Style Bot workflow with new inputs

Updated the PR Style Bot workflow to enhance functionality and include new inputs for style commands and Python quality dependencies.

* Refactor PR Style Bot to use reusable action

Updated the PR Style Bot workflow to use a reusable style bot action from the huggingface_hub repository.

* Apply suggestion from @NathanHB
* enable use of data files for custom tasks

* addressing PR comments, create new doc file, update docstring with types

* Update docs/source/offline-evaluation.md

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

* add new doc to toctree

* Add offline evaluation section to documentation

---------

Co-authored-by: David Biagioni <dbiagioni@proofpoint.com>
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Dave Biagioni <7086434+davebiagioni@users.noreply.github.com>
* initial commit

* multi challenge impl, ready for review

* docstring fixes

* addressed comments

* addressed comments

* modifs to multiturn api for inspect

---------

Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>
* initial commit

* fixes for mathvista

* Apply suggestion from @NathanHB

* Apply suggestion from @NathanHB

* Apply suggestion from @NathanHB

* make style

---------

Co-authored-by: Omkar Kabde <omkarkabde@gmail.com>
* ready for review

* some fixes

* addressing comments + fixing multi turn

* fit the task in one file

---------

Co-authored-by: Akshath Mangudi <akshathmangudi@gmail.com>
* [EVAL] BIG-Bench Extra Hard

* update

* fixing the prompt

* Apply suggestion from @NathanHB

* Apply suggestion from @NathanHB

---------

Co-authored-by: Jigyasu <jigyasu@outlook.in>
* add eval results tip

* Update docs/source/index.mdx
* Upgrade vLLM from 0.10.1.1 to 0.14.1

- Update pyproject.toml to vllm>=0.11.0
- Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
- Add comprehensive test suite for V1 engine compatibility
- Add smoke tests for quick validation

Changes:
- pyproject.toml: Updated vllm version constraint
- vllm_model.py: Updated get_tokenizer import path
- llm_as_judge.py: Updated get_tokenizer import path
- Added smoke_test_vllm_v11.py: Quick validation tests
- Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests

All tests passing - V1 engine compatible, basic inference working.

* Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup

The vLLM slow tests were failing with OOM errors when running after
accelerate tests. The issue was:
1. vLLM V1 engine requires a specific amount of free GPU memory at startup
2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB)
3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB

Fixes:
- Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB)
- Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests
- Improve AsyncVLLMModel.cleanup() to properly delete model object

The gpu_memory_utilization parameter only affects KV cache allocation and
does not impact model outputs with temperature=0.0, so this change is safe.

* Fix vLLM CI test by increasing gpu_memory_utilization to 0.4

The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'

Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB

Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)

* Fix vLLM CI test and add GPU memory monitoring

This commit addresses two issues:

1. Fix vLLM engine initialization failure in CI
   - Root cause: Triton library requires Python.h headers to compile CUDA utilities
   - Solution: Install python3.10-dev package in CI workflow
   - Error was: 'fatal error: Python.h: No such file or directory'

2. Add comprehensive GPU memory monitoring for slow tests
   - Add _log_gpu_memory() helper function in conftest.py
   - Log GPU memory before/after each slow test (device, total, allocated, reserved, free)
   - Add memory logging to model cleanup methods:
     * VLLMModel.cleanup()
     * AsyncVLLMModel.cleanup()
     * TransformersModel.cleanup()
   - Shows memory freed during cleanup operations

This will help diagnose OOM issues and verify proper memory cleanup between tests.

Changes:
- .github/workflows/slow_tests.yaml: Add python3.10-dev installation step
- tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture
- src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods
- src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup

* Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation

The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler)
for JIT kernel compilation during warmup. The error was:
'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist'

Fixes:
- Set CUDA_HOME=/usr/local/cuda-12.4 environment variable
- Add /usr/local/cuda-12.4/bin to PATH for nvcc access
- This allows FlashInfer to JIT-compile custom attention kernels

Previous fixes in this PR:
- ✅ Installed python3.10-dev for Python.h headers (Triton compilation)
- ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache
- ✅ Added comprehensive GPU memory monitoring

GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB),
so the issue is purely build-time tooling for JIT compilation.

* Fix vLLM CI: Pass CUDA environment variables to test subprocess

The vLLM v1 engine spawns subprocesses that don't inherit environment
variables set in . The previous fix set CUDA_HOME in the
GitHub Actions environment, but the vLLM EngineCore subprocess couldn't
access it, causing:

'/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found'

Fix:
- Set CUDA_HOME and PATH directly in the test run command
- This ensures the environment variables are inherited by all subprocesses
- Now nvcc will be found during FlashInfer JIT compilation

The issue was subprocess environment isolation, not the parent environment.

* Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation

- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90
- Cache /usr/local/cuda-12.8 to speed up subsequent CI runs
- Add verification step to check nvcc availability
- Update CUDA_HOME and PATH to use CUDA 12.8
- Use export in test run to ensure subprocess inherits environment variables

This fixes the issue where vLLM v0.15.x with FlashInfer backend requires
nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5).

Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found

* Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture

- Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len
- Replace model.model_config.max_seq_len_to_capture with max_model_len for async model
- This attribute was renamed in vLLM v0.15.x

Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'

* Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids

- Replace prompt_token_ids= with prompts= in LLM.generate() calls
- Update both VLLMModel and AsyncVLLMModel
- Update llm_as_judge.py for VLLM backend

In vLLM v0.15.x, the LLM.generate() method signature changed:
- Old: generate(prompt_token_ids=..., sampling_params=...)
- New: generate(prompts=..., sampling_params=...)

Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'

* Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure

In vLLM v0.15.x, the prompt_logprobs structure changed:
- Now returns dict[int, Logprob] at each position (FlatLogprobs class)
- Only contains top-k tokens (default was 1, causing KeyError for continuation tokens)
- Need to access logprobs_at_position[token] instead of direct dict access

Changes:
1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included
2. Add defensive error handling with helpful message if token not found
3. Update variable names for clarity (logprobs -> logprobs_at_position)

* Fix vLLM v0.15.x logprobs API compatibility

* working omg

* revert

* revert

* revert

* Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12

The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev
fails with 'Unable to locate package'. This updates the workflow to install
python3.12-dev to match the runner's Python version.

* lower memory need

* add debug prints

* upgrade ruff

* upgrade ruff

* fix dependencies

* fix dependencies
* Upgrade vLLM from 0.10.1.1 to 0.14.1

- Update pyproject.toml to vllm>=0.11.0
- Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
- Add comprehensive test suite for V1 engine compatibility
- Add smoke tests for quick validation

Changes:
- pyproject.toml: Updated vllm version constraint
- vllm_model.py: Updated get_tokenizer import path
- llm_as_judge.py: Updated get_tokenizer import path
- Added smoke_test_vllm_v11.py: Quick validation tests
- Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests

All tests passing - V1 engine compatible, basic inference working.

* Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup

The vLLM slow tests were failing with OOM errors when running after
accelerate tests. The issue was:
1. vLLM V1 engine requires a specific amount of free GPU memory at startup
2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB)
3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB

Fixes:
- Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB)
- Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests
- Improve AsyncVLLMModel.cleanup() to properly delete model object

The gpu_memory_utilization parameter only affects KV cache allocation and
does not impact model outputs with temperature=0.0, so this change is safe.

* Fix vLLM CI test by increasing gpu_memory_utilization to 0.4

The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'

Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB

Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)

* Fix vLLM CI test and add GPU memory monitoring

This commit addresses two issues:

1. Fix vLLM engine initialization failure in CI
   - Root cause: Triton library requires Python.h headers to compile CUDA utilities
   - Solution: Install python3.10-dev package in CI workflow
   - Error was: 'fatal error: Python.h: No such file or directory'

2. Add comprehensive GPU memory monitoring for slow tests
   - Add _log_gpu_memory() helper function in conftest.py
   - Log GPU memory before/after each slow test (device, total, allocated, reserved, free)
   - Add memory logging to model cleanup methods:
     * VLLMModel.cleanup()
     * AsyncVLLMModel.cleanup()
     * TransformersModel.cleanup()
   - Shows memory freed during cleanup operations

This will help diagnose OOM issues and verify proper memory cleanup between tests.

Changes:
- .github/workflows/slow_tests.yaml: Add python3.10-dev installation step
- tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture
- src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods
- src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup

* Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation

The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler)
for JIT kernel compilation during warmup. The error was:
'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist'

Fixes:
- Set CUDA_HOME=/usr/local/cuda-12.4 environment variable
- Add /usr/local/cuda-12.4/bin to PATH for nvcc access
- This allows FlashInfer to JIT-compile custom attention kernels

Previous fixes in this PR:
- ✅ Installed python3.10-dev for Python.h headers (Triton compilation)
- ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache
- ✅ Added comprehensive GPU memory monitoring

GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB),
so the issue is purely build-time tooling for JIT compilation.

* Fix vLLM CI: Pass CUDA environment variables to test subprocess

The vLLM v1 engine spawns subprocesses that don't inherit environment
variables set in . The previous fix set CUDA_HOME in the
GitHub Actions environment, but the vLLM EngineCore subprocess couldn't
access it, causing:

'/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found'

Fix:
- Set CUDA_HOME and PATH directly in the test run command
- This ensures the environment variables are inherited by all subprocesses
- Now nvcc will be found during FlashInfer JIT compilation

The issue was subprocess environment isolation, not the parent environment.

* Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation

- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90
- Cache /usr/local/cuda-12.8 to speed up subsequent CI runs
- Add verification step to check nvcc availability
- Update CUDA_HOME and PATH to use CUDA 12.8
- Use export in test run to ensure subprocess inherits environment variables

This fixes the issue where vLLM v0.15.x with FlashInfer backend requires
nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5).

Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found

* Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture

- Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len
- Replace model.model_config.max_seq_len_to_capture with max_model_len for async model
- This attribute was renamed in vLLM v0.15.x

Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'

* Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids

- Replace prompt_token_ids= with prompts= in LLM.generate() calls
- Update both VLLMModel and AsyncVLLMModel
- Update llm_as_judge.py for VLLM backend

In vLLM v0.15.x, the LLM.generate() method signature changed:
- Old: generate(prompt_token_ids=..., sampling_params=...)
- New: generate(prompts=..., sampling_params=...)

Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'

* Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure

In vLLM v0.15.x, the prompt_logprobs structure changed:
- Now returns dict[int, Logprob] at each position (FlatLogprobs class)
- Only contains top-k tokens (default was 1, causing KeyError for continuation tokens)
- Need to access logprobs_at_position[token] instead of direct dict access

Changes:
1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included
2. Add defensive error handling with helpful message if token not found
3. Update variable names for clarity (logprobs -> logprobs_at_position)

* Fix vLLM v0.15.x logprobs API compatibility

* working omg

* revert

* revert

* revert

* Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12

The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev
fails with 'Unable to locate package'. This updates the workflow to install
python3.12-dev to match the runner's Python version.

* lower memory need

* add debug prints

* upgrade ruff

* upgrade ruff

* fix dependencies

* fix dependencies

* workflow test against vllm nighlty

* workflow test against vllm nighlty
* 🔒 pin slow_tests.yaml actions to commit SHAs

* 🔒 pin quality.yaml actions to commit SHAs

* 🔒 pin doc-build.yml actions to commit SHAs

* 🔒 pin doc-pr-build.yml actions to commit SHAs

* 🔒 pin doc-pr-upload.yml actions to commit SHAs

* 🔒 pin pr_style_bot.yaml actions to commit SHAs

* 🔒 pin trufflehog.yml actions to commit SHAs
* Fix vLLM 0.11 compatibility and restore hellaswag_cf

Co-authored-by: OpenAI Codex <codex@openai.com>

* Support vLLM 0.19 prompt schema

Co-authored-by: OpenAI Codex <codex@openai.com>

* Address vLLM PR review feedback

Co-authored-by: OpenAI Codex <codex@openai.com>

* Remove temporary hellaswag_cf task

Co-authored-by: OpenAI Codex <codex@openai.com>

* Clarify vLLM compatibility branches

Co-authored-by: OpenAI Codex <codex@openai.com>

* Handle tied MCQ logits in slow sample comparisons

Co-authored-by: OpenAI Codex <codex@openai.com>

* Handle flat VLM token outputs in tie checks

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: OpenAI Codex <codex@openai.com>
@Jeronymous Jeronymous requested review from Lduignan1 and Oligou April 22, 2026 09:28
Upstream refactor splits src/lighteval/tasks into per-task files under
src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/,
drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and
removes the suite field from LightevalTaskConfig.

Port our edits to the new structure:
- tasks/gsm_plus.py: generation_size 16384
- tasks/gsm8k.py: generation_size 2048
- tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric,
  language-specific stop sequences for all 11 subsets
- tasks/piqa.py: switch to lighteval/piqa mirror
- tasks/siqa.py: pin hf_revision
- tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt
  uses dynamic letters based on the number of options; add a parallel
  mmlu_pro_raw task exposing the handmade prompt (no inspect_ai)
- tasks/ruler.py: new home for the ruler prompt helper
- tasks/advbench.py: move here from community_tasks/
- multilingual/tasks/mathalea.py: move here from community_tasks/
- multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the
  generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct

Other conflict resolutions:
- pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0,
  new inspect-ai and openai deps
- vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token
  guard, prefix-cache None-skip in logprob loop, and
  skip_reading_prefix_cache via guarded attribute assignment; adopt
  upstream's build_vllm_token_prompts helper
- llm_as_judge.py: keep max_model_len=65536, adopt upstream's
  api_key/base_url litellm pass-through
- lighteval_task.py: preserve name/data_dir fallback in load_dataset
  while picking up upstream's data_files support; keep partial args
  detail in __str__ for deterministic cache hashing
- cache_management.py: adopt name-only task_to_configs lookup; keep
  regex that strips function memory addresses for hash determinism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.