WIP watch diff with upstream main branch by Jeronymous · Pull Request #6 · OpenLLM-France/lighteval

Jeronymous · 2026-04-22T12:26:19Z

No description provided.

…ion of the dataset)

…q len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine"

… caching

…and new version of the dataset is different)

…t it has eos_token_id)

…o avoid failures or NaN

…it (>= 0.15). Unfortunately, it currently fails with VLLM 0.15.1 in our env: File ".../vllm/v1/worker/gpu_worker.py", line 412, in initialize_from_config self.model_runner.initialize_kv_cache(kv_cache_config) File ".../vllm/v1/worker/gpu_model_runner.py", line 5874, in initialize_kv_cache self.initialize_attn_backend(kv_cache_config) File ".../vllm/v1/worker/gpu_model_runner.py", line 5225, in initialize_attn_backend check_attention_cp_compatibility(self.vllm_config) File ".../vllm/v1/worker/cp_utils.py", line 39, in check_attention_cp_compatibility assert layer_impl.supports_pcp, ( AssertionError: PCP requires attention impls' support, but the impl FlashAttentionImpl does not support PCP.

Fix multi-parallelism (TP+DP or PP+DP)

Add MathAlea French math MCQ community task

Add Red Teaming benchmark based on AvgBench

Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism

Oligou and others added 30 commits October 14, 2025 11:51

Merge branch

fd58a24

skip task if no documents

80fb9cd

Change default use_chat_template when loading the tokenizer fails

acd19f1

Take HF_HOME env variable into account (if set)

3cc6315

Fix MGSM evals

f0f7162

fix reshape bug

df19f29

Remove padding from response

646d657

add ruler metric and prompt

8c07847

Add RULER in metrics

ed1718b

make FLORES translation benchmark work with datasets v2 (parquet vers…

58d0ccf

…ion of the dataset)

Fix possible failure around stop_sequences

1deed74

Fix failure reported in huggingface#1005 (from Pull Request huggingfa…

769a575

…ce#1006)

Do not use GPT as a judge

2d001dd

Fix IFBench subset

e7069e2

Fix IFEval-fr dataset repo

628d2b0

limit the model length to avoid error "ValueError: The model's max se…

2d1f146

…q len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine"

make cache string independant of function random address

b7cf5ff

Do not take version of transformers that is bug w.r.t OFFLINE behaviour

9436e15

Fix use of sets in eval code

4c9e90c

Fix corner case

bc164c1

Misc fixes in RULER evaluation

cb2da29

Change the code to make it work with more recent versions of vllm

82805ab

Fix vllm call in LLM as a judge

41dec9a

Fix error in logprob computation with vllm >= 0.12, because of prefix…

2e968b2

… caching

Fix GPQA-French benchmark (original dataset cannot be found anymore, …

d9af025

…and new version of the dataset is different)

Fix for Mistral tokenizer, that does not have eos_token attribute (bu…

a7e4591

…t it has eos_token_id)

Fix corner cases

45ba41e

Fix corner case on IFBench

9ba96b0

override max_position_embedding with max_length passed by the user, t…

e74e9c0

…o avoid failures or NaN

add COMET and MetricX metrics to lighteval

ddce778

Jeronymous and others added 30 commits February 20, 2026 16:17

remove unnecessary deps (already there)

637d2ef

Merge pull request #2 from OpenLLM-France/parallelism

1167c70

Fix multi-parallelism (TP+DP or PP+DP)

fix corner case

33968ce

tune generation_size for math tasks

7ab7fa0

larger limit for gsm_plus

6a5c942

add an option enable_thinking

e8ac11b

Add MathAlea benchmark for French math multiple-choice evaluation

0d59c8d

Fix gold index retrieval in prompt_mathalea function

7859993

Update MathAlea metadata with detailed description, language, and tags

3354541

Fix dataset reference in MathAlea metadata

e372a0f

Refactor MathAlea dataset configuration and prompt generation functions

d42f5fd

add system prompts in french and english

ce6848f

Make GPQA-fr a generative benchmark, not a MCQ

1db696e

Implement MMLU pro eval, with generative style (for instruct models)

2d55527

Merge pull request #3 from Lduignan1/mathalea

91d9639

Add MathAlea French math MCQ community task

Add Red Teaming benchmark based on AvgBench

02757f7

Allow to have non-numeric results (ex: judge textual output, for details

7138a21

Make results deterministic. Add the judgement in the details

280f450

Also add another judgement where the judge does not see the question

8d5c991

Add possibility to avoid running evaluation

da058f2

Merge pull request #4 from OpenLLM-France/advbench

481d9bd

Add Red Teaming benchmark based on AvgBench

Fix ruff style and lint after merge

180975c

Solve version incompatibility in project install

2466d64

less differences with the upstream branch

68494ca

Add copyright

9ca1f4b

less differences with the upstream branch

6ee2a9e

do not build doc on fork

d9fe736

Add safety / red-teaming benchmarks

379ed71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP watch diff with upstream main branch#6

WIP watch diff with upstream main branch#6
Jeronymous wants to merge 68 commits intoupstream-mainfrom
merge_hf_main

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants