You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
detect_task (used by inspect/eval) and _detect_task_and_class_from_config
(used by config/build) are two implementations of the same task decision over
the same MODEL_CLASS_MAPPING data. They have drifted in three places, and the
modality-disambiguation step (D2) reconstructs modality from config field names —
a heuristic that is provably weaker than information the pipeline already holds.
Proposal: extract a single task-detection core with two thin entry points, and
derive modality from the resolved model class's main_input_name.
Internal refactor only — public detect_task / resolve_task_and_model_class
signatures stay unchanged.
Motivation: the D2 false positive (concrete bug)
D2 upgrades feature-extraction → image-feature-extraction when the config has a
top-level image_size/patch_size (OR semantics). patch_size is not exclusive
to vision — spectrogram transformers patchify their mel-spectrogram. Verified via
the real detection path:
declared architecture
inspect task today
correct
Wav2Vec2Model
feature-extraction → text dataset → fails
audio
WhisperModel
feature-extraction → text dataset → fails
audio
ASTModel
image-feature-extraction → image dataset → fails
audio
feature-extraction routes to a text dataset (eval: STS-B; build: TextDataset) and image-feature-extraction to an image dataset, so audio backbones fail in both eval and build regardless of which branch D2 picks.
Root cause
Modality is a property of the model class (known with certainty), but the pipeline
collapses class → task through TasksManager (modality-blind by design), discarding
modality, then D2 reconstructs it from config fields. The class that carries the
authoritative signal is already resolved at that exact point
(_detect_task_from_config → _resolve_model_class_from_config), so D2 pays a
heuristic's fragility to avoid a cost already incurred.
main_input_name (an HF framework convention) is the authoritative, offline,
architecture-agnostic modality signal:
main_input_name
modality
feature-extraction upgrades to
input_ids
text
feature-extraction
pixel_values
image
image-feature-extraction
input_values / input_features
audio
audio-feature-extraction
It also handles the CLIP text/vision split correctly
(CLIPTextModelWithProjection→input_ids, CLIPVisionModelWithProjection→pixel_values),
which the config-field table cannot.
Model-id override: get_default_task_for_model_id (e.g. prajjwal1/bert-tiny)
is applied on the build path only; detect_task skips it, so inspect can
disagree with build today.
Modality signal: config fields (the AST bug) vs the resolved class.
Proposed architecture
_resolve_task_override(model_type, model_id) -> task | None
single place encoding model_type / model_id canonical task overrides
(replaces the short-circuit AND the sentinel AND folds in the model-id override)
_detect(config) -> (task, model_class | None, source)
the one task-detection core: override → wrapped-library → resolve class
→ infer task → fill-mask→seq2seq upgrade
detect_task(config) = _detect → modality-upgrade → drop class
resolve_task_and_model_class C1 = _detect → ensure class → modality-upgrade
Build-specific class resolution (get_model_class_for_task, specialization, arch
fallback) stays in the build entry layer. The short-circuit's "answer without
importing optimum" optimization is preserved when the override hits.
Decisions (proposed)
(a) Remove the D2 config-field table outright. Every path yielding feature-extraction either comes from the override mapping (already
modality-aware) or from TasksManager (class resolved → main_input_name
available). No path holds feature-extraction without a class, so the
heuristic is dead weight. Keep _resolve_task_modality as the single modality
entry point, re-implemented on main_input_name.
(b) Audio = detection-correct + register the name; defer real eval support.
AST/Whisper → audio-feature-extraction; add it to KNOWN_TASKS and TASK_ABBREV. Build quant then falls back to RandomDataset (works), eval errors
cleanly as unsupported. A real audio evaluator/dataset is follow-on work.
Out of scope
Universal "ONNX-input-intersection → RandomDataset" calibration fallback
(defense-in-depth; separate datasets-layer change).
A real audio-feature-extraction evaluator + default dataset.
Testing
Parametrized pytest over representative configs (text / image / audio / CLIP-dual /
SAM / bert-tiny), asserting: detect_task == task from resolve_task_and_model_class
for the same config; AST → audio-feature-extraction; bert-tiny model-id override
fires on both paths; vision backbones still → image-feature-extraction; no
regression on the #841 table (bart-mnli, sam, clip).
Summary
detect_task(used byinspect/eval) and_detect_task_and_class_from_config(used by
config/build) are two implementations of the same task decision overthe same
MODEL_CLASS_MAPPINGdata. They have drifted in three places, and themodality-disambiguation step (D2) reconstructs modality from config field names —
a heuristic that is provably weaker than information the pipeline already holds.
Proposal: extract a single task-detection core with two thin entry points, and
derive modality from the resolved model class's
main_input_name.Internal refactor only — public
detect_task/resolve_task_and_model_classsignatures stay unchanged.
Motivation: the D2 false positive (concrete bug)
D2 upgrades
feature-extraction → image-feature-extractionwhen the config has atop-level
image_size/patch_size(OR semantics).patch_sizeis not exclusiveto vision — spectrogram transformers patchify their mel-spectrogram. Verified via
the real detection path:
inspecttask todayWav2Vec2Modelfeature-extraction→ text dataset → failsWhisperModelfeature-extraction→ text dataset → failsASTModelimage-feature-extraction→ image dataset → failsfeature-extractionroutes to a text dataset (eval: STS-B; build: TextDataset) andimage-feature-extractionto an image dataset, so audio backbones fail in bothevalandbuildregardless of which branch D2 picks.Root cause
Modality is a property of the model class (known with certainty), but the pipeline
collapses
class → taskthrough TasksManager (modality-blind by design), discardingmodality, then D2 reconstructs it from config fields. The class that carries the
authoritative signal is already resolved at that exact point
(
_detect_task_from_config→_resolve_model_class_from_config), so D2 pays aheuristic's fragility to avoid a cost already incurred.
main_input_name(an HF framework convention) is the authoritative, offline,architecture-agnostic modality signal:
main_input_namefeature-extractionupgrades toinput_idsfeature-extractionpixel_valuesimage-feature-extractioninput_values/input_featuresaudio-feature-extractionIt also handles the CLIP text/vision split correctly
(
CLIPTextModelWithProjection→input_ids,CLIPVisionModelWithProjection→pixel_values),which the config-field table cannot.
Three inconsistencies the merge eliminates
detect_task'sdistinct_tasksshort-circuit vs_detect_task_and_class_from_config's(model_type, None)sentinelreverse-lookup — two mechanisms over the same dict (fix(task): make detect_task architecture-aware for multi-task model types #841 had to sync this pair).
get_default_task_for_model_id(e.g.prajjwal1/bert-tiny)is applied on the build path only;
detect_taskskips it, soinspectcandisagree with
buildtoday.Proposed architecture
Build-specific class resolution (
get_model_class_for_task, specialization, archfallback) stays in the build entry layer. The short-circuit's "answer without
importing optimum" optimization is preserved when the override hits.
Decisions (proposed)
feature-extractioneither comes from the override mapping (alreadymodality-aware) or from TasksManager (class resolved →
main_input_nameavailable). No path holds
feature-extractionwithout a class, so theheuristic is dead weight. Keep
_resolve_task_modalityas the single modalityentry point, re-implemented on
main_input_name.AST/Whisper →
audio-feature-extraction; add it toKNOWN_TASKSandTASK_ABBREV. Build quant then falls back to RandomDataset (works), eval errorscleanly as unsupported. A real audio evaluator/dataset is follow-on work.
Out of scope
(defense-in-depth; separate datasets-layer change).
audio-feature-extractionevaluator + default dataset.Testing
Parametrized pytest over representative configs (text / image / audio / CLIP-dual /
SAM / bert-tiny), asserting:
detect_task== task fromresolve_task_and_model_classfor the same config; AST →
audio-feature-extraction; bert-tiny model-id overridefires on both paths; vision backbones still →
image-feature-extraction; noregression on the #841 table (bart-mnli, sam, clip).
References