[quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper#798
Open
dvsav wants to merge 1 commit into
Open
[quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper#798dvsav wants to merge 1 commit into
dvsav wants to merge 1 commit into
Conversation
Add logit softcapping, export adapter, tests, and example script. Co-authored-by: Cline TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
Torrero
reviewed
Jul 1, 2026
Comment on lines
+136
to
+148
| # Run the model export adapter to get hidden states (already normed). | ||
| hidden_states = self.model_export( | ||
| inputs_embeds=inputs_embeds, | ||
| per_layer_inputs=per_layer_inputs, | ||
| attention_masks=attention_masks, | ||
| position_embeddings=position_embeddings, | ||
| ) | ||
|
|
||
| # Slice hidden states for logits_to_keep. | ||
| slice_indices = slice(-logits_to_keep, None) if logits_to_keep else slice(None) | ||
| logits = self.lm_head(hidden_states[:, slice_indices, :]) | ||
|
|
||
| return self._apply_logit_softcapping(logits) |
Contributor
There was a problem hiding this comment.
if I understand correctly, forward_export should contain only LM head and logit_softcapping for NPU processing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR implements the full PTQ quantization support for
Gemma4ForConditionalGeneration— the top-level multimodal model that wrapsGemma4Model(vision + text decoder) and adds anlm_headlinear layer with logit softcapping. Previously this wrapper was a skeleton stub; now it has a complete calibration path, export path, and Circle conversion support.Why
The
QuantGemma4ForConditionalGenerationwrapper existed only as a skeleton — itsforward()returned raw logits without logit softcapping,as_export_module()raisedValueErrorfor all modes, and the module was commented out in the registry. This meant the top-level Gemma4 model could not be quantized or converted to Circle format. This PR completes the wrapper so the full prepare → calibrate → convert → Circle export pipeline works end-to-end, matching the behavior of the originalGemma4ForConditionalGeneration.forwardincluding logit softcapping.Key Design Decisions
Three observers for the softcapping path —
obs_logit_softcapping_div(afterlogits / softcap),obs_logit_softcapping_tanh(aftertanh), andobs_logits(final logits). Thedivobserver was added after discovering that the exported graph had adivnode with missing quantization parameters. Each intermediate tensor in the softcapping chain gets its own observer so that every graph node carries qparam metadata.Separate
forward_export()method — Instead of reusingforward()for export, a dedicatedforward_export()method usesself.model_export(the export adapter fromself.model.as_export_module()) rather thanself.modeldirectly. This mirrors the pattern used byQuantGemma4Modeland ensures the export path uses precomputed static inputs (inputs_embeds, per_layer_inputs, attention_masks, position_embeddings) rather than dynamic token embedding.New
Gemma4ForConditionalGenerationExportAdapter— A dedicated adapter class was created instead of reusing the existingGemma4LMHeadExportAdapter, because the latter appliednormtwice and did not handle softcapping orlogits_to_keepslicing. The new adapter simply delegates toforward_export()with the storedlogits_to_keepvalue.logits_to_keepmapping — Prefill mode setslogits_to_keep=0(all positions), decode mode setslogits_to_keep=1(last position only), matching the production runtime contract.Lazy import of the export adapter —
Gemma4ForConditionalGenerationExportAdapteris imported insideas_export_module()to avoid a circular import betweenquant_for_conditional_generation.pyandexport_adapters.py.Changes
tico/quantization/wrapq/wrappers/gemma4/quant_for_conditional_generation.py— Implementedforward()with logit softcapping (div → tanh → mul) and three fake-quantization observers; implementedforward_export()using the model export adapter; implementedas_export_module()returningGemma4ForConditionalGenerationExportAdapterwith mode-awarelogits_to_keep; updated_all_observers()to return the three new observers.tico/quantization/wrapq/wrappers/gemma4/export_adapters.py— AddedGemma4ForConditionalGenerationExportAdapterclass that delegates toforward_export().tico/quantization/wrapq/wrappers/registry.py— Enabled thequant_for_conditional_generationmodule in_CORE_MODULES(uncommented).test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py— New file: 10 unit tests covering wrapper type, forward finiteness, output shape,logits_to_keep, softcapping bounds, prepare-convert flow, observer calibration, and prefill/decode export adapters.test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py— New file: 7 smoke tests for the prepare-calibrate-convert flow, including FP parity, image-text calibration,logits_to_keep, softcapping bounds, and prefill/decode export module flows.tico/quantization/wrapq/examples/gemma4/quantize_for_conditional_generation.py— New file: example script demonstrating the full PTQ flow (prepare → calibrate → convert → PEIR comparison → Circle export).tico/quantization/recipes/debug/wrapper_smoke/cases/gemma4.py— AddedGemma4ForConditionalGenerationCasesmoke case and registered it inGEMMA4_CASES.Tests
All tests were run with
RUN_INTERNAL_TESTS=1:test_quant_for_conditional_generation.py— 10 passed (80s): validates wrapper type, forward finiteness, output shape(batch, seq_len, vocab_size),logits_to_keep=1slicing, softcapping bounds (|logits| ≤ softcap), full prepare-convert flow, observer calibration after convert, and prefill/decode export adapterlogits_to_keepvalues.test_quantize_for_conditional_generation.py— 7 passed (60s): validates FP parity before quantization, prepare-convert flow with text-only and image-text calibration data,logits_to_keepslicing, softcapping bounds, and prefill/decode export module forward producing finite logits with correct shapes.Unit Tests
Internal Tests
Smoke Test
Example Script
tico/quantization/wrapq/examples/gemma4/quantize_for_conditional_generation.py— Demonstrates the complete PTQ workflow:Gemma4ForConditionalGenerationwithfinal_logit_softcapping=30.0(no download needed).build_gemma4_e2b_ptq_config.as_export_module(mode="prefill")and saves asgemma4_for_conditional_generation.q.circle.