Skip to content

[quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper#798

Open
dvsav wants to merge 1 commit into
Samsung:mainfrom
dvsav:model
Open

[quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper#798
dvsav wants to merge 1 commit into
Samsung:mainfrom
dvsav:model

Conversation

@dvsav

@dvsav dvsav commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

This PR implements the full PTQ quantization support for Gemma4ForConditionalGeneration — the top-level multimodal model that wraps Gemma4Model (vision + text decoder) and adds an lm_head linear layer with logit softcapping. Previously this wrapper was a skeleton stub; now it has a complete calibration path, export path, and Circle conversion support.

Why

The QuantGemma4ForConditionalGeneration wrapper existed only as a skeleton — its forward() returned raw logits without logit softcapping, as_export_module() raised ValueError for all modes, and the module was commented out in the registry. This meant the top-level Gemma4 model could not be quantized or converted to Circle format. This PR completes the wrapper so the full prepare → calibrate → convert → Circle export pipeline works end-to-end, matching the behavior of the original Gemma4ForConditionalGeneration.forward including logit softcapping.

Key Design Decisions

  1. Three observers for the softcapping pathobs_logit_softcapping_div (after logits / softcap), obs_logit_softcapping_tanh (after tanh), and obs_logits (final logits). The div observer was added after discovering that the exported graph had a div node with missing quantization parameters. Each intermediate tensor in the softcapping chain gets its own observer so that every graph node carries qparam metadata.

  2. Separate forward_export() method — Instead of reusing forward() for export, a dedicated forward_export() method uses self.model_export (the export adapter from self.model.as_export_module()) rather than self.model directly. This mirrors the pattern used by QuantGemma4Model and ensures the export path uses precomputed static inputs (inputs_embeds, per_layer_inputs, attention_masks, position_embeddings) rather than dynamic token embedding.

  3. New Gemma4ForConditionalGenerationExportAdapter — A dedicated adapter class was created instead of reusing the existing Gemma4LMHeadExportAdapter, because the latter applied norm twice and did not handle softcapping or logits_to_keep slicing. The new adapter simply delegates to forward_export() with the stored logits_to_keep value.

  4. logits_to_keep mapping — Prefill mode sets logits_to_keep=0 (all positions), decode mode sets logits_to_keep=1 (last position only), matching the production runtime contract.

  5. Lazy import of the export adapterGemma4ForConditionalGenerationExportAdapter is imported inside as_export_module() to avoid a circular import between quant_for_conditional_generation.py and export_adapters.py.

Changes

  • tico/quantization/wrapq/wrappers/gemma4/quant_for_conditional_generation.py — Implemented forward() with logit softcapping (div → tanh → mul) and three fake-quantization observers; implemented forward_export() using the model export adapter; implemented as_export_module() returning Gemma4ForConditionalGenerationExportAdapter with mode-aware logits_to_keep; updated _all_observers() to return the three new observers.
  • tico/quantization/wrapq/wrappers/gemma4/export_adapters.py — Added Gemma4ForConditionalGenerationExportAdapter class that delegates to forward_export().
  • tico/quantization/wrapq/wrappers/registry.py — Enabled the quant_for_conditional_generation module in _CORE_MODULES (uncommented).
  • test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py — New file: 10 unit tests covering wrapper type, forward finiteness, output shape, logits_to_keep, softcapping bounds, prepare-convert flow, observer calibration, and prefill/decode export adapters.
  • test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py — New file: 7 smoke tests for the prepare-calibrate-convert flow, including FP parity, image-text calibration, logits_to_keep, softcapping bounds, and prefill/decode export module flows.
  • tico/quantization/wrapq/examples/gemma4/quantize_for_conditional_generation.py — New file: example script demonstrating the full PTQ flow (prepare → calibrate → convert → PEIR comparison → Circle export).
  • tico/quantization/recipes/debug/wrapper_smoke/cases/gemma4.py — Added Gemma4ForConditionalGenerationCase smoke case and registered it in GEMMA4_CASES.

Tests

All tests were run with RUN_INTERNAL_TESTS=1:

  • test_quant_for_conditional_generation.py10 passed (80s): validates wrapper type, forward finiteness, output shape (batch, seq_len, vocab_size), logits_to_keep=1 slicing, softcapping bounds (|logits| ≤ softcap), full prepare-convert flow, observer calibration after convert, and prefill/decode export adapter logits_to_keep values.
  • test_quantize_for_conditional_generation.py7 passed (60s): validates FP parity before quantization, prepare-convert flow with text-only and image-text calibration data, logits_to_keep slicing, softcapping bounds, and prefill/decode export module forward producing finite logits with correct shapes.
  • Wrapper smoke runner — PASS with Mean |diff| = 0.000035, Max |diff| = 0.000540, PEIR = 0.000529%.

Unit Tests

$ RUN_INTERNAL_TESTS=1 python -m pytest test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py -v
====================================================== test session starts ======================================================
platform linux -- Python 3.10.12, pytest-9.0.3, pluggy-1.6.0 -- /home/d.savchenkov/myenv/bin/python
cachedir: .pytest_cache
rootdir: /home/d.savchenkov/TICO
configfile: pyproject.toml
plugins: anyio-4.13.0
collected 10 items                                                                                                              

test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_export_forward_produces_logits PASSED [ 10%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_export_module_decode PASSED [ 20%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_export_module_prefill PASSED [ 30%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_forward_output_shape PASSED [ 40%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_logit_softcapping_applied PASSED [ 50%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_logits_to_keep PASSED [ 60%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_observers_calibrated PASSED [ 70%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_prepare_convert_flow PASSED [ 80%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_prepare_returns_correct_wrapper PASSED [ 90%]
test/quantization/wrapq/wrappers/gemma4/test_quant_for_conditional_generation.py::TestQuantGemma4ForConditionalGenerationSmoke::test_text_only_forward_is_finite PASSED [100%]

================================================= 10 passed in 86.51s (0:01:26) =================================================

Internal Tests

$ RUN_INTERNAL_TESTS=1 /home/d.savchenkov/myenv/bin/python -m pytest test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py -v
====================================================== test session starts ======================================================
platform linux -- Python 3.10.12, pytest-9.0.3, pluggy-1.6.0 -- /home/d.savchenkov/myenv/bin/python
cachedir: .pytest_cache
rootdir: /home/d.savchenkov/TICO
configfile: pyproject.toml
plugins: anyio-4.13.0
collected 7 items                                                                                                               

test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_as_export_module_decode_flow PASSED [ 14%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_as_export_module_flow PASSED [ 28%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_logit_softcapping_bounded PASSED [ 42%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_logits_to_keep PASSED [ 57%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_no_quant_model_matches_reference PASSED [ 71%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_prepare_convert_flow PASSED [ 85%]
test/quantization/wrapq/wrappers/gemma4/test_quantize_for_conditional_generation.py::TestGemma4ForConditionalGenerationSmoke::test_prepare_convert_flow_with_image PASSED [100%]

====================================================== 7 passed in 59.82s =======================================================

Smoke Test

$ python -m tico.quantization.examples.inspect \
    --config tico/quantization/examples/configs/wrapper_smoke.yaml \
    --mode wrapper-smoke \
    --case gemma4_for_conditional_generation \
    --export circle \
    --output-dir ./out/wrapper_smoke
┌───────────── Wrapper Smoke Summary ─────────────
│ Case             : gemma4_for_conditional_generation
│ Status           : PASS
│ Mean |diff|      : 0.000035
│ Max |diff|       : 0.000540
│ PEIR             : 0.000529
│ Shape match      : True
│ Quant finite     : True
└─────────────────────────────────────────────────
Artifacts:
  - circle: out/wrapper_smoke/gemma4_for_conditional_generation.q.circle
     ┌───────────────────────────────────────────┐
 0.56┤                                           │
     │                                       ••  │
     │                                     ••    │
 0.37┤                                   ••      │
     │                                 •••       │
     │                               •••         │
     │                             •••           │
 0.18┤                           •••             │
     │                         •••               │
     │                       •••                 │
-0.01┤                     •••                   │
     │                   •••                     │
     │                 •••                       │
     │               •••                         │
-0.19┤             •••                           │
     │           •••                             │
     │         •••                               │
-0.38┤       •••                                 │
     │     •••                                   │
     │   •••                                     │
     │  ••                                       │
-0.57┤                                           │
     └┬──────────┬─────────┬──────────┬─────────┬┘
    -0.57      -0.29     -0.01      0.28     0.56 

Example Script

tico/quantization/wrapq/examples/gemma4/quantize_for_conditional_generation.py — Demonstrates the complete PTQ workflow:

  1. Creates a tiny Gemma4ForConditionalGeneration with final_logit_softcapping=30.0 (no download needed).
  2. Prepares the model with build_gemma4_e2b_ptq_config.
  3. Calibrates with 20 synthetic text-only samples.
  4. Converts to a fake-quantized model.
  5. Compares FP vs. quantized logits (prints PEIR and a plot).
  6. Exports to Circle format via as_export_module(mode="prefill") and saves as gemma4_for_conditional_generation.q.circle.
$ python tico/quantization/wrapq/examples/gemma4/quantize_for_conditional_generation.py
Preparing model for quantization...
Calibrating (text-only)...
Converting to quantized model...

┌───────────── Quantization Error Summary ─────────────
│ FP output shape    : (1, 16, 256)
│ Quant output shape : (1, 16, 256)
│ Mean |diff|        : 0.000020
│ PEIR               : 0.010376 %
└──────────────────────────────────────────────────────
     ┌───────────────────────────────────────────┐
 0.50┤                                           │
     │                                        •  │
     │                                     • •   │
 0.32┤                                   ••      │
     │                                  ••       │
     │                               •••         │
     │                             •••           │
 0.14┤                           •••             │
     │                         •••               │
     │                       •••                 │
-0.05┤                     •••                   │
     │                   •••                     │
     │                 •••                       │
     │               •••                         │
-0.23┤             •••                           │
     │           •••                             │
     │          ••                               │
-0.41┤       ••                                  │
     │      •                                    │
     │                                           │
     │  •                                        │
-0.59┤                                           │
     └┬──────────┬─────────┬──────────┬─────────┬┘
    -0.59      -0.32     -0.05      0.23     0.50 


Exporting to Circle format...
Export output shape: (1, 16, 256)
Converting to Circle format...
Circle model saved as 'gemma4_for_conditional_generation.q.circle'

Add logit softcapping, export adapter, tests, and example script.

Co-authored-by: Cline

TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
@dvsav dvsav requested review from Torrero and mhs4670go July 1, 2026 13:58
@dvsav dvsav changed the title [Quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper [quantization] Implement QuantGemma4ForConditionalGeneration PTQ wrapper Jul 1, 2026
Comment on lines +136 to +148
# Run the model export adapter to get hidden states (already normed).
hidden_states = self.model_export(
inputs_embeds=inputs_embeds,
per_layer_inputs=per_layer_inputs,
attention_masks=attention_masks,
position_embeddings=position_embeddings,
)

# Slice hidden states for logits_to_keep.
slice_indices = slice(-logits_to_keep, None) if logits_to_keep else slice(None)
logits = self.lm_head(hidden_states[:, slice_indices, :])

return self._apply_logit_softcapping(logits)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, forward_export should contain only LM head and logit_softcapping for NPU processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants