Skip to content

docs: add image generation dev note#777

Open
nabinchha wants to merge 3 commits into
mainfrom
nm/image-generation-dev-note
Open

docs: add image generation dev note#777
nabinchha wants to merge 3 commits into
mainfrom
nm/image-generation-dev-note

Conversation

@nabinchha

@nabinchha nabinchha commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

📋 Summary

Adds a Data Designer dev note showing how image generation fits into multimodal dataset pipelines, with a gallery of controlled examples across documents, products, traffic scenes, robotics, safety-review imagery, medical-style imagery, agriculture, drone inspection, and playful pet edits. The PR also adds the corresponding Fern recipe pages, example media, reusable display component, and downloadable recipe scripts under fern/assets.

🔗 Related Issue

N/A

🔄 Changes

  • Adds the image-generation dev note and reusable ImageExample component for paired image/control displays.
  • Adds retained image-generation recipe pages, example media, and recipe-card/navigation entries.
  • Adds downloadable recipe scripts under fern/assets/recipes/image_generation.
  • Keeps removed satellite/aerial and synthetic invoice recipe surfaces out of the dev note and recipe gallery.
  • Compresses the hero image as a JPEG and fixes both the dev note and Dev Notes index references.

🧪 Testing

  • Pre-commit hooks passed during commits
  • git diff --check passed
  • npx -y fern-api@5.41.1 check --warnings passed locally with 0 errors and 1 auth-dependent redirects warning
  • Fern docs preview build-and-deploy passed in CI
  • Verified PR diff only contains Fern docs/assets/component files
  • make test not run (docs-only change)
  • Unit tests added/updated (N/A — docs-only change)
  • E2E tests added/updated (N/A — docs-only change)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A)

@nabinchha nabinchha requested a review from a team as a code owner June 26, 2026 19:06
@nabinchha nabinchha deployed to agentic-ci June 26, 2026 19:06 — with GitHub Actions Active
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Fern preview: https://nvidia-preview-pr-777.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@github-actions

Copy link
Copy Markdown
Contributor

Code Review — PR #777: docs: add image generation dev note

Author: nabinchha · Base: main · Head: nm/image-generation-dev-note
Size: 83 files, +9,405 / −63

Summary

The PR's stated purpose is documentation: it adds an image-generation dev note,
a reusable ImageExample MDX/TSX component, nine recipe pages + downloadable
recipe scripts, example media, and a helper script for building before/after
comparison images. It also relocates recipe downloads from docs/assets/...
to fern/assets/recipes/image_generation/.

However, the PR also bundles a substantive, high-risk engine change to
ExpressionColumnGenerator and AsyncTaskScheduler (plus tests) that has
nothing to do with the dev note. The PR description acknowledges this
("supporting expression/drop-row scheduler fixes ... already present on this
branch"). The engine work is the part that carries real risk and is reviewed in
most depth below.

The diff splits cleanly into two unrelated concerns; that mismatch is the
single biggest issue with this PR (see Findings → Process).

Findings

Process / PR hygiene

  • 🔴 Title/scope mismatch — engine logic shipped under a docs: PR. The
    change to AsyncTaskScheduler (a core abstraction, 147 dependents per the
    structural analysis) and ExpressionColumnGenerator is real behavior change,
    not a doc edit. Per AGENTS.md ("No untested code paths", "declare, don't
    orchestrate" contract) and the conventional-commit checklist in the PR body,
    this belongs in its own fix:/refactor: PR so it gets an accurate title,
    focused review, and a clean changelog entry. Recommend splitting, or at
    minimum re-titling to reflect that engine behavior changed.
  • 🟡 make test was not run (PR checklist: "make test not run"). The
    bundled change touches the async scheduler's row-merge core. Even though new
    targeted tests are added, the full engine suite should pass before merge given
    the blast radius. Recommend running make test and make check-all-fix.
  • 🟡 fern check not run (CLI unavailable locally). The docs add nav
    entries, cross-page card links, and a new component import; a broken-link /
    nav validation pass should gate merge in CI.

Engine correctness — ExpressionColumnGenerator (expression.py)

  • 🟢 Solid, well-tested rewrite. The per-row try/except with a Counter
    drop-breakdown, the all-dropped → UserTemplateError escalation, and the
    index-preserving DataFrame(records, index=retained_indexes) are correct and
    covered by four new unit tests.
  • 🟢 Good edge-case handling of the 0-column DataFrame: to_dict(orient= "records") returns [] for a 0-column frame regardless of row count, and the
    synthesized [{} for _ in row_indexes] fallback (with the explanatory
    comment) correctly lets root/constant expressions render once per row. This
    is exactly what the scheduler's from_scratch path depends on.
  • 🟠 Behavior change worth calling out explicitly: _is_empty_rendered_ expression now drops any row whose rendered value is None or a
    whitespace-only string — including dtype="str". Previously an expression
    that legitimately rendered "" produced an empty-string cell; now that row is
    dropped entirely. This is intentional and tested, but it's a semantic change
    to existing pipelines (a str column that sometimes renders empty will now
    shrink the dataset). It should be documented in the changelog/release notes,
    and ideally the "drop on empty" behavior would be opt-in or at least noted in
    the ExpressionColumnConfig docs.
  • 🟢 Redundant-but-harmless guard: the new early if self.config.dtype not in _VALID_DTYPES: raise ValueError duplicates the else branch already in
    _cast_type. Fine for fail-fast, but note the duplication — if dtype
    validation belongs anywhere long-term it's config validation, not the
    generator hot path.
  • 🟢 Broad except Exception for template render is acceptable here because
    it's narrowly scoped to a single row, logged at debug with exc_info, counted
    as TemplateRenderError, and the all-dropped case still surfaces loudly.
    EmptyTemplateRenderError (a UserTemplateError subclass) is correctly caught
    before the generic handler so it's classified as an empty drop, not an error
    drop — verify ordering stays that way if exceptions are refactored.

Engine correctness — AsyncTaskScheduler (async_scheduler.py)

  • 🟢 The core fix is sound. Routing root expression columns (scheduled as
    from_scratch) through _run_full_column_from_scratch_with_row_drops so they
    honor the same per-row-drop / fatal-when-all-dropped contract as batch
    expression columns is the right call. The regression it fixes (PR fix: drop invalid expression rows #757: root
    {{ "" }} silently dropping the whole row group) is captured by
    test_root_expression_all_dropped_async_fails_loudly.
  • 🟢 Index-based result merge is more robust than the old positional merge.
    Replacing result_idx/result_df.iloc[result_idx] with
    _batch_result_by_row_index (keyed on the DataFrame index) and validating via
    _require_expression_row_drop_result (no duplicate indexes, no indexes outside
    the active set, no more rows than active) is defensive and correct. The
    strict=True zips are safe because the non-drop path still enforces
    expected_rows upstream.
  • 🟡 UserTemplateErrorDatasetGenerationError mapping in
    _raise_if_fatal_worker_error
    preserves the user-facing message
    (str(exc)) and chains the cause — good. Confirm this is the intended public
    error surface (callers depend on DatasetGenerationError per the "errors
    normalize at boundaries" principle), since the sync path raises
    UserTemplateError directly while the async path wraps it. That asymmetry
    (sync raises UserTemplateError, async raises DatasetGenerationError) could
    surprise callers catching one but not the other.
  • 🟢 The two near-identical block comments in _run_from_scratch and
    _run_full_column_from_scratch_with_row_drops are accurate; minor DRY nit but
    the duplication aids local readability.
  • Note on structural analysis: the "import direction violations" flagged
    (engine→interface via .items()) are false positives — they resolve to plain
    dict.items() calls, not cross-package imports. The one genuinely new import,
    ExpressionColumnConfig from data_designer.config into the engine, is in the
    legal direction (engine → config). No real layering violation introduced.

Docs / assets

  • 🟠 Orphaned assets: fern/assets/recipes/image_generation/synthetic_ invoices.py and the examples/synthetic-invoices/ image set (5 files) are
    added, but the PR body says the synthetic-invoice recipe surface was removed,
    and there is no recipe page, nav entry, or card referencing them. grep
    finds no .mdx/.yml/component reference to synthetic-invoices /
    synthetic_invoices. These appear to be dead files — either wire them up or
    drop them to avoid shipping unreferenced media and a downloadable script that
    no page links to.
  • 🟢 Card links look consistent. cards.mdx hrefs use Fern title-derived
    slugs (e.g. /recipes/image-generation/rich-document-image-generation) that
    correspond to the nav page: titles in latest.yml. Worth a fern check
    confirm, but the mapping is internally consistent. The dev note still
    references "invoice extraction" in a prose use-case table (line ~41) — fine as
    prose, just not as a linked recipe.
  • 🟢 ImageExample.tsx is clean: the dangerouslySetInnerHTML is a static
    CSS literal (commented as such, no user input — safe), the basepath fallback
    (withBasepath + onError swap) is a sensible way to handle dev-vs-prod asset
    roots, and the CSS-:target lightbox needs no JS. Minor a11y note: the
    :target lightbox can't be closed with Escape (only via the close/backdrop
    anchors) and toggling #id adds history entries — acceptable for a docs
    component.
  • 🟢 SPDX headers present on all new .py files (helper script + recipe
    scripts), satisfying the license-header requirement.
  • 🟢 .gitignore additions for the serve-local-docs-preview sidecar
    backups are well-commented and scoped.

Test coverage

  • 🟢 Strong. New tests cover: sync empty-drop + warn, row-specific render
    errors, type-cast errors, all-dropped-raises (unit); sync batch row-drop
    shrink, skip-aware batch (dataset builder); async row-group drop, async
    all-dropped fatal, root expression all-dropped fatal (the fix: drop invalid expression rows #757
    regression), and skip + drop interaction (async integration). The async tests
    assert on written batches and actual_num_records, not just non-null — good,
    non-hollow assertions.
  • 🟡 Gap: no test exercises the _run_full_column_from_scratch_with_row_ drops path with buffer_manager is None (the DataFrame(index=range(rg_size))
    branch). Low priority since that path is narrow, but it's the one branch in the
    new method without direct coverage.

Recommendations (priority order)

  1. Split the engine change out of this docs PR, or re-title to reflect that
    AsyncTaskScheduler / expression behavior changed; run make test +
    make check-all-fix before merge.
  2. Resolve the orphaned synthetic_invoices script + images — wire up or
    remove.
  3. Document the empty-string-drop behavior change for dtype="str"
    expression columns in release notes / config docs.
  4. Reconcile the sync-vs-async error type surfaced for all-dropped
    (UserTemplateError vs DatasetGenerationError).
  5. Run fern check in CI to validate nav + cross-page links.

Structural Impact (graphify, 2.2s)

Risk: HIGH (1 core abstraction(s) modified; 10 import direction violation(s))

  • 17 Python files, 131 AST entities, 2/77 clusters

Import Direction Violations (10)

Legal direction: interface -> engine -> config

  • ._log_row_drops() (engine) --calls--> .items() (interface)
  • _string_keyed_counts() (engine) --calls--> .items() (interface)
  • .__init__() (engine) --calls--> .items() (interface)
  • ._record_observed_task_state() (engine) --calls--> .items() (interface)
  • ._request_pressure_diagnostics() (engine) --calls--> .items() (interface)
  • +5 more

Core Abstractions Modified

High-Connectivity Changes

  • AsyncTaskScheduler (147 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • _DispatchOutcome (49 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • _RowGroupState (49 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • Return the scheduler-side async capacity explanation for this run. (47 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • Dependency-aware async task scheduler for the dataset builder. Replaces seq (47 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • Core task execution logic. (47 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • Run pre-batch callbacks for row groups whose seeds just completed. (47 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • Core dispatch loop extracted from ``run()``. (47 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • +85 more

Cross-Package Dependencies

  • ExpressionColumnGenerator (engine) --uses--> ExpressionColumnConfig (config)
  • .generate() (engine) --calls--> info() (config)
  • ._log_row_drops() (engine) --calls--> .items() (interface)
  • _string_keyed_counts() (engine) --calls--> .items() (interface)
  • _RowGroupState (engine) --uses--> ExpressionColumnConfig (config)
  • _RowGroupState (engine) --uses--> GenerationStrategy (config)
  • +85 more

Verdict

Request changes (non-blocking on correctness; blocking on scope/hygiene).

The engine logic is well-implemented and well-tested — the expression
row-drop rewrite and the async-scheduler root-expression fix are correct and
address a real regression (#757). The blocking concerns are process and
completeness
, not bugs:

  1. A high-blast-radius engine change (AsyncTaskScheduler, a core abstraction
    with 147 dependents) is shipped under a docs: title with make test not
    run. Split it out or re-title and run the full suite.
  2. Orphaned synthetic_invoices script + images are added but referenced by
    nothing — wire up or remove.
  3. The empty-string → row-drop semantic change for dtype="str" should be
    documented as it can shrink existing users' datasets.

Once the engine change is run through make test/fern check and the orphaned
assets are resolved, this is mergeable.

Reviewer did not approve or request changes on GitHub; this is an advisory review only.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds a new dev note explaining how image generation fits into multimodal dataset pipelines, illustrated with ~40 gallery examples across nine domains (documents, apparel, traffic, robotics, security, medical, agriculture, drone, and pets), along with the matching Fern recipe pages, nine downloadable Python recipe scripts, and the shared ImageExample React component.

  • New ImageExample TSX component renders a CSS-only lightbox alongside sampler-control chips; styles are injected inline via dangerouslySetInnerHTML on every render instance.
  • Nine recipe scripts (rich_document_images.py, product_image_variations.py, etc.) follow a consistent pattern: categorical samplers → ImageColumnConfig → optional parquet export.
  • Navigation and recipe card index (latest.yml, cards.mdx) are updated to surface all nine new image-generation recipes.

Confidence Score: 5/5

Documentation-only change adding a dev note, example images, recipe pages, and downloadable Python scripts. No production code paths are affected.

All changes are confined to Fern docs, static assets, and standalone recipe scripts. The new ImageExample component and recipe scripts are self-contained and do not touch any data-pipeline or server logic. The two observations noted are a redundant style-injection pattern in the React component and a column-name mismatch in an optional parquet export helper — neither affects the rendered page behavior or the core recipe functionality.

fern/components/ImageExample.tsx and fern/assets/recipes/image_generation/rich_document_images.py are worth a light second look for the style duplication and png_base64 naming, but neither issue blocks the change.

Important Files Changed

Filename Overview
fern/components/ImageExample.tsx New reusable component for dev-note image galleries; implements a CSS-only :target lightbox and injects the full CSS string on every render instance, meaning the same ~2 KB of styles is duplicated once per ImageExample in the final HTML.
fern/assets/recipes/image_generation/rich_document_images.py Recipe script for synthetic business-document generation; the export_seed_parquet helper names its output column png_base64 and describes it as "base64 PNGs" in both the docstring and CLI help, but reads raw image bytes that are likely JPEG for the default model, creating a misleading column name for downstream consumers.
fern/versions/latest/pages/devnotes/posts/image-generation-for-multimodal-data-pipelines.mdx Main dev-note page; correctly references the .jpg hero image and all example images; links to tutorials, recipes, and the PyData workshop. No issues found.
fern/versions/latest/pages/devnotes/index.mdx Dev Notes index; new BlogCard entry correctly uses .jpg extension for the hero image. The previously flagged .png extension issue has been fixed in this PR.
fern/versions/latest.yml Navigation config updated to add the new dev note and all nine image-generation recipe pages; structure is consistent with existing entries.
fern/versions/latest/pages/recipes/cards.mdx Recipe card index updated with nine new Image Generation cards; titles, icons, and hrefs are consistent and match the navigation config.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User
    participant Browser
    participant IE as ImageExample component
    participant LB as CSS target lightbox

    User->>Browser: Page loads
    Browser->>IE: Renders 40 ImageExample instances
    IE->>Browser: Injects style tag with IMAGE_EXAMPLE_CSS on each render
    IE->>Browser: Renders thumbnail linked to hash anchor
    User->>Browser: Clicks thumbnail
    Browser->>LB: URL hash set to image-example-slug
    LB->>Browser: lightbox target selector triggers display flex
    Browser-->>User: Lightbox overlay opens
    User->>Browser: Clicks backdrop or close button with href hash underscore
    Browser->>LB: URL hash set to underscore no matching element
    LB->>Browser: display none restored
    Browser-->>User: Lightbox closes
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User
    participant Browser
    participant IE as ImageExample component
    participant LB as CSS target lightbox

    User->>Browser: Page loads
    Browser->>IE: Renders 40 ImageExample instances
    IE->>Browser: Injects style tag with IMAGE_EXAMPLE_CSS on each render
    IE->>Browser: Renders thumbnail linked to hash anchor
    User->>Browser: Clicks thumbnail
    Browser->>LB: URL hash set to image-example-slug
    LB->>Browser: lightbox target selector triggers display flex
    Browser-->>User: Lightbox overlay opens
    User->>Browser: Clicks backdrop or close button with href hash underscore
    Browser->>LB: URL hash set to underscore no matching element
    LB->>Browser: display none restored
    Browser-->>User: Lightbox closes
Loading

Reviews (4): Last reviewed commit: "docs: tighten image generation gallery i..." | Re-trigger Greptile

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha force-pushed the nm/image-generation-dev-note branch from ca3f7d4 to fa74188 Compare June 26, 2026 19:18
Comment thread fern/versions/latest/pages/devnotes/index.mdx Outdated
Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant