GUI-DR is a part of a collaborative effort on Software Control Agents between Manifold Research and Fig
GUI-DR is a data augmentation pipeline built on domain randomization principles.
GUI grounding models often rely on visual primitives (shape, position, color) rather than functional semantics, and fixed-scene benchmarks do not reveal how they degrade under distribution shift. Using Mind2Web MHTML archives, GUI-DR varies visual scenes and instructions along controlled axes to generate data to evaluate or finetune models for use cases such as GUI grounding.
- 2026-04 Initial release of GUI-Perturbed, technical report, and data generation pipeline GUI-DR.
Requirements: Python β₯ 3.11. Download Mind2Web data under mm_mind2web/.
git clone https://github.com/ManifoldRG/GUI-DR.git
cd GUI-DRInstall with uv or pip below. Versions are pinned in uv.lock (see pyproject.toml). Playwright browsers are required to run the pipeline.
Install uv, then:
uv sync
uv run playwright installUse uv run python β¦ from the repo root (or source .venv/bin/activate and run python as usual). Example: uv run python src/main.py --split test_task.
python3.11 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .
playwright install
python src/main.py --split test_task-
mm_mind2web/(gitignored) at the repo root:mm_mind2web/data/<split>-*.parquetmm_mind2web/task/<task_uid>/processed/dom_content.jsonmm_mind2web/task/<task_uid>/processed/snapshots/*.mhtml
Parquets: Multimodal-Mind2Web on Hugging Face β
mm_mind2web/data/(e.g.train-*.parquet,test_task-*.parquet).Raw dump: Task trees with
processed/dom_content.jsonandprocessed/snapshots/*.mhtmlper Mind2Web raw dump. Symlink or copy so each task ismm_mind2web/task/<task_uid>/β¦. Optional: scripts/globus_mind2web_downloader.sh for Globus transfer (needs endpoint +.env).Parquet rows and
task_uidpaths must refer to the same tasks. -
(Optional) For debug logging or scripts that use Globus/API keys, copy .env.example to
.envand set any variables you need.
Default: test_task split, Style variant.
uv (from repo root):
uv run python src/main.py --split test_taskpip (venv activated):
python src/main.py --split test_taskOutputs: outputs/run_<timestamp>_test_task/<task_uid>/ with screenshots/ and trajectory.json. Other variants: Generating data.
One run produces one variant. Choose flags to match the variant you want. Run from the repo root with the venv active (pip) or prefix with uv run (uv).
# Original (no perturbations)
python src/main.py --split test_task --enable_zoom false --enable_dense_info false --enable_style_variants false
# Precision (viewport zoom 0.7Γ)
python src/main.py --split test_task --enable_zoom true --zoom_level 0.7 --enable_dense_info false --enable_style_variants false
# Style (colors, fonts, restyling) β default
python src/main.py --split test_task --enable_style_variants true --enable_zoom false --enable_dense_info false
# Text Shrink (reduced font size)
python src/main.py --split test_task --enable_dense_info true --enable_style_variants false --enable_zoom false| Argument | Default | Description |
|---|---|---|
--split, -s |
train |
Split: train, test_domain, test_task, test_website. |
--enable_zoom |
False |
Enable viewport zoom (Precision). |
--zoom_level |
0.7 |
Zoom level: 0.7, 0.5, or 0.3. |
--enable_dense_info |
False |
Enable text shrink. |
--enable_style_variants |
True |
Enable style randomization. |
outputs/run_<timestamp>_<split>/<task_uid>/ contains screenshots/ and trajectory.json. Use one run per variant when building evaluation data or downstream tooling.
Input: Parquet files for the split, plus per-task dom_content.json and MHTML snapshots in mm_mind2web/.
Flow: Load parquet β for each task, load MHTML snapshots in order β per step: optionally inject UI modifications (style / zoom / text shrink) β resolve target element from parquet β capture screenshot and bbox β write trajectory.json and screenshots.
flowchart TB
subgraph input [Input]
Parquet[parquet files]
MHTML[MHTML snapshots]
end
subgraph pipeline [Pipeline]
ActionProc[action_processor]
MHTMLProc[MHTMLProcessor]
Inject[inject UI mods]
Locate[locate element]
Screenshot[screenshot + bbox]
end
subgraph output [Output]
Trajectory[trajectory.json]
Screens[screenshots/]
end
Parquet --> ActionProc
MHTML --> ActionProc
ActionProc --> MHTMLProc
MHTMLProc --> Inject
Inject --> Locate
Locate --> Screenshot
Screenshot --> Trajectory
Screenshot --> Screens
Perturbations
| Variant | Config | Implementation |
|---|---|---|
| Original | All off | No injection. |
| Style | enable_style_variants=True |
randomization, generator, templates. |
| Precision | enable_zoom=True, zoom_level β {0.7, 0.5, 0.3} |
zoom. |
| Text Shrink | enable_dense_info=True |
dense_info. |
Instructions are generated per step from parquet target_action_reprs via generate_step_instruction. Config: config; injection: injection.
| Resource | Description |
|---|---|
| GUI-Perturbed | Released evaluation data (screenshots, instructions, ground-truth bboxes). |
| Baseline result viewer | Streamlit Space: baseline 7B GUI grounding predictions on original vs perturbed screenshots. |
Dataset summary
| Aspect | Description |
|---|---|
| Source | Mind2Web MHTML archives (real web pages, DOM preserved). |
| Visual variants | Original, Style, Precision (zoom 0.7), Text Shrink. ~390 screens per variant. |
| Schema | visual_variant, instruction_type, task_id, step_index, instruction, gt_bbox, screenshot. See the dataset card. |
| Instructions | Direct (constructed from target_action_reprs); relational (in released schema). |
Use this repo to reproduce or extend the data; use the Hugging Face dataset for evaluation.
Download the GUI-Perturbed dataset to evaluate your models. An evaluation script will be released soon.
- Perturbation realism - We prioritize diagnostic coverage over photorealism; some variants may look synthetic but still reveal reliance on color, position, or layout.
- Instruction diversity - The pipeline produces direct referring expressions; relational phrasings appear in the released dataset; broader natural-language diversity is future work.
- Web only - Desktop, mobile, and cross-application flows are out of scope.
See the Mind2Web project for data access. Place it under mm_mind2web/ with the structure described in Installation.
We welcome contributions: new perturbation types, bug reports, and improvements. Open an issue or pull request or reach out at our discord server.
If you find GUI-Perturbed or this pipeline useful, please cite the dataset and technical report series.
@dataset{gui_perturbed_2026,
title = {GUI-Perturbed: A Domain-Randomized Dataset for GUI Grounding},
author = {Wang, Yangyue and Sikka, Harsh and Mathur, Yash, and Zhou, Tony and Nyachhyon, Jinu and Guruprasad, Pranav},
year = {2026},
url = {https://huggingface.co/datasets/figai/GUI-Perturbed},
note = {Built on Mind2Web (Deng et al., 2023)}
}
@software{gui_dr_code_2026,
title = {GUI-DR: GUI Domain-Randomization for generating diagnostic GUI grounding evaluation data},
author = {Wang, Yangyue and Sikka, Harsh and Mathur, Yash, and Zhou, Tony and Nyachhyon, Jinu and Guruprasad, Pranav},
year = {2026},
url = {https://github.com/ManifoldRG/GUI-DR},
note = {Data augmentation pipeline for GUI-Perturbed}
}
@online{gui_perturbed_technical_report_2026,
title = {GUI-Perturbed: A Domain Randomization Dataset for GUI Grounding},
author = {Wang, Yangyue and Sikka, Harsh and Mathur, Yash, and Zhou, Tony and Nyachhyon, Jinu and Guruprasad, Pranav},
year = {2026},
url = {https://blog.fig.inc/gui-perturbed-a-domain-randomization-dataset-for-gui-grounding},
note = {Part 1: Dataset \& methodology}
}
