Skip to content

refactor: optimize HTML tree building to reuse existing trees#23258

Open
FAMarfuaty wants to merge 16 commits into
feature/misc-performancefrom
209-html-parser---investigate-when-to-build-the-tree
Open

refactor: optimize HTML tree building to reuse existing trees#23258
FAMarfuaty wants to merge 16 commits into
feature/misc-performancefrom
209-html-parser---investigate-when-to-build-the-tree

Conversation

@FAMarfuaty
Copy link
Copy Markdown
Contributor

@FAMarfuaty FAMarfuaty commented May 12, 2026

Context

Every analysis cycle in the editor used to rebuild the HTML tree several times: once in the worker (AnalysisWebWorker.analyze), once in each assessor (Readability, SEO, Inclusive Language) because Assessor.assess always called build() unconditionally, and once per related-keyphrase pass because each related-keyphrase paper was a fresh Paper instance with no tree set. On top of that, every runResearch call from the main thread (word count, reading time, prominent words, …) parsed the HTML again because the worker received a freshly deserialized paper across the postMessage boundary.

For a ~3000-word post with three related keyphrases this meant 6 builds per analyze cycle plus another build per runResearch call — around 2 seconds of redundant parsing per cycle in our harness measurements, scaling roughly linearly with 1 + numRelatedKeyphrases.

The analyze cycle is triggered and rerun every time there is a change in the editor. Hence, this redundant and unnecessary parsing is becoming even more expensive.

This PR teaches the worker to keep one tree alive across calls and reuse it everywhere it's safe to.

Summary

This PR can be summarized in the following changelog entry:

  • Improves the performance of the content analysis by reusing previously built HTML trees across assessor runs, related-keyphrase passes and research calls instead of rebuilding the tree for each.
  • Passes shortcodes to the Insights analysis data for more consistent analysis result across application.
  • [yoastseo] Improves the performance of the content analysis by reusing previously built HTML trees across assessor runs, related-keyphrase passes and research calls instead of rebuilding the tree for each.
  • [yoastseo] Changes Assessor.assess() to skip the HTML tree build when the paper already carries a tree; callers that pass a paper with a pre-set tree are now responsible for ensuring it matches the paper's current content.
  • [yoastseo] Fixes a bug where the keyphrase distribution research left circular parent references on tree sentences, which prevented analysis results from being delivered when the new tree reuse kept those references alive across cycles.
  • [shopify-seo] Improves the performance of the content analysis by reusing previously built HTML trees across assessor runs, related-keyphrase passes and research calls instead of rebuilding the tree for each.
  • [yoast-doc-extension] Improves the performance of the content analysis by reusing previously built HTML trees across assessor runs, related-keyphrase passes and research calls instead of rebuilding the tree for each.

Relevant technical choices:

  1. Assessor.assess becomes a passive consumer of the tree (packages/yoastseo/src/scoring/assessors/assessor.js). It now only builds when paper.getTree() === null. The worker is the single producer of the tree; assessors reuse what the worker already set. This is the smallest possible change to the assessor API — it does not affect callers that pass a paper without a tree (those still get a freshly built one, matching pre-PR behaviour).

  2. assessRelatedKeywords propagates the parent tree onto each related paper (AnalysisWebWorker.js). Related-keyphrase papers differ from the focus paper only in keyword/synonyms, which are not tree-relevant inputs, so the same tree object is safe to share. One tree, N assessor passes.

  3. New Paper.hasSameTreeInputsAs( other ) (packages/yoastseo/src/values/Paper.js). Explicitly enumerates the attributes that do feed the tree builder — text, locale, shortcodes, wpBlocks — and ignores everything else. We deliberately did not reuse Paper.equals here, because equals treats a keyphrase or title change as "different paper" while the tree is still perfectly valid in that case. This is the predicate that lets us reuse the cached tree on the much more common edit pattern of "user adjusted only the keyphrase / title / description without retyping the body".

  4. analyze reuses the cached tree on non-tree attribute changes (AnalysisWebWorker.js). When paperHasChanges is true but this._paper.hasSameTreeInputsAs( paper ) still holds, we propagate the previous tree onto the new paper instead of rebuilding. This is the win when the user changes only metadata.

  5. runResearch reuses the worker's cached tree when content matches (AnalysisWebWorker.js). Each runResearch message from the main thread carries a freshly deserialized paper with no _tree. If this._paper.hasSameTreeInputsAs( paper ), we attach the cached tree; otherwise we fall back to a fresh build(). The fallback path also backfills the cached shortcodes when the caller omitted them, because shortcodes are a site-wide registry that only analyze is told about by the main thread — without the backfill, a runResearch from a different source (e.g. an insights collector) would silently produce a tree that disagreed with analyze on which shortcode markers to filter, and the equality predicate would also miss valid reuse opportunities.

  6. keyphraseDistribution cleans up its tree mutations before returning (packages/yoastseo/src/languageProcessing/researches/keyphraseDistribution.js). The research calls getSentencesFromTree( tree, true ), which sets sentence.sentenceParentNode = parentNode on every sentence. Combined with paragraph.sentences[i] this is a cycle. Pre-PR this was harmless because each call built a fresh tree that went out of scope as soon as the research returned. With the new tree reuse the tree stays alive, the cycle persists, and any later runResearch whose result includes paragraph/sentence references (e.g. getParagraphs) carries the cycle into the payload — where Transporter.serialize walks it via recursive lodash.mapValues and blows the worker's stack with RangeError: Maximum call stack size exceeded. The fix is a single cleanup loop after the research's last consumer of sentenceParentNode (getMarkingsInSentence, called earlier in the same research). Regression tests in keyphraseDistributionSpec.js assert both the absence of sentenceParentNode after the research and that JSON.stringify( paper.getTree() ) no longer throws — the latter is the exact failure mode the live editor exhibited, expressed as an invariant.

  7. collectData now ships shortcodes to the worker so the Insights panel's runResearch calls produce results consistent with the main assessment cycle (packages/js/src/initializers/analysis.js). collectData is the main-thread data source the Insights panel uses for its own analysis (word count, prominent words, reading time, …) — separate from the analyze cycle but expected to surface numbers that agree with what the assessments report. Pre-PR, the Insights runResearch calls omitted shortcodes, so the worker built a tree without the site-wide shortcode filter and could disagree with the analyze tree on which [shortcode] markers count as content (e.g. the Insights word count widget could drift from the Text-length assessment's count). Including shortcodes here closes that consistency gap, and as a side effect lets hasSameTreeInputsAs recognise the Insights paper as compatible with the cached analyze paper — so the worker reuses the cached tree instead of rebuilding.

Test instructions

Test instructions for the acceptance test before the PR gets merged

No Circular dependency error

  1. Enable Yoast SEO Free and Yoast SEO Premium (the original crash only reproduced with Premium because of its extra assessors that exercise the cached-tree path more aggressively).
  2. Open the post editor on a post with at least ~500 words of body content.
  3. Open DevTools → Console.
  4. Set a focus keyphrase and let the analysis run. Confirm no RangeError: Maximum call stack size exceeded in the console.
  5. Add one or more related keyphrases. Confirm the related-keyphrase analysis results render in the sidebar and no console errors appear.
  6. Edit the title and meta description (without touching the body). Confirm the analysis updates without errors and feels responsive even on a long post.
  7. Edit the body. Confirm the analysis re-runs and still produces correct results.
  8. Repeat on Classic editor and Elementor if available.

Functional regression scenarios (essential for this PR)

The tree reuse changes feed many features through the same cached object. Two classes of subtle regression are possible — drifting numbers (different surfaces deriving values from the cached tree at different points) and mis-positioned highlights (the tree's sourceCodeRange / clientId / parentClientId no longer agreeing with the editor's current DOM). Verify both explicitly.

Important

Please also test this scenario in Shopify
When testing in Shopify, you need to link yoastseo package with this change with the app. You can run the following
run yarn build inside yoastseo package
run yalc publish
inside shopify-seo/app, run yalc update yoastseo to link the package

Cross-feature numeric consistency — the same underlying body content surfaces a count in several places; they must still agree after the tree reuse:

  • Text length assessment ↔ Insights word count. Open a post and wait for analysis. The assessment "The text contains X words" should report the same X as the Insights panel's word count widget. Try on short (~150w), medium (~950w), and long (~3000w) posts; add and remove a paragraph and recheck.
  • Keyphrase density count ↔ actual occurrences. Pick a keyphrase that appears 3–4 times in the body; the density assessment's count must match a manual count.

Highlighting integrity — robust position-based highlighting is the primary reason we build an HTML tree in the first place. With the tree now reused across analyze / related-keyphrase / runResearch calls, every eye-icon flow needs a sanity check. Pay extra attention to Keyphrase distribution, because that's the assessment whose research touchedsentenceParentNode (the cycle source the bugfix addressed) and whose marks are produced via the list-item-merge path.

  • Keyphrase distribution (highest priority). Click the eye icon next to "Keyphrase distribution". Verify the highlighted sentences are exactly the ones containing the keyphrase/synonyms. Repeat with the keyphrase appearing in:
    • a plain paragraph
    • a list item (<ul> or <ol> — this is the merge-list-items code path that consumes the sentenceParentNode back-reference; the cleanup in the bugfix must not have broken it)
    • a heading
    • a paragraph that contains both the keyphrase and a synonym
  • Sentence beginnings (readability) — highlights mark the start of the repeated-beginning sentences.
  • Paragraph length — highlight covers the over-long paragraph block exactly (not the surrounding ones).
  • Transition words / Passive voice — highlights land on the right sentences.

Edit-then-highlight — guard against stale-tree positions, the most likely failure mode if the tree reuse predicate is too permissive:

  • Type into the body, then immediately click an eye icon. The highlight should reflect the new content. Repeat after adding a paragraph at the top of the post (this is the worst case for stale offsets).
  • Change only the focus keyphrase (no body edit), then click the Keyphrase distribution eye icon. Highlights should re-target the new keyphrase against the same cached tree. This is the "metadata-only re-analyze" path the dedup is most aggressive on.
  • Change only the title or meta description (no body edit), then click any body-targeted eye icon. Highlights on body-targeted assessments must still land correctly after a metadata-only re-analyze.

Optional: measure the build-count drop yourself

Reviewers who want to see the perf improvement quantitatively can apply the harness patch below into their working tree (it lands in tmp/, which is gitignored). The harness instantiates AnalysisWebWorker directly in Node/Jest, wraps parse/build with a counter, and drives a 3×3 matrix (post length × keyphrase config) plus a representative runResearch sequence — same code path the browser hits, minus the worker boundary. A second harness (transporter-roundtrip.perf.js) round-trips every result through Transporter.serialize/parse to confirm no payload contains a cycle.

# From the repo root:
git checkout 209-html-parser---investigate-when-to-build-the-tree
git apply perf-harness.patch   # patch contents in the <details> block below
cd packages/yoastseo
npx jest --testMatch='**/tmp/tree-build-perf/**/*.perf.js' --collectCoverage=false --testPathIgnorePatterns=
# Results land in: packages/yoastseo/tmp/tree-build-perf/results/<branch>-<sha>-<timestamp>.{md,json}

# To compare against trunk, switch branches and re-run:
git checkout trunk
npx jest --testMatch='**/tmp/tree-build-perf/**/*.perf.js' --collectCoverage=false --testPathIgnorePatterns=

Headline numbers measured on this branch vs trunk (3×3 matrix, single run on the same machine):

Cell analyze#1 builds (changed paper) analyze#2 builds (unchanged paper) runResearch builds (7 calls)
any size × focus only / +1 synonym 3 → 1 0 → 0 7 → 0
any size × focus + 3 related kws 6 → 1 3 → 0 7 → 0

Wall-clock on a 2843-word post (Node/JSDOM, not directly comparable to browser):

  • analyze#1 configC: ~2.0s → ~1.9s
  • analyze#2 configC: ~1.35s → ~1.29s
  • runResearch (7 calls) configC: ~120ms → ~69ms
perf-harness.patch (apply with git apply; creates files under packages/yoastseo/tmp/tree-build-perf/ which is gitignored)
diff --git a/packages/yoastseo/tmp/tree-build-perf/tree-build-perf.perf.js b/packages/yoastseo/tmp/tree-build-perf/tree-build-perf.perf.js
new file mode 100644
index 0000000000..253703c541
--- /dev/null
+++ b/packages/yoastseo/tmp/tree-build-perf/tree-build-perf.perf.js
@@ -0,0 +1,237 @@
+/* eslint-disable */
+/**
+ * Tree-build perf harness.
+ *
+ * Lives under tmp/ (gitignored). Not part of the regular test suite.
+ *
+ * What it does:
+ *   - Drives AnalysisWebWorker.analyze() and AnalysisWebWorker.runResearch()
+ *     against a 3×3 matrix (3 post lengths × 3 keyphrase configurations).
+ *   - Counts how many times the real parse/build/build.js `build()` function
+ *     is invoked per cycle, using a jest.mock wrapper that touches no
+ *     production source.
+ *   - Records wall-clock ms per cycle via performance.now().
+ *
+ * Run with:
+ *   cd packages/yoastseo
+ *   npx jest --testMatch='**\/tmp/tree-build-perf/**\/*.perf.js' --collectCoverage=false --testPathIgnorePatterns=
+ *
+ * Output: a Markdown table written to tmp/tree-build-perf/results/
+ * named `<branch>-<timestamp>.{md,json}`.
+ */
+
+const fs = require( "fs" );
+const path = require( "path" );
+const { execSync } = require( "child_process" );
+
+// jest.mock factory is hoisted; it cannot reference module-scope variables.
+// We expose a counter as an extra named export from the mocked module instead.
+jest.mock( "../../src/parse/build", () => {
+	const actual = jest.requireActual( "../../src/parse/build" );
+	const counter = { count: 0 };
+	return {
+		__esModule: true,
+		...actual,
+		build: ( ...args ) => {
+			counter.count++;
+			return actual.build( ...args );
+		},
+		__buildCounter: counter,
+	};
+} );
+
+// Pull the counter back out. Importing the mocked module yields the wrapper +
+// the counter object we attached.
+const buildModule = require( "../../src/parse/build" );
+const buildCounter = buildModule.__buildCounter;
+
+const AnalysisWebWorker = require( "../../src/worker/AnalysisWebWorker" ).default;
+const Paper = require( "../../src/values/Paper" ).default;
+const EnglishResearcher = require( "../../src/languageProcessing/languages/en/Researcher" ).default;
+
+const { shortPost, mediumPost, longPost, wordCounts } = require( "./fixtures/posts" );
+const { all: keyphraseConfigs } = require( "./fixtures/keyphrases" );
+
+const POSTS = [
+	{ size: "short",  text: shortPost,  words: wordCounts.short  },
+	{ size: "medium", text: mediumPost, words: wordCounts.medium },
+	{ size: "long",   text: longPost,   words: wordCounts.long   },
+];
+
+// runResearch names mirror what the screenshot showed coming from the editor.
+const RESEARCH_NAMES = [
+	"morphology",
+	"readingTime",
+	"getFleschReadingScore",
+	"wordCountInText",
+	"findKeyphraseInFirstParagraph",
+	"getParagraphs",
+	"getProminentWordsForInsights",
+];
+
+function createScope() {
+	return {
+		postMessage: jest.fn(),
+		importScripts: jest.fn(),
+	};
+}
+
+function createMessage( type, payload = {}, id = 0 ) {
+	return { data: { type, id, payload } };
+}
+
+function freshPaper( text, attrs ) {
+	return new Paper( text, { ...attrs, locale: "en_US" } );
+}
+
+function nowMs() {
+	return ( typeof performance !== "undefined" && performance.now )
+		? performance.now()
+		: Date.now();
+}
+
+async function measure( fn ) {
+	const t0 = nowMs();
+	const buildsBefore = buildCounter.count;
+	const result = await fn();
+	const buildsAfter = buildCounter.count;
+	const t1 = nowMs();
+	return {
+		builds: buildsAfter - buildsBefore,
+		ms: +( t1 - t0 ).toFixed( 2 ),
+		result,
+	};
+}
+
+function getCurrentBranch() {
+	try {
+		return execSync( "git rev-parse --abbrev-ref HEAD", {
+			cwd: path.resolve( __dirname, "../../../.." ),
+			encoding: "utf8",
+		} ).trim();
+	} catch ( e ) {
+		return "unknown";
+	}
+}
+
+function getCurrentSha() {
+	try {
+		return execSync( "git rev-parse --short HEAD", {
+			cwd: path.resolve( __dirname, "../../../.." ),
+			encoding: "utf8",
+		} ).trim();
+	} catch ( e ) {
+		return "unknown";
+	}
+}
+
+describe( "tree-build perf harness", () => {
+	jest.setTimeout( 600000 );
+
+	test( "runs the 3x3 matrix and writes results", async() => {
+		const branch = getCurrentBranch();
+		const sha = getCurrentSha();
+		const timestamp = new Date().toISOString().replace( /[:.]/g, "-" );
+		const slug = branch.replace( /[^a-zA-Z0-9._-]/g, "_" );
+
+		const rows = [];
+
+		for ( const post of POSTS ) {
+			for ( const config of keyphraseConfigs ) {
+				// One fresh worker per cell so each cell's "first analyze" is
+				// a clean paperHasChanges=true with no leftover paper state.
+				const scope = createScope();
+				const researcher = new EnglishResearcher();
+				const worker = new AnalysisWebWorker( scope, researcher );
+				worker.register();
+				scope.onmessage( createMessage( "initialize" ) );
+
+				// First analyze: fresh paper, paperHasChanges = true (worker._paper was null).
+				const paper1 = freshPaper( post.text, config.paperAttributes );
+				const firstAnalyze = await measure( () =>
+					worker.analyze( 0, { paper: paper1, relatedKeywords: config.relatedKeywords } ),
+				);
+
+				// Second analyze: a *new* Paper instance with identical content.
+				// Real-plugin flow serializes/deserializes papers across the worker boundary,
+				// so the worker compares by value (Paper.equals) and sees paperHasChanges = false.
+				const paper2 = freshPaper( post.text, config.paperAttributes );
+				const secondAnalyze = await measure( () =>
+					worker.analyze( 1, { paper: paper2, relatedKeywords: config.relatedKeywords } ),
+				);
+
+				// runResearch: each call gets its own fresh Paper (no pre-built tree).
+				// This mirrors what the main thread sends: papers serialized over postMessage
+				// arrive at the worker without a _tree, so each runResearch hits the tree-build path.
+				const researchResults = [];
+				for ( const name of RESEARCH_NAMES ) {
+					const researchPaper = freshPaper( post.text, config.paperAttributes );
+					const r = await measure( () =>
+						Promise.resolve( worker.runResearch( 0, { name, paper: researchPaper } ) ),
+					);
+					researchResults.push( { name, builds: r.builds, ms: r.ms } );
+				}
+				const totalResearchBuilds = researchResults.reduce( ( s, r ) => s + r.builds, 0 );
+				const totalResearchMs = +researchResults.reduce( ( s, r ) => s + r.ms, 0 ).toFixed( 2 );
+
+				rows.push( {
+					post: post.size,
+					words: post.words,
+					config: config.label,
+					analyzeFirstBuilds: firstAnalyze.builds,
+					analyzeFirstMs: firstAnalyze.ms,
+					analyzeSecondBuilds: secondAnalyze.builds,
+					analyzeSecondMs: secondAnalyze.ms,
+					researchTotalBuilds: totalResearchBuilds,
+					researchTotalMs: totalResearchMs,
+					researchPerCall: researchResults,
+				} );
+
+				worker._scheduler.stopPolling();
+			}
+		}
+
+		// Pretty-print Markdown.
+		const header = [
+			"| Post | Words | Config | analyze#1 builds (changed) | analyze#1 ms | analyze#2 builds (unchanged) | analyze#2 ms | runResearch builds (7 calls) | runResearch ms |",
+			"|------|-------|--------|---------------------------:|-------------:|------------------------------:|-------------:|------------------------------:|---------------:|",
+		];
+		const body = rows.map( r =>
+			`| ${ r.post } | ${ r.words } | ${ r.config } | ${ r.analyzeFirstBuilds } | ${ r.analyzeFirstMs } | ${ r.analyzeSecondBuilds } | ${ r.analyzeSecondMs } | ${ r.researchTotalBuilds } | ${ r.researchTotalMs } |`,
+		);
+		const md = [
+			`# Tree-build perf — ${ branch } @ ${ sha }`,
+			"",
+			`Run at: ${ new Date().toISOString() }`,
+			`Word counts: short=${ wordCounts.short }, medium=${ wordCounts.medium }, long=${ wordCounts.long }`,
+			`Research calls per cell: ${ RESEARCH_NAMES.join( ", " ) }`,
+			"",
+			...header,
+			...body,
+			"",
+		].join( "\n" );
+
+		const resultsDir = path.resolve( __dirname, "results" );
+		if ( ! fs.existsSync( resultsDir ) ) {
+			fs.mkdirSync( resultsDir, { recursive: true } );
+		}
+		const baseName = `${ slug }-${ sha }-${ timestamp }`;
+		fs.writeFileSync( path.join( resultsDir, `${ baseName }.md` ), md );
+		fs.writeFileSync(
+			path.join( resultsDir, `${ baseName }.json` ),
+			JSON.stringify( { branch, sha, timestamp, rows }, null, 2 ),
+		);
+
+		// Print to stdout so the run is also self-documenting.
+		// eslint-disable-next-line no-console
+		console.log( "\n" + md );
+		// eslint-disable-next-line no-console
+		console.log( `Saved to ${ baseName }.{md,json}` );
+
+		// Sanity expectation so jest reports failure clearly if the harness wiring breaks.
+		expect( rows ).toHaveLength( POSTS.length * keyphraseConfigs.length );
+		for ( const r of rows ) {
+			expect( typeof r.analyzeFirstBuilds ).toBe( "number" );
+		}
+	} );
+} );
diff --git a/packages/yoastseo/tmp/tree-build-perf/transporter-roundtrip.perf.js b/packages/yoastseo/tmp/tree-build-perf/transporter-roundtrip.perf.js
new file mode 100644
index 0000000000..897e769651
--- /dev/null
+++ b/packages/yoastseo/tmp/tree-build-perf/transporter-roundtrip.perf.js
@@ -0,0 +1,192 @@
+/* eslint-disable */
+/**
+ * Transporter round-trip diagnostic.
+ *
+ * Lives under tmp/ (gitignored). Not part of the regular test suite.
+ *
+ * What it does:
+ *   - For each (post size × keyphrase config), drives analyze + a sequence
+ *     of runResearch calls, then runs Transporter.serialize() on every result.
+ *   - Catches and reports per-cell which call (analyze#1, analyze#2, or any
+ *     of the named researches) throws — including "Maximum call stack size
+ *     exceeded" errors from the recursive mapValues path in serialize.js.
+ *   - Reports the depth of each successful serialization too, so we can spot
+ *     near-overflow cases on smaller posts.
+ *
+ * Run with:
+ *   cd packages/yoastseo
+ *   npx jest --testMatch='**\/tmp/tree-build-perf/transporter-roundtrip.perf.js' --collectCoverage=false --testPathIgnorePatterns=
+ */
+
+const fs = require( "fs" );
+const path = require( "path" );
+const { execSync } = require( "child_process" );
+
+const Transporter = require( "../../src/worker/transporter" ).default;
+const AnalysisWebWorker = require( "../../src/worker/AnalysisWebWorker" ).default;
+const Paper = require( "../../src/values/Paper" ).default;
+const EnglishResearcher = require( "../../src/languageProcessing/languages/en/Researcher" ).default;
+
+const { shortPost, mediumPost, longPost, wordCounts } = require( "./fixtures/posts" );
+const { all: keyphraseConfigs } = require( "./fixtures/keyphrases" );
+
+const POSTS = [
+	{ size: "short",  text: shortPost,  words: wordCounts.short  },
+	{ size: "medium", text: mediumPost, words: wordCounts.medium },
+	{ size: "long",   text: longPost,   words: wordCounts.long   },
+];
+
+const RESEARCH_NAMES = [
+	"morphology",
+	"readingTime",
+	"getFleschReadingScore",
+	"wordCountInText",
+	"findKeyphraseInFirstParagraph",
+	"getParagraphs",
+	"getProminentWordsForInsights",
+];
+
+function createScope() {
+	return { postMessage: () => {}, importScripts: () => {} };
+}
+function createMessage( type, payload = {}, id = 0 ) {
+	return { data: { type, id, payload } };
+}
+function freshPaper( text, attrs ) {
+	return new Paper( text, { ...attrs, locale: "en_US" } );
+}
+
+function tryRoundTrip( value ) {
+	try {
+		const serialized = Transporter.serialize( value );
+		const parsed = Transporter.parse( serialized );
+		return { ok: true, error: null, parsed };
+	} catch ( e ) {
+		return { ok: false, error: String( e.message || e ), parsed: null };
+	}
+}
+
+function getCurrentBranch() {
+	try {
+		return execSync( "git rev-parse --abbrev-ref HEAD", {
+			cwd: path.resolve( __dirname, "../../../.." ),
+			encoding: "utf8",
+		} ).trim();
+	} catch ( e ) {
+		return "unknown";
+	}
+}
+function getCurrentSha() {
+	try {
+		return execSync( "git rev-parse --short HEAD", {
+			cwd: path.resolve( __dirname, "../../../.." ),
+			encoding: "utf8",
+		} ).trim();
+	} catch ( e ) {
+		return "unknown";
+	}
+}
+
+describe( "transporter round-trip diagnostic", () => {
+	jest.setTimeout( 600000 );
+
+	test( "reports which call's payload breaks Transporter.serialize", async() => {
+		const branch = getCurrentBranch();
+		const sha = getCurrentSha();
+		const timestamp = new Date().toISOString().replace( /[:.]/g, "-" );
+		const slug = branch.replace( /[^a-zA-Z0-9._-]/g, "_" );
+
+		const rows = [];
+
+		for ( const post of POSTS ) {
+			for ( const config of keyphraseConfigs ) {
+				const scope = createScope();
+				const researcher = new EnglishResearcher();
+				const worker = new AnalysisWebWorker( scope, researcher );
+				worker.register();
+				scope.onmessage( createMessage( "initialize" ) );
+
+				const cellResults = {
+					post: post.size,
+					words: post.words,
+					config: config.label,
+					calls: [],
+				};
+
+				// --- analyze#1 ---
+				const paper1 = freshPaper( post.text, config.paperAttributes );
+				let analyzeResult1;
+				try {
+					analyzeResult1 = await worker.analyze( 0, { paper: paper1, relatedKeywords: config.relatedKeywords } );
+					const rt = tryRoundTrip( analyzeResult1 );
+					cellResults.calls.push( { name: "analyze#1", ok: rt.ok, error: rt.error } );
+				} catch ( e ) {
+					cellResults.calls.push( { name: "analyze#1", ok: false, error: "analyze threw: " + String( e.message || e ) } );
+				}
+
+				// --- analyze#2 (same content, paperHasChanges = false) ---
+				const paper2 = freshPaper( post.text, config.paperAttributes );
+				try {
+					const analyzeResult2 = await worker.analyze( 1, { paper: paper2, relatedKeywords: config.relatedKeywords } );
+					const rt = tryRoundTrip( analyzeResult2 );
+					cellResults.calls.push( { name: "analyze#2", ok: rt.ok, error: rt.error } );
+				} catch ( e ) {
+					cellResults.calls.push( { name: "analyze#2", ok: false, error: "analyze threw: " + String( e.message || e ) } );
+				}
+
+				// --- runResearch * 7 ---
+				for ( const name of RESEARCH_NAMES ) {
+					const researchPaper = freshPaper( post.text, config.paperAttributes );
+					try {
+						const researchResult = worker.runResearch( 0, { name, paper: researchPaper } );
+						const rt = tryRoundTrip( researchResult );
+						cellResults.calls.push( { name: `runResearch:${ name }`, ok: rt.ok, error: rt.error } );
+					} catch ( e ) {
+						cellResults.calls.push( { name: `runResearch:${ name }`, ok: false, error: "runResearch threw: " + String( e.message || e ) } );
+					}
+				}
+
+				rows.push( cellResults );
+				worker._scheduler.stopPolling();
+			}
+		}
+
+		// Compose a compact failure report.
+		const failures = [];
+		for ( const cell of rows ) {
+			for ( const c of cell.calls ) {
+				if ( ! c.ok ) {
+					failures.push( `  - ${ cell.post } / ${ cell.config } / ${ c.name }: ${ c.error }` );
+				}
+			}
+		}
+
+		const lines = [
+			`# Transporter round-trip — ${ branch } @ ${ sha }`,
+			"",
+			`Run at: ${ new Date().toISOString() }`,
+			`Word counts: short=${ wordCounts.short }, medium=${ wordCounts.medium }, long=${ wordCounts.long }`,
+			"",
+			`Failures: ${ failures.length }`,
+			...( failures.length > 0 ? [ "", ...failures ] : [ "  (none)" ] ),
+		];
+		const md = lines.join( "\n" );
+
+		const resultsDir = path.resolve( __dirname, "results" );
+		if ( ! fs.existsSync( resultsDir ) ) {
+			fs.mkdirSync( resultsDir, { recursive: true } );
+		}
+		const baseName = `roundtrip-${ slug }-${ sha }-${ timestamp }`;
+		fs.writeFileSync( path.join( resultsDir, `${ baseName }.md` ), md );
+		fs.writeFileSync(
+			path.join( resultsDir, `${ baseName }.json` ),
+			JSON.stringify( { branch, sha, timestamp, rows }, null, 2 ),
+		);
+
+		// eslint-disable-next-line no-console
+		console.log( "\n" + md + "\n\nSaved to " + baseName + ".{md,json}" );
+
+		// Always pass — this is a diagnostic, not an assertion of correctness.
+		expect( rows ).toHaveLength( POSTS.length * keyphraseConfigs.length );
+	} );
+} );
diff --git a/packages/yoastseo/tmp/tree-build-perf/fixtures/posts.js b/packages/yoastseo/tmp/tree-build-perf/fixtures/posts.js
new file mode 100644
index 0000000000..5e385a20cc
--- /dev/null
+++ b/packages/yoastseo/tmp/tree-build-perf/fixtures/posts.js
@@ -0,0 +1,125 @@
+/* eslint-disable */
+/**
+ * Fixture HTML posts for the tree-build perf harness.
+ *
+ * Each "section" is a Gutenberg-style block (heading + two paragraphs)
+ * of roughly ~150 words. Section text mentions the focus keyphrase
+ * ("sustainable gardening") and its synonyms ("eco-friendly gardening",
+ * "organic gardening", "green gardening"), plus the related keyphrases
+ * used in configC ("composting at home", "rainwater harvesting",
+ * "native plants").
+ *
+ * Posts of different lengths are produced by concatenating sections.
+ * Lives under tmp/ (gitignored) — not shipped, not committed.
+ */
+
+const SECTION_BODIES = [
+	{
+		heading: "Why sustainable gardening matters",
+		paragraphs: [
+			"Sustainable gardening starts with the soil. By feeding the ground with compost from your own kitchen, you build a living foundation that needs fewer inputs each season. Eco-friendly gardening also means choosing native plants that already belong to your region — they require less water, attract more pollinators, and shrug off local pests with little intervention. Organic gardening, in particular, leans on biodiversity rather than chemistry to keep crops healthy.",
+			"Green gardening is not a single technique, but a mindset. It treats every patch of ground as part of a wider ecosystem, and asks whether each decision — what to plant, how to water, when to harvest — leaves the garden a little richer than it was before. Small choices add up: a shaded bed of indigenous plants, a barrel of rainwater under the gutter, a corner left wild for insects.",
+		],
+	},
+	{
+		heading: "Composting at home",
+		paragraphs: [
+			"Composting at home is the quickest entry point into sustainable gardening. A simple bin in the corner of the yard, fed with vegetable peels, coffee grounds, and dry leaves, turns kitchen waste into the dark, crumbly material your beds crave. Home composting works year-round if you balance greens and browns and keep the pile slightly moist. Kitchen composting in a sealed caddy bridges the gap on rainy days when nobody wants to walk out to the bin.",
+			"Eco-friendly gardening relies on these closed loops. The carrot tops you trimmed in March return as fertility in July, and the lawn clippings you raked in August feed next spring's tomatoes. Organic gardening at this scale never asks you to buy bagged amendments; it asks you to notice what your household already produces and route it back into the soil instead of the bin.",
+		],
+	},
+	{
+		heading: "Rainwater harvesting and native plants",
+		paragraphs: [
+			"Rainwater harvesting reshapes how a green gardening plot meets dry spells. A single barrel under the gutter captures hundreds of liters during a storm, and rainwater collection at scale — connected to a drip system — can carry an entire bed through a hot week. The water is soft, free of chlorine, and at ambient temperature, which makes it gentler on roots than anything from the tap.",
+			"Native plants finish the picture. Indigenous plants evolved alongside local soil microbes, rainfall patterns, and pollinators, so they need a fraction of the care that exotic species demand. Regional flora is also more resistant to local diseases. Pair them with rainwater harvesting and you get a sustainable gardening setup that more or less runs itself once established.",
+		],
+	},
+	{
+		heading: "Tools and timing in eco-friendly gardening",
+		paragraphs: [
+			"You do not need much hardware for sustainable gardening. A sharp spade, a hand fork, a sturdy pair of pruners, and a watering can will carry most jobs. Quality matters more than quantity: a single well-balanced tool lasts decades, while a drawer of cheap ones rusts in a season. Eco-friendly gardening implies durability, because the most sustainable purchase is the one you do not have to repeat.",
+			"Timing is the other lever. Organic gardening rewards observation more than effort: planting just after the last frost, mulching before the summer heat, sowing cover crops as the days shorten. Green gardening calendars are not rigid; they read the season as it actually arrives, not as the catalogue promised. Pay attention to the first cherry blossom and you will know more than the almanac.",
+		],
+	},
+	{
+		heading: "Common mistakes that undo sustainable gardening",
+		paragraphs: [
+			"The most common error in sustainable gardening is overwatering. Beds with native plants and a layer of mulch need far less water than newcomers expect. A daily sprinkle keeps roots near the surface and makes the plot fragile; a deep weekly soak trains roots to chase moisture downward, which is what carries them through droughts. Rainwater harvesting helps here, because the limited supply forces discipline.",
+			"Another quiet mistake is monoculture. Even an eco-friendly gardening setup falters when a single crop dominates a whole bed — pests find it, soil tires, and one bad week can wipe out the season. Organic gardening leans on mixed planting: a tomato next to basil, lettuce under the beans, marigolds at the edges. Green gardening looks messier than a tidy row, and it is far more resilient for it.",
+		],
+	},
+	{
+		heading: "Long-term thinking in green gardening",
+		paragraphs: [
+			"Sustainable gardening is a multi-year project. The first season you amend the soil and learn the light. The second you adjust which native plants thrive in which corner. By the third, the bed mostly tells you what it needs. Eco-friendly gardening is patient by nature; it asks you to invest in fertility, structure, and biodiversity that will only pay off two or three seasons later.",
+			"The reward is a plot that needs less every year, not more. Organic gardening at maturity produces real food with modest effort: tomatoes that ripen on the vine, herbs you cut for dinner, kale that shrugs off frost. Green gardening, at that point, has stopped being a project and become a habit — one walk through the beds in the morning, a basket of vegetables in the evening, and the satisfaction of a garden that quietly works.",
+		],
+	},
+];
+
+/**
+ * Renders one Gutenberg-style section (heading + paragraphs).
+ *
+ * @param {{ heading: string, paragraphs: string[] }} section
+ * @returns {string}
+ */
+function renderSection( section ) {
+	const head = `<!-- wp:heading -->\n<h2>${ section.heading }</h2>\n<!-- /wp:heading -->`;
+	const paras = section.paragraphs
+		.map( p => `<!-- wp:paragraph -->\n<p>${ p }</p>\n<!-- /wp:paragraph -->` )
+		.join( "\n" );
+	return `${ head }\n${ paras }`;
+}
+
+/**
+ * Builds a post by concatenating `count` sections, cycling through the
+ * available base sections. Each repetition is tagged with a small heading
+ * suffix so headings stay unique (real posts rarely repeat headings verbatim).
+ *
+ * @param {number} count
+ * @returns {string}
+ */
+function buildPost( count ) {
+	const parts = [];
+	for ( let i = 0; i < count; i++ ) {
+		const base = SECTION_BODIES[ i % SECTION_BODIES.length ];
+		const cycle = Math.floor( i / SECTION_BODIES.length );
+		const heading = cycle === 0 ? base.heading : `${ base.heading } (part ${ cycle + 1 })`;
+		parts.push( renderSection( { heading, paragraphs: base.paragraphs } ) );
+	}
+	return parts.join( "\n" );
+}
+
+const shortPost = buildPost( 1 );    // ~150 words
+const mediumPost = buildPost( 6 );   // ~900–950 words
+const longPost = buildPost( 20 );    // ~3000 words
+
+/**
+ * Counts words in an HTML string (rough — strips tags and block comments,
+ * splits on whitespace). Good enough for fixture sanity-checking.
+ *
+ * @param {string} html
+ * @returns {number}
+ */
+function countWords( html ) {
+	const stripped = html
+		.replace( /<!--[\s\S]*?-->/g, " " )
+		.replace( /<[^>]+>/g, " " )
+		.trim();
+	if ( ! stripped ) {
+		return 0;
+	}
+	return stripped.split( /\s+/ ).length;
+}
+
+module.exports = {
+	shortPost,
+	mediumPost,
+	longPost,
+	wordCounts: {
+		short: countWords( shortPost ),
+		medium: countWords( mediumPost ),
+		long: countWords( longPost ),
+	},
+};
diff --git a/packages/yoastseo/tmp/tree-build-perf/fixtures/keyphrases.js b/packages/yoastseo/tmp/tree-build-perf/fixtures/keyphrases.js
new file mode 100644
index 0000000000..678b3f1566
--- /dev/null
+++ b/packages/yoastseo/tmp/tree-build-perf/fixtures/keyphrases.js
@@ -0,0 +1,52 @@
+/* eslint-disable */
+/**
+ * Keyphrase configurations for the tree-build perf harness.
+ *
+ *   configA — focus keyphrase only, no synonyms, no related keyphrases.
+ *   configB — focus keyphrase + 1 synonym, no related keyphrases.
+ *   configC — focus keyphrase + 3 synonyms + 3 related keyphrases, each with their own synonyms.
+ *
+ * The shape of `relatedKeywords` mirrors what AnalysisWebWorker.analyze
+ * receives from the editor: an object keyed by id, each value carrying
+ * `keyword` and (optionally) `synonyms`.
+ */
+
+const FOCUS_KEYWORD = "sustainable gardening";
+
+const configA = {
+	label: "A: focus only",
+	paperAttributes: {
+		keyword: FOCUS_KEYWORD,
+		synonyms: "",
+	},
+	relatedKeywords: {},
+};
+
+const configB = {
+	label: "B: focus + 1 synonym",
+	paperAttributes: {
+		keyword: FOCUS_KEYWORD,
+		synonyms: "eco-friendly gardening",
+	},
+	relatedKeywords: {},
+};
+
+const configC = {
+	label: "C: focus + 3 synonyms + 3 related (each with synonyms)",
+	paperAttributes: {
+		keyword: FOCUS_KEYWORD,
+		synonyms: "eco-friendly gardening, organic gardening, green gardening",
+	},
+	relatedKeywords: {
+		a: { keyword: "composting at home", synonyms: "home composting, kitchen composting" },
+		b: { keyword: "rainwater harvesting", synonyms: "rainwater collection" },
+		c: { keyword: "native plants", synonyms: "indigenous plants, regional flora" },
+	},
+};
+
+module.exports = {
+	configA,
+	configB,
+	configC,
+	all: [ configA, configB, configC ],
+};

Relevant test scenarios

  • Changes should be tested with the browser console open
  • Changes should be tested on different posts/pages/taxonomies/custom post types/custom taxonomies
  • Changes should be tested on different editors (Default Block/Gutenberg/Classic/Elementor/other)
  • Changes should be tested on different browsers
  • Changes should be tested on multisite

Test instructions for QA when the code is in the RC

  • QA should use the same steps as above.

QA can test this PR by following these steps:

Impact check

This PR affects the following parts of the plugin, which may require extra testing:

  • Content analysis worker: every analysis cycle, all assessors (focus + related + readability + inclusive language), every runResearch call from the main thread.
  • Tree-consuming researches in particular (keyphraseDistribution, getParagraphs, getSentenceBeginnings, findTransitionWords, etc.) — they now run on a tree that may have been built by a different code path on a previous cycle. Their assumption that the tree is freshly built no longer holds, which is what surfaced the keyphraseDistribution cleanup bug.
  • The assessor.assess() contract: now conditional on paper.getTree() === null. Callers that constructed a paper themselves and called assess directly (outside AnalysisWebWorker) keep working because their paper has no tree, so the new branch falls through to the original build path.

Other environments

  • This PR also affects Shopify. I have added a changelog entry starting with [shopify-seo], added test instructions for Shopify and attached the Shopify label to this PR.
  • This PR also affects Yoast SEO for Google Docs. I have added a changelog entry starting with [yoast-doc-extension], added test instructions for Yoast SEO for Google Docs and attached the Google Docs Add-on label to this PR.

Documentation

  • I have written documentation for this change. For example, comments in the Relevant technical choices, comments in the code, documentation on Confluence / shared Google Drive / Yoast developer portal, or other.

Quality assurance

  • I have tested this code to the best of my abilities.
  • During testing, I had activated all plugins that Yoast SEO provides integrations for.
  • I have added unit tests to verify the code works as intended.
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.
  • I have checked that the base branch is correctly set.
  • I have run grunt build:images and committed the results, if my PR introduces or edits images or SVGs.

Innovation

  • No innovation project is applicable for this PR.
  • This PR falls under an innovation project. I have attached the innovation label.
  • I have added my hours to the WBSO document.

Fixes https://github.com/Yoast/lingo-other-tasks/issues/209

@coveralls
Copy link
Copy Markdown

coveralls commented May 12, 2026

Coverage Report for CI Build 0

Coverage increased (+0.06%) to 54.2%

Details

  • Coverage increased (+0.06%) from the base build.
  • Patch coverage: 46 of 46 lines across 4 files are fully covered (100%).
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 68309
Covered Lines: 36798
Line Coverage: 53.87%
Relevant Branches: 17151
Covered Branches: 9521
Branch Coverage: 55.51%
Branches in Coverage %: Yes
Coverage Strength: 44047.91 hits per line

💛 - Coveralls

FAMarfuaty and others added 5 commits May 13, 2026 09:51
Three regression tests guard the cleanup added in 86c0b8e: that tree
sentences carry no sentenceParentNode after the research returns, that the
same holds on the list-item merge path, and that JSON.stringify on the tree
does not throw (the exact failure mode the live editor hit via the worker
transporter once the tree-build dedup kept the cycle alive across calls).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@FAMarfuaty FAMarfuaty marked this pull request as ready for review May 19, 2026 10:57
@FAMarfuaty FAMarfuaty changed the base branch from trunk to feature/misc-performance May 19, 2026 10:57
@FAMarfuaty FAMarfuaty added the changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog label May 19, 2026
…-seo into 209-html-parser---investigate-when-to-build-the-tree
@FAMarfuaty FAMarfuaty changed the base branch from feature/misc-performance to trunk May 19, 2026 11:26
@FAMarfuaty FAMarfuaty changed the base branch from trunk to feature/misc-performance May 19, 2026 11:26
@FAMarfuaty FAMarfuaty added Shopify This PR impacts Shopify. Google Docs Add-on If this PR is also relevant or has an impact on the Google Docs Add-on labels May 19, 2026
@FAMarfuaty FAMarfuaty changed the title refactor(assessor): optimize HTML tree building to reuse existing trees refactor: optimize HTML tree building to reuse existing trees May 19, 2026
@FAMarfuaty FAMarfuaty added this to the feature/misc-performance milestone May 20, 2026
…-seo into 209-html-parser---investigate-when-to-build-the-tree
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the yoastseo analysis worker to reduce redundant HTML parsing by caching and reusing the built HTML tree across analyze cycles, assessor runs, related-keyphrase passes, and worker runResearch calls, while adding safeguards for tree-mutation side effects.

Changes:

  • Reuse previously built HTML trees in AnalysisWebWorker.analyze, related-keyphrase assessment, and runResearch based on a new Paper.hasSameTreeInputsAs predicate.
  • Make Assessor.assess conditional on paper.getTree() === null so assessors reuse prebuilt trees when available.
  • Fix keyphraseDistribution to clean up sentenceParentNode back-references to prevent circular structures when trees are reused, and add regression coverage.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/yoastseo/src/worker/AnalysisWebWorker.js Caches/reuses trees across analyze, related-keyphrase assessment, and runResearch (incl. shortcode backfill).
packages/yoastseo/src/values/Paper.js Adds hasSameTreeInputsAs to determine when tree reuse is valid.
packages/yoastseo/src/scoring/assessors/assessor.js Avoids rebuilding the tree when one is already present on the paper.
packages/yoastseo/src/languageProcessing/researches/keyphraseDistribution.js Extracts sentence retrieval and removes sentenceParentNode back-references to prevent cycles on reused trees.
packages/yoastseo/spec/worker/AnalysisWebWorkerSpec.js Adds tests asserting tree reuse behavior for analyze/runResearch/related keywords.
packages/yoastseo/spec/values/PaperSpec.js Adds unit tests for hasSameTreeInputsAs.
packages/yoastseo/spec/scoring/assessors/assessorSpec.js Adds tests ensuring assessors don’t rebuild trees unnecessarily.
packages/yoastseo/spec/languageProcessing/researches/keyphraseDistributionSpec.js Adds regression tests ensuring no sentenceParentNode remains and tree is JSON-stringifiable.
packages/js/src/initializers/analysis.js Sends shortcode tags as part of collected analysis data.

Comment thread packages/yoastseo/src/languageProcessing/researches/keyphraseDistribution.js Outdated
Comment on lines +1277 to +1286
if ( ( ! callerShortcodes || callerShortcodes.length === 0 ) && cachedShortcodes && cachedShortcodes.length > 0 ) {
paper._attributes.shortcodes = cachedShortcodes;
}

if ( this._paper !== null && this._paper.getTree() !== null && this._paper.hasSameTreeInputsAs( paper ) ) {
// The incoming paper has the same tree inputs as the worker's cached paper; reuse that tree instead of rebuilding.
paper.setTree( this._paper.getTree() );
} else {
const languageProcessor = new LanguageProcessor( researcher );
paper.setTree( build( paper, languageProcessor, paper._attributes.shortcodes ) );
Copy link
Copy Markdown
Contributor

@vraja-pro vraja-pro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments on the tree-reuse implementation. See individual inline comments for specifics.

Comment thread packages/yoastseo/src/scoring/assessors/assessor.js Outdated
Comment thread packages/yoastseo/src/worker/AnalysisWebWorker.js
Comment thread packages/yoastseo/src/worker/AnalysisWebWorker.js Outdated
Comment thread packages/yoastseo/src/values/Paper.js
@@ -101,6 +101,9 @@ export function collectData() {
...data,
textTitle: getEditorDataTitle(),
isFrontPage: getIsFrontPage(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change lands in packages/js/ but there's no corresponding test. A couple of things worth verifying:

  1. Is wpseo_shortcode_tags guaranteed to be an array in all contexts (Gutenberg, Classic, Elementor)? If it can be null or a plain object, the fallback to [] is fine but the truthy check on the outer shortcodes object could silently produce undefined for the value.
  2. This means shortcodes are now included in the paper data sent to the worker for analyze calls, which feeds hasSameTreeInputsAs. Was omitting them before a latent inconsistency (i.e. the worker was building trees without the correct shortcode filter)? If so, worth noting in the PR as a secondary bugfix.

Copy link
Copy Markdown
Contributor Author

@FAMarfuaty FAMarfuaty May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these — all three addressed:

  1. Test gap → added packages/js/tests/initializers/analysis.test.js with 5 tests covering: array present, plugins.shortcodes missing, and wpseo_shortcode_tags missing / null / non-array.
  2. Array safety of wpseo_shortcode_tags → good catch. The truthy check on the outer shortcodes object silently yielded undefined when the inner key was missing or non-array. Replaced with an Array.isArray(...) guard so any non-array value (missing, null, plain object) collapses to [].
  3. Latent inconsistency / secondary bugfix → confirmed. Before this PR, collectData() omitted shortcodes, so the Insights runResearch calls reached the worker with paper._attributes.shortcodes === undefined. The fallback build() then ran with no shortcode filter, meaning markers like [gallery] weren't stripped from the Insights view of the content — so Insights word count, prominent words, etc. could disagree with what the analyze tree saw for the same post. The fix here closes that gap; written up as Relevant technical choice item 7 in the PR description so it's visible alongside the perf changes.

Copy link
Copy Markdown
Contributor

@vraja-pro vraja-pro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Assessor API contract change is a breaking change for external yoastseo consumers

assessor.assess() previously always rebuilt the tree. After this PR, it only builds if paper.getTree() === null. Any third-party code that calls assess() with a stale tree set on the paper will silently get wrong results. The changelog entry says "reusing previously built HTML trees" but doesn't signal that the assessor's build-on-every-call contract is gone.

Consider either: (a) adding a deprecated-style note to the JSDoc on assess() flagging the new precondition, or (b) calling this out explicitly in the yoastseo changelog entry so downstream consumers know to audit their usage.

  1. Shared tree object across related papers — other tree-mutating researches

assessRelatedKeywords now does:

  relatedPaper.setTree( paper.getTree() );

keyphraseDistribution is now cleaned up. But are there other researches that mutate tree nodes? The research that creates sourceCodeRange, clientId, or any sentence-level properties would now mutate the shared object. Since this is a worker (single-threaded), there's no race condition, but if any other research writes to sentence/paragraph nodes without cleaning up, a future regression could be hard to trace. A short comment calling out "this tree is shared — mutations must be idempotent or cleaned up" would help maintainers.

  1. paper._attributes.shortcodes direct mutation in runResearch

paper._attributes.shortcodes = cachedShortcodes;

This mutates the paper passed in by the caller. Callers that hold a reference and inspect the paper afterward will see unexpected shortcodes. This is the same pattern used elsewhere in the codebase, but it's worth documenting with a comment explaining why you're mutating the incoming paper rather than creating a local override.

  1. hasSameTreeInputsAs — undefined vs [] shortcodes edge case

isEqual(undefined, []) returns false. If one paper is constructed without shortcodes (leaving _attributes.shortcodes as undefined) and another has [], they'll compare as unequal despite producing identical trees. The backfill in runResearch mitigates this for that code path, but the analyze path isn't backfilled. Is shortcodes always initialized to [] in the Paper constructor? If not, it's worth normalizing to [] in hasSameTreeInputsAs before comparing.

  1. analysis.js — shortcodes added to collectData() without a test

shortcodes: window.wpseoScriptData.analysis.plugins.shortcodes
? window.wpseoScriptData.analysis.plugins.shortcodes.wpseo_shortcode_tags
: [],

This is needed for hasSameTreeInputsAs to work in the analyze path. But it's in packages/js/, and there's no corresponding test for this change. Is wpseo_shortcode_tags guaranteed to be an array (not null or an object keyed differently)? Also — if this data was already available but just not included in the paper, did omitting it previously cause any silent mismatch between what the worker thought the shortcodes were versus what they actually were? This seems like a pre-existing issue being addressed correctly, but a brief note in the PR would clarify.


Minor notes

  • The refactoring of keyphraseDistributionResearcher from function to arrow function (=>) is unrelated to the feature and adds a small diff noise. Not a blocker, just worth noting.
  • The retrieveSentences helper extracted in keyphraseDistribution.js is clean and improves readability.
  • Test quality is excellent — the sentinel tree pattern ({ __sentinel: true }) and the JSON.stringify smoke test are the right tools for this.
  • The perf harness in the PR description is a nice touch for future benchmarking, even if it's not committed.

Not blocking, but worth a follow-up

The analyze#1 wall-clock improvement (~2.0s → ~1.9s) is modest. Most of the gains are in analyze#2 (unchanged paper) and runResearch. That's expected given the paper equality check was already gating re-analysis — the real win is probably more visible in Premium with its additional assessors (as the test instructions note). Would be interesting to see a browser-based measurement after this lands.


Overall: Approve with the above addressed or acknowledged. The circular-reference bugfix alone is worth landing; the performance improvements are solid. The main ask is ensuring the assessor.assess() API change is clearly communicated in the yoastseo changelog.

FAMarfuaty and others added 6 commits May 21, 2026 13:48
…-seo into 209-html-parser---investigate-when-to-build-the-tree
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…f github.com:Yoast/wordpress-seo into 209-html-parser---investigate-when-to-build-the-tree
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog Google Docs Add-on If this PR is also relevant or has an impact on the Google Docs Add-on Shopify This PR impacts Shopify.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants