refactor: optimize HTML tree building to reuse existing trees#23258
refactor: optimize HTML tree building to reuse existing trees#23258FAMarfuaty wants to merge 16 commits into
Conversation
Coverage Report for CI Build 0Coverage increased (+0.06%) to 54.2%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats💛 - Coveralls |
…parser---investigate-when-to-build-the-tree
Three regression tests guard the cleanup added in 86c0b8e: that tree sentences carry no sentenceParentNode after the research returns, that the same holds on the list-item merge path, and that JSON.stringify on the tree does not throw (the exact failure mode the live editor hit via the worker transporter once the tree-build dedup kept the cycle alive across calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-seo into 209-html-parser---investigate-when-to-build-the-tree
…-seo into 209-html-parser---investigate-when-to-build-the-tree
There was a problem hiding this comment.
Pull request overview
Refactors the yoastseo analysis worker to reduce redundant HTML parsing by caching and reusing the built HTML tree across analyze cycles, assessor runs, related-keyphrase passes, and worker runResearch calls, while adding safeguards for tree-mutation side effects.
Changes:
- Reuse previously built HTML trees in
AnalysisWebWorker.analyze, related-keyphrase assessment, andrunResearchbased on a newPaper.hasSameTreeInputsAspredicate. - Make
Assessor.assessconditional onpaper.getTree() === nullso assessors reuse prebuilt trees when available. - Fix
keyphraseDistributionto clean upsentenceParentNodeback-references to prevent circular structures when trees are reused, and add regression coverage.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/yoastseo/src/worker/AnalysisWebWorker.js | Caches/reuses trees across analyze, related-keyphrase assessment, and runResearch (incl. shortcode backfill). |
| packages/yoastseo/src/values/Paper.js | Adds hasSameTreeInputsAs to determine when tree reuse is valid. |
| packages/yoastseo/src/scoring/assessors/assessor.js | Avoids rebuilding the tree when one is already present on the paper. |
| packages/yoastseo/src/languageProcessing/researches/keyphraseDistribution.js | Extracts sentence retrieval and removes sentenceParentNode back-references to prevent cycles on reused trees. |
| packages/yoastseo/spec/worker/AnalysisWebWorkerSpec.js | Adds tests asserting tree reuse behavior for analyze/runResearch/related keywords. |
| packages/yoastseo/spec/values/PaperSpec.js | Adds unit tests for hasSameTreeInputsAs. |
| packages/yoastseo/spec/scoring/assessors/assessorSpec.js | Adds tests ensuring assessors don’t rebuild trees unnecessarily. |
| packages/yoastseo/spec/languageProcessing/researches/keyphraseDistributionSpec.js | Adds regression tests ensuring no sentenceParentNode remains and tree is JSON-stringifiable. |
| packages/js/src/initializers/analysis.js | Sends shortcode tags as part of collected analysis data. |
| if ( ( ! callerShortcodes || callerShortcodes.length === 0 ) && cachedShortcodes && cachedShortcodes.length > 0 ) { | ||
| paper._attributes.shortcodes = cachedShortcodes; | ||
| } | ||
|
|
||
| if ( this._paper !== null && this._paper.getTree() !== null && this._paper.hasSameTreeInputsAs( paper ) ) { | ||
| // The incoming paper has the same tree inputs as the worker's cached paper; reuse that tree instead of rebuilding. | ||
| paper.setTree( this._paper.getTree() ); | ||
| } else { | ||
| const languageProcessor = new LanguageProcessor( researcher ); | ||
| paper.setTree( build( paper, languageProcessor, paper._attributes.shortcodes ) ); |
vraja-pro
left a comment
There was a problem hiding this comment.
Review comments on the tree-reuse implementation. See individual inline comments for specifics.
| @@ -101,6 +101,9 @@ export function collectData() { | |||
| ...data, | |||
| textTitle: getEditorDataTitle(), | |||
| isFrontPage: getIsFrontPage(), | |||
There was a problem hiding this comment.
This change lands in packages/js/ but there's no corresponding test. A couple of things worth verifying:
- Is
wpseo_shortcode_tagsguaranteed to be an array in all contexts (Gutenberg, Classic, Elementor)? If it can benullor a plain object, the fallback to[]is fine but the truthy check on the outershortcodesobject could silently produceundefinedfor the value. - This means shortcodes are now included in the paper data sent to the worker for
analyzecalls, which feedshasSameTreeInputsAs. Was omitting them before a latent inconsistency (i.e. the worker was building trees without the correct shortcode filter)? If so, worth noting in the PR as a secondary bugfix.
There was a problem hiding this comment.
Thanks for these — all three addressed:
- Test gap → added
packages/js/tests/initializers/analysis.test.jswith 5 tests covering: array present,plugins.shortcodesmissing, andwpseo_shortcode_tagsmissing / null / non-array. - Array safety of
wpseo_shortcode_tags→ good catch. The truthy check on the outer shortcodes object silently yielded undefined when the inner key was missing or non-array. Replaced with anArray.isArray(...)guard so any non-array value (missing, null, plain object) collapses to[]. - Latent inconsistency / secondary bugfix → confirmed. Before this PR,
collectData()omitted shortcodes, so the InsightsrunResearchcalls reached the worker withpaper._attributes.shortcodes=== undefined. The fallbackbuild()then ran with no shortcode filter, meaning markers like[gallery]weren't stripped from the Insights view of the content — so Insights word count, prominent words, etc. could disagree with what the analyze tree saw for the same post. The fix here closes that gap; written up as Relevant technical choice item 7 in the PR description so it's visible alongside the perf changes.
vraja-pro
left a comment
There was a problem hiding this comment.
- Assessor API contract change is a breaking change for external
yoastseoconsumers
assessor.assess() previously always rebuilt the tree. After this PR, it only builds if paper.getTree() === null. Any third-party code that calls assess() with a stale tree set on the paper will silently get wrong results. The changelog entry says "reusing previously built HTML trees" but doesn't signal that the assessor's build-on-every-call contract is gone.
Consider either: (a) adding a deprecated-style note to the JSDoc on assess() flagging the new precondition, or (b) calling this out explicitly in the yoastseo changelog entry so downstream consumers know to audit their usage.
- Shared tree object across related papers — other tree-mutating researches
assessRelatedKeywords now does:
relatedPaper.setTree( paper.getTree() );
keyphraseDistribution is now cleaned up. But are there other researches that mutate tree nodes? The research that creates sourceCodeRange, clientId, or any sentence-level properties would now mutate the shared object. Since this is a worker (single-threaded), there's no race condition, but if any other research writes to sentence/paragraph nodes without cleaning up, a future regression could be hard to trace. A short comment calling out "this tree is shared — mutations must be idempotent or cleaned up" would help maintainers.
- paper._attributes.shortcodes direct mutation in runResearch
paper._attributes.shortcodes = cachedShortcodes;
This mutates the paper passed in by the caller. Callers that hold a reference and inspect the paper afterward will see unexpected shortcodes. This is the same pattern used elsewhere in the codebase, but it's worth documenting with a comment explaining why you're mutating the incoming paper rather than creating a local override.
- hasSameTreeInputsAs — undefined vs [] shortcodes edge case
isEqual(undefined, []) returns false. If one paper is constructed without shortcodes (leaving _attributes.shortcodes as undefined) and another has [], they'll compare as unequal despite producing identical trees. The backfill in runResearch mitigates this for that code path, but the analyze path isn't backfilled. Is shortcodes always initialized to [] in the Paper constructor? If not, it's worth normalizing to [] in hasSameTreeInputsAs before comparing.
- analysis.js — shortcodes added to collectData() without a test
shortcodes: window.wpseoScriptData.analysis.plugins.shortcodes
? window.wpseoScriptData.analysis.plugins.shortcodes.wpseo_shortcode_tags
: [],
This is needed for hasSameTreeInputsAs to work in the analyze path. But it's in packages/js/, and there's no corresponding test for this change. Is wpseo_shortcode_tags guaranteed to be an array (not null or an object keyed differently)? Also — if this data was already available but just not included in the paper, did omitting it previously cause any silent mismatch between what the worker thought the shortcodes were versus what they actually were? This seems like a pre-existing issue being addressed correctly, but a brief note in the PR would clarify.
Minor notes
- The refactoring of keyphraseDistributionResearcher from function to arrow function (=>) is unrelated to the feature and adds a small diff noise. Not a blocker, just worth noting.
- The retrieveSentences helper extracted in keyphraseDistribution.js is clean and improves readability.
- Test quality is excellent — the sentinel tree pattern ({ __sentinel: true }) and the JSON.stringify smoke test are the right tools for this.
- The perf harness in the PR description is a nice touch for future benchmarking, even if it's not committed.
Not blocking, but worth a follow-up
The analyze#1 wall-clock improvement (~2.0s → ~1.9s) is modest. Most of the gains are in analyze#2 (unchanged paper) and runResearch. That's expected given the paper equality check was already gating re-analysis — the real win is probably more visible in Premium with its additional assessors (as the test instructions note). Would be interesting to see a browser-based measurement after this lands.
Overall: Approve with the above addressed or acknowledged. The circular-reference bugfix alone is worth landing; the performance improvements are solid. The main ask is ensuring the assessor.assess() API change is clearly communicated in the yoastseo changelog.
…-seo into 209-html-parser---investigate-when-to-build-the-tree
…valent to empty arrays
…f github.com:Yoast/wordpress-seo into 209-html-parser---investigate-when-to-build-the-tree
Context
Every analysis cycle in the editor used to rebuild the HTML tree several times: once in the worker (
AnalysisWebWorker.analyze), once in each assessor (Readability, SEO, Inclusive Language) becauseAssessor.assessalways calledbuild()unconditionally, and once per related-keyphrase pass because each related-keyphrase paper was a freshPaperinstance with no tree set. On top of that, everyrunResearchcall from the main thread (word count, reading time, prominent words, …) parsed the HTML again because the worker received a freshly deserialized paper across thepostMessageboundary.For a ~3000-word post with three related keyphrases this meant 6 builds per
analyzecycle plus another build perrunResearchcall — around 2 seconds of redundant parsing per cycle in our harness measurements, scaling roughly linearly with1 + numRelatedKeyphrases.The
analyzecycle is triggered and rerun every time there is a change in the editor. Hence, this redundant and unnecessary parsing is becoming even more expensive.This PR teaches the worker to keep one tree alive across calls and reuse it everywhere it's safe to.
Summary
This PR can be summarized in the following changelog entry:
Assessor.assess()to skip the HTML tree build when the paper already carries a tree; callers that pass a paper with a pre-set tree are now responsible for ensuring it matches the paper's current content.Relevant technical choices:
Assessor.assessbecomes a passive consumer of the tree (packages/yoastseo/src/scoring/assessors/assessor.js). It now only builds whenpaper.getTree() === null. The worker is the single producer of the tree; assessors reuse what the worker already set. This is the smallest possible change to the assessor API — it does not affect callers that pass a paper without a tree (those still get a freshly built one, matching pre-PR behaviour).assessRelatedKeywordspropagates the parent tree onto each related paper (AnalysisWebWorker.js). Related-keyphrase papers differ from the focus paper only inkeyword/synonyms, which are not tree-relevant inputs, so the same tree object is safe to share. One tree, N assessor passes.New
Paper.hasSameTreeInputsAs( other )(packages/yoastseo/src/values/Paper.js). Explicitly enumerates the attributes that do feed the tree builder —text,locale,shortcodes,wpBlocks— and ignores everything else. We deliberately did not reusePaper.equalshere, becauseequalstreats a keyphrase or title change as "different paper" while the tree is still perfectly valid in that case. This is the predicate that lets us reuse the cached tree on the much more common edit pattern of "user adjusted only the keyphrase / title / description without retyping the body".analyzereuses the cached tree on non-tree attribute changes (AnalysisWebWorker.js). WhenpaperHasChangesis true butthis._paper.hasSameTreeInputsAs( paper )still holds, we propagate the previous tree onto the new paper instead of rebuilding. This is the win when the user changes only metadata.runResearchreuses the worker's cached tree when content matches (AnalysisWebWorker.js). EachrunResearchmessage from the main thread carries a freshly deserialized paper with no_tree. Ifthis._paper.hasSameTreeInputsAs( paper ), we attach the cached tree; otherwise we fall back to a freshbuild(). The fallback path also backfills the cached shortcodes when the caller omitted them, because shortcodes are a site-wide registry that onlyanalyzeis told about by the main thread — without the backfill, arunResearchfrom a different source (e.g. an insights collector) would silently produce a tree that disagreed withanalyzeon which shortcode markers to filter, and the equality predicate would also miss valid reuse opportunities.keyphraseDistributioncleans up its tree mutations before returning (packages/yoastseo/src/languageProcessing/researches/keyphraseDistribution.js). The research callsgetSentencesFromTree( tree, true ), which setssentence.sentenceParentNode = parentNodeon every sentence. Combined withparagraph.sentences[i]this is a cycle. Pre-PR this was harmless because each call built a fresh tree that went out of scope as soon as the research returned. With the new tree reuse the tree stays alive, the cycle persists, and any laterrunResearchwhose result includes paragraph/sentence references (e.g.getParagraphs) carries the cycle into the payload — whereTransporter.serializewalks it via recursivelodash.mapValuesand blows the worker's stack withRangeError: Maximum call stack size exceeded. The fix is a single cleanup loop after the research's last consumer ofsentenceParentNode(getMarkingsInSentence, called earlier in the same research). Regression tests inkeyphraseDistributionSpec.jsassert both the absence ofsentenceParentNodeafter the research and thatJSON.stringify( paper.getTree() )no longer throws — the latter is the exact failure mode the live editor exhibited, expressed as an invariant.collectDatanow ships shortcodes to the worker so the Insights panel'srunResearchcalls produce results consistent with the main assessment cycle (packages/js/src/initializers/analysis.js).collectDatais the main-thread data source the Insights panel uses for its own analysis (word count, prominent words, reading time, …) — separate from the analyze cycle but expected to surface numbers that agree with what the assessments report. Pre-PR, the InsightsrunResearchcalls omittedshortcodes, so the worker built a tree without the site-wide shortcode filter and could disagree with the analyze tree on which[shortcode]markers count as content (e.g. the Insights word count widget could drift from the Text-length assessment's count). Includingshortcodeshere closes that consistency gap, and as a side effect letshasSameTreeInputsAsrecognise the Insights paper as compatible with the cached analyze paper — so the worker reuses the cached tree instead of rebuilding.Test instructions
Test instructions for the acceptance test before the PR gets merged
No Circular dependency error
RangeError: Maximum call stack size exceededin the console.Functional regression scenarios (essential for this PR)
The tree reuse changes feed many features through the same cached object. Two classes of subtle regression are possible — drifting numbers (different surfaces deriving values from the cached tree at different points) and mis-positioned highlights (the tree's
sourceCodeRange/clientId/parentClientIdno longer agreeing with the editor's current DOM). Verify both explicitly.Important
Please also test this scenario in Shopify
When testing in Shopify, you need to link
yoastseopackage with this change with the app. You can run the followingrun
yarn buildinsideyoastseopackagerun
yalc publishinside
shopify-seo/app, runyalc update yoastseoto link the packageCross-feature numeric consistency — the same underlying body content surfaces a count in several places; they must still agree after the tree reuse:
Highlighting integrity — robust position-based highlighting is the primary reason we build an HTML tree in the first place. With the tree now reused across analyze / related-keyphrase / runResearch calls, every eye-icon flow needs a sanity check. Pay extra attention to Keyphrase distribution, because that's the assessment whose research touched
sentenceParentNode(the cycle source the bugfix addressed) and whose marks are produced via the list-item-merge path.<ul>or<ol>— this is the merge-list-items code path that consumes thesentenceParentNodeback-reference; the cleanup in the bugfix must not have broken it)Edit-then-highlight — guard against stale-tree positions, the most likely failure mode if the tree reuse predicate is too permissive:
Optional: measure the build-count drop yourself
Reviewers who want to see the perf improvement quantitatively can apply the harness patch below into their working tree (it lands in
tmp/, which is gitignored). The harness instantiatesAnalysisWebWorkerdirectly in Node/Jest, wrapsparse/buildwith a counter, and drives a 3×3 matrix (post length × keyphrase config) plus a representativerunResearchsequence — same code path the browser hits, minus the worker boundary. A second harness (transporter-roundtrip.perf.js) round-trips every result throughTransporter.serialize/parseto confirm no payload contains a cycle.Headline numbers measured on this branch vs trunk (3×3 matrix, single run on the same machine):
Wall-clock on a 2843-word post (Node/JSDOM, not directly comparable to browser):
analyze#1configC: ~2.0s → ~1.9sanalyze#2configC: ~1.35s → ~1.29srunResearch(7 calls) configC: ~120ms → ~69msperf-harness.patch (apply with
git apply; creates files underpackages/yoastseo/tmp/tree-build-perf/which is gitignored)Relevant test scenarios
Test instructions for QA when the code is in the RC
QA can test this PR by following these steps:
Impact check
This PR affects the following parts of the plugin, which may require extra testing:
runResearchcall from the main thread.keyphraseDistribution,getParagraphs,getSentenceBeginnings,findTransitionWords, etc.) — they now run on a tree that may have been built by a different code path on a previous cycle. Their assumption that the tree is freshly built no longer holds, which is what surfaced thekeyphraseDistributioncleanup bug.assessor.assess()contract: now conditional onpaper.getTree() === null. Callers that constructed a paper themselves and called assess directly (outsideAnalysisWebWorker) keep working because their paper has no tree, so the new branch falls through to the original build path.Other environments
[shopify-seo], added test instructions for Shopify and attached theShopifylabel to this PR.[yoast-doc-extension], added test instructions for Yoast SEO for Google Docs and attached theGoogle Docs Add-onlabel to this PR.Documentation
Quality assurance
grunt build:imagesand committed the results, if my PR introduces or edits images or SVGs.Innovation
innovationlabel.Fixes https://github.com/Yoast/lingo-other-tasks/issues/209