Fix greedy regex in escape_html_characters that eats text between comments by Chessing234 · Pull Request #38388 · openedx/openedx-platform

Chessing234 · 2026-04-20T11:19:49Z

Bug

escape_html_characters in xmodule/util/misc.py is used to strip HTML
noise before ElasticSearch indexing. Its comment and CDATA strippers
misbehave in two ways:

Greedy: r\"\" matches from the first  in the input. Indexable text between two separate
comments is deleted along with the comments:
```
\" keep this text \"
    → \"\"   (everything between the first  is gone)
```
Single-line only: . does not match newlines, so any multi-line
comment / CDATA is not stripped at all:
```
\"\" → unchanged
```
and those spans end up in the search index.

Root cause

Both patterns use greedy .* and neither is compiled with
re.DOTALL. The helper was written assuming single-line, single-comment
input.

Why the fix is correct

Switching to non-greedy .*? makes each re.sub match exactly one
comment (or one CDATA span) and stop, so text between separate
comments survives.
Passing flags=re.DOTALL lets . also span newlines, so multi-line
comments and CDATA blocks are stripped too.
No other behaviour changes: the enclosing whitespace-normalisation
re.sub is untouched, and the function's public signature is the
same.

Change

xmodule/util/misc.py: make the two outer re.sub calls non-greedy
(.* → .*?) and add flags=re.DOTALL so multi-line comments and
CDATA are actually stripped.

escape_html_characters uses re.sub with r"" and r"<!\[CDATA\[.*\]\]>" to strip HTML comments and CDATA sections before ElasticSearch indexing. Two issues with those patterns: 1. `.*` is greedy, so a string containing two comments like " keep this " collapses to "" - everything between the first "" is eaten, including legitimate indexable text. 2. The default `.` does not match newlines, so any multi-line comment or CDATA block  slips through completely and ends up in the search index. Switch to non-greedy `.*?` and pass `flags=re.DOTALL` so each pattern matches exactly one comment/CDATA span, including multi-line ones, without swallowing surrounding text.

openedx-webhooks · 2026-04-20T11:19:55Z

Thanks for the pull request, @Chessing234!

This repository is currently maintained by @openedx/wg-maintenance-openedx-platform.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Submit a signed contributor agreement (CLA)

⚠️ We ask all contributors to the Open edX project to submit a signed contributor agreement or indicate their institutional affiliation.
Please see the CONTRIBUTING file for more information.

If you've signed an agreement in the past, you may need to re-sign.
See The New Home of the Open edX Codebase for details.

Once you've signed the CLA, please allow 1 business day for it to be processed.
After this time, you can re-run the CLA check by adding a comment below that you have signed it.
If the CLA check continues to fail, you can tag the @openedx/cla-problems team in a comment for further assistance.

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

mphilbrick211 · 2026-04-22T21:11:19Z

Hi @Chessing234! Welcome, and thank you for this contribution! In order for your CLA check to turn green, you'll need to submit a CLA form. If you are contributing as an individual, please fill out the individual CLA form here.

If you are contributing on behalf of an organization, please have your manager reach out to oscm@axim.org so you may be added to your org's existing entity agreement.

Please let me know if you have any questions. Thanks!

Chessing234 requested review from farhan, irtazaakram and salman2013 as code owners April 20, 2026 11:19

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Apr 20, 2026

openedx-webhooks added this to Contributions Apr 20, 2026

github-project-automation Bot moved this to Needs Triage in Contributions Apr 20, 2026

mphilbrick211 added the needs test run Author's first PR to this repository, awaiting test authorization from Axim label Apr 22, 2026

mphilbrick211 moved this from Needs Triage to Needs Tests Run or CLA Signed in Contributions Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix greedy regex in escape_html_characters that eats text between comments#38388

Fix greedy regex in escape_html_characters that eats text between comments#38388
Chessing234 wants to merge 1 commit intoopenedx:masterfrom
Chessing234:fix/escape-html-characters-nongreedy-dotall

Chessing234 commented Apr 20, 2026

Uh oh!

openedx-webhooks commented Apr 20, 2026

Uh oh!

mphilbrick211 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Chessing234 commented Apr 20, 2026

Bug

Root cause

Why the fix is correct

Change

Uh oh!

openedx-webhooks commented Apr 20, 2026

Uh oh!

mphilbrick211 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants