Fix greedy regex in escape_html_characters that eats text between comments#38388
Fix greedy regex in escape_html_characters that eats text between comments#38388Chessing234 wants to merge 1 commit intoopenedx:masterfrom
Conversation
escape_html_characters uses re.sub with r"<!--.*-->" and
r"<!\[CDATA\[.*\]\]>" to strip HTML comments and CDATA sections before
ElasticSearch indexing. Two issues with those patterns:
1. `.*` is greedy, so a string containing two comments like
"<!-- a --> keep this <!-- b -->"
collapses to "" - everything between the first "<!--" and the last
"-->" is eaten, including legitimate indexable text.
2. The default `.` does not match newlines, so any multi-line comment
or CDATA block
<!--
notes
-->
slips through completely and ends up in the search index.
Switch to non-greedy `.*?` and pass `flags=re.DOTALL` so each pattern
matches exactly one comment/CDATA span, including multi-line ones,
without swallowing surrounding text.
|
Thanks for the pull request, @Chessing234! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Submit a signed contributor agreement (CLA)
If you've signed an agreement in the past, you may need to re-sign. Once you've signed the CLA, please allow 1 business day for it to be processed. 🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
|
Hi @Chessing234! Welcome, and thank you for this contribution! In order for your CLA check to turn green, you'll need to submit a CLA form. If you are contributing as an individual, please fill out the individual CLA form here. If you are contributing on behalf of an organization, please have your manager reach out to oscm@axim.org so you may be added to your org's existing entity agreement. Please let me know if you have any questions. Thanks! |
Bug
escape_html_charactersinxmodule/util/misc.pyis used to strip HTMLnoise before ElasticSearch indexing. Its comment and CDATA strippers
misbehave in two ways:
Greedy:
r\"<!--.*-->\"matches from the first<!--to thelast
-->in the input. Indexable text between two separatecomments is deleted along with the comments:
Single-line only:
.does not match newlines, so any multi-linecomment / CDATA is not stripped at all:
and those spans end up in the search index.
Root cause
Both patterns use greedy
.*and neither is compiled withre.DOTALL. The helper was written assuming single-line, single-commentinput.
Why the fix is correct
.*?makes eachre.submatch exactly onecomment (or one CDATA span) and stop, so text between separate
comments survives.
flags=re.DOTALLlets.also span newlines, so multi-linecomments and CDATA blocks are stripped too.
re.subis untouched, and the function's public signature is thesame.
Change
xmodule/util/misc.py: make the two outerre.subcalls non-greedy(
.*→.*?) and addflags=re.DOTALLso multi-line comments andCDATA are actually stripped.