Skip to content

Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868

Open
snyaggarwal wants to merge 8 commits into
masterfrom
issues#2505
Open

Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868
snyaggarwal wants to merge 8 commits into
masterfrom
issues#2505

Conversation

@snyaggarwal
Copy link
Copy Markdown
Contributor

@snyaggarwal snyaggarwal commented May 7, 2026

Refs OpenConceptLab/ocl_issues#2505 (Problem B only — Problem A still open)
Refs OpenConceptLab/ocl_issues#2511 (Lexical Variant Dictionary infrastructure)

Heads up — approach changed mid-flight

Sunny — apologies for the redirect. After deeper review of #2505 with @paynejd, we landed on a substantially different architecture and felt it was faster to push the new approach directly than to iterate via comments. Your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept — only the variant source changed from inline regex to a dictionary-driven lookup. Happy to walk through the diff on a call if useful.

Three reasons for the redirect:

  1. False positives in the original regex list. hem/haem matches themselves, anthem, hemisphere, hemp, hemlock, remember — hard to fix defensively in regex without enumerating exclusions for every English word.
  2. Reusability. The fix should be available beyond ConceptFuzzySearch. Standard /concepts/?q=… has the same gap, and upcoming abbreviation-expansion work needs the same primitive. A dictionary-as-Source approach makes that reuse trivial.
  3. Eat our own dog food. Storing dictionaries as OCL Sources gives versioning, release management, locale support, editability through the OCL UI, and discoverability for free — and aligns with how UMLS / SNOMED / OBO model lexical variants.

Tradeoff worth flagging

This approach puts pressure on oclapi2 to do caching well. The MVP scale is trivial (~40 entries, in-memory dict), but as dictionary types and sizes grow — especially abbreviation dictionaries (UMLS LRABR is ~250k rows) and once variants are enabled by default on high-QPS endpoints — the per-token DB lookup will need to graduate beyond the MVP's module-level cache. Likely Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones. Plan to revisit when (a) a dictionary exceeds ~5k entries, or (b) variants are enabled by default on a high-QPS endpoint. Captured in #2511.

What this PR does

  • New helper: core/common/lexical_variants.pyget_lexical_variants(text, source_uri) and get_variant_terms(text, source_uri). Tokenizes input (lowercase + whitespace + ASCII punctuation strip), looks each token up against the dictionary's Names, returns sibling Names from the matching Concept. Module-level cache keyed on (source_uri, version). Returns [] and never raises if the dictionary is missing — graceful degradation.
  • Seed data: core/common/data/lexical-variants-en.json — OCL bulk import file creating OCL/lexical-variants-en (Source + 43 vetted Concepts + Source Version v1.0). Each Concept has en-US and en-GB Names with name_type=Fully Specified. False-positive-prone substring entries dropped.
  • Wired into $match: MetadataToConceptsListView reads variants from request body, passes the resolved URI (or None) to ConceptFuzzySearch.search(..., variants_repo=...). The kNN sub-query construction and rescore expansion are unchanged in shape; only the variant source changed.
  • Empty-synonyms rescore crash fix: preserved from your initial commit.
  • Removed: get_spelling_variant() from core/common/utils.py.

API: variants request body field

Lexical variant expansion is OFF by default. Clients opt in. Same shape will apply to standard concept search when that wiring lands as a follow-up.

Value Behavior
absent / null / false / "false" / "0" disabled (default)
true / "true" / "1" enabled, default dictionary (OCL/lexical-variants-en)
URI string (e.g. "/orgs/myorg/sources/my-dictionary/") enabled, that dictionary

Tests

core/common/tests.py LexicalVariantsTest covers:

  • Tokenization
  • Multi-token expansion
  • Cache lifecycle (per-source-version) + invalidation
  • Missing-source graceful degradation
  • Regression: themselves, anthem, hemisphere, hemp, hemlock, remember return no variants

Deployment

The dictionary Source has already been seeded on prod (OCL/lexical-variants-en v1.0, 43 concepts, public read) so the wiring will work as soon as this PR merges and deploys. For dev / qa / staging environments, run:

ocl --server ocl-dev import file --wait core/common/data/lexical-variants-en.json
ocl --server ocl-qa import file --wait core/common/data/lexical-variants-en.json
ocl --server ocl-staging import file --wait core/common/data/lexical-variants-en.json

Verify with: ocl repo get OCL lexical-variants-en (should show 43 concepts, source_type: "Lexical Variants", extras.dictionary_kind: "lexical_variant", version v1.0 released).

The helper degrades gracefully (returns [], never raises) if the Source doesn't exist yet — so a deploy gap on any environment won't break search.

Out of scope (deferred follow-ups)

  • ?variants=... on standard concept search (ConceptListView) — touches CustomESSearch / get_raw_search_string layer, smaller blast radius as a separate PR
  • $lexical-variants HTTP operation
  • Abbreviation dictionary type (Mapping-based)
  • Multi-dictionary composition
  • Problem A from #2505 (CIEL bridge exact-match scoring via apply_score() highlight) — entirely separate work; this PR no longer auto-closes that part of the ticket

Architecture / design references

  • Full design proposal: Lexical Variant Dictionaries — UMLS / SNOMED / OBO precedents, dictionary-type modeling decisions, tokenization-first reasoning, multi-dictionary roadmap
  • Infrastructure tracking ticket: #2511 — Phase 2/3 work (standard search wiring, $lexical-variants operation, abbreviation dictionary type, caching strategy, etc.)

Adds get_spelling_variant() to generate US<->British spelling alternatives
(leukemia/leukaemia, haem/hem, paed/ped, oedema/edema, -our/-or, -ise/-ize,
etc.). In semantic search, additional kNN sub-queries are fired using the
variant's embedding so that e.g. querying "leukemia" still retrieves
"leukaemia" concepts when the index is large enough to push them out of the
default top-50 candidates. The rescore query is also expanded to boost exact
matches of either spelling, and a pre-existing crash risk (empty should-clause
when no synonyms are provided) is fixed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@snyaggarwal snyaggarwal requested a review from paynejd May 7, 2026 15:55
paynejd and others added 2 commits May 9, 2026 10:31
…structure

Models lexical variants as an OCL Source so they get versioning, release
management, locale handling, editability, and discoverability through
existing OCL infrastructure — replacing the hardcoded regex approach with
a data-driven lookup that aligns with UMLS / SNOMED / OBO conventions.

* core/common/lexical_variants.py — get_lexical_variants() and
  get_variant_terms() helpers with token-level lookup, dataclass result,
  and per-(source_uri, version) cache. Tokenization-first design avoids
  the regex false-positive problems (e.g. "themselves" no longer matches
  "hem", "hemisphere" no longer matches "haem").

* core/common/data/lexical-variants-en.json — OCL bulk import file
  seeding ocl/lexical-variants-en with 43 vetted whole-word spelling
  pairs across en-US and en-GB, plus a v1.0 Source Version. Load via
  the standard OCL bulk import path.

* core/common/tests.py — LexicalVariantsTest covering tokenization,
  cache lifecycle, false-positive regressions (themselves, anthem,
  hemisphere, hemp, hemlock, remember), missing-source graceful
  degradation, and multi-token expansion.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
… $match

ConceptFuzzySearch.search() and MetadataToConceptsListView (the $match
endpoint) now consume the dictionary helper instead of the inline regex.

Variant expansion is OFF by default — clients opt in via the request body
`variants` field. Same shape will be used for standard concept search
when that wiring lands as a follow-up.

Variants param accepted forms:
* missing / null / false / "false" / "0" → disabled (default)
* true / "true" / "1" → use DEFAULT_LEXICAL_VARIANTS_REPO
* non-empty URI string → use that dictionary Source

Sunny's kNN sub-query construction and rescore expansion are kept;
only the variant source changes from hardcoded regex to dictionary
lookup. The empty-synonyms rescore crash fix is preserved.

* core/common/utils.py — remove get_spelling_variant() (replaced by
  the dictionary helper).

* core/concepts/search.py — ConceptFuzzySearch.search() accepts
  variants_repo; gates variant kNN and rescore expansion on it being
  truthy. None means skip entirely.

* core/concepts/views.py — MetadataToConceptsListView reads
  request.data.variants and normalizes it to a URI or None via
  _resolve_variants_repo before passing to ConceptFuzzySearch.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
@paynejd
Copy link
Copy Markdown
Member

paynejd commented May 9, 2026

@snyaggarwal — heads up, I pushed two commits on top of yours that redirect this PR to a dictionary-driven approach (full rationale in the updated description above + the proposal doc).

TL;DR — your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept. The variant source changed from inline regex to a dictionary lookup against ocl/lexical-variants-en (a new OCL Source seeded by bulk import). Variant expansion is now off by default; clients opt in via variants: true (or a custom dictionary URI) in the request body. We'll need follow up work to utilize this in Mapper (expose in project config?).

Apologies for the in-flight redirect — the false-positive cases in the regex list (themselves, anthem, hemisphere, hemp, etc. all matching hem/haem) plus the realization that the same primitive will be needed for upcoming synonym/abbreviation/keyword-expansion evaluation that we're starting pushed me to the dictionary-as-Source approach. Phase 2/3 work tracked in #2511.

Happy to walk through the diff or rework anything that doesn't sit right.

…riants-en

The OCL organization on prod has mnemonic OCL (uppercase), not ocl. Updates
the bulk import file owner field, the DEFAULT_LEXICAL_VARIANTS_REPO
constant, and test mock URIs to match.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
Comment thread core/common/lexical_variants.py Outdated
@snyaggarwal
Copy link
Copy Markdown
Contributor Author

snyaggarwal commented May 12, 2026

Two missing things:

  1. Project config to enable it ("use_lexical_variants") and for core users only
  2. Use Redis caching with TTL

@snyaggarwal to take on this

Copy link
Copy Markdown
Member

@paynejd paynejd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the latest revision after the redis-cache refactor and use_lexical_variants field addition. Four things to address before merge:

1. LEXICAL_VARIANTS_CACHE_TIMEOUT env override is brokenos.environ.get returns a string when set, but django-redis's cache.set(timeout=...) expects an int. Suggestion inline.

2. use_lexical_variants MapProject flag is persistence-only. The field exists, the serializer surfaces it, but MetadataToConceptsListView.filter_queryset still only reads request.data.get('variants') — so the flag has no effect at $match time until oclmap is updated to read the project config and pass variants: true per request. That's consistent with how encoder_model flows (client passes it per call), so the field is correct as persistence; just call out the dependency in the PR body and file an oclmap follow-up ticket. Otherwise reviewers / Mapper users will reasonably assume flipping the flag does something on its own.

3. Fixture path changed. Sunny moved core/common/data/lexical-variants-en.jsoncore/lookup_fixtures/lexical-variants-en.json, but (a) import_lookup_values.py has a hardcoded list and does not auto-load this file, so the lookup_fixtures/ placement falsely implies auto-load, and (b) the PR body's deploy commands still reference the old core/common/data/... path — anyone running them on dev/qa/staging gets a 404. Either move the file back, or wire it into import_lookup_values and update the PR body.

4. Rebase needed. PR is CONFLICTING (8 commits behind master). Conflicts are in core/map_projects/{models,serializers,tests/tests}.py from master's prompt_output_locale config addition (c883bd0). Should be a clean resolve — just append use_lexical_variants after prompt_output_locale in each CONFIGURATION_FIELDS / serializer field list / prepare_object loop. Verify migration 0036_mapproject_use_lexical_variants.py doesn't collide with master's migration numbering after rebase.

The redis-cache refactor itself looks good — keying on (source.uri, version) cleanly invalidates released-version edits. HEAD edits will be stale up to the TTL (4 days default), which is fine for a 43-entry vetted dictionary; worth a one-line comment so future maintainers know. The is_core_user gate matches the encoder_model precedent.

Comment thread core/settings.py

DEFAULT_LEXICAL_VARIANTS_REPO = os.environ.get(
'DEFAULT_LEXICAL_VARIANTS_REPO', '/orgs/OCL/sources/lexical-variants-en/')
LEXICAL_VARIANTS_CACHE_TIMEOUT = os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env override returns a string but cache.set(timeout=...) wants an int. Bug if anyone sets LEXICAL_VARIANTS_CACHE_TIMEOUT in env.

Suggested change
LEXICAL_VARIANTS_CACHE_TIMEOUT = os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4)
LEXICAL_VARIANTS_CACHE_TIMEOUT = int(os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4))


class LexicalVariantDictionary:
CACHE_KEY_PREFIX = 'lexical_variants'
CACHE_TIMEOUT = settings.LEXICAL_VARIANTS_CACHE_TIMEOUT
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small notes:

  1. Cache key includes version, so released versions invalidate cleanly when bumped. But for HEAD edits the key stays HEAD and stale data persists up to CACHE_TIMEOUT (4 days default). For a vetted 43-entry dictionary that's fine — but a future maintainer who edits the dictionary in-place will be surprised. Worth a one-line note here. Optionally, hook invalidate_cache(source.uri) into Source / SourceVersion save() post-save so HEAD edits flush.

  2. CACHE_TIMEOUT = settings.LEXICAL_VARIANTS_CACHE_TIMEOUT is evaluated at class-definition time. Fine in practice (env reads on process boot), but if anyone ever wants @override_settings to take effect in tests, this won't pick it up — they'd have to use cls.CACHE_TIMEOUT via a classmethod instead. Not worth changing unless tests need it.

Comment thread core/concepts/views.py

map_config = self.request.data.get('map_config', [])
filters = self.request.data.get('filter', {})
variants_repo = self._resolve_variants_repo(self.request.data.get('variants')) if is_core_user else None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a one-line debug log when variants is supplied but dropped because the caller isn't core — silent drops will confuse anyone reading API docs and trying it from a non-core token (worse results with no explanation).

Also — this is where the new MapProject.use_lexical_variants flag would need to feed in if we wanted server-side default behavior (rather than relying on oclmap to pass variants: true per request). Current design pushes that responsibility to the client, which is consistent with encoder_model. Just calling out: this PR makes the flag visible via the API but its only effect right now is that oclmap can persist user intent — it won't change $match behavior until oclmap reads the project config and passes variants. Recommend an explicit follow-up ticket and a line in the PR body so reviewers don't assume merging this enables variants in the Mapper.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcting myself — I missed that Sunny opened the companion oclmap PR (OpenConceptLab/oclmap#20) earlier today, which does exactly the wiring I was asking about: reads use_lexical_variants from project config, renders a core-user-gated checkbox in AdvancedSettings, persists via FormData on save, and sends variants: useLexicalVariants in the $match request body. So the flag is not dormant — it's a sibling-PR situation, not a missing-wiring situation.

Two PRs land safely in either order: if oclapi2 lands first, oclmap master doesn't yet send variants so default-disabled holds; if oclmap lands first, oclapi2 master ignores the unknown variants field. No race risk.

Retracting the "follow-up ticket" ask. Still worth a one-liner in this PR's body cross-linking oclmap#20 (and vice versa) so reviewers see the pair. Debug-log suggestion for silently-dropped non-core variants still applies.

@@ -0,0 +1,45 @@
{"type": "Source", "id": "lexical-variants-en", "short_code": "lexical-variants-en", "name": "English Lexical Variants", "full_name": "OCL English Lexical Variants Dictionary", "owner_type": "Organization", "owner": "OCL", "description": "Dictionary of English lexical variants (US/British/Commonwealth spelling pairs) used by OCL search and matching pipelines for query expansion. Each concept represents a spelling equivalence class. Names are tagged with locale (en-US, en-GB) and name_type=Fully Specified for the canonical form per locale. See: https://github.com/OpenConceptLab/ocl_issues (Lexical Dictionary infrastructure).", "default_locale": "en", "source_type": "Lexical Variants", "public_access": "View", "supported_locales": "en,en-US,en-GB", "custom_validation_schema": "None", "extras": {"dictionary_kind": "lexical_variant"}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename from core/common/data/core/lookup_fixtures/ is misleading: core/common/management/commands/import_lookup_values.py has a hardcoded list of fixture files and does not pick this one up — so placing it here implies an auto-load that doesn't happen. The file is also a full OCL bulk-import payload (Source line + Concept lines + version line), not the JSONL Concept-only format the other fixtures here use, so wiring it into import_lookup_values isn't a one-line change either.

Two options:

  • Move back to core/common/data/lexical-variants-en.json (matches the bulk-import-via-CLI deploy path the PR body describes), or
  • Keep here and add a real loader path in import_lookup_values for bulk-format fixtures.

Either way, the PR body's deploy commands still reference the old path — update those so dev/qa/staging deploys don't 404.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants