Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868
Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868snyaggarwal wants to merge 8 commits into
Conversation
Adds get_spelling_variant() to generate US<->British spelling alternatives (leukemia/leukaemia, haem/hem, paed/ped, oedema/edema, -our/-or, -ise/-ize, etc.). In semantic search, additional kNN sub-queries are fired using the variant's embedding so that e.g. querying "leukemia" still retrieves "leukaemia" concepts when the index is large enough to push them out of the default top-50 candidates. The rescore query is also expanded to boost exact matches of either spelling, and a pre-existing crash risk (empty should-clause when no synonyms are provided) is fixed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure Models lexical variants as an OCL Source so they get versioning, release management, locale handling, editability, and discoverability through existing OCL infrastructure — replacing the hardcoded regex approach with a data-driven lookup that aligns with UMLS / SNOMED / OBO conventions. * core/common/lexical_variants.py — get_lexical_variants() and get_variant_terms() helpers with token-level lookup, dataclass result, and per-(source_uri, version) cache. Tokenization-first design avoids the regex false-positive problems (e.g. "themselves" no longer matches "hem", "hemisphere" no longer matches "haem"). * core/common/data/lexical-variants-en.json — OCL bulk import file seeding ocl/lexical-variants-en with 43 vetted whole-word spelling pairs across en-US and en-GB, plus a v1.0 Source Version. Load via the standard OCL bulk import path. * core/common/tests.py — LexicalVariantsTest covering tokenization, cache lifecycle, false-positive regressions (themselves, anthem, hemisphere, hemp, hemlock, remember), missing-source graceful degradation, and multi-token expansion. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
… $match ConceptFuzzySearch.search() and MetadataToConceptsListView (the $match endpoint) now consume the dictionary helper instead of the inline regex. Variant expansion is OFF by default — clients opt in via the request body `variants` field. Same shape will be used for standard concept search when that wiring lands as a follow-up. Variants param accepted forms: * missing / null / false / "false" / "0" → disabled (default) * true / "true" / "1" → use DEFAULT_LEXICAL_VARIANTS_REPO * non-empty URI string → use that dictionary Source Sunny's kNN sub-query construction and rescore expansion are kept; only the variant source changes from hardcoded regex to dictionary lookup. The empty-synonyms rescore crash fix is preserved. * core/common/utils.py — remove get_spelling_variant() (replaced by the dictionary helper). * core/concepts/search.py — ConceptFuzzySearch.search() accepts variants_repo; gates variant kNN and rescore expansion on it being truthy. None means skip entirely. * core/concepts/views.py — MetadataToConceptsListView reads request.data.variants and normalizes it to a URI or None via _resolve_variants_repo before passing to ConceptFuzzySearch. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
|
@snyaggarwal — heads up, I pushed two commits on top of yours that redirect this PR to a dictionary-driven approach (full rationale in the updated description above + the proposal doc). TL;DR — your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept. The variant source changed from inline regex to a dictionary lookup against Apologies for the in-flight redirect — the false-positive cases in the regex list ( Happy to walk through the diff or rework anything that doesn't sit right. |
…riants-en The OCL organization on prod has mnemonic OCL (uppercase), not ocl. Updates the bulk import file owner field, the DEFAULT_LEXICAL_VARIANTS_REPO constant, and test mock URIs to match. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
|
Two missing things:
@snyaggarwal to take on this |
…redis cache and making things class based from module based
paynejd
left a comment
There was a problem hiding this comment.
Reviewing the latest revision after the redis-cache refactor and use_lexical_variants field addition. Four things to address before merge:
1. LEXICAL_VARIANTS_CACHE_TIMEOUT env override is broken — os.environ.get returns a string when set, but django-redis's cache.set(timeout=...) expects an int. Suggestion inline.
2. use_lexical_variants MapProject flag is persistence-only. The field exists, the serializer surfaces it, but MetadataToConceptsListView.filter_queryset still only reads request.data.get('variants') — so the flag has no effect at $match time until oclmap is updated to read the project config and pass variants: true per request. That's consistent with how encoder_model flows (client passes it per call), so the field is correct as persistence; just call out the dependency in the PR body and file an oclmap follow-up ticket. Otherwise reviewers / Mapper users will reasonably assume flipping the flag does something on its own.
3. Fixture path changed. Sunny moved core/common/data/lexical-variants-en.json → core/lookup_fixtures/lexical-variants-en.json, but (a) import_lookup_values.py has a hardcoded list and does not auto-load this file, so the lookup_fixtures/ placement falsely implies auto-load, and (b) the PR body's deploy commands still reference the old core/common/data/... path — anyone running them on dev/qa/staging gets a 404. Either move the file back, or wire it into import_lookup_values and update the PR body.
4. Rebase needed. PR is CONFLICTING (8 commits behind master). Conflicts are in core/map_projects/{models,serializers,tests/tests}.py from master's prompt_output_locale config addition (c883bd0). Should be a clean resolve — just append use_lexical_variants after prompt_output_locale in each CONFIGURATION_FIELDS / serializer field list / prepare_object loop. Verify migration 0036_mapproject_use_lexical_variants.py doesn't collide with master's migration numbering after rebase.
The redis-cache refactor itself looks good — keying on (source.uri, version) cleanly invalidates released-version edits. HEAD edits will be stale up to the TTL (4 days default), which is fine for a 43-entry vetted dictionary; worth a one-line comment so future maintainers know. The is_core_user gate matches the encoder_model precedent.
|
|
||
| DEFAULT_LEXICAL_VARIANTS_REPO = os.environ.get( | ||
| 'DEFAULT_LEXICAL_VARIANTS_REPO', '/orgs/OCL/sources/lexical-variants-en/') | ||
| LEXICAL_VARIANTS_CACHE_TIMEOUT = os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4) |
There was a problem hiding this comment.
Env override returns a string but cache.set(timeout=...) wants an int. Bug if anyone sets LEXICAL_VARIANTS_CACHE_TIMEOUT in env.
| LEXICAL_VARIANTS_CACHE_TIMEOUT = os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4) | |
| LEXICAL_VARIANTS_CACHE_TIMEOUT = int(os.environ.get('LEXICAL_VARIANTS_CACHE_TIMEOUT', 60 * 60 * 24 * 4)) |
|
|
||
| class LexicalVariantDictionary: | ||
| CACHE_KEY_PREFIX = 'lexical_variants' | ||
| CACHE_TIMEOUT = settings.LEXICAL_VARIANTS_CACHE_TIMEOUT |
There was a problem hiding this comment.
Two small notes:
-
Cache key includes
version, so released versions invalidate cleanly when bumped. But for HEAD edits the key staysHEADand stale data persists up toCACHE_TIMEOUT(4 days default). For a vetted 43-entry dictionary that's fine — but a future maintainer who edits the dictionary in-place will be surprised. Worth a one-line note here. Optionally, hookinvalidate_cache(source.uri)into Source / SourceVersionsave()post-save so HEAD edits flush. -
CACHE_TIMEOUT = settings.LEXICAL_VARIANTS_CACHE_TIMEOUTis evaluated at class-definition time. Fine in practice (env reads on process boot), but if anyone ever wants@override_settingsto take effect in tests, this won't pick it up — they'd have to usecls.CACHE_TIMEOUTvia aclassmethodinstead. Not worth changing unless tests need it.
|
|
||
| map_config = self.request.data.get('map_config', []) | ||
| filters = self.request.data.get('filter', {}) | ||
| variants_repo = self._resolve_variants_repo(self.request.data.get('variants')) if is_core_user else None |
There was a problem hiding this comment.
Worth a one-line debug log when variants is supplied but dropped because the caller isn't core — silent drops will confuse anyone reading API docs and trying it from a non-core token (worse results with no explanation).
Also — this is where the new MapProject.use_lexical_variants flag would need to feed in if we wanted server-side default behavior (rather than relying on oclmap to pass variants: true per request). Current design pushes that responsibility to the client, which is consistent with encoder_model. Just calling out: this PR makes the flag visible via the API but its only effect right now is that oclmap can persist user intent — it won't change $match behavior until oclmap reads the project config and passes variants. Recommend an explicit follow-up ticket and a line in the PR body so reviewers don't assume merging this enables variants in the Mapper.
There was a problem hiding this comment.
Correcting myself — I missed that Sunny opened the companion oclmap PR (OpenConceptLab/oclmap#20) earlier today, which does exactly the wiring I was asking about: reads use_lexical_variants from project config, renders a core-user-gated checkbox in AdvancedSettings, persists via FormData on save, and sends variants: useLexicalVariants in the $match request body. So the flag is not dormant — it's a sibling-PR situation, not a missing-wiring situation.
Two PRs land safely in either order: if oclapi2 lands first, oclmap master doesn't yet send variants so default-disabled holds; if oclmap lands first, oclapi2 master ignores the unknown variants field. No race risk.
Retracting the "follow-up ticket" ask. Still worth a one-liner in this PR's body cross-linking oclmap#20 (and vice versa) so reviewers see the pair. Debug-log suggestion for silently-dropped non-core variants still applies.
| @@ -0,0 +1,45 @@ | |||
| {"type": "Source", "id": "lexical-variants-en", "short_code": "lexical-variants-en", "name": "English Lexical Variants", "full_name": "OCL English Lexical Variants Dictionary", "owner_type": "Organization", "owner": "OCL", "description": "Dictionary of English lexical variants (US/British/Commonwealth spelling pairs) used by OCL search and matching pipelines for query expansion. Each concept represents a spelling equivalence class. Names are tagged with locale (en-US, en-GB) and name_type=Fully Specified for the canonical form per locale. See: https://github.com/OpenConceptLab/ocl_issues (Lexical Dictionary infrastructure).", "default_locale": "en", "source_type": "Lexical Variants", "public_access": "View", "supported_locales": "en,en-US,en-GB", "custom_validation_schema": "None", "extras": {"dictionary_kind": "lexical_variant"}} | |||
There was a problem hiding this comment.
The rename from core/common/data/ → core/lookup_fixtures/ is misleading: core/common/management/commands/import_lookup_values.py has a hardcoded list of fixture files and does not pick this one up — so placing it here implies an auto-load that doesn't happen. The file is also a full OCL bulk-import payload (Source line + Concept lines + version line), not the JSONL Concept-only format the other fixtures here use, so wiring it into import_lookup_values isn't a one-line change either.
Two options:
- Move back to
core/common/data/lexical-variants-en.json(matches the bulk-import-via-CLI deploy path the PR body describes), or - Keep here and add a real loader path in
import_lookup_valuesfor bulk-format fixtures.
Either way, the PR body's deploy commands still reference the old path — update those so dev/qa/staging deploys don't 404.
Refs OpenConceptLab/ocl_issues#2505 (Problem B only — Problem A still open)
Refs OpenConceptLab/ocl_issues#2511 (Lexical Variant Dictionary infrastructure)
Heads up — approach changed mid-flight
Sunny — apologies for the redirect. After deeper review of #2505 with @paynejd, we landed on a substantially different architecture and felt it was faster to push the new approach directly than to iterate via comments. Your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept — only the variant source changed from inline regex to a dictionary-driven lookup. Happy to walk through the diff on a call if useful.
Three reasons for the redirect:
hem/haemmatchesthemselves,anthem,hemisphere,hemp,hemlock,remember— hard to fix defensively in regex without enumerating exclusions for every English word.ConceptFuzzySearch. Standard/concepts/?q=…has the same gap, and upcoming abbreviation-expansion work needs the same primitive. A dictionary-as-Source approach makes that reuse trivial.Tradeoff worth flagging
This approach puts pressure on oclapi2 to do caching well. The MVP scale is trivial (~40 entries, in-memory dict), but as dictionary types and sizes grow — especially abbreviation dictionaries (UMLS LRABR is ~250k rows) and once variants are enabled by default on high-QPS endpoints — the per-token DB lookup will need to graduate beyond the MVP's module-level cache. Likely Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones. Plan to revisit when (a) a dictionary exceeds ~5k entries, or (b) variants are enabled by default on a high-QPS endpoint. Captured in #2511.
What this PR does
core/common/lexical_variants.py—get_lexical_variants(text, source_uri)andget_variant_terms(text, source_uri). Tokenizes input (lowercase + whitespace + ASCII punctuation strip), looks each token up against the dictionary's Names, returns sibling Names from the matching Concept. Module-level cache keyed on(source_uri, version). Returns[]and never raises if the dictionary is missing — graceful degradation.core/common/data/lexical-variants-en.json— OCL bulk import file creatingOCL/lexical-variants-en(Source + 43 vetted Concepts + Source Versionv1.0). Each Concept has en-US and en-GB Names withname_type=Fully Specified. False-positive-prone substring entries dropped.$match:MetadataToConceptsListViewreadsvariantsfrom request body, passes the resolved URI (orNone) toConceptFuzzySearch.search(..., variants_repo=...). The kNN sub-query construction and rescore expansion are unchanged in shape; only the variant source changed.get_spelling_variant()fromcore/common/utils.py.API:
variantsrequest body fieldLexical variant expansion is OFF by default. Clients opt in. Same shape will apply to standard concept search when that wiring lands as a follow-up.
null/false/"false"/"0"true/"true"/"1"OCL/lexical-variants-en)"/orgs/myorg/sources/my-dictionary/")Tests
core/common/tests.pyLexicalVariantsTestcovers:themselves,anthem,hemisphere,hemp,hemlock,rememberreturn no variantsDeployment
The dictionary Source has already been seeded on prod (
OCL/lexical-variants-env1.0, 43 concepts, public read) so the wiring will work as soon as this PR merges and deploys. For dev / qa / staging environments, run:Verify with:
ocl repo get OCL lexical-variants-en(should show 43 concepts,source_type: "Lexical Variants",extras.dictionary_kind: "lexical_variant", versionv1.0released).The helper degrades gracefully (returns
[], never raises) if the Source doesn't exist yet — so a deploy gap on any environment won't break search.Out of scope (deferred follow-ups)
?variants=...on standard concept search (ConceptListView) — touchesCustomESSearch/get_raw_search_stringlayer, smaller blast radius as a separate PR$lexical-variantsHTTP operationapply_score()highlight) — entirely separate work; this PR no longer auto-closes that part of the ticketArchitecture / design references
$lexical-variantsoperation, abbreviation dictionary type, caching strategy, etc.)