Skip to content

RFC 089: Identifiers API#156

Open
kenoir wants to merge 22 commits into
mainfrom
rk/identifiers-api-rfc
Open

RFC 089: Identifiers API#156
kenoir wants to merge 22 commits into
mainfrom
rk/identifiers-api-rfc

Conversation

@kenoir

@kenoir kenoir commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Preview

View the rendered RFC on this branch:


What does this change?

The proposal is a small, read-only Identifiers API. Wellcome Collection gives every catalogue thing (a work, an image, an item) a stable public "canonical" id, and keeps a registry recording which underlying source ids that canonical id was built from. This API does one job: given a canonical id it returns the source id(s) behind it, and given a source id it returns the canonical id (optionally with its siblings). It only ever reads that registry; it never creates or changes ids.

It exists because of the Sierra/CALM to FOLIO/Axiell migration. As records move between systems a single canonical id accumulates several source ids over time (an original plus inherited "predecessor" aliases), and a couple of internal services sit right at the edges where that translation has to happen: the IIIF viewer needs to turn old b-numbers and CALM refs into the canonical id it presents under, and requesting needs to turn a canonical item id into the FOLIO UUID a hold is placed on, and back again. The guiding principle is that everything public speaks canonical and source ids only appear at those two edges (ingest and the FOLIO boundary). Rather than have each consumer re-derive the mapping or query the catalogue by source id, this API is the single shared place that translation lives. Because the main running cost is database queries, it is also the natural place to cache aggressively (at the edge, to keep requests off the database) and to attribute that database cost to the consumers driving it.

How it relates to the other RFCs:

  • It reads the ID Registry defined in RFC 083 (stable identifiers). RFC 083 owns the data and all the writes; this API is just a read-only window onto it.
  • It serves the IIIF/DDS lookup that RFC 085 (IIIF identities, open PR RFC 085: IIIF identities #143) describes, where the canonical Work id becomes the IIIF URI and older identifier forms redirect to it.
  • It is the concrete "service" answer to the open question in RFC 088 (the Sierra to FOLIO identity/requesting migration, open PR RFC 088: Migrating identity, requesting and items APIs from Sierra to FOLIO #153) about how requesting should translate item ids. RFC 088 left that access mechanism open (a direct database read, a sync, or a service); this RFC proposes the service and is for RFC 088 to ratify.

The RFC is written to stand on its own and carries the API contract alongside it, so it can be reviewed without access to the closed discovery/prototype repository where the working prototype lives.

Files added:

  • rfcs/089-identifiers-api/README.md: the RFC document.
  • rfcs/README.md: refreshed RFC listing table (RFC 089 row added).
  • rfcs/089-identifiers-api/openapi.yaml: the OpenAPI 3.0 spec for the two lookup operations (the source of truth).
  • rfcs/089-identifiers-api/openapi.md: generated human-readable rendering of the spec, browsable on GitHub without a Swagger/Redoc renderer.
  • render_docs.py, pyproject.toml, .python-version, .gitignore, uv.lock: a self-contained uv project that validates openapi.yaml and regenerates openapi.md.

How to test

  • Read rfcs/089-identifiers-api/README.md and review the contract, architecture, caching strategy and open questions.
  • Confirm the RFC passes repo validation: .scripts/validate_rfc.py.
  • Confirm the listing table is in sync: .scripts/create_table_summary.py --check-readme.
  • (Optional) Regenerate the rendered contract and confirm no diff: from the RFC directory, uv run python render_docs.py: this validates openapi.yaml against the OpenAPI spec validator and rewrites openapi.md.

How can we measure success?

No measurable runtime success criteria; this is a documentation RFC. Success is the RFC being reviewed and providing a clear, self-contained contract and architecture that the team can align on, and a decision on whether this service is the access mechanism for identifier translation in RFC 088.

Have we considered potential risks?

  • Documentation only; no production code or infrastructure is changed by this PR, so there is no runtime or deployment risk.
  • The design risks themselves (the caching strategy and the database cost it controls, the unmet FOLIO-item ingestion dependency, item canonical-id stability) are enumerated in the RFC's Open questions section and are intended to be the subject of review.
  • The OpenAPI spec is validated and the rendered Markdown is generated from it, reducing the chance of the contract and its human-readable rendering drifting apart.

kenoir and others added 5 commits June 18, 2026 13:00
Read-only canonical <-> source identifier translation over the RFC 083 ID
Registry, for the IIIF/DDS (RFC 085) and requesting (RFC 088) consumers.

Carries the OpenAPI contract alongside the RFC (openapi.yaml + a rendered
openapi.md via a small uv project, following the RFC 088 pattern) so the
proposal stands alone without the private prototype repository. Covers the
contract, AWS architecture, API-key auth + usage-plan metering, the caching
topology, and the live-data findings (folio-instance aliases present;
folio-item-id absent, so the requesting translation has no data yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The concern is the cost of database (Aurora) queries, not policing a per-consumer
billing quota. That inverts the caching strategy: an edge (CloudFront) cache that
serves hits without touching the database is now preferred, rather than rejected
for breaking metering. Recasts the per-consumer story as API keys for identity /
cost attribution plus a throttle as a database safety valve, and reorients the
caching open question toward hit-ratio, throttle sizing and cost attribution.

Also removes the detailed real-data-findings list (kept in the prototype docs),
leaving a one-line pointer. Contract unchanged (openapi.yaml/openapi.md untouched).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces em dashes with plain punctuation throughout and tones down a few
flourishes ("single translation membrane", "exactly the win", "evaporates").
Directional notation (the migration and lookup arrows) is kept. No change to the
contract, the decisions, or the meaning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes the decision-log section and its table of contents entry, and renames the
contract-summary heading to "API Contract" (the separate "API contract (OpenAPI)"
section is unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Avoids two similarly-named sections after "The contract" became "API Contract".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread rfcs/089-identifiers-api/openapi.md Outdated
Comment thread rfcs/089-identifiers-api/README.md Outdated
Comment thread rfcs/089-identifiers-api/README.md Outdated
Comment thread rfcs/089-identifiers-api/README.md
Comment thread rfcs/089-identifiers-api/README.md Outdated
Comment thread rfcs/089-identifiers-api/README.md Outdated
Comment thread rfcs/089-identifiers-api/README.md Outdated
Comment thread rfcs/089-identifiers-api/README.md
Comment thread rfcs/089-identifiers-api/README.md
Comment thread rfcs/089-identifiers-api/README.md
kenoir and others added 16 commits June 22, 2026 15:32
The reverse-lookup 200 response is oneOf [CanonicalIdRef, IdentifierSet]
but render_docs.py only handled single $ref bodies, so the generated
table showed 'n/a'. Render oneOf/anyOf as the alternatives joined by an
escaped pipe (allOf as an intersection) and regenerate openapi.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A reviewer asked whether '400 ... unsupported enum value' contradicts the
rule that an unknown open-set sourceSystem yields 404. It does not: type is
the only enum-constrained parameter. Name it explicitly to remove the
ambiguity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… UUID)

The Sierra item number is the predecessor that lets the new FOLIO item UUID
inherit the existing canonical id, matching the work-level pattern. The text
had the direction reversed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…and)

Record the decision rather than leaving it open: SourceIdentifier.type is
scoped to the three catalogue-entity types the API needs (Work/Image/Item)
and the enum is extended on demand, rather than modelling the full registry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…question

Records that digitisation metadata ingestion fetches mostly unique ids while
the Items API is more likely to repeat requests, grounding the hit-ratio
sub-question in the concrete clients paul-butcher raised in review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…en question

Records paul-butcher's specific-sibling include idea (?include=sierra-system-number)
as a related projection to settle alongside the bare-value reverse lookup: more
cacheable, but only in the immutable new-to-old direction, returns a filtered set
given the one-to-many registry, and must return canonical with a 200 when the
requested sibling is absent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the Lambda arch row from the decisions table and the two inline ARM64
references; we don't pin Lambda architecture elsewhere in the estate, so it
doesn't belong as a decision here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records agnesgaroux's review point that a bare value is not only expensive to
index but can be genuinely ambiguous: the same SourceId can appear under
different source systems and resolve to different canonical ids, so a bare-value
query may have to return multiple candidates or force disambiguation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records agnesgaroux's review point: FOLIO records carry both a UUID and an HRID,
so confirm which the OAI-PMH feed delivers. The registry can hold both forms per
item, so the Minter could record both and this API would serve HRID <-> UUID
translation, but okapi resolves the two natively so storing both is an
optimisation, not a requirement. Decision sits with the catalogue-pipeline
workstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lift the two parallel Context concepts (canonical-first principle, the two
consumers) to subheadings, and rephrase the remaining standalone bold
sentence-starters (service boundary, schema finding, edge caching, freshness)
into formal prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Decide the isAlias vs obsolete question: the API exposes the full sibling set
with isAlias and does not model an obsolete flag. Records that the two are
different axes (in the Sierra->FOLIO migration the isAlias=false original is the
retired id and the isAlias=true alias is the live one), so this is a decision not
to model retired-ness rather than a claim that isAlias encodes it. Drops the now
settled item from the RFC 085 contract-edges next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…demand)

Reword Q6 from an open question into a recorded decision, matching Q5. Soften
the Open-questions intro now that two items are settled, and drop the type enum
from the RFC 085 contract-edges next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Settle Q7: do not hoist a convenience top-level type. A single top-level value
would have to pick one row's type for a mixed-type canonical id and could
contradict the others, so the per-row representation is kept and the mixed-type
ambiguity is left to consumers rather than resolved here. Drop Q7 from the RFC
085 next step, leaving only the bare-value lookup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Narrow Q4 against RFC 085's actual requirement (WorkID-level identity lookup,
sourceSystem as an optional qualifier, full sibling-set response, no new shape)
and decide not to add the unqualified bare form now: sourceSystem stays a
required key component, added only if a consumer explicitly requires it, the cost
(secondary SourceId index plus cross-system ambiguity) being the reason. The
related specific-sibling include is deferred on the same basis. Drop the now
resolved bare-value item from the next steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverse open question 7: add a convenience top-level `type` to IdentifierSet,
populated from the original (the single isAlias=false) row, so consumers can read
a canonical id's type without scanning the set. Exactly one row is isAlias=false,
so the source is unambiguous; with cross-type predecessors the top-level value
reflects the original and may differ from a later alias. Update the spec (add the
property, required, enum), regenerate openapi.md, and add the field to the README
example and field docs.

Also a formatting pass on the open questions: prefix the resolved items (4-7)
with Decided and reword the intro, so it is clear at a glance which questions
carry a decision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kenoir kenoir marked this pull request as ready for review June 24, 2026 13:18
@kenoir kenoir requested review from a team as code owners June 24, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants