Skip to content

RFC 090: CMS and LMS Sync#157

Open
anubhavijay wants to merge 5 commits into
mainfrom
av/cms-lms-sync-rfc
Open

RFC 090: CMS and LMS Sync#157
anubhavijay wants to merge 5 commits into
mainfrom
av/cms-lms-sync-rfc

Conversation

@anubhavijay

@anubhavijay anubhavijay commented Jun 23, 2026

Copy link
Copy Markdown

RFC 090: CMS to LMS Sync

What does this change?

Adds RFC 090: CMS to LMS Sync, proposing a new data pipeline to synchronize library location and item holdings from Axiell Collections (Content Management System) into FOLIO (Library Management System) with strict requirements for idempotency, auditability, and graceful error isolation. T

The RFC describes an AWS-native architecture using EventBridge, Step Functions, and Lambda to upsert FOLIO inventory records on every Axiell adapter run (15-minute cadence, ~10–500 records per run).

Key components:

  • Axiell Data Feed: Existing OAI-PMH loader writes raw MARCXML changesets to Apache Iceberg tables on S3 (15-minute windows, 7-day lookback).
  • FOLIO Upserter Step: New Lambda-based transformer that:
    • Authenticates to FOLIO using cached credentials
    • Loads reference data cache (locations, material types, etc.) for field mapping
    • Reads changeset rows from Iceberg
    • Applies a declarative YAML mapper to convert MARCXML → FOLIO JSON (instance/holdings/item schemas)
    • Upserts records to FOLIO Inventory API with idempotency keys (create, update, or suppress)
    • Isolates errors per-record so one failed item doesn't block batch
  • Transformation Pipeline: Raw MARCXML → YAML Mapper → Schema Validation → FOLIO API Payload Builder
  • Design Rationale: Covers orchestration (Step Functions vs. alternatives), storage (S3 NDJSON vs. alternatives), invocation patterns (sync vs. async), and per-record error handling.
  • Upsert Strategy: Uses (sourceSystemId, barcode) tuples as idempotency keys to enable safe re-runs and manual recovery.

Files added:

  • rfcs/090-axiell-folio-sync/README.md: the RFC document.
  • rfcs/README.md: refreshed RFC listing table (RFC 090 row added).

How to test

  • Read rfcs/090-axiell-folio-sync/README.md and review the architecture, transformation design, AWS infrastructure, design rationale, assumptions, and open questions.
  • Confirm the RFC passes repo validation: .scripts/validate_rfc.py.
  • Confirm the listing table is in sync: .scripts/create_table_summary.py --check-readme.
  • Review the Transformation Pipeline diagram (Mermaid flowchart) and confirm the stages are clear: YAML Mapper → Schema Validation → FOLIO API Payload Builder.

How can we measure success?

  • No measurable runtime success criteria; this is a documentation RFC. Success is the RFC being reviewed and providing a clear, self-contained contract and architecture that the team can align on.

Have we considered potential risks?

  • Documentation only; no production code or infrastructure is changed by this PR, so there is no runtime or deployment risk.

- Proposes Step Functions + Lambda + S3 architecture
- Details orchestration, storage, invocation design choices
- Compares alternatives with trade-off analysis
- Includes implementation details from working prototype
- Outlines future pipelining strategy for scale
- Cost estimates for current and future volumes
@anubhavijay anubhavijay requested review from a team as code owners June 23, 2026 08:05
@kenoir kenoir changed the title RFC 090 - CMS and LMS Sync RFC 090: CMS and LMS Sync Jun 23, 2026

@kenoir kenoir left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. The architecture, cost reasoning, and field-mapping detail give a solid base to work from. My inline comments cluster around a few correctness threads worth resolving before this is adopted as the agreed design:

  • Identity and deletions: the deletion path relies on OAI tombstones, but those are unreliable, which is why the reconciler exists, and the sync as drawn cannot see the reconciler's signal. Identity should be GUID-based throughout, since collectIds are reused.
  • Idempotency and replay: only the instance layer actually upserts; holdings and items are unconditional creates.
  • Ordering under concurrent runs needs a data-driven rule (a source-timestamp watermark) rather than the "adapter waits" framing.
  • A few mapping.yaml placeholders and internal inconsistencies (MARCXML versus JSON, volume figures) to tidy.

Inline comments cover each of these. A few editorial notes below as well.

Style

  • Avoid em-dashes as punctuation.
  • Remove emojis (for example the stars in the rationale tables).
  • Several sections (Background, Key Characteristics, Design Rationale) would read better as prose than as bullet lists.
  • All diagrams should be mermaid; the two ASCII box diagrams in System Architecture and Fan-out could be converted.

Framing and baseline

  • The current architecture isn't clearly described. It would help to draw a clear line between what the Axiell adapter does today and what this RFC adds. "Integration Gap" (line 80) reads oddly; stating it plainly would be clearer.
  • Consider clarifying whether the FOLIO upserter is conceptually another transformer, since that affects how a reader models the fan-out.
  • "Key Characteristics" would benefit from more context rather than a bare list.

Decisions that would benefit from justification

  • ref_cache: line 227 says it reloads on every invocation with no cross-invocation caching. Worth justifying, or reconsidering whether the FOLIO reference data can be cached across runs.
  • S3 Select (line 720) isn't used elsewhere in the org, so it needs more context or could be dropped as a justification.
  • NDJSON manifests are the existing pipeline pattern. If that's the reason for the choice, it would help to say so explicitly.
  • Worth confirming that the YAML mapper is a requirement, and whether "non-programmer-friendly" is a genuine goal here.
  • Errors to alerts: it would help to spell out the path, for example CloudWatch metric alarms to Slack via Amazon Q, following the existing pattern.
  • Consider sharing the FOLIO API client with existing catalogue-pipeline work rather than adding a new one under prototypes.

for row in iceberg_records:
try:
record = json.loads(row['content'])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.loads(row['content']) here doesn't match the rest of the document. The loader writes raw MARCXML to Iceberg, and mapping.yaml extracts fields with from_marc, so the content is MARCXML, not JSON. This step should parse the MARCXML (with pymarc, as the ES transformer already does) and then build the FOLIO JSON payload from the parsed record.

It would also help to record the provenance: the OIM-to-MARC mapping is an XSL transform configured inside Axiell Collections (the source API), which is what the "OIM Field" column in folio-axc-fields-mapping.md refers to. This pipeline doesn't run that transform itself.

Comment on lines +532 to +538
Holdings (if Instance succeeded):
POST /holdings-storage/holdings (create linked to Instance)
Action: "create"

Item (if Holdings succeeded):
POST /item-storage/items (create linked to Holdings)
Action: "create"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document states that upserts are idempotent and replays are safe, but only the Instance does a GET-by-hrid before a PUT or POST. Holdings and Item are unconditional POST .../create. Replaying a changeset would then create duplicate holdings and items, or fail on hrid uniqueness. All three entities should use the same GET-by-hrid then PUT/POST pattern. The holdings and item hrids already provide the match key.

Separately, the delete semantics are an open question this RFC should raise rather than settle. The document uses "suppress" and "remove" interchangeably (line 448), and these are different operations in FOLIO. One option is discoverySuppress=true, which preserves the records and their links for audit, and we'd suggest that as a starting point, but it may not be the right choice. The RFC should flag this as needing an answer, including whether the action cascades from instance to holdings to item.

Comment on lines +741 to +763
### Invocation Pattern: Synchronous vs. Asynchronous

| Approach | Pros | Cons |
|----------|------|------|
| **A: Async** (EventBridge → Lambda, fire-and-forget) | Decoupled, adapter doesn't wait, low latency | No backpressure; if adapter emits 10 events in quick succession, Lambda might queue up (can overwhelm FOLIO); no failure signal |
| **B: Synchronous** ⭐ **CHOSEN** | Backpressure (adapter waits), clear success/failure, enables replay on failure | Adapter must wait ~45 sec for sync to complete before next event; if FOLIO is slow, adapter is blocked |

**Why we chose B (Synchronous):**

1. **Natural backpressure**: If FOLIO is slow or unavailable, Step Function times out → adapter pauses before emitting next event. No need for queue management.

2. **Failure visibility**: If sync fails, EventBridge knows → can retry the same changeset. No silent failures.

3. **Replay support**: If sync fails partway (e.g., 100 of 200 records succeeded before timeout), we can:
- Query S3 manifest to see which records failed
- Fix the issue (e.g., restart FOLIO, fix mapping rule)
- Re-trigger sync with same changeset_ids
- Idempotent upserts (via FOLIO HRID lookup) prevent duplication

4. **Adapter is designed for it**: Axiell adapter runs on 15-min cycles. Waiting 45 sec for sync fits within the cadence.

**Why not A (Async)?**
- Queue buildup: If adapter emits 10 changesets in quick succession, and each sync takes 45 sec, we'd have a 7-minute backlog. Async doesn't naturally handle this (we'd need manual queue monitoring).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section makes backpressure the main reason for the synchronous choice and describes the adapter as waiting for the sync to finish. That doesn't match the architecture elsewhere: the trigger is EventBridge to Step Function StartExecution (an asynchronous call), and the fan-out is described as decoupled (lines 149, 242). The adapter does not block, so there is no backpressure on it. The only synchronous part is Step Function to Lambda.

I'd suggest reframing this as a fire-and-forget asynchronous trigger, with the Step Function providing retry and execution visibility. The question this section should answer instead is ordering under concurrent or retried executions, since neither EventBridge delivery nor the fan-out guarantees order. One approach: enforce ordering per record by comparing the incoming source last_modified against a last-applied watermark stored on the FOLIO record, and apply only when the incoming value is strictly newer. That also makes replays a no-op. A Step Function concurrency limit of 1 would be a secondary guard against the gap between the GET and the PUT.

Comment on lines +58 to +82
permanentLocationId:
from_marc: "852$b"
default: "History of Medicine"
lookup: location
required: true

item:
fields:
hrid:
template: "AxC:{source_id}"
materialType.id:
from_marc: "949$c"
map: material_type
default: "Books"
lookup: materialType
required: true
permanentLoanType.id:
from_marc: "949$l"
default: "Can Circulate"
lookup: loanType
required: true
permanentLocation.id:
from_marc: "852$b"
default: "History of Medicine"
lookup: location

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default values mean the required: true rule can never trigger, because the field is never null after a default is applied. So a record with no 852$b is filed under "History of Medicine", and an unmapped format becomes "Books", rather than being skipped and recorded in the failures manifest. That works against the auditability goal, since a missing location is a data problem worth surfacing rather than papering over. I'd suggest removing the defaults on these data-derived fields so MappingError is raised as the prose describes. The genuine constants (source: "AxC", status.name: "Available", holdings.sourceId: "MARC") can stay.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes at the moment, these mappings are being worked out and to get around adding the mandatory items as we dont have the mappings yet. For the purposes of the RFC, I will remove the defaults where looking for a map and its required item

Comment on lines +815 to +822
| Service | Operation | Qty/mo | Rate | Cost/mo |
|---------|-----------|--------|------|---------|
| Lambda | 80 invocations × 300s × 1 GB | 80 invocations | $0.0000167/GB-s | ~$1.20 |
| Step Functions | 80 state transitions | 80 transitions | $0.000025/transition | ~$0.02 |
| EventBridge | 80 events | 80 events | $1/M events | ~$0.00 |
| S3 (manifests) | ~80 objects written, 90-day retention | 80 objects + storage | $0.023/K objects + $0.023/GB/mo | ~$1.50 |
| CloudWatch Logs | ~80 × 5 KB = 400 KB/month | 400 KB ingested | $0.50/GB ingested | ~$0.20 |
| **Total** | | | | **~$3–5** |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume figures don't agree across the document. The Background says about 1,000 records per day across 80 syncs (though a 15-minute cadence is 96 runs per day, not 80). The storage comparison uses 5,000 records per day (lines 716, 731). This cost table uses 80 invocations per month, which at roughly 80 to 96 syncs per day should be about 2,900 per month. It would be clearer to pick one figure, label it as an estimate, and derive every section from it. The S3 versus DynamoDB conclusion still holds at the lower volume.

Comment on lines +37 to +67
hrid:
template: "AxC:{source_id}"
title:
from_marc: "245$a"
transforms: [trim]
required: true
source:
default: "AxC"
instanceType.id:
lookup: instanceType
required: true

holdings:
fields:
hrid:
template: "AxC:{source_id}-holding-{location_slug}"
# FOLIO requires holdings sourceId; resolve it by name via RefCache.
sourceId:
default: "MARC"
lookup: holdingsSource
required: true
permanentLocationId:
from_marc: "852$b"
default: "History of Medicine"
lookup: location
required: true

item:
fields:
hrid:
template: "AxC:{source_id}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Instance and Item are assigned the same hrid (AxC:{source_id}), which is ambiguous in logs and manifests. Adding a type prefix (for example AxC-instance: and AxC-item:) would disambiguate them, as the holdings hrid already does.

A related point on identity: this needs to be keyed on the GUID rather than the object number, because Axiell reuses collectIds and a reused id would otherwise overwrite the wrong FOLIO record (this connects to the reconciler comment). The document and the field-mapping table are currently inconsistent about whether MARC 001 holds the GUID or the object number, and that should be resolved in favour of the GUID.

Because matching is by hrid throughout, the "Upsert Key Strategy" ladder in README.md (lines 547 to 561), which lists GUID then barcode then a composite fallback, conflicts with the rest of the design. It would be clearer to simplify it to "match by hrid". It would also help to state the 1:1 instance-to-item assumption (which holds for AxC) under Assumptions, since the item hrid depends on it.

Comment on lines +436 to +448

| Signal | Source | Meaning |
|--------|--------|----------|
| Record in changeset | OAI-PMH datestamp window | Record was created or modified in source |
| `deleted=true` | OAI tombstone | Record was removed from source |
| Payload hash mismatch | XSL output comparison (optional) | FOLIO-relevant fields actually changed |
| Reconciler GUID remap | Axiell reconciler step | Old identity superseded, emit delete for old |

### Every Record in Changeset Is Either:

- **New** (first time this `id` appears in Iceberg) → create in FOLIO
- **Updated** (existing `id`, newer `last_modified`) → update in FOLIO
- **Deleted** (`deleted=true`) → suppress/remove in FOLIO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete path here fires on OAI deleted=true, but OAI tombstones from Axiell are unreliable, which is the reason the existing reconciler was built. The reconciler detects deletions by tracking the collectId to guid mapping in its own Iceberg store, and emits a DeletedSourceWork for the superseded GUID when a collectId is remapped to a different work.

Two consequences for this design. First, the FOLIO upserter consumes the loader's changeset, so it would not see reconciler-detected deletes, which are produced by a separate, later transformer step. Second, the delete is keyed by the old GUID rather than by the changeset row being processed.

Suggested approach: have the reconciler step fan out a FOLIO suppression path that mirrors the loader's fan-out to the upsert path, reusing the existing reconciler rather than building a parallel mapping store. On a reconciler delete, suppress AxC:{old-guid} and cascade to holdings and item, in line with the delete-semantics question raised earlier. Separately, the line stating that FOLIO exposes an OAI-PMH feed (line 112) is not used anywhere in the design and could be removed.

Comment on lines +528 to +531
If exists: PUT (update)
Else: POST (create)
Action: "create" or "update"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A consideration worth adding for the update path. Some FOLIO records will be backed by Source Record Storage (SRS) as MARC instances, for example anything migrated or loaded as MARC. For those records the bibliographic fields are controlled by the underlying SRS MARC record and cannot be updated through the mod-inventory instance API. Only administrative data is editable there, and MARC edits go through quickMARC, which writes to SRS and syncs the Inventory record. So a PUT to mod-inventory may fail or be silently ignored for SRS-backed instances, which the current update logic assumes will work.

This leads to a consistency decision the RFC should make explicit: which storage do we create records in, Inventory-native (FOLIO source) or MARC/SRS? Records created the two ways behave differently on update, so a mixed estate is harder to reason about and maintain.

It also affects the catalogue pipeline. The pipeline harvests FOLIO over OAI-PMH using the marc21_withholdings prefix (see catalogue_graph/src/adapters/extractors/oai_pmh/folio/config.py). Under that prefix the instance bib comes from SRS when an SRS record is present, or is generated on the fly from Inventory depending on the mod-oai-pmh record-source setting, while holdings and items come from Inventory. So whether records this sync creates appear in that feed, and in what form, depends on the storage type we choose together with the mod-oai-pmh configuration. Worth confirming the two line up before building.

If this can't be resolved quickly, it should be captured in the Open Questions section rather than left implicit, since it affects both the update path and what the catalogue pipeline sees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants