RFC 090: CMS and LMS Sync#157
Conversation
- Proposes Step Functions + Lambda + S3 architecture - Details orchestration, storage, invocation design choices - Compares alternatives with trade-off analysis - Includes implementation details from working prototype - Outlines future pipelining strategy for scale - Cost estimates for current and future volumes
kenoir
left a comment
There was a problem hiding this comment.
Thanks for this. The architecture, cost reasoning, and field-mapping detail give a solid base to work from. My inline comments cluster around a few correctness threads worth resolving before this is adopted as the agreed design:
- Identity and deletions: the deletion path relies on OAI tombstones, but those are unreliable, which is why the reconciler exists, and the sync as drawn cannot see the reconciler's signal. Identity should be GUID-based throughout, since collectIds are reused.
- Idempotency and replay: only the instance layer actually upserts; holdings and items are unconditional creates.
- Ordering under concurrent runs needs a data-driven rule (a source-timestamp watermark) rather than the "adapter waits" framing.
- A few mapping.yaml placeholders and internal inconsistencies (MARCXML versus JSON, volume figures) to tidy.
Inline comments cover each of these. A few editorial notes below as well.
Style
- Avoid em-dashes as punctuation.
- Remove emojis (for example the stars in the rationale tables).
- Several sections (Background, Key Characteristics, Design Rationale) would read better as prose than as bullet lists.
- All diagrams should be mermaid; the two ASCII box diagrams in System Architecture and Fan-out could be converted.
Framing and baseline
- The current architecture isn't clearly described. It would help to draw a clear line between what the Axiell adapter does today and what this RFC adds. "Integration Gap" (line 80) reads oddly; stating it plainly would be clearer.
- Consider clarifying whether the FOLIO upserter is conceptually another transformer, since that affects how a reader models the fan-out.
- "Key Characteristics" would benefit from more context rather than a bare list.
Decisions that would benefit from justification
- ref_cache: line 227 says it reloads on every invocation with no cross-invocation caching. Worth justifying, or reconsidering whether the FOLIO reference data can be cached across runs.
- S3 Select (line 720) isn't used elsewhere in the org, so it needs more context or could be dropped as a justification.
- NDJSON manifests are the existing pipeline pattern. If that's the reason for the choice, it would help to say so explicitly.
- Worth confirming that the YAML mapper is a requirement, and whether "non-programmer-friendly" is a genuine goal here.
- Errors to alerts: it would help to spell out the path, for example CloudWatch metric alarms to Slack via Amazon Q, following the existing pattern.
- Consider sharing the FOLIO API client with existing catalogue-pipeline work rather than adding a new one under prototypes.
| for row in iceberg_records: | ||
| try: | ||
| record = json.loads(row['content']) | ||
|
|
There was a problem hiding this comment.
json.loads(row['content']) here doesn't match the rest of the document. The loader writes raw MARCXML to Iceberg, and mapping.yaml extracts fields with from_marc, so the content is MARCXML, not JSON. This step should parse the MARCXML (with pymarc, as the ES transformer already does) and then build the FOLIO JSON payload from the parsed record.
It would also help to record the provenance: the OIM-to-MARC mapping is an XSL transform configured inside Axiell Collections (the source API), which is what the "OIM Field" column in folio-axc-fields-mapping.md refers to. This pipeline doesn't run that transform itself.
| Holdings (if Instance succeeded): | ||
| POST /holdings-storage/holdings (create linked to Instance) | ||
| Action: "create" | ||
|
|
||
| Item (if Holdings succeeded): | ||
| POST /item-storage/items (create linked to Holdings) | ||
| Action: "create" |
There was a problem hiding this comment.
The document states that upserts are idempotent and replays are safe, but only the Instance does a GET-by-hrid before a PUT or POST. Holdings and Item are unconditional POST .../create. Replaying a changeset would then create duplicate holdings and items, or fail on hrid uniqueness. All three entities should use the same GET-by-hrid then PUT/POST pattern. The holdings and item hrids already provide the match key.
Separately, the delete semantics are an open question this RFC should raise rather than settle. The document uses "suppress" and "remove" interchangeably (line 448), and these are different operations in FOLIO. One option is discoverySuppress=true, which preserves the records and their links for audit, and we'd suggest that as a starting point, but it may not be the right choice. The RFC should flag this as needing an answer, including whether the action cascades from instance to holdings to item.
| ### Invocation Pattern: Synchronous vs. Asynchronous | ||
|
|
||
| | Approach | Pros | Cons | | ||
| |----------|------|------| | ||
| | **A: Async** (EventBridge → Lambda, fire-and-forget) | Decoupled, adapter doesn't wait, low latency | No backpressure; if adapter emits 10 events in quick succession, Lambda might queue up (can overwhelm FOLIO); no failure signal | | ||
| | **B: Synchronous** ⭐ **CHOSEN** | Backpressure (adapter waits), clear success/failure, enables replay on failure | Adapter must wait ~45 sec for sync to complete before next event; if FOLIO is slow, adapter is blocked | | ||
|
|
||
| **Why we chose B (Synchronous):** | ||
|
|
||
| 1. **Natural backpressure**: If FOLIO is slow or unavailable, Step Function times out → adapter pauses before emitting next event. No need for queue management. | ||
|
|
||
| 2. **Failure visibility**: If sync fails, EventBridge knows → can retry the same changeset. No silent failures. | ||
|
|
||
| 3. **Replay support**: If sync fails partway (e.g., 100 of 200 records succeeded before timeout), we can: | ||
| - Query S3 manifest to see which records failed | ||
| - Fix the issue (e.g., restart FOLIO, fix mapping rule) | ||
| - Re-trigger sync with same changeset_ids | ||
| - Idempotent upserts (via FOLIO HRID lookup) prevent duplication | ||
|
|
||
| 4. **Adapter is designed for it**: Axiell adapter runs on 15-min cycles. Waiting 45 sec for sync fits within the cadence. | ||
|
|
||
| **Why not A (Async)?** | ||
| - Queue buildup: If adapter emits 10 changesets in quick succession, and each sync takes 45 sec, we'd have a 7-minute backlog. Async doesn't naturally handle this (we'd need manual queue monitoring). |
There was a problem hiding this comment.
This section makes backpressure the main reason for the synchronous choice and describes the adapter as waiting for the sync to finish. That doesn't match the architecture elsewhere: the trigger is EventBridge to Step Function StartExecution (an asynchronous call), and the fan-out is described as decoupled (lines 149, 242). The adapter does not block, so there is no backpressure on it. The only synchronous part is Step Function to Lambda.
I'd suggest reframing this as a fire-and-forget asynchronous trigger, with the Step Function providing retry and execution visibility. The question this section should answer instead is ordering under concurrent or retried executions, since neither EventBridge delivery nor the fan-out guarantees order. One approach: enforce ordering per record by comparing the incoming source last_modified against a last-applied watermark stored on the FOLIO record, and apply only when the incoming value is strictly newer. That also makes replays a no-op. A Step Function concurrency limit of 1 would be a secondary guard against the gap between the GET and the PUT.
| permanentLocationId: | ||
| from_marc: "852$b" | ||
| default: "History of Medicine" | ||
| lookup: location | ||
| required: true | ||
|
|
||
| item: | ||
| fields: | ||
| hrid: | ||
| template: "AxC:{source_id}" | ||
| materialType.id: | ||
| from_marc: "949$c" | ||
| map: material_type | ||
| default: "Books" | ||
| lookup: materialType | ||
| required: true | ||
| permanentLoanType.id: | ||
| from_marc: "949$l" | ||
| default: "Can Circulate" | ||
| lookup: loanType | ||
| required: true | ||
| permanentLocation.id: | ||
| from_marc: "852$b" | ||
| default: "History of Medicine" | ||
| lookup: location |
There was a problem hiding this comment.
These default values mean the required: true rule can never trigger, because the field is never null after a default is applied. So a record with no 852$b is filed under "History of Medicine", and an unmapped format becomes "Books", rather than being skipped and recorded in the failures manifest. That works against the auditability goal, since a missing location is a data problem worth surfacing rather than papering over. I'd suggest removing the defaults on these data-derived fields so MappingError is raised as the prose describes. The genuine constants (source: "AxC", status.name: "Available", holdings.sourceId: "MARC") can stay.
There was a problem hiding this comment.
Yes at the moment, these mappings are being worked out and to get around adding the mandatory items as we dont have the mappings yet. For the purposes of the RFC, I will remove the defaults where looking for a map and its required item
| | Service | Operation | Qty/mo | Rate | Cost/mo | | ||
| |---------|-----------|--------|------|---------| | ||
| | Lambda | 80 invocations × 300s × 1 GB | 80 invocations | $0.0000167/GB-s | ~$1.20 | | ||
| | Step Functions | 80 state transitions | 80 transitions | $0.000025/transition | ~$0.02 | | ||
| | EventBridge | 80 events | 80 events | $1/M events | ~$0.00 | | ||
| | S3 (manifests) | ~80 objects written, 90-day retention | 80 objects + storage | $0.023/K objects + $0.023/GB/mo | ~$1.50 | | ||
| | CloudWatch Logs | ~80 × 5 KB = 400 KB/month | 400 KB ingested | $0.50/GB ingested | ~$0.20 | | ||
| | **Total** | | | | **~$3–5** | |
There was a problem hiding this comment.
The volume figures don't agree across the document. The Background says about 1,000 records per day across 80 syncs (though a 15-minute cadence is 96 runs per day, not 80). The storage comparison uses 5,000 records per day (lines 716, 731). This cost table uses 80 invocations per month, which at roughly 80 to 96 syncs per day should be about 2,900 per month. It would be clearer to pick one figure, label it as an estimate, and derive every section from it. The S3 versus DynamoDB conclusion still holds at the lower volume.
| hrid: | ||
| template: "AxC:{source_id}" | ||
| title: | ||
| from_marc: "245$a" | ||
| transforms: [trim] | ||
| required: true | ||
| source: | ||
| default: "AxC" | ||
| instanceType.id: | ||
| lookup: instanceType | ||
| required: true | ||
|
|
||
| holdings: | ||
| fields: | ||
| hrid: | ||
| template: "AxC:{source_id}-holding-{location_slug}" | ||
| # FOLIO requires holdings sourceId; resolve it by name via RefCache. | ||
| sourceId: | ||
| default: "MARC" | ||
| lookup: holdingsSource | ||
| required: true | ||
| permanentLocationId: | ||
| from_marc: "852$b" | ||
| default: "History of Medicine" | ||
| lookup: location | ||
| required: true | ||
|
|
||
| item: | ||
| fields: | ||
| hrid: | ||
| template: "AxC:{source_id}" |
There was a problem hiding this comment.
The Instance and Item are assigned the same hrid (AxC:{source_id}), which is ambiguous in logs and manifests. Adding a type prefix (for example AxC-instance: and AxC-item:) would disambiguate them, as the holdings hrid already does.
A related point on identity: this needs to be keyed on the GUID rather than the object number, because Axiell reuses collectIds and a reused id would otherwise overwrite the wrong FOLIO record (this connects to the reconciler comment). The document and the field-mapping table are currently inconsistent about whether MARC 001 holds the GUID or the object number, and that should be resolved in favour of the GUID.
Because matching is by hrid throughout, the "Upsert Key Strategy" ladder in README.md (lines 547 to 561), which lists GUID then barcode then a composite fallback, conflicts with the rest of the design. It would be clearer to simplify it to "match by hrid". It would also help to state the 1:1 instance-to-item assumption (which holds for AxC) under Assumptions, since the item hrid depends on it.
|
|
||
| | Signal | Source | Meaning | | ||
| |--------|--------|----------| | ||
| | Record in changeset | OAI-PMH datestamp window | Record was created or modified in source | | ||
| | `deleted=true` | OAI tombstone | Record was removed from source | | ||
| | Payload hash mismatch | XSL output comparison (optional) | FOLIO-relevant fields actually changed | | ||
| | Reconciler GUID remap | Axiell reconciler step | Old identity superseded, emit delete for old | | ||
|
|
||
| ### Every Record in Changeset Is Either: | ||
|
|
||
| - **New** (first time this `id` appears in Iceberg) → create in FOLIO | ||
| - **Updated** (existing `id`, newer `last_modified`) → update in FOLIO | ||
| - **Deleted** (`deleted=true`) → suppress/remove in FOLIO |
There was a problem hiding this comment.
The delete path here fires on OAI deleted=true, but OAI tombstones from Axiell are unreliable, which is the reason the existing reconciler was built. The reconciler detects deletions by tracking the collectId to guid mapping in its own Iceberg store, and emits a DeletedSourceWork for the superseded GUID when a collectId is remapped to a different work.
Two consequences for this design. First, the FOLIO upserter consumes the loader's changeset, so it would not see reconciler-detected deletes, which are produced by a separate, later transformer step. Second, the delete is keyed by the old GUID rather than by the changeset row being processed.
Suggested approach: have the reconciler step fan out a FOLIO suppression path that mirrors the loader's fan-out to the upsert path, reusing the existing reconciler rather than building a parallel mapping store. On a reconciler delete, suppress AxC:{old-guid} and cascade to holdings and item, in line with the delete-semantics question raised earlier. Separately, the line stating that FOLIO exposes an OAI-PMH feed (line 112) is not used anywhere in the design and could be removed.
| If exists: PUT (update) | ||
| Else: POST (create) | ||
| Action: "create" or "update" | ||
|
|
There was a problem hiding this comment.
A consideration worth adding for the update path. Some FOLIO records will be backed by Source Record Storage (SRS) as MARC instances, for example anything migrated or loaded as MARC. For those records the bibliographic fields are controlled by the underlying SRS MARC record and cannot be updated through the mod-inventory instance API. Only administrative data is editable there, and MARC edits go through quickMARC, which writes to SRS and syncs the Inventory record. So a PUT to mod-inventory may fail or be silently ignored for SRS-backed instances, which the current update logic assumes will work.
This leads to a consistency decision the RFC should make explicit: which storage do we create records in, Inventory-native (FOLIO source) or MARC/SRS? Records created the two ways behave differently on update, so a mixed estate is harder to reason about and maintain.
It also affects the catalogue pipeline. The pipeline harvests FOLIO over OAI-PMH using the marc21_withholdings prefix (see catalogue_graph/src/adapters/extractors/oai_pmh/folio/config.py). Under that prefix the instance bib comes from SRS when an SRS record is present, or is generated on the fly from Inventory depending on the mod-oai-pmh record-source setting, while holdings and items come from Inventory. So whether records this sync creates appear in that feed, and in what form, depends on the storage type we choose together with the mod-oai-pmh configuration. Worth confirming the two line up before building.
If this can't be resolved quickly, it should be captured in the Open Questions section rather than left implicit, since it affects both the update path and what the catalogue pipeline sees.
RFC 090: CMS to LMS Sync
What does this change?
Adds RFC 090: CMS to LMS Sync, proposing a new data pipeline to synchronize library location and item holdings from Axiell Collections (Content Management System) into FOLIO (Library Management System) with strict requirements for idempotency, auditability, and graceful error isolation. T
The RFC describes an AWS-native architecture using EventBridge, Step Functions, and Lambda to upsert FOLIO inventory records on every Axiell adapter run (15-minute cadence, ~10–500 records per run).
Key components:
(sourceSystemId, barcode)tuples as idempotency keys to enable safe re-runs and manual recovery.Files added:
rfcs/090-axiell-folio-sync/README.md: the RFC document.rfcs/README.md: refreshed RFC listing table (RFC 090 row added).How to test
rfcs/090-axiell-folio-sync/README.mdand review the architecture, transformation design, AWS infrastructure, design rationale, assumptions, and open questions..scripts/validate_rfc.py..scripts/create_table_summary.py --check-readme.How can we measure success?
Have we considered potential risks?