Skip to content

Proxy Updates#1

Open
A-Tarraf wants to merge 30 commits into
besnardjb:masterfrom
A-Tarraf:main
Open

Proxy Updates#1
A-Tarraf wants to merge 30 commits into
besnardjb:masterfrom
A-Tarraf:main

Conversation

@A-Tarraf

@A-Tarraf A-Tarraf commented Jan 8, 2026

Copy link
Copy Markdown

The new version of FTIO creates a prediction server that can be accessed via ZMQ and MessagePack.

Tim Deringer (@Tim-Dieringer) just finished his thesis on some improvements and linkage with FTIO. More precisely, these are his changes:

  • Added integration with FTIO, including the configuration and visualization of FTIO on the web server.
  • Uses ZMQ and MessagePack to send requests.
  • Allows for reconstruction of the original signal based on FTIO's output.
  • Also enables custom parameters for a single metric. Implemented quality of life changes on the trace web page, such as different time formats and zooming/panning.
  • Added an argument to the root proxy for different topologies, as well as an argument for basic instrumentation to display aggregation overhead from scraping child proxies.

The changes to FTIO's integration are complementary to https://github.com/tuda-parallel/FTIO/tree/feature/metric_proxy_bindings

A-Tarraf and others added 28 commits April 25, 2025 16:06
…o, moved ftio visualization to mean value and changed ftio signal name to proper wave name
@A-Tarraf

Copy link
Copy Markdown
Author

We tested out the version and it work great

… expand

- New --auto-root / --root-url-dir flags: child proxies discover the root
  URL from a shared filesystem file (root.url) written by the root at startup.
  Root URL can also be injected via PROXY_ROOT_URL env var.
- Graceful leave: SIGTERM handler sends /leave?from=<url> to the root before
  exiting, triggering immediate TBON repair without waiting for a missed scrape.
- /leave HTTP endpoint on the root: removes the departing node from the topology
  and calls the existing self-repair logic to rewire the TBON.
- Fix NaN/null serialization: Gauge min/max/total fields that are NaN were
  serialized as JSON null, crashing the root when deserializing child scrapes.
  Fixed with a serialize_with helper that maps NaN/infinite to 0.0; the binary
  UNIX socket protocol (which does not support deserialize_any) is unaffected.
- Add experiment/run_malleability_test.sh: end-to-end test against a 4-node
  Docker cluster exercising graceful leave, shrink self-repair, and expand
  auto-join. Script checks prerequisites and auto-builds the binary if needed.
- Add experiment/README.md with quick-start, cluster setup link, expected
  output, and a table of what each step tests.
- Update README.md with a Malleability Support section documenting all new
  flags, endpoints, DMR integration guidance, and a pointer to the experiment.
Root cause: {{fnall}} in mpi_wrappers.w generates Fortran wrappers for
ALL MPI functions. OpenMPI 5.x added the MPI-4 Session API
(MPI_Session_*), whose Fortran wrappers use MPI_Session_f2c() — but that
returns an int handle while the C functions expect an MPI_Session*
pointer. GCC 15 treats this as an error.

Fixes:
1. exporters/mpi/mpi_wrappers.w — Added 9 problematic MPI-4 Session
   functions to the {{fnall}} exclusion list (they handle process-set
   management, not data transfer, so excluding them doesn't affect proxy
   measurement)
   2. install.sh — Changed the mpicc step to use || error_out instead of
      checking file existence, so a real compilation failure is caught
      and reported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants