Skip to content

feat: monthly global spend litellm database materialized view migration#154

Open
noahpodgurski wants to merge 2 commits into
mainfrom
monthly-global-spend-materialized-view
Open

feat: monthly global spend litellm database materialized view migration#154
noahpodgurski wants to merge 2 commits into
mainfrom
monthly-global-spend-materialized-view

Conversation

@noahpodgurski
Copy link
Copy Markdown
Collaborator

@noahpodgurski noahpodgurski commented May 21, 2026

What's new

Results:

$ python scripts/benchmark_spendlogs_table.py --skip-load --benchmark-target monthly-view
[setup] connecting to postgresql://litellm:litellm@localhost:5432/litellm
[summary] current size=1.00 GiB rows=470,000 target=1.00 GiB
[benchmark] server-side EXPLAIN ANALYZE:
  Aggregate  (cost=124348.77..124348.78 rows=1 width=8) (actual time=92.098..94.902 rows=1 loops=1)
    Buffers: shared hit=14642 read=102874 written=54
    ->  Finalize GroupAggregate  (cost=123592.56..124313.54 rows=2818 width=12) (actual time=92.079..94.897 rows=30 loops=1)
          Group Key: (date("LiteLLM_SpendLogs"."startTime"))
          Buffers: shared hit=14642 read=102874 written=54
          ->  Gather Merge  (cost=123592.56..124250.14 rows=5636 width=12) (actual time=92.072..94.884 rows=90 loops=1)
                Workers Planned: 2
                Workers Launched: 2
                Buffers: shared hit=14642 read=102874 written=54
                ->  Sort  (cost=122592.53..122599.58 rows=2818 width=12) (actual time=83.697..83.698 rows=30 loops=3)
                      Sort Key: (date("LiteLLM_SpendLogs"."startTime"))
                      Sort Method: quicksort  Memory: 26kB
                      Buffers: shared hit=14642 read=102874 written=54
                      Worker 0:  Sort Method: quicksort  Memory: 26kB
                      Worker 1:  Sort Method: quicksort  Memory: 26kB
                      ->  Partial HashAggregate  (cost=122395.83..122431.06 rows=2818 width=12) (actual time=83.662..83.672 rows=30 loops=3)
                            Group Key: date("LiteLLM_SpendLogs"."startTime")
                            Batches: 1  Memory Usage: 121kB
                            Buffers: shared hit=14626 read=102874 written=54
                            Worker 0:  Batches: 1  Memory Usage: 121kB
                            Worker 1:  Batches: 1  Memory Usage: 121kB
                            ->  Parallel Seq Scan on "LiteLLM_SpendLogs"  (cost=0.00..121416.67 rows=195833 width=12) (actual time=3.765..71.998 rows=156667 loops=3)
                                  Filter: ("startTime" >= (CURRENT_DATE - '30 days'::interval))
                                  Buffers: shared hit=14626 read=102874 written=54
  Planning:
    Buffers: shared hit=199 read=1
  Planning Time: 0.486 ms
  JIT:
    Functions: 29
    Options: Inlining false, Optimization false, Expressions true, Deforming true
    Timing: Generation 0.947 ms, Inlining 0.000 ms, Optimization 0.570 ms, Emission 10.660 ms, Total 12.177 ms
  Execution Time: 104.687 ms
[benchmark] monthly view query elapsed=0.081291s total_spend=23514392.764500007
$ sh scripts/migrate-litellm-database.sh
...
$ python scripts/benchmark_spendlogs_table.py --skip-load --benchmark-target monthly-view
[setup] connecting to postgresql://litellm:litellm@localhost:5432/litellm
[summary] current size=1.00 GiB rows=470,000 target=1.00 GiB
[benchmark] server-side EXPLAIN ANALYZE:
  Aggregate  (cost=1.01..1.02 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=1)
    Buffers: shared hit=1
    ->  Seq Scan on monthly_global_spend_cache  (cost=0.00..1.01 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=1)
          Buffers: shared hit=1
  Planning:
    Buffers: shared hit=61 read=1
  Planning Time: 0.261 ms
  Execution Time: 0.035 ms
[benchmark] monthly view query elapsed=0.000405s total_spend=23514392.764500115

Summary (for 1GB of data in table):

Before migration, execution time: 104.687 ms
Post migration, execution time: 0.035 ms

@noahpodgurski noahpodgurski requested a review from a team as a code owner May 21, 2026 15:05
@ti3x
Copy link
Copy Markdown
Contributor

ti3x commented May 21, 2026

[Critical] Don't DROP/CREATE the wrapper view on every refresh

The CronJob in dataservices-infra#1968 invokes this script with refresh every 5 minutes, so the current logic re-runs DROP VIEW IF EXISTS "MonthlyGlobalSpend" ... CREATE VIEW ... each time. Two problems:

  1. Every refresh takes an ACCESS EXCLUSIVE lock on MonthlyGlobalSpend and leaves a brief window where queries against it fail with relation does not exist.
  2. DROP VIEW IF EXISTS (without CASCADE) will error if anything else depends on MonthlyGlobalSpend. Adding CASCADE would silently drop those dependents, which is worse. Worth running this once in each env first:
    SELECT DISTINCT dependent.relname, dependent.relkind
    FROM pg_depend d
    JOIN pg_rewrite r        ON r.oid = d.objid
    JOIN pg_class  dependent ON dependent.oid = r.ev_class
    WHERE d.refobjid = 'public."MonthlyGlobalSpend"'::regclass
      AND dependent.relname <> 'MonthlyGlobalSpend';

Both go away if the refresh path skips DDL entirely:

ACTION="${1:-migrate}"

if [[ "${ACTION}" == "refresh" ]]; then
  if ! psql -d "${LiteLLM_DB_NAME}" -tAc \
      "SELECT to_regclass('public.monthly_global_spend_cache') IS NOT NULL" \
      | grep -qx t; then
    echo "[mlpa-litellm-migrate] cache missing; run migrate first. Skipping."
    exit 0
  fi
  psql -v ON_ERROR_STOP=1 -d "${LiteLLM_DB_NAME}" -c \
    'REFRESH MATERIALIZED VIEW CONCURRENTLY public.monthly_global_spend_cache;'
  exit 0
fi

# existing schema-setup block runs only on the migrate path

The DROP/CREATE then runs once at deploy time instead of 8,640 times a month, and the dependency risk is limited to the controlled migrate window.

@ti3x
Copy link
Copy Markdown
Contributor

ti3x commented May 21, 2026

[Medium] IF NOT EXISTS will silently keep the old MV definition

Postgres has no CREATE OR REPLACE MATERIALIZED VIEW. If the SELECT inside CREATE MATERIALIZED VIEW IF NOT EXISTS public.monthly_global_spend_cache AS ... ever changes (new column, different window, etc.), this script will quietly keep the existing definition and the change won't take effect.

Suggest a comment in the script so the next person knows what's needed:

-- NOTE: Postgres has no CREATE OR REPLACE MATERIALIZED VIEW. To change the
-- SELECT below, add a one-shot migration step that runs
--   DROP MATERIALIZED VIEW IF EXISTS public.monthly_global_spend_cache;
-- before this block, otherwise the new definition will be ignored.
CREATE MATERIALIZED VIEW IF NOT EXISTS public.monthly_global_spend_cache AS
  ...

Pure documentation — no behavior change today, just saves a future debugging session.

@ti3x
Copy link
Copy Markdown
Contributor

ti3x commented May 22, 2026

[Worth flagging — likely not an issue] Wrapper view column/row shape change

The new wrapper view exposes a single spend column with one row, while the original (per the Group Key: date("startTime") and ~30 rows in the EXPLAIN) returns (date, spend) × ~30 rows.

SELECT SUM(spend) FROM "MonthlyGlobalSpend" keeps working — that's why the benchmark numbers match. But any caller doing SELECT date, spend FROM "MonthlyGlobalSpend" (time-series UI, daily breakdown chart, etc.) would break.

Probably fine if LiteLLM only uses the SUM path — but worth a quick check before this lands:

# from the LiteLLM source (and any internal callers)
git grep -i 'MonthlyGlobalSpend'

If everything is summing, no action needed. If anything is selecting the daily rows, the MV would need to keep GROUP BY date("startTime") and use date as the unique-index column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants