Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion docs/getting-started/advanced-topics/scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,11 @@ This is perfect for personal use, small teams, or evaluation. The scaling journe

## Step 1 — Switch to PostgreSQL

**When:** You plan to run more than one Open WebUI instance, or you want better performance and reliability for your database.
**When:** You plan to run more than one Open WebUI instance, or you want better performance and reliability for your database. **You should also switch if your SQLite file lives on anything other than a locally-attached SSD/NVMe** — see the callout below.

:::tip You don't need this step if you're a single-replica deployment on local disk
**Staying on SQLite is fine for:** single-replica deployments, personal use, evaluation, home lab setups, and small teams — **as long as the database file lives on a locally-attached SSD/NVMe and you're not running multiple replicas or workers.** The 0.8 → 0.9 async-backend story only bites when `webui.db` is on network storage; on local disk, SQLite is fast, supported, and a perfectly reasonable default. No migration needed. Skip this step and move on to whichever later step you actually need.
:::

SQLite stores everything in a single file and doesn't handle concurrent writes from multiple processes well. PostgreSQL is a production-grade database that supports many simultaneous connections.

Expand All @@ -51,6 +55,34 @@ DATABASE_URL=postgresql://user:password@db-host:5432/openwebui
A good starting point for tuning is `DATABASE_POOL_SIZE=15` and `DATABASE_POOL_MAX_OVERFLOW=20`. Keep the combined total per instance well below your PostgreSQL `max_connections` limit (default is 100).
:::

### Why SQLite on network storage fails the moment you scale (or upgrade)

Since 0.9.0 the backend data layer is **fully async** (async SQLAlchemy + `aiosqlite`). That change made Open WebUI dramatically more concurrent — and, as a side effect, made every pre-existing "SQLite is slow on NFS/CephFS/Azure Files" problem go from *tolerable* to *fatal* overnight. Many operators hit this right after upgrading from 0.8.x without changing anything else in their deployment.

The mechanism in one paragraph: SQLite's durability guarantee is `fsync()` on every commit. On local SSD that's ~100 μs. On NFS / CephFS / Azure Files / Kubernetes PVCs backed by network storage that's 50–500 ms, sometimes seconds. In the old sync backend, FastAPI's ~40-thread worker pool acted as a natural throttle, so slow storage meant "slow app." In the async backend there's no thread-pool ceiling — the asyncio loop schedules thousands of DB coroutines in parallel, every slow `fsync` keeps a connection checked out for the full duration, and the SQLAlchemy async pool (default `pool_size=5` + `max_overflow=10` = 15 connections) saturates almost instantly. You then see:

```
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
```

Making the pool bigger just moves the breaking point. More connections means more concurrent slow `fsync`s hitting the same slow storage; the filesystem is still the bottleneck.

On top of that, SQLite's WAL mode relies on a memory-mapped `-shm` file for cross-process coordination, and `mmap` over NFS is [officially unreliable per SQLite upstream](https://www.sqlite.org/faq.html#q5) — with high async concurrency it can produce actual locking pathologies (deadlocks, `PRAGMA journal_mode=WAL` that starts but never completes, multi-minute stalls on trivial queries).

**There is no setting that fixes this while SQLite stays on network storage.** The three options are:

1. **Best — switch to PostgreSQL (this step).** The DB server manages its own I/O against its own local storage. Your app reaches it over a network socket, but that hop is orders of magnitude cheaper than NFS `fsync`, and Postgres was designed from day one for concurrent writers. This is the only supported configuration for multi-replica, multi-user, or Kubernetes/Swarm deployments.
2. **Move `webui.db` off network storage onto a local SSD/NVMe.** Only appropriate for single-node, low-user deployments. Your RAG files and uploads on NFS are fine — SQLite specifically is the problem, not the shared filesystem in general.
3. **Temporary workaround if you cannot do either yet:**
```bash
DATABASE_POOL_SIZE=1
DATABASE_SQLITE_PRAGMA_BUSY_TIMEOUT=30000
```
Serializes to a single async connection, trading concurrency for stability. **Not supported long-term** — plan the real migration.

The short version: sync backends throttled concurrency through thread pools, so slow storage just made things *slow*. Async backends allow massive concurrency, which means slow `fsync`s stack up, connections stay checked out longer, the pool saturates, and the whole thing wedges. The same storage was tolerable before because the app wasn't asking it to do 20 concurrent `fsync`s.

---

## Step 2 — Add Redis
Expand Down
31 changes: 31 additions & 0 deletions docs/reference/env-configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6359,6 +6359,26 @@ For configuration using individual parameters or encrypted SQLite, see the relev

:::

:::tip Single-user / single-replica on local disk? The default is fine.
The default `sqlite:///${DATA_DIR}/webui.db` is a perfectly good choice for personal use, evaluation, home-lab setups, and small single-replica deployments — **as long as `DATA_DIR` lives on a locally-attached SSD/NVMe**. No migration needed. The rest of this callout is for operators running on network storage or scaling out.
:::

:::danger SQLite is only supported on locally-attached SSD/NVMe
Running SQLite on **NFS, CIFS/SMB, Azure Files, GlusterFS, CephFS, or any Kubernetes PersistentVolumeClaim backed by network storage is not supported** — this is SQLite upstream's own position ([SQLite FAQ #5](https://www.sqlite.org/faq.html#q5)), and under the async backend (0.9.0+) it will produce pool-timeout errors (`QueuePool limit of size N overflow M reached`), multi-minute stalls, WAL deadlocks, and potential database corruption.

**For multi-replica, multi-worker, Kubernetes, Docker Swarm, or any deployment where `DATA_DIR` is on shared/network storage, set `DATABASE_URL` to a PostgreSQL URL:**

```
DATABASE_URL=postgresql+asyncpg://user:password@host:5432/openwebui
```

See [Performance → Disk I/O Latency](/troubleshooting/performance#disk-io-latency-sqlite--storage) for the full mechanism (fsync + async pool saturation) and [Scaling → Step 1](/getting-started/advanced-topics/scaling#step-1--switch-to-postgresql) for migration guidance.
:::

:::warning Multi-replica / multi-worker deployments REQUIRE PostgreSQL
For multi-replica or high-availability deployments (Kubernetes, Docker Swarm), you **must** use an external database (PostgreSQL) instead of SQLite. SQLite does not support concurrent writes from multiple instances and will result in database corruption or data inconsistency. A shared SQLite file on NFS does not count as "supported" — it will still corrupt or deadlock. See the scaling guide linked above.
:::

#### `ENABLE_DB_MIGRATIONS`

- Type: `bool`
Expand Down Expand Up @@ -6522,6 +6542,17 @@ When calculating pool settings, always account for this multiplier to avoid exha

:::

:::warning SQLite on NFS / network storage — increasing the pool does not help
If you are seeing `QueuePool limit of size N overflow M reached, connection timed out` errors on SQLite and the database file lives on NFS, CephFS, Azure Files, an SMB mount, or any network-backed Kubernetes PVC, **increasing this value will not fix it**. The root cause is slow `fsync` on the network filesystem, which keeps each connection checked out far longer than on local disk; more pool slots just means more concurrent slow `fsync`s against the same slow storage. The async backend saturates any size pool you give it.

The only real fixes are:
1. Migrate to PostgreSQL (`DATABASE_URL=postgresql+asyncpg://...`) — strongly recommended.
2. Move `webui.db` to a locally-attached SSD/NVMe.
3. Temporary workaround: `DATABASE_POOL_SIZE=1` + `DATABASE_SQLITE_PRAGMA_BUSY_TIMEOUT=30000` — serializes DB access, not a supported long-term config.

See [Performance → Disk I/O Latency](/troubleshooting/performance#disk-io-latency-sqlite--storage) for the mechanism, and [Scaling → Step 1](/getting-started/advanced-topics/scaling#step-1--switch-to-postgresql) for migration guidance.
:::

#### `DATABASE_POOL_MAX_OVERFLOW`

- Type: `int`
Expand Down
30 changes: 24 additions & 6 deletions docs/troubleshooting/multi-replica.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -95,16 +95,34 @@ REDIS_URL=redis://your-redis-host:6379/0
**Symptoms:**
- Logs show `database is locked` or severe SQL errors.
- Data saved on one instance disappears on another.
- `sqlalchemy.exc.TimeoutError: QueuePool limit of size N overflow M reached, connection timed out, timeout 30.00` on **every** request after a short warm-up — not just at peak load.
- `/api/config`, `/api/v1/chats/?page=1`, OIDC callbacks all stall for 10 s to multiple minutes.
- `PRAGMA journal_mode=WAL` logged as starting but never completing.
- Problems appeared suddenly after the 0.8.x → 0.9.x upgrade without any other change in the deployment.

**Cause:**
Using **SQLite** with multiple replicas. SQLite is a file-based database and does not support concurrent network writes from multiple containers.
Using **SQLite** with multiple replicas — or with a single replica whose `webui.db` lives on a network filesystem. SQLite is a file-based database; its file locking [does not work reliably over NFS / CIFS / CephFS / Azure Files / any network PVC](https://www.sqlite.org/faq.html#q5) (this is SQLite upstream's own position, not an Open WebUI policy). With the async backend introduced in 0.9.0, this turns from "slow and occasionally locked" into a hard failure mode because slow network `fsync`s hold each pool connection hostage long enough to saturate the connection pool on every request.

**Solution:**
Migrate to **PostgreSQL**. Update your connection string:
For the full mechanism (fsync latency + async concurrency + pool saturation + WAL-on-mmap-on-NFS), see [Performance → Disk I/O Latency](/troubleshooting/performance#disk-io-latency-sqlite--storage).

```bash
DATABASE_URL=postgresql://user:password@postgres-host:5432/openwebui
```
**Solution — in order of correctness:**

1. **Migrate to PostgreSQL (strongly recommended, required for multi-replica):**
```bash
DATABASE_URL=postgresql+asyncpg://user:password@postgres-host:5432/openwebui
```
For Kubernetes / Docker Swarm this is effectively mandatory. Postgres manages its own I/O against its own local storage, so the network-`fsync` pathology disappears entirely. See [Scaling → Step 1](/getting-started/advanced-topics/scaling#step-1--switch-to-postgresql) for the full migration steps.

2. **If you're on a single instance and can move the DB:** put `webui.db` on a locally-attached SSD/NVMe (host bind mount, node-local volume, ephemeral disk) — not on the same NFS/Ceph/EFS mount you use for uploads and RAG files. Your shared storage for `/app/backend/data` is fine; SQLite specifically is the problem.

3. **Do NOT just increase `DATABASE_POOL_SIZE`.** A bigger pool doesn't fix slow `fsync`; it just schedules more concurrent slow `fsync`s against the same slow storage and moves the breaking point by a few seconds. This is a symptom-whack, not a fix.

4. **Temporary damage-control only** (for a deployment you're about to migrate):
```bash
DATABASE_POOL_SIZE=1
DATABASE_SQLITE_PRAGMA_BUSY_TIMEOUT=30000
```
Serializes to a single async connection, trading concurrency for stability. **Not supported long-term.** Also consider `ENABLE_AUTOMATIONS=false` if the background scheduler's periodic poll is the specific thing tipping you over the edge.

### 5. Uploaded Files or RAG Knowledge Inaccessible

Expand Down
33 changes: 33 additions & 0 deletions docs/troubleshooting/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,39 @@ Typical symptoms after upgrading to releases that use the async SQLite driver:
Older synchronous SQLAlchemy releases (≤ 0.8.12) serialized contention in-process, which masked slow storage. The async driver opens connections across threads and hammers the filesystem, so network-attached storage degradation becomes immediately visible.
:::

#### Why the async backend makes network-storage SQLite fail suddenly

If you upgraded from 0.8.x to 0.9.x and nothing else changed in your deployment, the mechanism below is why things broke. Worth understanding because `DATABASE_POOL_SIZE` and friends are symptom-adjacent, not cures.

**The core thing is `fsync()`.** SQLite's durability guarantee is a synchronous flush on every commit. `fsync` latency depends entirely on where the file lives:

| Storage | Typical `fsync` latency |
| :--- | :--- |
| Local NVMe | ~100 μs |
| Local SATA SSD | 100 μs – a few ms |
| Local HDD | ~10 ms |
| NFS / CephFS / Azure Files (SSD-backed) | 50–500 ms |
| NFS (HDD-backed or high-latency) | hundreds of ms to multiple seconds |

The latency is identical in sync and async code. What changes is **how many concurrent `fsync`s are in flight at once**.

**Old world — sync SQLAlchemy (0.8.x):** DB calls ran on FastAPI's ~40-thread worker pool. That pool was a natural throttle — you could never have more than ~40 concurrent SQLite operations. Slow storage made individual requests slow, but the thread pool created backpressure before anything collapsed. Users saw "the app is slow," not "the app is dead."

**New world — async `aiosqlite` (0.9.x):** No thread-pool ceiling. The asyncio loop schedules thousands of DB coroutines in parallel, each trying to check out a connection from the **SQLAlchemy async pool** (default `pool_size=5` + `max_overflow=10` = 15 connections). On local SSD, a connection checks out, `fsync`s in ~1 ms, returns to the pool — churn is fast, 15 slots is plenty. On NFS/CephFS, the same connection blocks for hundreds of ms on `fsync`, stays checked out the whole time, and the pool saturates almost instantly. Every subsequent request waits `pool_timeout` (30 s) and then fails with:

```
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
```

Increasing `DATABASE_POOL_SIZE` just moves the breaking point. More connections means more concurrent slow `fsync`s against the same slow storage — the filesystem is still the bottleneck, and you can't pool your way past it.

**And WAL over NFS is specifically broken.** SQLite's WAL mode uses an `mmap`-backed `-shm` file for cross-process coordination. [SQLite upstream says plainly](https://www.sqlite.org/faq.html#q5) that `mmap` on NFS is unreliable — some NFS versions don't support it at all. Under low concurrency it was merely slow; under async concurrency you can hit actual locking pathologies (deadlocks, `PRAGMA journal_mode=WAL` that starts and never completes, multi-minute stalls on trivial queries).

**Why Postgres is the fix, not a bigger pool:** the Postgres server manages its own I/O concurrency against its own local storage. Your app hits it over a network socket, but that hop is orders of magnitude cheaper than NFS `fsync`, and Postgres was designed from day one for concurrent writers — no file-level locking, no cross-process `mmap` coordination, no WAL-on-network-FS caveats. A dedicated async driver (`asyncpg`) talks to it directly. That's the only database shape that actually composes with async concurrency when the storage isn't guaranteed-fast-local.

The one-line summary: sync backends throttled concurrency through thread pools, so slow storage just made things *slow*. Async backends allow massive concurrency, which means slow `fsync`s stack up, connections stay checked out longer, the pool saturates, and the whole thing wedges. The same storage was tolerable before because the app wasn't asking it to do 20 concurrent `fsync`s.

SQLite is particularly sensitive to disk performance because it performs synchronous writes. Moving from local SSDs to a network share can increase latency by 10x or more per operation.

**Symptoms:**
Expand Down
Loading