Skip to content

[scd] Add hash lock option #1520

Open
the-glu wants to merge 1 commit into
interuss:masterfrom
Orbitalize:spread_lock
Open

[scd] Add hash lock option #1520
the-glu wants to merge 1 commit into
interuss:masterfrom
Orbitalize:spread_lock

Conversation

@the-glu

@the-glu the-glu commented Jun 10, 2026

Copy link
Copy Markdown
Member

This PR follow #1518 as it increase performance of that PR.

Add a new option to lock cells using an hash, as an intermediate between the global lock option and the subscriptions lock.

Using FlightsInSubs test, with 8 clusters, on a single-local CRDB node we can see an improvement others lock methods.

image

(Don't include option in terraform/helm/tanka yet)

if err != nil {
return stacktrace.Propagate(err, "Failed to upsert OperationalIntent in repo")
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been ordered as it has no impact and slightly reduce transaction restarts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a dedicated PR, please split it out.

@BenjaminPelletier

Copy link
Copy Markdown
Member

I think we should check performance under conditions at least somewhat similar to real operations, which I think means at least multiple nodes, probably multiple instances, and latency between nodes. I first measured v0.22.0 with:

monitoring$ export NUM_USS=7
monitoring$ export NUM_NODES=3
monitoring$ export INTRA_USS_NETEM_CONF="delay 600us 40us 25% distribution normal loss 0.0005%"
monitoring$ export INTER_USS_NETEM_CONF="delay 36ms 40ms 50% distribution paretonormal loss 0.25% 15%"
monitoring$ make start-locally
monitoring$ AUTH_SPEC="DummyOAuth(http://localhost:8085/token,locust)" uv run --index https://pypi.org/simple locust -f ./monitoring/loadtest/locust_files/FlightsInSub.py -H http://localhost:8001 --uss-base-url http://testdummy.interuss.org/uss --cluster-count 11 --base-lat 34 --base-lng -118 --area-radius 15182 --max-flight-distance 10000

With 10 users, I got:

Screenshot 2026-06-16 at 11 36 37 PM

I then measured with this branch:

dss$ ./build/build.sh
dss$ docker image tag interuss-local/dss interuss/dss:v0.22.0
monitoring$ make down-locally
<edit monitoring/build/dev/startup/core_service.sh to specify -enable_scd_hash_lock>
monitoring$ make start-locally
monitoring$ AUTH_SPEC="DummyOAuth(http://localhost:8085/token,locust)" uv run --index https://pypi.org/simple locust -f ./monitoring/loadtest/locust_files/FlightsInSub.py -H http://localhost:8001 --uss-base-url http://testdummy.interuss.org/uss --cluster-count 11 --base-lat 34 --base-lng -118 --area-radius 15182 --max-flight-distance 10000
Screenshot 2026-06-16 at 11 57 13 PM

Just to make sure my retag worked:

monitoring$ docker container logs local_infra_1-1-dss-1 2>&1 | head -n 30
Using CockroachDB
Waiting for datastore db1.uss1.localutm:26257...
Datastore db1.uss1.localutm:26257 is not available yet, sleeping...
Datastore db1.uss1.localutm:26257 is not available yet, sleeping...
Datastore db1.uss1.localutm:26257 is not available yet, sleeping...
Datastore db1.uss1.localutm:26257 is not available yet, sleeping...
Datastore db1.uss1.localutm:26257 is not available yet, sleeping...
Datastore db1.uss1.localutm:26257 is online!
Debug Mode: off
2026-06-17T06:52:24.128Z	info	core-service/main.go:255	version	{"address": ":80", "version": "0.0.0+49c67230-dirty"}

I see some latency improvement versus v0.22.0 and a tiny throughput improvement, so looks reasonable to me. I will leave final approval to @barroco with his greater practical experience with database operations than me.

@the-glu the-glu force-pushed the spread_lock branch 6 times, most recently from 86539e0 to 6a5d9fe Compare June 17, 2026 20:32

@barroco barroco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the results.

if err != nil {
return stacktrace.Propagate(err, "Failed to upsert OperationalIntent in repo")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a dedicated PR, please split it out.

!!! danger
All DSS instances in a DSS pool must use the same value for this option. Mixing will result in dramatically lower performance.

You can use the `/aux/v1/configuration/scd_lock_mode` endpoint to retrive the current values of locks for a specifc DSS instance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can use the `/aux/v1/configuration/scd_lock_mode` endpoint to retrive the current values of locks for a specifc DSS instance.
You can use the `/aux/v1/configuration/scd_lock_mode` endpoint to retrieve the current values of locks for a specifc DSS instance.

!!! danger
This option requires migration 3.4.1 for CockroachDB and 1.1.1 for Yugabyte to be effective

Similar to the previous option, but use lock based on hashed cells (with 65k locks).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similar to the previous option, but use lock based on hashed cells (with 65k locks).
Similar to the SCD global lock option, but use lock based on hashed cells (with 65k locks).

for _, cell := range cells {
// Fibonacci hash: level-13 cell ids only differ in their high bits,
// so a plain modulo would map them all to the same stripe.
k := int64((uint64(cell) * 0x9e3779b97f4a7c15) >> 48 % lockStripes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move the value into a constant.


Similar to the previous option, but use lock based on hashed cells (with 65k locks).

This should be better than global lock, as long as cells used don't colide and limit number of entries in the database.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This should be better than global lock, as long as cells used don't colide and limit number of entries in the database.
This should be better than global lock, as long as cells used don't collide and limit number of entries in the database.

zap.Bool("global_lock", false),
zap.Duration("duration", duration),
zap.Int("cell_count", len(cells)),
zap.Int("explicit_subscription_id_count", len(subscriptionIds)),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add the number of locks acquired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants