Skip to content

Make the inner Jaccard loop cheaper #43

@hittjwX

Description

@hittjwX
  • See TidyObsidian/find-duplicate-blocks.py

For each candidate, you recompute jaccard(tokens, cb["tokens"]) by doing & and | on Python sets.

Change:

  • Pre-store len(tokens) alongside tokens inside canonical_blocks, so you compute union size as len_a + len_b - inter and avoid building a full union set.

  • If you keep tokens in a sorted list instead of a set, you can do an intersection with a two-pointer walk, which is often faster and more cache-friendly at this scale.

Both reduce per-comparison overhead without changing behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions