Skip to content

Parallelize block comparison itself #42

@hittjwX

Description

@hittjwX
  • See TidyObsidian/find-duplicate-blocks.py

This script takes 30-40 minutes to process 22K files. Currently, only reading and block extraction are parallel; step 5 runs in a single process.

A simple structural improvement:

  • Split all_blocks into chunks and run the “for each block, find candidate indices, check Jaccard, append or merge” logic in worker processes.

  • Have each worker build its own local canonical_blocks/token_index, then merge the results at the end (e.g., by re-running a cheaper global dedup on the worker outputs).

This is more work to implement, but if 30–40 minutes is mostly CPU, using all cores in step 5 can cut that substantially.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions