Skip to content

Update make_fossa_deps_conan.py with smarter dependency classification#1707

Open
jtrinidad-fossa wants to merge 1 commit into
masterfrom
update-conan-parser
Open

Update make_fossa_deps_conan.py with smarter dependency classification#1707
jtrinidad-fossa wants to merge 1 commit into
masterfrom
update-conan-parser

Conversation

@jtrinidad-fossa
Copy link
Copy Markdown

@jtrinidad-fossa jtrinidad-fossa commented May 11, 2026

Overview

Merges improvements from an updated version of the script:

  • Replaces vendored-dependencies output with three-tier classification: referenced-dependencies (git), remote-dependencies (archive tarball), and custom-dependencies (fallback), matching fossa-deps.yml spec more closely
  • Adds git URL extraction from conandata.sources before falling back to homepage, supporting GitHub, GitLab, Bitbucket, SourceHut, and Codeberg
  • Adds git tag extraction from source URLs to handle packages that use non-standard tag formats (e.g. curl-8_17_0 vs 8.17.0)
  • Adds archive URL fallback for packages like GNU libraries that have no Git repo
  • Skips private-channel packages without conandata to avoid resolving internal version strings as git tags
  • Skips header-only libraries with binary == "Skip"
  • Adds Option B usage: accept a pre-generated conan graph info JSON file as an argument instead of always running conan via subprocess
  • Removes package_id from version string (was opaque to FOSSA's git resolver)

Acceptance criteria

Customer's will be able to use this updated script as a starting point for scanning their Conan projects. The current hadn't been updated in a while and needed an update.

Testing plan

Took a conan graph info.json supplied by a customer and used it to generate a fossa-deps.yml

This section should list concrete steps that a reviewer can sanity check and repeat on their own machine (and provide any needed test cases).

Risks

Would like someone more familiar with this script to make sure it is correct.

Metrics

Is this change something that can or should be tracked? If so, can we do it today? And how? If its easy, do it

References

Add links to any referenced GitHub issues, Zendesk tickets, Jira tickets, Slack threads, etc.

Example:

Checklist

  • I added tests for this PR's change (or explained in the PR description why tests don't make sense).
  • If this PR introduced a user-visible change, I added documentation into docs/.
  • If this PR added docs, I added links as appropriate to the user manual's ToC in docs/README.ms and gave consideration to how discoverable or not my documentation is.
  • If this change is externally visible, I updated Changelog.md. If this PR did not mark a release, I added my changes into an ## Unreleased section at the top.
  • If I made changes to .fossa.yml or fossa-deps.{json.yml}, I updated docs/references/files/*.schema.json AND I have updated example files used by fossa init command. You may also need to update these if you have added/removed new dependency type (e.g. pip) or analysis target type (e.g. poetry).
  • If I made changes to a subcommand's options, I updated docs/references/subcommands/<subcommand>.md.

Merges improvements from an updated version of the script:

- Replaces vendored-dependencies output with three-tier classification:
  referenced-dependencies (git), remote-dependencies (archive tarball),
  and custom-dependencies (fallback), matching fossa-deps.yml spec more closely
- Adds git URL extraction from conandata.sources before falling back to homepage,
  supporting GitHub, GitLab, Bitbucket, SourceHut, and Codeberg
- Adds git tag extraction from source URLs to handle packages that use
  non-standard tag formats (e.g. curl-8_17_0 vs 8.17.0)
- Adds archive URL fallback for packages like GNU libraries that have no Git repo
- Skips private-channel packages without conandata to avoid resolving internal
  version strings as git tags
- Skips header-only libraries with binary == "Skip"
- Adds Option B usage: accept a pre-generated conan graph info JSON file as an
  argument instead of always running conan via subprocess
- Removes package_id from version string (was opaque to FOSSA's git resolver)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jtrinidad-fossa jtrinidad-fossa requested a review from a team as a code owner May 11, 2026 20:35
@jtrinidad-fossa jtrinidad-fossa requested a review from csasarak May 11, 2026 20:35
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

Walkthrough

This pull request refactors a Python script that processes Conan v2 dependency graphs and generates FOSSA dependency manifests. The dataclass-based YAML model was replaced with five focused extraction helpers that derive normalized Git URLs, archive URLs, versions, and embedded tags from Conan node metadata. The core generator now filters ineligible nodes (root, build-context, test, header-only), classifies remaining dependencies as Git-referenced, remote-archived, or custom, and writes fossa-deps.yml. Input handling was updated to support both JSON file input and live Conan execution. The script's documentation was expanded to describe both modes, and imports were augmented with file and pattern utilities.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers most required sections (Overview, Acceptance criteria, Testing plan, Risks, Checklist) with good detail on technical changes, but several sections are incomplete or use placeholder text. Please complete the Metrics section, add GitHub/Zendesk references if applicable, and clarify the testing steps to enable reviewer validation of the changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change—replacing vendored-dependencies with smarter three-tier dependency classification.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/walkthroughs/make_fossa_deps_conan.py`:
- Around line 76-81: The current normalization truncates multi-segment GitLab
project paths by taking only the first two path parts; update the logic around
clean_url/host_name/parts/path_parts/git_url to preserve the full project path
for gitlab.com by joining all remaining path_parts (git_url = f"{host_name}/" +
"/".join(path_parts)), while still keeping the existing behavior of using only
owner/repo for other hosts (and preserving the return tuple (host_name ==
"github.com")). Apply the same change to the duplicate block referenced at the
other location (lines 108-111).
- Around line 167-169: The function extract_git_tag_from_source_url currently
strips a leading "v" from the captured archive tag (m.group(1).lstrip("v")),
which loses the exact git ref for repos that publish tags like "v1.2.3"; update
the function to return the raw captured tag (m.group(1)) without removing a
leading "v" so the exact archive tag is preserved.
- Around line 346-350: The current branch silently falls back to
get_graph_from_conan when a single JSON filename is passed but the file does not
exist; change the logic so that if len(args) == 1 and args[0].endswith(".json")
but os.path.isfile(args[0]) is False, the script fails fast with a clear error
(e.g., print/log an error and exit or raise SystemExit) rather than calling
get_graph_from_conan; update the conditional around
get_graph_from_file/get_graph_from_conan and use args, os.path.isfile,
get_graph_from_file, and get_graph_from_conan to implement the check and exit
behavior.
- Around line 251-293: The current emitter appends raw interpolated strings into
yaml_lines (see yaml_lines, the git_deps/archive_deps/custom_deps loops and the
final f.write) which can produce invalid YAML for values containing colons,
hashes, quotes or newlines; instead build proper Python data structures for
referenced-dependencies/remote-dependencies/custom-dependencies from
git_deps/archive_deps/custom_deps and serialize them with a YAML library (e.g.,
yaml.safe_dump(..., sort_keys=False) from PyYAML) when writing fossa-deps.yml so
all scalars are correctly quoted/escaped and the output is valid YAML.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b94c08b-bd0b-4e72-9d3e-e1b6af9b101e

📥 Commits

Reviewing files that changed from the base of the PR and between 77a146d and 1d3e9b8.

📒 Files selected for processing (1)
  • docs/walkthroughs/make_fossa_deps_conan.py

Comment on lines +76 to +81
parts = clean_url.split(host_name + "/")
if len(parts) > 1:
path_parts = parts[1].split("/")
if len(path_parts) >= 2:
git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}"
return git_url, (host_name == "github.com")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the full GitLab project path when normalizing forge URLs.

Both helpers currently assume every supported forge uses exactly owner/repo. That truncates valid GitLab subgroup repos like gitlab.com/group/subgroup/project down to gitlab.com/group/subgroup, so FOSSA will resolve the wrong repository.

Suggested fix
-                        if len(path_parts) >= 2:
-                            git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}"
+                        if host_name == "gitlab.com":
+                            repo_path = re.split(r"/-/|/releases/|/archive/", parts[1], maxsplit=1)[0].rstrip("/")
+                            if repo_path:
+                                git_url = f"{host_name}/{repo_path}"
+                                return git_url, False
+                        elif len(path_parts) >= 2:
+                            git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}"
                             return git_url, (host_name == "github.com")

Also applies to: 108-111

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 76 - 81, The current
normalization truncates multi-segment GitLab project paths by taking only the
first two path parts; update the logic around
clean_url/host_name/parts/path_parts/git_url to preserve the full project path
for gitlab.com by joining all remaining path_parts (git_url = f"{host_name}/" +
"/".join(path_parts)), while still keeping the existing behavior of using only
owner/repo for other hosts (and preserving the return tuple (host_name ==
"github.com")). Apply the same change to the duplicate block referenced at the
other location (lines 108-111).

Comment on lines +167 to +169
m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url)
if m:
return m.group(1).lstrip("v")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return the exact archive tag instead of stripping v.

extract_git_tag_from_source_url() is supposed to recover the real git ref. For URLs like /archive/v1.2.3.tar.gz, converting that to 1.2.3 will miss the tag on repos that actually publish v1.2.3.

Suggested fix
-            return m.group(1).lstrip("v")
+            return m.group(1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url)
if m:
return m.group(1).lstrip("v")
m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url)
if m:
return m.group(1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 167 - 169, The
function extract_git_tag_from_source_url currently strips a leading "v" from the
captured archive tag (m.group(1).lstrip("v")), which loses the exact git ref for
repos that publish tags like "v1.2.3"; update the function to return the raw
captured tag (m.group(1)) without removing a leading "v" so the exact archive
tag is preserved.

Comment thread docs/walkthroughs/make_fossa_deps_conan.py
Comment on lines +346 to +350
# If the sole argument is a JSON file, read it directly; otherwise run conan.
if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]):
data = get_graph_from_file(args[0])
else:
data = get_graph_from_conan(args)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when the requested JSON graph file does not exist.

If the user passes a single .json path with a typo, the script silently switches to Option A and invokes Conan instead. That makes the documented file-input mode very confusing to debug.

Suggested fix
-    if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]):
-        data = get_graph_from_file(args[0])
+    if len(args) == 1 and args[0].endswith(".json"):
+        if not os.path.isfile(args[0]):
+            logging.error("JSON graph file not found: %s", args[0])
+            sys.exit(1)
+        data = get_graph_from_file(args[0])
     else:
         data = get_graph_from_conan(args)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If the sole argument is a JSON file, read it directly; otherwise run conan.
if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]):
data = get_graph_from_file(args[0])
else:
data = get_graph_from_conan(args)
# If the sole argument is a JSON file, read it directly; otherwise run conan.
if len(args) == 1 and args[0].endswith(".json"):
if not os.path.isfile(args[0]):
logging.error("JSON graph file not found: %s", args[0])
sys.exit(1)
data = get_graph_from_file(args[0])
else:
data = get_graph_from_conan(args)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 346 - 350, The
current branch silently falls back to get_graph_from_conan when a single JSON
filename is passed but the file does not exist; change the logic so that if
len(args) == 1 and args[0].endswith(".json") but os.path.isfile(args[0]) is
False, the script fails fast with a clear error (e.g., print/log an error and
exit or raise SystemExit) rather than calling get_graph_from_conan; update the
conditional around get_graph_from_file/get_graph_from_conan and use args,
os.path.isfile, get_graph_from_file, and get_graph_from_conan to implement the
check and exit behavior.

@zlav zlav requested review from zlav and removed request for csasarak May 13, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant