Update make_fossa_deps_conan.py with smarter dependency classification#1707
Update make_fossa_deps_conan.py with smarter dependency classification#1707jtrinidad-fossa wants to merge 1 commit into
Conversation
Merges improvements from an updated version of the script: - Replaces vendored-dependencies output with three-tier classification: referenced-dependencies (git), remote-dependencies (archive tarball), and custom-dependencies (fallback), matching fossa-deps.yml spec more closely - Adds git URL extraction from conandata.sources before falling back to homepage, supporting GitHub, GitLab, Bitbucket, SourceHut, and Codeberg - Adds git tag extraction from source URLs to handle packages that use non-standard tag formats (e.g. curl-8_17_0 vs 8.17.0) - Adds archive URL fallback for packages like GNU libraries that have no Git repo - Skips private-channel packages without conandata to avoid resolving internal version strings as git tags - Skips header-only libraries with binary == "Skip" - Adds Option B usage: accept a pre-generated conan graph info JSON file as an argument instead of always running conan via subprocess - Removes package_id from version string (was opaque to FOSSA's git resolver) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughThis pull request refactors a Python script that processes Conan v2 dependency graphs and generates FOSSA dependency manifests. The dataclass-based YAML model was replaced with five focused extraction helpers that derive normalized Git URLs, archive URLs, versions, and embedded tags from Conan node metadata. The core generator now filters ineligible nodes (root, build-context, test, header-only), classifies remaining dependencies as Git-referenced, remote-archived, or custom, and writes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/walkthroughs/make_fossa_deps_conan.py`:
- Around line 76-81: The current normalization truncates multi-segment GitLab
project paths by taking only the first two path parts; update the logic around
clean_url/host_name/parts/path_parts/git_url to preserve the full project path
for gitlab.com by joining all remaining path_parts (git_url = f"{host_name}/" +
"/".join(path_parts)), while still keeping the existing behavior of using only
owner/repo for other hosts (and preserving the return tuple (host_name ==
"github.com")). Apply the same change to the duplicate block referenced at the
other location (lines 108-111).
- Around line 167-169: The function extract_git_tag_from_source_url currently
strips a leading "v" from the captured archive tag (m.group(1).lstrip("v")),
which loses the exact git ref for repos that publish tags like "v1.2.3"; update
the function to return the raw captured tag (m.group(1)) without removing a
leading "v" so the exact archive tag is preserved.
- Around line 346-350: The current branch silently falls back to
get_graph_from_conan when a single JSON filename is passed but the file does not
exist; change the logic so that if len(args) == 1 and args[0].endswith(".json")
but os.path.isfile(args[0]) is False, the script fails fast with a clear error
(e.g., print/log an error and exit or raise SystemExit) rather than calling
get_graph_from_conan; update the conditional around
get_graph_from_file/get_graph_from_conan and use args, os.path.isfile,
get_graph_from_file, and get_graph_from_conan to implement the check and exit
behavior.
- Around line 251-293: The current emitter appends raw interpolated strings into
yaml_lines (see yaml_lines, the git_deps/archive_deps/custom_deps loops and the
final f.write) which can produce invalid YAML for values containing colons,
hashes, quotes or newlines; instead build proper Python data structures for
referenced-dependencies/remote-dependencies/custom-dependencies from
git_deps/archive_deps/custom_deps and serialize them with a YAML library (e.g.,
yaml.safe_dump(..., sort_keys=False) from PyYAML) when writing fossa-deps.yml so
all scalars are correctly quoted/escaped and the output is valid YAML.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7b94c08b-bd0b-4e72-9d3e-e1b6af9b101e
📒 Files selected for processing (1)
docs/walkthroughs/make_fossa_deps_conan.py
| parts = clean_url.split(host_name + "/") | ||
| if len(parts) > 1: | ||
| path_parts = parts[1].split("/") | ||
| if len(path_parts) >= 2: | ||
| git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}" | ||
| return git_url, (host_name == "github.com") |
There was a problem hiding this comment.
Preserve the full GitLab project path when normalizing forge URLs.
Both helpers currently assume every supported forge uses exactly owner/repo. That truncates valid GitLab subgroup repos like gitlab.com/group/subgroup/project down to gitlab.com/group/subgroup, so FOSSA will resolve the wrong repository.
Suggested fix
- if len(path_parts) >= 2:
- git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}"
+ if host_name == "gitlab.com":
+ repo_path = re.split(r"/-/|/releases/|/archive/", parts[1], maxsplit=1)[0].rstrip("/")
+ if repo_path:
+ git_url = f"{host_name}/{repo_path}"
+ return git_url, False
+ elif len(path_parts) >= 2:
+ git_url = f"{host_name}/{path_parts[0]}/{path_parts[1]}"
return git_url, (host_name == "github.com")Also applies to: 108-111
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 76 - 81, The current
normalization truncates multi-segment GitLab project paths by taking only the
first two path parts; update the logic around
clean_url/host_name/parts/path_parts/git_url to preserve the full project path
for gitlab.com by joining all remaining path_parts (git_url = f"{host_name}/" +
"/".join(path_parts)), while still keeping the existing behavior of using only
owner/repo for other hosts (and preserving the return tuple (host_name ==
"github.com")). Apply the same change to the duplicate block referenced at the
other location (lines 108-111).
| m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url) | ||
| if m: | ||
| return m.group(1).lstrip("v") |
There was a problem hiding this comment.
Return the exact archive tag instead of stripping v.
extract_git_tag_from_source_url() is supposed to recover the real git ref. For URLs like /archive/v1.2.3.tar.gz, converting that to 1.2.3 will miss the tag on repos that actually publish v1.2.3.
Suggested fix
- return m.group(1).lstrip("v")
+ return m.group(1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url) | |
| if m: | |
| return m.group(1).lstrip("v") | |
| m = re.search(r'/archive/([^/]+)\.(?:tar\.gz|tar\.xz|tar\.bz2|zip)$', url) | |
| if m: | |
| return m.group(1) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 167 - 169, The
function extract_git_tag_from_source_url currently strips a leading "v" from the
captured archive tag (m.group(1).lstrip("v")), which loses the exact git ref for
repos that publish tags like "v1.2.3"; update the function to return the raw
captured tag (m.group(1)) without removing a leading "v" so the exact archive
tag is preserved.
| # If the sole argument is a JSON file, read it directly; otherwise run conan. | ||
| if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]): | ||
| data = get_graph_from_file(args[0]) | ||
| else: | ||
| data = get_graph_from_conan(args) |
There was a problem hiding this comment.
Fail fast when the requested JSON graph file does not exist.
If the user passes a single .json path with a typo, the script silently switches to Option A and invokes Conan instead. That makes the documented file-input mode very confusing to debug.
Suggested fix
- if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]):
- data = get_graph_from_file(args[0])
+ if len(args) == 1 and args[0].endswith(".json"):
+ if not os.path.isfile(args[0]):
+ logging.error("JSON graph file not found: %s", args[0])
+ sys.exit(1)
+ data = get_graph_from_file(args[0])
else:
data = get_graph_from_conan(args)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If the sole argument is a JSON file, read it directly; otherwise run conan. | |
| if len(args) == 1 and args[0].endswith(".json") and os.path.isfile(args[0]): | |
| data = get_graph_from_file(args[0]) | |
| else: | |
| data = get_graph_from_conan(args) | |
| # If the sole argument is a JSON file, read it directly; otherwise run conan. | |
| if len(args) == 1 and args[0].endswith(".json"): | |
| if not os.path.isfile(args[0]): | |
| logging.error("JSON graph file not found: %s", args[0]) | |
| sys.exit(1) | |
| data = get_graph_from_file(args[0]) | |
| else: | |
| data = get_graph_from_conan(args) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/walkthroughs/make_fossa_deps_conan.py` around lines 346 - 350, The
current branch silently falls back to get_graph_from_conan when a single JSON
filename is passed but the file does not exist; change the logic so that if
len(args) == 1 and args[0].endswith(".json") but os.path.isfile(args[0]) is
False, the script fails fast with a clear error (e.g., print/log an error and
exit or raise SystemExit) rather than calling get_graph_from_conan; update the
conditional around get_graph_from_file/get_graph_from_conan and use args,
os.path.isfile, get_graph_from_file, and get_graph_from_conan to implement the
check and exit behavior.
Overview
Merges improvements from an updated version of the script:
Acceptance criteria
Customer's will be able to use this updated script as a starting point for scanning their Conan projects. The current hadn't been updated in a while and needed an update.
Testing plan
Took a conan graph info.json supplied by a customer and used it to generate a fossa-deps.yml
This section should list concrete steps that a reviewer can sanity check and repeat on their own machine (and provide any needed test cases).
Risks
Would like someone more familiar with this script to make sure it is correct.
Metrics
Is this change something that can or should be tracked? If so, can we do it today? And how? If its easy, do it
References
Add links to any referenced GitHub issues, Zendesk tickets, Jira tickets, Slack threads, etc.
Example:
Checklist
docs/.docs/README.msand gave consideration to how discoverable or not my documentation is.Changelog.md. If this PR did not mark a release, I added my changes into an## Unreleasedsection at the top..fossa.ymlorfossa-deps.{json.yml}, I updateddocs/references/files/*.schema.jsonAND I have updated example files used byfossa initcommand. You may also need to update these if you have added/removed new dependency type (e.g.pip) or analysis target type (e.g.poetry).docs/references/subcommands/<subcommand>.md.