feat: add AI skill to find and improve the Pythonic interface to functions#1484
feat: add AI skill to find and improve the Pythonic interface to functions#1484timsaucer wants to merge 4 commits intoapache:mainfrom
Conversation
…uiring lit()
Update 47 functions in functions.py to accept native Python types (int, float,
str) for arguments that are contextually literals, eliminating verbose lit()
wrapping. For example, users can now write split_part(col("a"), ",", 2) instead
of split_part(col("a"), lit(","), lit(2)). All changes are backward compatible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions Update instr and position (aliases of strpos) to accept Expr | str for the substring parameter, matching the updated primary function signature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alias functions that delegate to a primary function must have their type hints updated to match, even though coercion logic is only added to the primary. Added a new Step 3 to the implementation workflow for this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Since @kevinjqliu asked about it regarding the last skill I wrote, here is an export of the chat history: One of the important things to note in the chat history is that I intentionally exited the session and started a new session so that the skill would be applied without prior context. Then as I reviewed the code it generated I gave it feedback on the fact that it missed the aliases. And so then I had the agent update the skill it was using. I've found that this has to be an iterative process. The next step I'm going to take is to start a fresh session and have it review this PR, both the skill and the updates it makes. I'll keep iterating on the process of having the agent and myself review the code suggestions and update the skill. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a new AI “make-pythonic” skill and applies its recommendations to make datafusion-python function wrappers accept native Python literals (e.g., str, int, float) in places where callers previously had to wrap values with lit().
Changes:
- Added a reusable AI skill definition documenting how to audit and “pythonicize” function signatures.
- Updated many
python/datafusion/functions.pyAPIs to accept native types and internally coerce them toExpr.literal(...). - Updated doctest examples to demonstrate the simplified calling convention.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| python/datafusion/functions.py | Broad signature + coercion updates to accept native Python types; doctest examples updated accordingly. |
| .ai/skills/make-pythonic/SKILL.md | New skill documentation describing how to identify and implement pythonic argument coercions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| def log(base: Expr, num: Expr) -> Expr: | ||
| def log(base: Expr | float, num: Expr) -> Expr: |
There was a problem hiding this comment.
The new type hints for log, power, and pow only allow float for base/exponent, but the coercion (Expr.literal(...)) will also accept int cleanly and the upstream signatures typically coerce “numeric”/“integer” for these parameters. To avoid incorrect/overly-restrictive typing for users and type-checkers, widen these hints to Expr | int | float (or at least include int) for the numeric literal parameters.
| def log(base: Expr | float, num: Expr) -> Expr: | |
| def log(base: Expr | int | float, num: Expr) -> Expr: |
| if not isinstance(base, Expr): | ||
| base = Expr.literal(base) | ||
| return Expr(f.log(base.expr, num.expr)) |
There was a problem hiding this comment.
The new type hints for log, power, and pow only allow float for base/exponent, but the coercion (Expr.literal(...)) will also accept int cleanly and the upstream signatures typically coerce “numeric”/“integer” for these parameters. To avoid incorrect/overly-restrictive typing for users and type-checkers, widen these hints to Expr | int | float (or at least include int) for the numeric literal parameters.
|
|
||
|
|
||
| def power(base: Expr, exponent: Expr) -> Expr: | ||
| def power(base: Expr, exponent: Expr | float) -> Expr: |
There was a problem hiding this comment.
The new type hints for log, power, and pow only allow float for base/exponent, but the coercion (Expr.literal(...)) will also accept int cleanly and the upstream signatures typically coerce “numeric”/“integer” for these parameters. To avoid incorrect/overly-restrictive typing for users and type-checkers, widen these hints to Expr | int | float (or at least include int) for the numeric literal parameters.
| if not isinstance(exponent, Expr): | ||
| exponent = Expr.literal(exponent) | ||
| return Expr(f.power(base.expr, exponent.expr)) | ||
|
|
||
|
|
||
| def pow(base: Expr, exponent: Expr) -> Expr: | ||
| def pow(base: Expr, exponent: Expr | float) -> Expr: |
There was a problem hiding this comment.
The new type hints for log, power, and pow only allow float for base/exponent, but the coercion (Expr.literal(...)) will also accept int cleanly and the upstream signatures typically coerce “numeric”/“integer” for these parameters. To avoid incorrect/overly-restrictive typing for users and type-checkers, widen these hints to Expr | int | float (or at least include int) for the numeric literal parameters.
| if not isinstance(regex, Expr): | ||
| regex = Expr.literal(regex) | ||
|
|
||
| def _to_raw(val: Any) -> Any: | ||
| if val is None: | ||
| return None | ||
| if not isinstance(val, Expr): | ||
| val = Expr.literal(val) | ||
| return val.expr | ||
|
|
||
| return Expr( | ||
| f.regexp_instr( | ||
| values.expr, | ||
| regex.expr, | ||
| start, | ||
| n, | ||
| flags, | ||
| sub_expr, | ||
| _to_raw(start), | ||
| _to_raw(n), | ||
| _to_raw(flags), | ||
| _to_raw(sub_expr), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
regexp_instr introduces a nested _to_raw helper, which is inconsistent with the inline coercion approach used throughout the rest of this file (and conflicts with the newly-added skill guidance that explicitly says not to introduce helper functions for coercion). Recommend expanding this to explicit inline conversions per-argument (similar to regexp_count / string_to_array) so the coercion is uniform and easier to scan/review. Also, the use of Any here requires Any to be in scope for annotation evaluation; avoiding Any (or ensuring it’s imported / annotations are deferred) prevents potential runtime issues.
| def date_part(part: Expr | str, date: Expr) -> Expr: | ||
| """Extracts a subfield from the date. | ||
|
|
||
| Args: | ||
| part: The part of the date to extract. Must be one of ``"year"``, | ||
| ``"month"``, ``"day"``, ``"hour"``, ``"minute"``, ``"second"``, etc. |
There was a problem hiding this comment.
The docstrings describe part as an enum-like literal (e.g., "year", "month", …), but the signatures still accept Expr | str. If Expr is supported only for backward compatibility (and/or the Rust implementation requires a literal scalar), please document that expectation explicitly (e.g., “must be a literal scalar string; column expressions are not supported”) to avoid misleading users. Alternatively (more breaking), consider narrowing the type to str if non-literal expressions are not meaningful here.
| if not isinstance(part, Expr): | ||
| part = Expr.literal(part) | ||
| return Expr(f.date_part(part.expr, date.expr)) |
There was a problem hiding this comment.
The docstrings describe part as an enum-like literal (e.g., "year", "month", …), but the signatures still accept Expr | str. If Expr is supported only for backward compatibility (and/or the Rust implementation requires a literal scalar), please document that expectation explicitly (e.g., “must be a literal scalar string; column expressions are not supported”) to avoid misleading users. Alternatively (more breaking), consider narrowing the type to str if non-literal expressions are not meaningful here.
|
|
||
|
|
||
| def date_trunc(part: Expr, date: Expr) -> Expr: | ||
| def date_trunc(part: Expr | str, date: Expr) -> Expr: |
There was a problem hiding this comment.
The docstrings describe part as an enum-like literal (e.g., "year", "month", …), but the signatures still accept Expr | str. If Expr is supported only for backward compatibility (and/or the Rust implementation requires a literal scalar), please document that expectation explicitly (e.g., “must be a literal scalar string; column expressions are not supported”) to avoid misleading users. Alternatively (more breaking), consider narrowing the type to str if non-literal expressions are not meaningful here.
| if not isinstance(part, Expr): | ||
| part = Expr.literal(part) | ||
| return Expr(f.date_trunc(part.expr, date.expr)) |
There was a problem hiding this comment.
The docstrings describe part as an enum-like literal (e.g., "year", "month", …), but the signatures still accept Expr | str. If Expr is supported only for backward compatibility (and/or the Rust implementation requires a literal scalar), please document that expectation explicitly (e.g., “must be a literal scalar string; column expressions are not supported”) to avoid misleading users. Alternatively (more breaking), consider narrowing the type to str if non-literal expressions are not meaningful here.
Update SKILL.md to prevent three classes of issues: clarify that float already accepts int per PEP 484 (avoiding redundant int | float that fails ruff PYI041), add backward-compat rule for Category B so existing Expr params aren't removed, and add guidance for inline coercion with many optional nullable params instead of local helpers. Replace regexp_instr's _to_raw() helper with inline coercion matching the pattern used throughout the rest of the file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
None
Rationale for this change
This adds an AI agent skill that can be used to search the repository and identify cases where we can make our interface more intuitive to users. Attached is also the diff recommended when using this skill in coordination with our existing agent directives about how to write functions.
What changes are included in this PR?
Add skill for searching repository for functions, investigating their upstream equivalent, and update the function inputs where appropriate.
I ran the skill and updated many function signatures.
Are there any user-facing changes?
Improved type hints and inputs allowed in Python.