Skip to content

[Question] Best practices for token-efficient incremental code modifications via SDK sessions? #1024

@moti-malka

Description

@moti-malka

Use Case

Building a web-based code generation platform using the Copilot SDK (Python, v0.2.0). Users create and iteratively modify single-file projects (500–5,000+ lines of HTML/CSS/JS). A typical session involves 5–15 modification requests like "make the character bigger" or "change the background color" on an existing project.

Current Approach (Pseudocode)

For each modification request:
  1. Reset session (disconnect + create fresh session)  ← clears history
  2. Build prompt:
     - System message (~1K tokens, same every time)
     - Full current project code (2K–30K tokens)
     - User's modification request (~50 tokens)
  3. Send prompt via session.send()
  4. Parse response for line-based patches (REPLACE_LINES / INSERT_AFTER / DELETE_LINES)
  5. If patches fail → retry with "return full updated code" (~doubles tokens)
  6. Apply patches/extract code → validate → finalize

Optional thinking pre-pass: For complex requests, a separate model call analyzes the code first (GPT-5.2 with reasoning), then the plan is prepended to the main prompt → effectively 2× input tokens.

The Problem: Token Waste

Scenario Est. Input Tokens Est. Output Tokens Notes
Small project modification ~4K ~2K 500-line project
Medium project modification ~12K ~8K 2000-line project
Large project modification ~35K ~30K 5000+ lines
+ Thinking pre-pass 2× input same Two full model calls
+ Patch failure retry 3× total 2× output Re-sends full code asking for complete output

Key observations:

  • cache_read_tokens is consistently 0 in our session.usage events — even though we track it
  • Every modification resets the session → no context reuse between turns
  • For a "change the button color" request on a 3000-line project, we're sending ~15K tokens of unchanged code
  • With 10 modifications per session, that's ~150K+ input tokens for what should be incremental edits

Specific Questions

1. Prompt Caching

I noticed from other issues (e.g. #1005 session logs) that cacheReadTokens can be non-zero. Our setup always shows 0.

  • Is prompt caching automatic when the system message + prefix stays the same?
  • Does resetting the session (disconnect + create_session) break cache eligibility?
  • Would keeping the session alive across turns (instead of resetting) enable caching of the static system message portion?

2. Session Continuity vs. Reset

We currently reset the session before each modification to avoid accumulating conversation history (since we always embed the full current code in the prompt anyway). But this may be preventing prompt caching.

Trade-off question: Is it better to:

  • (A) Keep the session alive, let history accumulate, and rely on compaction (ref Expose used token information after compaction #1012) when context grows too large?
  • (B) Reset each time but find a way to enable prompt caching?
  • (C) Some hybrid — e.g., keep session alive for N turns, then reset?

What's the recommended pattern for repeated modifications to the same large context?

3. Reducing Output Tokens

The model frequently ignores patch/diff instructions and returns the entire file instead of targeted changes. This wastes output tokens proportional to project size.

  • Are there SDK-level mechanisms to constrain output format?
  • Has anyone found reliable prompting strategies that consistently produce diffs rather than full rewrites?
  • Would reasoning_effort: "low" help for simple modifications while keeping output focused?

4. Thinking Pre-Pass Overhead

For complex requests, we run a separate "thinking" model call (GPT-5.2 with reasoning) to produce a plan, then feed that plan + full code to the main model. This doubles the input token cost.

  • With reasoning_effort on the main model, is a separate thinking pre-pass still justified?
  • Any patterns for "think-then-act" that avoid sending the full context twice?

5. Large File Strategies

For projects >5000 lines, we currently force a full rewrite (no patches).

  • Are there recommended patterns for chunked/windowed modifications — e.g., only sending the relevant portion of the file + surrounding context?
  • Does the SDK's file handling (edit tool) use any internal optimization we could leverage instead of manual prompt construction?

Environment

  • SDK: github-copilot-sdk==0.2.0 (Python)
  • Models: GPT-5.2 (thinking), Claude Sonnet 4 / Claude Opus (main generation)
  • Session config: streaming=True, available_tools=["ask_user"], session reset per modification

Related

Would love to hear how others in the community handle similar "iterative code modification" workflows with the SDK. Any insights on which of these optimizations yield the biggest token savings would be greatly appreciated! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions