AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
-
Updated
Mar 18, 2026 - Rust
AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance
An LLM council that reviews your coding agent's every move
Lean orchestration platform for enterprise AI — where each decision costs hundreds. State machine core, HITL as a first-class state, corrections that accumulate. First use-case being Coding agent. Open research, early stage.
Model Context Protocol Benchmark Runner
Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.
Do MCP tools serialize in Claude Code? Empirical study: readOnlyHint controls parallelism, IPC overhead is ~5ms/call. Reproduces #14353.
Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.
A technical guide and live-tracking repository for the world's top AI models, specialized by coding, reasoning, and multimodal performance.
Squeeze verbose LLM agent tool output down to only the relevant lines
A Rust reimplementation of mini-swe-agent with CLI task execution, benchmark runners, trajectory inspection, and multi-environment support.
Supplementary materials for SRE shadow-mode PR replay experiment
One-command SWE-bench eval harness in Go. Native ARM64 containers with 6.3x test runner speedup on Apple Silicon and AWS Graviton. Pre-built images on Docker Hub.
Reproducible benchmark framework for testing hypotheses about AI coding agents
This project explores how Large Language Models (LLMs) perform on real-world software engineering tasks, inspired by the SWE-Bench benchmark. Using locally hosted models like Llama 3 via Ollama, the tool evaluates code repair capabilities on Python repositories through custom test cases and a lightweight scoring framework.
Add a description, image, and links to the swe-bench topic page so that developers can more easily learn about it.
To associate your repository with the swe-bench topic, visit your repo's landing page and select "manage topics."