#

swe-bench

Here are 29 public repositories matching this topic...

smallcloudai / refact

AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.

open-source enterprise vscode self-hosted developer-tools on-prem fine-tuning rag ai-agent swe-bench

Updated Mar 18, 2026
Rust

JARVIS-Xs / SE-Agent

SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance

mcts code-fix swe-agent test-time-scaling claude-code code-agent swe-bench self-evolve

Updated Sep 23, 2025
Python

usetig / sage

An LLM council that reviews your coding agent's every move

Updated Dec 8, 2025
TypeScript

logic-star-ai / insights

We track and analyze the activity and performance of autonomous code agents in the wild

agents swe-agent swe-bench

Updated Dec 5, 2025
TypeScript

verseles / showdown

Comprehensive LLM leaderboard aggregating multiple benchmarks into transparent rankings. Open data, community-driven, built with Svelte.

benchmark ai score lmarena swe-bench

Updated Apr 7, 2026
HTML

agentic-trust-labs / glassbox-ai

Lean orchestration platform for enterprise AI — where each decision costs hundreds. State machine core, HITL as a first-class state, corrections that accumulate. First use-case being Coding agent. Open research, early stage.

platform enterprise state-machine orchestration transparency human-in-the-loop ai-agent hitl enterprise-ai auditability coding-agent swe-bench

Updated Mar 17, 2026
HTML

supermodeltools / mcpbr

Model Context Protocol Benchmark Runner

python benchmarking machine-learning mcp ml-evaluation llm-evaluation model-context-protocol swe-bench

Updated Mar 27, 2026
Python

Vexp-ai / vexp-swe-bench

Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.

benchmark mcp developer-tools ai-agents ai-coding claude-code swe-bench context-engineering

Updated Mar 22, 2026
Shell

greynewell / mcp-serialization-repro

Do MCP tools serialize in Claude Code? Empirical study: readOnlyHint controls parallelism, IPC overhead is ~5ms/call. Reproduces #14353.

mcp llm-agents model-context-protocol claude-code swe-bench tool-parallelism readonlyhint

Updated Feb 15, 2026
Python

abhaymundhara / llm-benchmark-suite

Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.

python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench

Updated Feb 5, 2026
Python

best-ai-models-leaderboard

AgileWoW / best-ai-models-leaderboard

A technical guide and live-tracking repository for the world's top AI models, specialized by coding, reasoning, and multimodal performance.

llm-leaderboard multimodal-ai ai-benchmark chatbot-arena agentic-coding swe-bench swe-bench-pro

Updated Feb 26, 2026

KRLabsOrg / squeez

Squeeze verbose LLM agent tool output down to only the relevant lines

python pytorch lora tool-use llm context-compression coding-agent swe-bench

Updated Apr 4, 2026
Python

hobostay / rust-mini-swe-agent

A Rust reimplementation of mini-swe-agent with CLI task execution, benchmark runners, trajectory inspection, and multi-environment support.

rust cli benchmark containers openai ai-agent coding-agent swe-bench

Updated Mar 11, 2026
Rust

nvandessel / floop-bench

Open benchmark evaluating floop's impact on coding agent performance using SWE-bench Verified

benchmark ai eva agents floop swe-bench

Updated Mar 23, 2026
Python

clouatre-labs / sre-shadow-replay

Supplementary materials for SRE shadow-mode PR replay experiment

evaluation ai-agents supplementary-materials swe-bench shadow-replay

Updated Apr 2, 2026
Python

greynewell / swe-bench-fast

One-command SWE-bench eval harness in Go. Native ARM64 containers with 6.3x test runner speedup on Apple Silicon and AWS Graviton. Pre-built images on Docker Hub.

docker golang benchmark software-engineering arm64 aarch64 apple-silicon ai-evaluation aws-graviton coding-agents swe-bench native-containers

Updated Mar 6, 2026
Go

sergeyklay / agentprobe

Reproducible benchmark framework for testing hypotheses about AI coding agents

coding-agents ai-agent-evaluation claude-code agentic-coding swe-bench context-engineering llm-evaluation-benchmark llm-benchmark

Updated Mar 4, 2026
Shell

Praveenraja195 / swe-bench-claude

Automated solution for SWE-bench (OpenLibrary) using a self-healing Claude 3.5 Haiku agent.

python hackathon openlibrary ai-agent claude-ai swe-bench

Updated Feb 8, 2026
Python

RanjanaRaghavan / swe-bench-evaluation

This project explores how Large Language Models (LLMs) perform on real-world software engineering tasks, inspired by the SWE-Bench benchmark. Using locally hosted models like Llama 3 via Ollama, the tool evaluates code repair capabilities on Python repositories through custom test cases and a lightweight scoring framework.

generative-ai swe-bench

Updated Feb 17, 2025
TeX

Tenormusica2024 / agentbench-pulse-api

AI benchmark aggregation API - SWE-bench / BigCodeBench / EvalPlus time-series data

api benchmark ai time-series llm swe-bench

Updated Feb 27, 2026
Python

Improve this page

Add a description, image, and links to the swe-bench topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the swe-bench topic, visit your repo's landing page and select "manage topics."