ProgramBench
ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed during the task.
Released May 5, 2026 by researchers at Meta FAIR, Stanford, and Harvard, the benchmark covers 200 tasks spanning compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter, verified by 248,000+ agent-generated behavioral tests. No model fully solves a single task; Claude Opus 4.7 leads at 3% almost-resolved.
A ProgramBench agent receives the compiled `jq` binary and its man page. Without seeing a single line of source, it must choose a programming language, design an architecture, and produce a build-ready codebase whose output matches `jq` across thousands of edge-case inputs — the same task a human engineer would need days to complete.
Think of it as a blindfolded architectural drawing contest: you see only the finished building, never the blueprints.
Search Interest
-
Nascent0–7 days
-
Emergent8–30 days
-
Validating ← now31–90 days
-
Rising91–180 days
-
Established180 days +
Why is it emerging now?
Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.
Outlook
6-month signal projection and commercial timeline.
Benchmark from the SWE-bench team — brand credibility will drive sustained researcher attention and media cycles.
Risk · Competing benchmarks like MirrorCode (Epoch AI) may fragment attention if they publish contradictory results.
Analogs · SWE-bench · HumanEval · BIG-Bench
-
nowResearch coverage open
AI benchmarking newsletter and explainer content faces zero SERP competition today.
-
3-6moLeaderboard tooling emerges
Comparison tools and tutorials attract developer traffic as model scores improve.
-
6-12moEnterprise training signal
Companies adopt ProgramBench tasks to evaluate coding agent vendors; consulting gap opens.
Competition & Opportunity for term “ProgramBench”
Three heuristic signals derived from the tracked queries, the term's monetization cards, and its cluster neighbors. Directional, not audited.
Ideas for term “ProgramBench”
Buildable pitches — turn this term into an article, site, product, post, newsletter, video, or course. Steal any card and run with it.
Zero competition in SERP for this comparison; developers actively asking what distinguishes the two. High intent, long-tail traffic.
Definitional explainer with immediate search demand. Ideal anchor for a benchmarks-focused content site.
pip-installable, MIT-licensed tool — tutorial content has long-tail developer traffic and no existing guides yet.
The official leaderboard is static snapshots. A lightweight service alerting subscribers when scores change fills the gap immediately.
Researchers need task-level drill-down the current leaderboard doesn't provide. Small open-source tool with clear niche.
Live side-by-side demo with real benchmark tasks. Shareable format for YouTube; existing media hasn't done the visual comparison yet.
Recurring issue tracking when and which models breach the 0% ceiling. Clear subscription hook for AI researcher audience.
Nine frontier models. Two hundred real programs. Zero fully solved. Not because the models are bad — because for the first time, the benchmark refused to make it easy.
Claude Opus 4.7 passes 95% of tests on 3% of tasks. That's not close to solving software engineering — it's a proof that architectural reasoning at scale is a different capability than autocomplete.
When the SWE-bench team shipped a new benchmark in May 2026, every major AI lab scored exactly 0% on the hardest metric. Here's what ProgramBench actually tests and why the 0% result is the most informative score in AI right now.
What People Search
Long-tail queries from Google Suggest + Trends. Volume and competition are heuristics — directional, not audited. Content Type comes from query shape.
SERP of term “ProgramBench”
What searchers see today — organic results on top, paid ads if anyone's bidding. Ad density is a real-time commercial signal.
FAQ
What is ProgramBench?
ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed….
Why is ProgramBench emerging now?
Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.
When did ProgramBench emerge?
Publicly emerged around 2026-05-05 (about 42 days ago as of 2026-06-16). EarlyTerms first recorded a pipeline signal on 2026-05-07.
Related Terms
Other terms in the same space — aliases, subtypes, competitors, and neighbors to explore next.
- Part of agentic-coding Agentic coding is the software-development pattern where an autonomous AI agent plans, writes, tests, and iterates on code against a… →
- Part of coding-agents Coding Agents is the category name for AI developer tools that act on code autonomously — reading a repo, planning a change, editing… →
- Related managed-agents Managed Agents is an infrastructure paradigm where cloud platforms host, orchestrate, and operate AI agents as a service. →
- Related agent-harness An agent harness is the middleware between a large language model and the real world — code that runs the agent loop, calls tools,… →
- Related deep-research Deep Research is an agentic AI capability that autonomously browses the web, synthesizes hundreds of sources, and produces a cited… →
- Competitor
- Related ···
Sources
Primary URLs this report cites — open any to verify the claim yourself.
- 01 ProgramBench paper — arXiv:2605.03546 (May 5, 2026) arxiv.org ↗
- 02 facebookresearch/ProgramBench — official GitHub repo github.com ↗
- 03 ProgramBench.com — live leaderboard programbench.com ↗
- 04 HN: ProgramBench — 139 points, 72 comments news.ycombinator.com ↗
- 05 ProgramBench-Tests dataset — HuggingFace huggingface.co ↗
- 06 Emergent Mind — ProgramBench: Evaluating LM Software Reconstruction emergentmind.com ↗
- 07 ProgramBench paper full text — arXiv HTML arxiv.org ↗