ProgramBench

Validating · Emerged 2026-05-05 · 42 days old · Last reviewed 2026-05-07

ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed during the task.

Released May 5, 2026 by researchers at Meta FAIR, Stanford, and Harvard, the benchmark covers 200 tasks spanning compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter, verified by 248,000+ agent-generated behavioral tests. No model fully solves a single task; Claude Opus 4.7 leads at 3% almost-resolved.

💡

A ProgramBench agent receives the compiled `jq` binary and its man page. Without seeing a single line of source, it must choose a programming language, design an architecture, and produce a build-ready codebase whose output matches `jq` across thousands of edge-case inputs — the same task a human engineer would need days to complete.

Think of it as a blindfolded architectural drawing contest: you see only the finished building, never the blueprints.

Search Interest

peak ~259/mo

updated 2026-06-12

~259/mo ~129/mo 0

2026-05-14 2026-05-29 2026-06-12

Term Lifecycle

Nascent

0–7 days
Emergent

8–30 days
Validating ← now

31–90 days
Rising

91–180 days
Established

180 days +

Why is it emerging now?

TL;DR

Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.

5 forces driving coverage — scroll →

arXiv / Meta FAIR

ProgramBench: Can Language Models Rebuild Programs From Scratch?

200 tasks, 248K behavioral tests; none of 9 evaluated models fully resolves any task. Best: Opus 4.7 at 3% almost-resolved.

May 5, 2026

facebookresearch/ProgramBench

Official benchmark repo — MIT license, pip-installable

321 ⭐

Y Hacker News

ProgramBench: Can language models rebuild programs from scratch?

May 7, 2026 139 points · 72 comments

ProgramBench

Live leaderboard — Opus 4.7 leads at 3%, all others at 0%

Claude Opus 4.7 leads at 0% resolved / 3.0% almost-resolved; GPT 5.4 and Gemini 3.1 Pro at 0% on both metrics.

May 2026

Emergent Mind

ProgramBench: Evaluating LM Software Reconstruction

Models favor monolithic, single-file implementations with longer functions — diverging sharply from human-written modular design.

May 2026

Outlook

6-month signal projection and commercial timeline.

Signal medium

Revenue moderate

Benchmark from the SWE-bench team — brand credibility will drive sustained researcher attention and media cycles.

Risk · Competing benchmarks like MirrorCode (Epoch AI) may fragment attention if they publish contradictory results.

Analogs · SWE-bench · HumanEval · BIG-Bench

Monetization timeline

now

Research coverage open

AI benchmarking newsletter and explainer content faces zero SERP competition today.
3-6mo

Leaderboard tooling emerges

Comparison tools and tutorials attract developer traffic as model scores improve.
6-12mo

Enterprise training signal

Companies adopt ProgramBench tasks to evaluate coding agent vendors; consulting gap opens.

Competition & Opportunity for term “ProgramBench”

Three heuristic signals derived from the tracked queries, the term's monetization cards, and its cluster neighbors. Directional, not audited.

Content Gap

1 queries tracked

Led by General (1)

1 Suggest-only tails — long-tail opening

Revenue Potential

0% commercial-intent queries

2 monetization angles mapped

Mostly informational — pre-commercial

Build Difficulty

Medium

Stage: validating — incumbents warming up

1 / 10 default TLDs taken · oldest incumbent programbench.com (2008-09-09)

5 related terms already published

Heuristic · signals: tracked queries, term monetization cards, cluster neighbors

Ideas for term “ProgramBench”

Buildable pitches — turn this term into an article, site, product, post, newsletter, video, or course. Steal any card and run with it.

Article

ProgramBench vs SWE-bench: Which AI Coding Benchmark Actually Matters in 2026?

Zero competition in SERP for this comparison; developers actively asking what distinguishes the two. High intent, long-tail traffic.

Article

What Is ProgramBench and Why Every AI Model Scores 0%?

Definitional explainer with immediate search demand. Ideal anchor for a benchmarks-focused content site.

Article

How to Run ProgramBench Locally: Step-by-Step Setup Guide

pip-installable, MIT-licensed tool — tutorial content has long-tail developer traffic and no existing guides yet.

Product

A ProgramBench leaderboard tracker that monitors model score changes over time

The official leaderboard is static snapshots. A lightweight service alerting subscribers when scores change fills the gap immediately.

Product

A subset-difficulty visualizer for ProgramBench tasks ranked by test pass rate

Researchers need task-level drill-down the current leaderboard doesn't provide. Small open-source tool with clear niche.

Video

I Gave GPT-5.4 and Claude Opus 4.7 the Same ProgramBench Task — Here's What Happened

Live side-by-side demo with real benchmark tasks. Shareable format for YouTube; existing media hasn't done the visual comparison yet.

Newsletter

Weekly AI Benchmarks Briefing anchored on ProgramBench score updates

Recurring issue tracking when and which models breach the 0% ceiling. Clear subscription hook for AI researcher audience.

Post HN / r/MachineLearning

ProgramBench Is the Benchmark That Finally Said No: Why 0% Fully Solved Matters More Than 87% on SWE-bench

Nine frontier models. Two hundred real programs. Zero fully solved. Not because the models are bad — because for the first time, the benchmark refused to make it easy.

Post LinkedIn / Newsletter

The Gap Between 'AI Writes Code' and 'AI Builds Software' Is Bigger Than You Think

Claude Opus 4.7 passes 95% of tests on 3% of tasks. That's not close to solving software engineering — it's a proof that architectural reasoning at scale is a different capability than autocomplete.

Post YouTube / Tech media

Meta Just Released the AI Benchmark Nobody Can Beat — And That's the Point

When the SWE-bench team shipped a new benchmark in May 2026, every major AI lab scored exactly 0% on the hardest metric. Here's what ProgramBench actually tests and why the 0% result is the most informative score in AI right now.

What People Search

Long-tail queries from Google Suggest + Trends. Volume and competition are heuristics — directional, not audited. Content Type comes from query shape.

Keyword

Competition

Content Type

programbench

Very Low

General

Updated 2026-06-12 · sources: Google Trends, Google Suggest · Competition is heuristic

SERP of term “ProgramBench”

What searchers see today — organic results on top, paid ads if anyone's bidding. Ad density is a real-time commercial signal.

FAQ

What is ProgramBench?

Why is ProgramBench emerging now?

When did ProgramBench emerge?

Publicly emerged around 2026-05-05 (about 42 days ago as of 2026-06-16). EarlyTerms first recorded a pipeline signal on 2026-05-07.

Related Terms

Other terms in the same space — aliases, subtypes, competitors, and neighbors to explore next.

Explore next

Also mentioned

Competitor MirrorCode
Related SWE-bench·HumanEval·BIG-Bench·mini-SWE-agent

Sources

Primary URLs this report cites — open any to verify the claim yourself.

Domain Availability

programbench.com
programbench.ai
programbench.io
programbench.dev
programbench.app
programbench.net
programbench.org
programbench.co
program-bench.com
program-bench.ai
program-bench.io
program-bench.dev
program-bench.app
program-bench.net
program-bench.org
program-bench.co

Checked via RDAP — live from your browser.

EarlyTerms Weekly

5–8 new terms every Tuesday. Research, story angles, buildable ideas — straight to your inbox.

Join the waitlist for issue #1. No spam.

Search Interest

Why is it emerging now?

Outlook

Competition & Opportunity for term “ProgramBench”

Ideas for term “ProgramBench”

What People Search

SERP of term “ProgramBench”

FAQ

Related Terms

Sources

Full access is a paid feature