EarlyTerms

ProgramBench

Validating · Emerged · 42 days old · Last reviewed

ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed during the task.

Released May 5, 2026 by researchers at Meta FAIR, Stanford, and Harvard, the benchmark covers 200 tasks spanning compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter, verified by 248,000+ agent-generated behavioral tests. No model fully solves a single task; Claude Opus 4.7 leads at 3% almost-resolved.

💡

A ProgramBench agent receives the compiled `jq` binary and its man page. Without seeing a single line of source, it must choose a programming language, design an architecture, and produce a build-ready codebase whose output matches `jq` across thousands of edge-case inputs — the same task a human engineer would need days to complete.

Think of it as a blindfolded architectural drawing contest: you see only the finished building, never the blueprints.

Search Interest

peak ~259/mo
updated 2026-06-12
~259/mo ~129/mo 0
2026-05-14 2026-05-29 2026-06-12
Term Lifecycle
  1. Nascent
    0–7 days
  2. Emergent
    8–30 days
  3. Validating ← now
    31–90 days
  4. Rising
    91–180 days
  5. Established
    180 days +

Why is it emerging now?

TL;DR

Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.

5 forces driving coverage — scroll →

Outlook

6-month signal projection and commercial timeline.

Signal medium
Revenue moderate

Benchmark from the SWE-bench team — brand credibility will drive sustained researcher attention and media cycles.

Risk · Competing benchmarks like MirrorCode (Epoch AI) may fragment attention if they publish contradictory results.

Analogs · SWE-bench · HumanEval · BIG-Bench

Monetization timeline
  1. now
    Research coverage open

    AI benchmarking newsletter and explainer content faces zero SERP competition today.

  2. 3-6mo
    Leaderboard tooling emerges

    Comparison tools and tutorials attract developer traffic as model scores improve.

  3. 6-12mo
    Enterprise training signal

    Companies adopt ProgramBench tasks to evaluate coding agent vendors; consulting gap opens.

Competition & Opportunity for term “ProgramBench”

Three heuristic signals derived from the tracked queries, the term's monetization cards, and its cluster neighbors. Directional, not audited.

Content Gap
1 queries tracked
Led by General (1)
1 Suggest-only tails — long-tail opening
Revenue Potential
0% commercial-intent queries
2 monetization angles mapped
Mostly informational — pre-commercial
Build Difficulty
Medium
Stage: validating — incumbents warming up
1 / 10 default TLDs taken · oldest incumbent programbench.com (2008-09-09)
5 related terms already published
Heuristic · signals: tracked queries, term monetization cards, cluster neighbors

Ideas for term “ProgramBench”

Buildable pitches — turn this term into an article, site, product, post, newsletter, video, or course. Steal any card and run with it.

Article
ProgramBench vs SWE-bench: Which AI Coding Benchmark Actually Matters in 2026?

Zero competition in SERP for this comparison; developers actively asking what distinguishes the two. High intent, long-tail traffic.

Article
What Is ProgramBench and Why Every AI Model Scores 0%?

Definitional explainer with immediate search demand. Ideal anchor for a benchmarks-focused content site.

Article
How to Run ProgramBench Locally: Step-by-Step Setup Guide

pip-installable, MIT-licensed tool — tutorial content has long-tail developer traffic and no existing guides yet.

Product
A ProgramBench leaderboard tracker that monitors model score changes over time

The official leaderboard is static snapshots. A lightweight service alerting subscribers when scores change fills the gap immediately.

Product
A subset-difficulty visualizer for ProgramBench tasks ranked by test pass rate

Researchers need task-level drill-down the current leaderboard doesn't provide. Small open-source tool with clear niche.

Video
I Gave GPT-5.4 and Claude Opus 4.7 the Same ProgramBench Task — Here's What Happened

Live side-by-side demo with real benchmark tasks. Shareable format for YouTube; existing media hasn't done the visual comparison yet.

Newsletter
Weekly AI Benchmarks Briefing anchored on ProgramBench score updates

Recurring issue tracking when and which models breach the 0% ceiling. Clear subscription hook for AI researcher audience.

Post HN / r/MachineLearning
ProgramBench Is the Benchmark That Finally Said No: Why 0% Fully Solved Matters More Than 87% on SWE-bench

Nine frontier models. Two hundred real programs. Zero fully solved. Not because the models are bad — because for the first time, the benchmark refused to make it easy.

Post LinkedIn / Newsletter
The Gap Between 'AI Writes Code' and 'AI Builds Software' Is Bigger Than You Think

Claude Opus 4.7 passes 95% of tests on 3% of tasks. That's not close to solving software engineering — it's a proof that architectural reasoning at scale is a different capability than autocomplete.

Post YouTube / Tech media
Meta Just Released the AI Benchmark Nobody Can Beat — And That's the Point

When the SWE-bench team shipped a new benchmark in May 2026, every major AI lab scored exactly 0% on the hardest metric. Here's what ProgramBench actually tests and why the 0% result is the most informative score in AI right now.

What People Search

Long-tail queries from Google Suggest + Trends. Volume and competition are heuristics — directional, not audited. Content Type comes from query shape.

Keyword
Competition
Content Type
programbench
Very Low
General
Updated 2026-06-12 · sources: Google Trends, Google Suggest · Competition is heuristic

SERP of term “ProgramBench”

What searchers see today — organic results on top, paid ads if anyone's bidding. Ad density is a real-time commercial signal.

FAQ

What is ProgramBench?

ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed….

Why is ProgramBench emerging now?

Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.

When did ProgramBench emerge?

Publicly emerged around 2026-05-05 (about 42 days ago as of 2026-06-16). EarlyTerms first recorded a pipeline signal on 2026-05-07.

Related Terms

Other terms in the same space — aliases, subtypes, competitors, and neighbors to explore next.

Explore next
Also mentioned
  • Competitor MirrorCode
  • Related SWE-bench·HumanEval·BIG-Bench·mini-SWE-agent

Sources

Primary URLs this report cites — open any to verify the claim yourself.

  1. 01 ProgramBench paper — arXiv:2605.03546 (May 5, 2026) arxiv.org
  2. 02 facebookresearch/ProgramBench — official GitHub repo github.com
  3. 03 ProgramBench.com — live leaderboard programbench.com
  4. 04 HN: ProgramBench — 139 points, 72 comments news.ycombinator.com
  5. 05 ProgramBench-Tests dataset — HuggingFace huggingface.co
  6. 06 Emergent Mind — ProgramBench: Evaluating LM Software Reconstruction emergentmind.com
  7. 07 ProgramBench paper full text — arXiv HTML arxiv.org