Value Accuracy
Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure. The gap is large: schema compliance routinely exceeds 84%, while value accuracy peaks at 83% even on frontier models.
Interfaze and JigsawStack formalized the metric in the SOB paper on April 28, 2026, benchmarking 21 models across 5,000+ text, image, and audio records. GLM-4.7 leads text at 83.0%; audio accuracy collapses to 23.7% for every model tested — exposing that structured outputs silently hallucinate at production scale.
On the SOB leaderboard, every model in the 21-model test achieves >84% JSON Pass Rate. Yet Value Accuracy (exact leaf-value match) peaks at 83.0% for text — and model size is not a predictor: open-weight Qwen3.5-35B outscores GPT-5.4 on this metric, exposing that the failure mode is extraction quality, not model scale.
Think of it as the spell-checker that passes "their" when you meant "there" — valid, but wrong.
Search Interest
-
Nascent0–7 days
-
Emergent8–30 days
-
Validating ← now31–90 days
-
Rising91–180 days
-
Established180 days +
Why is it emerging now?
LLMs now produce valid JSON at >84% schema compliance — so the bottleneck has shifted to whether values are correct, not whether the schema parses. The SOB paper (April 28, 2026) named this gap 'Value Accuracy' and quantified it across 21 frontier models, surfacing that structured outputs silently hallucinate at rates most teams don't measure.
Outlook
6-month signal projection and commercial timeline.
The metric fills a real gap in production LLM evaluation, but adoption hinges on whether SOB becomes a de facto leaderboard.
Risk · Existing hallucination benchmarks may absorb this framing without adopting the specific term.
Analogs · perplexity · BLEU score · hallucination rate
-
nowEvaluation tooling gap
Developers ship structured-output pipelines without measuring value accuracy; audit tools are absent.
-
3-6moLeaderboard & monitoring
SOB leaderboard attracts SaaS monitoring tools that track value accuracy per endpoint.
-
6-12moPlatform integration
Cloud inference APIs add Value Accuracy as a billable quality metric alongside token counts.
Competition & Opportunity for term “Value Accuracy”
Three heuristic signals derived from the tracked queries, the term's monetization cards, and its cluster neighbors. Directional, not audited.
Ideas for term “Value Accuracy”
Buildable pitches — turn this term into an article, site, product, post, newsletter, video, or course. Steal any card and run with it.
Evergreen explainer targeting devs searching for 'structured output evaluation LLM' — a query cluster with no dominant page yet. Monetize via affiliate links to eval frameworks.
Tutorial-style guide targeting builders implementing custom eval loops. Long-tail traffic from 'json extraction accuracy llm' searches.
Comparison piece citing SOB leaderboard data. Targets 'best llm structured output' and 'llm json accuracy comparison' queries.
SaaS tool that computes value accuracy on live inference logs against schema-defined ground truth, alerting on model drift before downstream corruption spreads.
OSS tool targeting CI/CD workflows; lightweight monetization via cloud storage for baseline snapshots.
Recurring benchmarking digest targeting ML engineers who run extraction pipelines. Natural upgrade path to paid leaderboard access.
In a 21-model benchmark, every frontier LLM achieved >84% schema compliance — yet the best value accuracy was 83.0%, and audio dropped to 23.7%.
Qwen3.5-35B outscored GPT-5.4 on structured-output value accuracy — a smaller open-weight model, no API fees.
Valid JSON doesn't mean correct JSON — here's what I found when I checked the actual leaf values, not just the schema.
What People Search
Long-tail queries from Google Suggest + Trends. Volume and competition are heuristics — directional, not audited. Content Type comes from query shape.
SERP of term “Value Accuracy”
What searchers see today — organic results on top, paid ads if anyone's bidding. Ad density is a real-time commercial signal.
FAQ
What is Value Accuracy?
Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure.
Why is Value Accuracy emerging now?
LLMs now produce valid JSON at >84% schema compliance — so the bottleneck has shifted to whether values are correct, not whether the schema parses. The SOB paper (April 28, 2026) named this gap 'Value Accuracy' and quantified it across 21 frontier models, surfacing that structured outputs silently hallucinate at rates most teams don't measure.
When did Value Accuracy emerge?
Publicly emerged around 2026-04-28 (about 49 days ago as of 2026-06-16). EarlyTerms first recorded a pipeline signal on 2026-04-30.
Related Terms
Other terms in the same space — aliases, subtypes, competitors, and neighbors to explore next.
- Also known as
- Part of ·
- Related ······
Sources
Primary URLs this report cites — open any to verify the claim yourself.
- 01 SOB paper: The Structured Output Benchmark (arXiv, Apr 2026) arxiv.org ↗
- 02 Interfaze: Introducing SOB — A Multi-Source Structured Output Benchmark interfaze.ai ↗
- 03 Hacker News: Show HN — A new benchmark for testing LLMs for deterministic outputs news.ycombinator.com ↗
- 04 JigsawStack/sob — GitHub repository github.com ↗
- 05 Cleanlab: LLM Structured Output Benchmarks are Riddled with Mistakes cleanlab.ai ↗