EarlyTerms

Value Accuracy

Validating · Emerged · 49 days old · Last reviewed

Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure. The gap is large: schema compliance routinely exceeds 84%, while value accuracy peaks at 83% even on frontier models.

Interfaze and JigsawStack formalized the metric in the SOB paper on April 28, 2026, benchmarking 21 models across 5,000+ text, image, and audio records. GLM-4.7 leads text at 83.0%; audio accuracy collapses to 23.7% for every model tested — exposing that structured outputs silently hallucinate at production scale.

💡

On the SOB leaderboard, every model in the 21-model test achieves >84% JSON Pass Rate. Yet Value Accuracy (exact leaf-value match) peaks at 83.0% for text — and model size is not a predictor: open-weight Qwen3.5-35B outscores GPT-5.4 on this metric, exposing that the failure mode is extraction quality, not model scale.

Think of it as the spell-checker that passes "their" when you meant "there" — valid, but wrong.

Search Interest

peak ~5.2K/mo
updated 2026-06-14
~5.2K/mo ~2.6K/mo 0
2026-05-16 2026-05-31 2026-06-14
Term Lifecycle
  1. Nascent
    0–7 days
  2. Emergent
    8–30 days
  3. Validating ← now
    31–90 days
  4. Rising
    91–180 days
  5. Established
    180 days +

Why is it emerging now?

TL;DR

LLMs now produce valid JSON at >84% schema compliance — so the bottleneck has shifted to whether values are correct, not whether the schema parses. The SOB paper (April 28, 2026) named this gap 'Value Accuracy' and quantified it across 21 frontier models, surfacing that structured outputs silently hallucinate at rates most teams don't measure.

4 forces driving coverage — scroll →

Outlook

6-month signal projection and commercial timeline.

Signal medium
Revenue moderate

The metric fills a real gap in production LLM evaluation, but adoption hinges on whether SOB becomes a de facto leaderboard.

Risk · Existing hallucination benchmarks may absorb this framing without adopting the specific term.

Analogs · perplexity · BLEU score · hallucination rate

Monetization timeline
  1. now
    Evaluation tooling gap

    Developers ship structured-output pipelines without measuring value accuracy; audit tools are absent.

  2. 3-6mo
    Leaderboard & monitoring

    SOB leaderboard attracts SaaS monitoring tools that track value accuracy per endpoint.

  3. 6-12mo
    Platform integration

    Cloud inference APIs add Value Accuracy as a billable quality metric alongside token counts.

Competition & Opportunity for term “Value Accuracy”

Three heuristic signals derived from the tracked queries, the term's monetization cards, and its cluster neighbors. Directional, not audited.

Content Gap
10 queries tracked
Led by General (9), Explainer (1)
10 Suggest-only tails — long-tail opening
Revenue Potential
0% commercial-intent queries
2 monetization angles mapped
Mostly informational — pre-commercial
Build Difficulty
Medium
Stage: validating — incumbents warming up
1 / 10 default TLDs taken · oldest incumbent valueaccuracy.com (2023-05-11)
No cluster neighbors published yet
Heuristic · signals: tracked queries, term monetization cards, cluster neighbors

Ideas for term “Value Accuracy”

Buildable pitches — turn this term into an article, site, product, post, newsletter, video, or course. Steal any card and run with it.

Article
Value Accuracy vs JSON Pass Rate: The Hidden Gap in LLM Structured Output Evaluation

Evergreen explainer targeting devs searching for 'structured output evaluation LLM' — a query cluster with no dominant page yet. Monetize via affiliate links to eval frameworks.

Article
How to Measure Value Accuracy in Your LLM Pipeline (Without the SOB Paper)

Tutorial-style guide targeting builders implementing custom eval loops. Long-tail traffic from 'json extraction accuracy llm' searches.

Article
Best LLMs for Structured Output by Value Accuracy: 2026 Benchmark Comparison

Comparison piece citing SOB leaderboard data. Targets 'best llm structured output' and 'llm json accuracy comparison' queries.

Product
Value Accuracy Monitor: Field-Level LLM Output Auditing for Production Pipelines

SaaS tool that computes value accuracy on live inference logs against schema-defined ground truth, alerting on model drift before downstream corruption spreads.

Product
A drop-in pytest fixture that asserts value accuracy thresholds on LLM extraction tests

OSS tool targeting CI/CD workflows; lightweight monetization via cloud storage for baseline snapshots.

Newsletter
Eval Matters: Weekly structured-output accuracy benchmarks for the top 10 frontier models

Recurring benchmarking digest targeting ML engineers who run extraction pipelines. Natural upgrade path to paid leaderboard access.

Post HN / r/MachineLearning
Your LLM Passes JSON Validation. Its Values Are Still Wrong. Here's the Gap Nobody Measures.

In a 21-model benchmark, every frontier LLM achieved >84% schema compliance — yet the best value accuracy was 83.0%, and audio dropped to 23.7%.

Post LinkedIn / Newsletter
Model Size Does Not Predict Value Accuracy. Here's What Does.

Qwen3.5-35B outscored GPT-5.4 on structured-output value accuracy — a smaller open-weight model, no API fees.

Post YouTube / Tech media
I Ran the Same JSON Extraction Prompt on 10 LLMs. The Value Accuracy Results Are Embarrassing.

Valid JSON doesn't mean correct JSON — here's what I found when I checked the actual leaf values, not just the schema.

What People Search

Long-tail queries from Google Suggest + Trends. Volume and competition are heuristics — directional, not audited. Content Type comes from query shape.

Keyword
Competition
Content Type
value accuracy
Very Low
General
value accuracy professional consulting
Very Low
General
accuracy value formula
Low
General
accuracy value calculator
Low
General
accuracy value meaning
Low
Explainer
zillow value accuracy
Very Low
General
redfin value accuracy
Very Low
General
p value for accuracy
Low
General
1–8 of 10
1 / 2
Updated 2026-06-14 · sources: Google Trends, Google Suggest · Competition is heuristic

SERP of term “Value Accuracy”

What searchers see today — organic results on top, paid ads if anyone's bidding. Ad density is a real-time commercial signal.

FAQ

What is Value Accuracy?

Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure.

Why is Value Accuracy emerging now?

LLMs now produce valid JSON at >84% schema compliance — so the bottleneck has shifted to whether values are correct, not whether the schema parses. The SOB paper (April 28, 2026) named this gap 'Value Accuracy' and quantified it across 21 frontier models, surfacing that structured outputs silently hallucinate at rates most teams don't measure.

When did Value Accuracy emerge?

Publicly emerged around 2026-04-28 (about 49 days ago as of 2026-06-16). EarlyTerms first recorded a pipeline signal on 2026-04-30.

Related Terms

Other terms in the same space — aliases, subtypes, competitors, and neighbors to explore next.

Also mentioned
  • Also known as field-level accuracy
  • Part of structured output·LLM evaluation
  • Related JSON schema compliance·hallucination rate·SOB benchmark·schema compliance·exact match·structured hallucination·JSON pass rate

Sources

Primary URLs this report cites — open any to verify the claim yourself.

  1. 01 SOB paper: The Structured Output Benchmark (arXiv, Apr 2026) arxiv.org
  2. 02 Interfaze: Introducing SOB — A Multi-Source Structured Output Benchmark interfaze.ai
  3. 03 Hacker News: Show HN — A new benchmark for testing LLMs for deterministic outputs news.ycombinator.com
  4. 04 JigsawStack/sob — GitHub repository github.com
  5. 05 Cleanlab: LLM Structured Output Benchmarks are Riddled with Mistakes cleanlab.ai