The Bart Test - Part 9: The Question I Couldn't Answer

Measuring

Jan 23

This is Part 9 (final) of the Bart Test series. Read Part 8 for the value proposition questions, and Part 6 for the Experiment 05 hypothesis.

*Time to pause the Bart Test experiment? The question I couldn't answer with the data I had.*

After questioning whether the Bart Test was worth continuing, I finally got the Experiment 05 evaluation sheets back.

I glanced at the scores and saw lots of 8s, 9s, and 10s.

My first reaction wasn't excitement. It was concern.

Back in Part 6, I posed a hypothesis: Would giving LLMs more guidance in Experiment 05 create better differentiation, or would it make the test too easy?

Experiment 05's refined prompts:

Tighter word count: 50-60 words (vs 50-100 in Experiment 04)
Explicit goals: "make it funny and engaging"
Clearer guidance: "make sure it makes sense to the teenage reader"

The judges completed the evaluation sheets. Here's what I found from Judge #1:

Judge #1's Top Scores:

Claude Scenario B: 8/10/8/10 - "This sounds like something I'd write with, like excessive slang as a joke" 😊
OLMo Scenario B: 9/10/9/9 - "omg, so close to perfect, I'd just take out this one sentence"
GPT Scenario A: 8/7/8/10 - "That's a really cool story :)"
Claude Scenario A: 9/8/7/9 - "This one paints a really good picture :)"

From Judge #1's perspective, the refined prompts had improved the model outputs. Less of the "trying too hard" problem. More genuinely funny content.

But Judge #2's scores told a different story:

No scores above 8 on any output
Fewer comments overall
More critical throughout
Made it clear that they were tired of this and didn't want to do it

Two judges, same outputs, completely different reactions.

Looking at this data, I can't answer the American Ninja Warrior question from Part 6.

The model rankings shifted:

Claude Opus 4.5: 7.69 (was 4th in Exp 04)
GPT-5.2: 7.06 (was 2nd in Exp 04)
OLMo 3 32B: 6.19 (was 1st in Exp 04)
Llama 3.1 8B: 5.75 (was 5th in Exp 04)
Qwen3 14B: 4.94 (was 3rd in Exp 04)

Some models improved with tighter constraints. Others declined. Qwen had a complete coherence collapse on one scenario—Judge #1 literally couldn't understand what happened in the story.

So did the test get easier or harder? Did tighter constraints improve differentiation or reduce it? I don't have enough data to know.

With only 2 judges and a 5-point spread on the same output (OLMo Scenario B: Judge #1 gave 9/10/9/9, Judge #2 gave 4/3/4/3), I'm not measuring model performance. I'm capturing individual judge perspectives.

Judge #1 saw improvement everywhere. Judge #2 was tired, critical, and done. Both are valid perspectives, but together, they're not conclusive.

What it would take to answer the question properly:

More judges per session (5-10??, not 2)
Consistent evaluation over time (quarterly cadence)
Managing judge burnout and social costs
Sustained investment in recruitment and compensation

Combined with the social cost issues from Part 7 and the unclear utility from Part 8, I realized: to do this right requires more investment than I'm willing to make right now.

I've paused the Bart Test.

The data from Experiments 04 and 05 is interesting, but it's anecdotal. Two judges, two perspectives, no definitive answer about which difficulty level creates better differentiation. The American Ninja Warrior question remains open.

Maybe I'll return to this with a bigger budget and more judges. Maybe someone else will take the idea and run with it. Maybe the value was always in the exploration itself, as Part 8 suggested.

But for now, it's paused.

Part 9 of 9 in the Bart Test series.

Project****bart-test - View on GitHub

Code References

Experiment 05 Results - All model outputs
Experiment 04 vs 05 Comparison - Analysis of difficulty tradeoffs

All Parts

Part 1 - The OLMo discovery
Part 2 - Constraint experiments
Part 3 - Teen insights
Part 4 - First wall
Part 5 - The pivot
Part 6 - Test design philosophy
Part 7 - Social cost and value questions
Part 8 - Reflection
Part 9 - The hypothesis tested

This post is part of my AI journey blog at Mosaic Mesh AI. Building in public, learning in public, sharing the messy middle of AI development.

validationbenchmark-designdifferentiationiterationresults

Bart Gottschalk

The Bart Test - Part 9: The Question I Couldn't Answer

The Efficiency Trap: What We Lose When AI Optimizes Everything

The Bart Test - Part 8: When an Interesting Experiment Might Not Be a Useful Tool (And Why That's Okay)

Mosaic Mesh AI