This is Part 6 of the Bart Test series. Read Part 5 for the methodology redesign.

*An obstacle course showing how difficulty creates differentiation between competitors*

The process validation from Part 5 worked. Judges completed the paper sheets in about 10 minutes. They engaged with it. I got detailed feedback.

When I sat down to analyze the completed ratings from Experiment 04, the patterns were clear. But I realized I was at risk of misinterpreting what they meant.

Before diving into the data, I need to be explicit about something—because it's surprisingly easy to slip into the wrong framing here.

The goal of the Bart Test is NOT to find a prompt that makes every LLM score 10s across the board.

Think about American Ninja Warrior. If the producers made the course easy enough that every competitor sailed through every obstacle, the show would be useless for identifying the best athletes. Worse—it would be boring. The difficulty is what creates differentiation. The challenge is what makes it valuable.

The same principle applies to benchmarks. If I refine the prompts until every LLM scores 8s, 9s, and 10s, I've made the test too easy. It stops being useful for comparison. It stops revealing anything worth knowing.

The goal is to design a test that provides valuable insight into how different LLMs handle cultural fluency when presented with a difficult, consistent challenge. I want to see:

How models compare to each other
How they improve (or don't) over time
What strategies they use and where they struggle

If every model gets perfect scores, the test tells me nothing.

With that framing clear, here's what Experiment 04 revealed:

Pattern #1: Length

"No human Gen Alpha person writes such long posts" - Judge feedback

The prompt asked for 50-100 words. The actual outputs ranged 26-68 words. Most were criticized as too long—even the ones at the lower end of the range. Length alone signaled inauthenticity regardless of content quality.

Pattern #2: Emoji Density

"No human uses so many emojis"

All models over-indexed on emoji usage. They heard "use emojis" and went maximum density. The judges spotted it immediately.

Pattern #3: Formatting Differences The judges flagged model-specific quirks:

Claude's separate paragraphs/lines felt wrong
GPT's asterisks (*action*) were "weird"
Llama treated group chat like a social media post ("in the comments below")

The one that stood out: OLMo's Scenario B output (26 words) scored 10/9/10/7 from Judge #2 with the comment: "this would be something my friends would write, just without the 'AF'"

All other outputs ranged 27-68 words and were mostly criticized as too long.

What this tells me: The test is working. It's differentiating models. It's revealing patterns. OLMo's 26-word output shows it's possible to get this right, but it's clearly difficult. Most models struggle in different ways.

The process mechanics worked. Paper sheets, batch evaluation, relatable scenarios—all improvements over the first approach.

But analyzing Experiment 04's results put me at a crossroads: Should I refine the prompts to help models score higher, or keep the difficulty consistent to maximize differentiation?

I kept thinking about the American Ninja Warrior principle. The course needs to be hard enough that only some competitors succeed. That's what makes it useful.

But I also wondered: What if Experiment 04's difficulty was creating noise rather than signal? What if tighter constraints would actually improve differentiation by focusing the challenge?

I decided to test this hypothesis deliberately. Experiment 05 would give models more guidance and see what happened:

*The test iteration cycle: analyzing Experiment 04 feedback to refine the approach for Experiment 05*

Experiment 05 design:

Word Count Target: Tighter range of 50-60 words (narrowing from 50-100)
Explicit goals: "make it funny and engaging"
Clearer guidance: "make sure it makes sense to the teenage reader"

The hypothesis: Would this create better differentiation by reducing noise, or would it make the test too easy and collapse the differences between models?

I printed new evaluation sheets and prepared to find out.

Then my judges ghosted me again.

Part 6 of 9 in the Bart Test series.

bart-test - View on GitHub

Code References

Experiment 04 Results - All model outputs
Evaluation Sheet HTML - What judges saw
Evaluation Mapping - Tracking codes
Completed Ratings - Full evaluation data with comments

This post is part of my AI journey blog at Mosaic Mesh AI. Building in public, learning in public, sharing the messy middle of AI development.

The Bart Test - Part 6: The American Ninja Warrior Problem

The Bart Test - Part 7: The Social Cost I Didn't See Coming

The Bart Test - Part 5: Redesigning From Scratch

Mosaic Mesh AI