The Bart Test - Part 9: The Question I Couldn't Answer
After questioning whether the Bart Test was worth continuing, I finally got the Experiment 05 evaluation sheets back.
I glanced at the scores and saw lots of 8s, 9s, and 10s.
My first reaction wasn't excitement. It was concern.
The Bart Test - Part 7: The Social Cost I Didn't See Coming
After analyzing Experiment 04's results in [Part 6](/blog/bart-test-part-6-the-american-ninja-warrior-problem), I designed Experiment 05 to test a hypothesis: Would tighter constraints improve differentiation, or make the test too easy?
I ran [Experiment 05](https://github.com/bart-mosaicmeshai/bart-test/tree/main/results/05_final_outputs). Printed the [evaluation sheets](https://github.com/bart-mosaicmeshai/bart-test/tree/main/evaluation_sheets/20251230). And prepared to find out.
Then I hit a wall I hadn't anticipated.
One judge was going to fill it out the next day and ask her friend to help. Then this judge told me: "[Friend] hates AI, so I reconsidered asking them."
The second judge was very clear: "Don't ask my friends to help with this!"
The Bart Test - Part 6: The American Ninja Warrior Problem
The process validation from [Part 5](/blog/bart-test-part-5-redesigning-from-scratch) worked. Judges completed the paper sheets in about 10 minutes. They engaged with it. I got detailed feedback.
When I sat down to analyze the [completed ratings](https://github.com/bart-mosaicmeshai/bart-test/blob/main/evaluation_sheets/20251228/completed_ratings_20251228.json) from Experiment 04, the patterns were clear. But I realized I was at risk of misinterpreting what they meant.
The Bart Test - Part 5: Redesigning From Scratch
After my teens ghosted the frontier model evaluation, I sat with a choice: give up on this whole thing, or try again.
The doubt was real. Maybe the Bart Test would never work. Maybe asking teenagers to evaluate AI-generated slang was fundamentally flawed. But I couldn't shake the insights from [Part 3](/blog/bart-test-part-3-the-zoo-not-duck-problem)—the "zoo not duck" problem, the slang half-life, the "trying too hard" pattern. Those felt real.
So I decided to try again. Not because I was confident it would work, but because I wasn't ready to give up.
