Fine-Tuning Gemma for Personality - Part 5: Base Models vs Instruction-Tuned

This is Part 5 of a series on fine-tuning Gemma for personality. Read Part 1 for the concept, Part 2 for the dataset, Part 3 for training, and Part 4 for the short response problem.

Same training data. Same hardware. Same 5 minutes. I tested both base (-pt) and instruction-tuned (-it) models to see if one would handle personality better. Both struggled with consistency.

Gemma models come in two flavors:

  • Pre-trained (-pt): Base models trained on text continuation
  • Instruction-tuned (-it): Further trained to follow conversational patterns

I ran the same Bluey dataset through both gemma-3-1b-pt and gemma-3-1b-it.

Base vs instruction-tuned model comparison showing shared issues and marginal differences]

Training time was nearly identical (4.9 vs 4.8 minutes). Both models reduced loss significantly (ending around 0.1). I tested both with the same prompts to see if model type made a difference.

What I found: Both models showed inconsistent behavior. Sometimes on-topic with Bluey's voice, sometimes off-topic but still in Bluey's voice, both truncating mid-sentence.

Example - "Can you help me with my homework?":

Base model (-pt):

"Yeah! That's fair, even though sometimes I don't understand. What kind of homework is it? Is it maths? Because Mum's really good at maths. Or maybe" (26 words, truncated)

Instruction-tuned (-it):

"Homework is actually pretty fun if you make it into a game! So I'm happy to help you with your homework. Just remember - if it gets too hard, you can take a break and play for a bit. That's" (40 words, truncated)

Both show Bluey patterns. The -it response is longer, but both cut off mid-sentence. Across four different test prompts, both models went off-topic sometimes and stayed on-topic other times.

See all test cases: pt_vs_it_comparison.txt shows all four prompts and responses. Run compare_pt_vs_it.py to reproduce these results.

The theory: instruction-tuned models are pre-trained on conversational patterns, so they should handle personality fine-tuning better with limited data.

The reality: both models struggled. Both went off-topic sometimes. Both truncated responses. Both reproduced Bluey patterns inconsistently.

The marginal difference: the -it model produced slightly longer responses before truncating (30-40 words vs 25-35 words). But "longer" doesn't mean "better" when both are cutting off mid-sentence.

With only 111 examples, neither model type solved the consistency problem. The training data limitations affected both equally.

Next: how to test personality (you can't unit test "sounds like Bluey").


Part 5 of 9 in the Fine-Tuning Gemma for Personality series.


gemma-local-finetune - View on GitHub


This post is part of my AI journey blog at Mosaic Mesh AI. Building in public, learning in public, sharing the messy middle of AI development.

Next
Next

Fine-Tuning Gemma for Personality - Part 4: When Your Model Learns Too Well