Content creation - starting with rhymes

  • 4 min read

One of the steps in the pipeline I’m building generates short text content from a structured source document. The content needs rhymes — lots of them, including rhymes against awkward proper nouns from historical material. Not the kind of thing a thesaurus lookup handles well.

The architecture I’ve landed on splits the work into two model calls. The first is a low-temperature extraction pass: pull anchor moments from the source, present-tense, single-incident, named participants only. This is factual work and the model is on a tight leash. The second is a higher- temperature creative pass: produce rhyme partners for the proper nouns and build thematic word clusters that draw on period-appropriate vocabulary. Same source document, two completely different jobs, two completely different prompts.

Splitting the calls matters for a reason that took me a few iterations to appreciate. If you ask one prompt to be both factually strict and creatively loose, the model gets contradictory signals and produces something that’s neither. Two prompts, two temperatures, two single-purpose jobs. The output is better and the prompts are easier to debug because each one is doing one thing.

The output of this stage isn’t finished content. It’s raw material — a draft for me to prune by hand before the next stage consumes it. I write to rhyme_pool.draft.txt and the next script reads rhyme_pool.txt, which doesn’t exist until I’ve gone through and removed the obvious failures. If I forget to prune, the next stage fails fast with a clear error. Cheap discipline.

The model is part of the prompt

Here’s the thing I didn’t fully internalise until I started swapping models for comparison: a prompt isn’t an abstract specification that any sufficiently capable model should be able to satisfy. A prompt is paired with the model it was tuned against. Move it to a different model and expect to do real work re-tuning it.

Within a model family this is mostly true with caveats. A prompt I tune against Gemma 3 12B will probably behave similarly on Gemma 3 27B, because they share the same training recipes scaled up. The 27B will be better at the hard parts but it’ll fail in recognisably similar ways. Family resemblance is real and useful.

But the resemblance breaks across major version jumps. Gemma 4 was released last week with substantially different training and post-training — the benchmark numbers suggest something genuinely different is happening inside it, not just a bigger version of Gemma 3. There’s no reason to assume my Gemma 3 prompts will behave identically on Gemma 4. Worth re-tuning when I get there.

And across families it’s a different story entirely. Qwen 3.5 takes the same instructions as Gemma 3 and interprets them differently. Gemma tends to follow instructions literally; Qwen tends to interpret. Neither is wrong, but the prompt that gets clean structured output from one will produce verbose preamble from the other unless you adapt for it.

The practical implication is that “which model is better” is the wrong question. The right question is “which model, with a prompt tuned specifically for it, produces better output for my task.” Those are very different questions, and answering the second one honestly means budgeting time to re-tune for each model you seriously evaluate. Otherwise you’re not comparing models — you’re comparing one model against another model handicapped by the wrong prompt.

I’ve been keeping a short note for each pipeline stage recording which model was chosen, against which alternatives, with which prompt version, on which date. Boring discipline, but future-me will thank present-me when I’m wondering in six months why I made a particular choice.