Models, directories and getting my house in order

8th Mar 2026
4 min read
Tags:
ai,
llama.cpp,
docker,
pipeline

Long session today. Started with what I thought was a simple question about Qwen 3.5 and ended up rebuilding my assumptions about the whole pipeline. Good day, if a slightly humbling one.

Qwen 3.5 and the model question

I’d been working with Mistral Nemo (and my somewhat questionable fine-tuned version of it, mistral-miranda) for lyrics generation. The results were… interesting. Not in a good way. Time to move on.

Qwen 3.5 was released in February and looks genuinely promising. The family has a 9B and a 27B dense model that are relevant for my hardware (Ryzen 3700x, 32GB RAM, 3060 12GB VRAM). I’d assumed the 9B was the obvious choice given my VRAM constraints, but got corrected on that - the 27B at Q4_K_M quantisation is 17GB total, meaning most layers sit in VRAM and only a small spill goes to system RAM. The MoE variants (like the 35B-A3B) looked tempting on paper but community testing suggests they underperform dense models for creative writing specifically, which is exactly my use case.

Downloaded both from Hugging Face (bartowski’s quantisations, note the Qwen_ prefix on the filenames - caught me out):

Qwen_Qwen3.5-9B-Q8_0.gguf - for extraction tasks
Qwen_Qwen3.5-27B-Q4_K_M.gguf - for creative/lyrics generation

Both sitting in /data/llama.cpp/models/ now.

The Ollama question

I’d been using Ollama and assumed I’d continue. Turns out there are real compatibility issues between Ollama and Qwen 3.5 GGUFs - specifically around the 27B, with users hitting 500 errors on model load. The issues are Ollama-side, not the models themselves.

I went back and forth on this (and got some advice I had to push back on, which is fair). The right answer is llama.cpp direct via the official server-cuda Docker image. It gives me precise control over n_gpu_layers - critical for maximising how much of the 27B sits in VRAM versus spilling to system RAM. Ollama abstracts that away in ways that become a problem rather than a convenience at my hardware limits.

Two separate docker-compose files, one per model, both on port 8080 serially. The script manages container lifecycle - start 9B, run extraction, stop 9B, start 27B, run lyrics generation, stop 27B. Slower than having both resident, but I only have 12GB VRAM and I’m not fighting myself for it.

The two-phase lyrics pipeline

This was the other big thing. My previous lyrics script was dumping the entire Production Dossier (formerly known as the Master Bible - I’ve retired both “master” and “bible” from my vocabulary) into the prompt as context for lyrics generation. The problem: the LLM was consistently dropping the burning birds from Olga of Kiev, even though it’s the most interesting part of the story. Birds with sulfur tied to their feet, released back to their nests, burning a city from the inside out. Every time: ignored.

The fix is a two-phase approach:

Phase 1 (9B, low temperature): Extract exactly 8 concrete physical images from the Production Dossier. No interpretation, no emotion, just “Olga’s men tie sulfur and burning cloth to the feet of pigeons and sparrows, then release them.” Ordered by dramatic intensity, most striking first. This becomes the Image Bank.

Phase 2 (27B, high temperature): Generate the lyrics from the Image Bank, not the full Dossier. Smaller context, more concentrated attention, model can’t bury the birds under three pages of historical background.

The other fix was reordering the $TOPIC variable to put the birds first. Models front-load what you front-load. Obvious in retrospect.

Directory structure

My previous layout was a disaster. I had lyrics02.txt through lyrics06.txt in a flat output directory, with no obvious record of what changed between them or why. Classic iterative-without-a-plan naming.

New structure:

/data/
├── ai-tools/          # docker-compose files, one dir per tool
├── models/            # all model weights by tool
├── config/            # style guide, reusable prompts
├── pipeline/          # scripts (under git)
├── topics/            # one dir per topic
│   └── olga-of-kiev/
│       ├── research/      # raw scraped files from n8n
│       ├── documents/     # LLM outputs: dossiers, image bank, lyrics, explainer
│       ├── voice/         # audio files
│       ├── visuals/       # SD outputs
│       └── output/        # final assembled video
└── workspace/         # ephemeral junk, caches, experiments

The topic directory is the version. There’s no rap06.txt - there’s just rap.txt. If I want version history, that’s what git is for. /data/pipeline and /data/config go under git. The topic directories don’t - they’re generated outputs.

Document naming settled too:

Initial Source Document - per-source summaries from the LLM
Interim Source Document - summary of summaries
Final Production Dossier - the polished creative input
Image Bank - the 8 concrete images extracted for creative use

What’s next

The n8n scraping workflow needs revisiting. I built it before I had a clear directory structure, so it’s writing to the wrong places and has secrets hardcoded in the compose file (moving those to .env files, gitignored, with a .env.example committed instead).

After that: get the 27B actually running via llama.cpp and see what it produces. That’s the test that matters.

I also need to actually start the GitHub organisation (cellfive) and get the pipeline under version control before it gets any bigger.

Lots of good foundations today. Time to build on them.