Tidying up the research summary scripts

Been a productive session today. The local AI pipeline is starting to feel like actual infrastructure, which is either exciting or alarming depending on how you look at it.

The scripts needed work

I’d been running a pair of bash scripts to handle the summary stages — one to extract structured data from each source file individually, one to synthesise those extractions into a single final document. They worked, mostly, but they had accumulated some problems.

A bunch of issues:

  • $TOPIC was being referenced in the system prompt before it was actually defined in the script. Variable expansion was silently returning empty.
  • The JSON payload in the initial script was being built with string interpolation, which is fragile — any quote character in the source text could break it. The final script was already using jq -n to construct the payload safely; bringing the initial script in line with that approach was the right call.
  • The error checking was testing $PIPESTATUS after a jq pipe, which tells you whether jq succeeded, not whether Ollama actually returned a valid response. Replaced with a proper check that the response object contains a .response field before trying to use it.
  • There was a line to strip emojis from the source text. That was a workaround for a problem that no longer exists. Gone.

The scripts also needed to be more flexible about input — I want to be able to call them with either a topic slug (peasants-revolt-1381) or a full path (/data/topics/peasants-revolt-1381), the latter because tab-completion in the terminal is a real quality-of-life thing. Both work now, with the slug derived from the path when needed.

Output goes to /data/topics/{topic-slug}/summaries/ consistently. The directory structure is now actually what I said it was going to be.

The empty file problem

The n8n research workflow scrapes web pages and writes the content to source files. Some pages don’t scrape — paywalls, bot protection, whatever — and when that happens n8n was writing a placeholder file containing “No content extracted.” rather than not writing a file at all.

I spotted this because of the token counts logged by the scripts. The summary phase logs how many tokens went into the prompt and how many came out. A prompt token count of 295 for what should be a substantial article is an immediate red flag.

The dangerous part: Mistral Nemo doesn’t have a “nothing to work with” mode. Given a near-empty input, it will confidently produce a plausible-looking summary drawn from its training data. In the case of a History Press article on the Peasants’ Revolt, it generated structured extraction output complete with fake verbatim quotes attributed to a real-sounding but invented source — “Chronicles of the Revolution of 1381 by Thomas Walsingham.” Completely fabricated. Looked entirely legitimate.

The fix is upstream: n8n shouldn’t write the file if there’s nothing to put in it. The scripts also now skip files that are too short to contain useful content, but that’s a safety net rather than the real solution.

On fidelity and what LLMs can actually do

This led to a longer conversation about whether LLMs are the right tool for the extraction stage at all. The concern is real: I need the final output to be factually grounded. I’m making content that’s meant to amuse and educate, not to invent history.

The honest answer is that generative models are abstractive by nature — they predict plausible output, they don’t retrieve and quote. You can constrain them heavily, but you can’t change what they are. For guaranteed fidelity, extractive approaches (tools that pull actual sentences from the source rather than generating new ones) are more trustworthy.

That said, the pipeline architecture provides some natural protection. Multiple good sources on the same topic means that real facts appear consistently across summaries and get reinforced during synthesis. A hallucination in one summary is unlikely to survive into the final output if it contradicts everything else. The empty file problem was dangerous precisely because it was feeding a model with no real content to work from — with nothing to extract, it had nothing to be consistent with.

The mitigation that’s actually practical: fix the empty file problem, use enough good sources, and apply human judgement at review. That’s not a perfect QA system, but it’s honest about what the tools can and can’t do.

The QA rabbit hole

I went down the path of thinking about an automated QA stage to verify the final summary against the sources. The logic falls apart quickly: to verify facts you need a trusted reference, but your sources are your trusted reference, so you’re verifying the summary matches the sources rather than verifying historical truth. That’s tractable, but it requires attribution — every claim tagged with the source text that supports it — which adds token overhead and complicates the pipeline.

For the extraction stage to be mechanically verifiable, the model would need to quote its sources inline. That makes QA a grep operation rather than another model call. But it makes the intermediate files noisier and the synthesis stage has to strip the quotes before passing content to the creative stage.

It’s solvable. It’s also more complexity than the problem warrants right now. The empty file fix is the real solution to the real problem.

Quantisation

Running Mistral Nemo at q8_0 — essentially near-full precision — is the reason the summary stage is slower than it should be. The model is spilling out of VRAM into system RAM, and the PCIe bandwidth between the two is the actual bottleneck, not the GPU compute. This is a dry structured extraction task at temperature zero. It doesn’t need near-full precision.

Dropping to q4_K_M should fit cleanly within 12GB VRAM and run meaningfully faster. The quality difference for this kind of task should be negligible. That’s the next thing to test.

The summary stages should be fast and cheap. They’re not the interesting part.