Fixing n8n issues and assumtions
Another evening on the pipeline. Tonight was less about building new things and more about fixing assumptions I’d made early on that turned out to be wrong. The kind of session that feels slow while you’re in it but leaves everything in a better state than you found it.
The Fordham problem
I’ve been using SearXNG to search for Fordham Sourcebook documents, and it’s been unreliable. The Fordham Sourcebook is genuinely excellent for primary sources - exactly the kind of material that makes the rap and explainer content historically grounded rather than generic - so getting it working properly matters.
The fix was obvious in retrospect: stop relying on third-party search engines to find Fordham documents and go direct to Fordham’s own search. Their site uses a simple GET form powered by Funnelback, which means the search URL is just:
https://search.fordham.edu/s/search.html?collection=fordham~sp-search&query=your+topic
Two new n8n nodes - an HTTP Request to that URL, and an HTML extractor that parses the results - and the pipeline now reliably finds primary sources that SearXNG was consistently missing. For the Siege of Acre I got three Fordham documents including the contemporary account of Richard fighting from a litter because dysentery had floored him. That’s exactly the kind of concrete detail the Image Bank needs.
One thing worth noting: Fordham’s search results contain two types of pages. The source/*.asp URLs are actual primary source documents. The sbook*.asp URLs are index pages - lists of sources, not sources themselves. A simple filter on the URL pattern keeps only the useful ones.
I checked their robots.txt before building this - got a 404, which means no restrictions defined. One request per topic, proper User-Agent header identifying the project. Being a good citizen costs nothing.
Single files per source
This one I should have done from the start. The original workflow collected all scraped content and concatenated it into one large file per topic. That made a certain kind of sense when I was thinking about feeding everything to an LLM in one go - but that approach doesn’t work on my hardware anyway. The context window problem means I need to summarise each source individually before combining the summaries.
So the workflow now writes one file per source, named by URL slug, into the correct topic directory:
/data/topics/the-cadaver-synod-897-ad/research/source_en_wikipedia_org_wiki_cadaver_synod.txt
The directory gets created by the shell script before the webhook fires, so n8n never has to worry about whether the path exists. Clean separation - shell manages the filesystem, n8n manages the content.
Running the pipeline against all seven topics and seeing the directory structure actually populate was a good moment. The thing is starting to look like a real pipeline.
Cleanup improvements
The source text cleaning was also harbouring a bug I hadn’t noticed. The Medievalists cleaner had this line:
text = text.replace(/\s+/g, ' '); // Collapse multiple spaces into one
The comment says “collapse multiple spaces” but \s matches newlines too. Every paragraph break was being destroyed, turning readable text into a single wall of characters that the LLM was struggling to parse. The fix is to use [^\S\n]+ instead - which matches whitespace but explicitly excludes newlines.
While I was in there I consolidated all three cleanup nodes (Fordham, Medievalists, fallback) into a single shared function. Same logic everywhere, site-specific patterns passed in as an array. One place to fix things when I find new noise patterns, which I will.
Added emoji stripping too - they’re wasted tokens and some sites use them liberally in their content. Unicode ranges \u{1F000}-\u{1FFFF} and \u{2600}-\u{27FF} catches the lot. The u flag on the regex matters here - without it JavaScript can’t handle Unicode code points above U+FFFF.
Also added paragraph-level deduplication. Some sites render the same content multiple times in different HTML structures for mobile, desktop, and print. Normalise each paragraph to lowercase, track what you’ve seen, drop verbatim repeats.
On working with Claude
I want to be honest about something. I switched from Gemini to Claude a while back specifically because Gemini was too agreeable - it would enthusiastically encourage approaches that sounded reasonable but didn’t fit my actual constraints. RAG, huge context windows, command-R 35B. All fine suggestions if you have the hardware or the cloud budget. Not fine if you have a 3060 and a principle about keeping everything local.
Claude pushes back. When I’m wrong it says so. When I suggest something that won’t work on my hardware it says that too. That’s worth more than enthusiasm.
Tonight’s session was a good example. I’d made early assumptions about the workflow that were sensible at the time but needed revisiting as the pipeline matured. Having something that could look at the actual workflow JSON, understand the full context of what I’m building, and give concrete specific fixes rather than generic advice - that’s genuinely useful. The Fordham node sequence came out of a conversation about the HTML structure of their search results. The consolidated cleanup function came from a conversation about where the wall-of-text bug actually lived.
The pipeline is starting to feel like a real thing. Seven topic directories, proper source files, Fordham working, cleanup working. Next step is the summarisation stage - turning these source files into Initial Source Documents, then an Interim, then the Final Production Dossier. That’s where the LLM work really begins.