過濾器沒有出錯,它只是把連結全部吃掉了

過濾器沒有出錯,它只是把連結全部吃掉了

7-Eleven 的包裹代收名單貼在大廳玻璃上,上面列著合作的快遞公司。郵局不在上面。郵局的包裹每次送到都被退回,沒有通知、沒有錯誤紀錄,收件人只知道什麼都沒收到。這個 bug 的結構幾乎一模一樣。

技術環境

n8n 自動化 pipeline,LLM 生成 HTML 文章後,送入 WordPress 前經過一層 strip_tags 型白名單清洗。整個流程同步執行:Prompt → LLM 輸出 → HTML 過濾 → WP REST API 寫入,沒有非同步佇列。問題與語言模型的選擇無關——任何在寫入前有靜默過濾步驟、且清單規則未定期稽核的 pipeline,都會複現相同行為。

連續三輪,外部連結:0

SEO 工具回報的數字是 0。不是 1,不是偶爾,是三輪都是 0。

Prompt 裡有明確的要求:生成文章時帶入參考連結。生成模型也確實輸出了錨點標籤,HTML 裡看得到 <a href>。但最終寫進資料庫的文章,一條連結都沒有。Yoast SEO 掃到的 outbound links 欄位始終是空的。

往上追才看到那個步驟:輸出在寫入前,會經過一層 HTML 允許清單的清洗。清單上列著所有能保留的標籤——<h4><p><blockquote><strong><em>。錨點標籤 <a> 不在裡面。

過濾器沒有報錯。它按照設計,把不在清單上的東西靜默移除。

錯誤傳染鏈(時序)

n8n Prompt         LLM 輸出             HTML 過濾層          WordPress DB
     |                  |                    |                     |
     |── 生成請求 ───────>|                   |                     |
     |                  |── HTML(含 <a>)──>|                |
     |                  |                    |── 白名單比對         |
     |                  |                    |   <a> 不在清單   |
     |                  |                    |── 靜默移除 <a>   |
     |                  |                    |── 輸出無連結 HTML    |
     |                  |                    |── POST /wp/v2/posts >|
     |                  |                    |                     |── OK ✓
LLM 輸出:含連結 ✓                          DB 存入:無連結 ✗

過濾層是唯一差異點:LLM 輸出驗證通過、DB 寫入成功,只有中間這一層在不回報任何錯誤的情況下改變了輸出內容。

為什麼沒有立刻看出來

這類 bug 難在它的表現完全像「正常」。生成那端沒問題,輸出看起來正確,寫入也成功——只有資料庫裡的那份 HTML 是乾淨過的版本。如果沒有人主動去比對「生成的 HTML」和「入庫的 HTML」,這個差距不會自己浮出來。

SEO 分數低,第一直覺是模型沒有生成連結。去看 prompt,發現有要求。去看生成結果,發現有輸出。到這裡很容易停下來,以為是模型不穩定、隨機跳過。實際上連結一直都在,只是在最後一步被安靜地過濾掉了。

靜默過濾是這類問題最難被發現的原因。它不拋出例外,不留下 log,不改變整體流程的成功狀態。資料寫進去了,程序結束了,一切看起來都好。

修法分兩層

第一層是允許清單補上 <a href>。這是必要條件,沒有這一步,連結永遠進不了資料庫。

第二層是在 prompt 新增一個強制的「Outbound Links」段落要求,指定每篇文章中英文版各嵌入 1-2 個對外連結,只連維基百科、MDN、官方文件這類長期存在的權威來源,連通用概念而非特定主張。

第二層的用意不只是補強,而是讓連結的存在不再依賴生成模型的自由發揮。Prompt 要求模型輸出,允許清單讓輸出能活過過濾步驟——兩個條件都滿足,連結才能安全落地。

Code 對照:修法前後

修法前(允許清單缺少錨點標籤)

# n8n Function node — HTML 清洗
ALLOWED_TAGS = [
    'h4', 'p', 'blockquote',
    'strong', 'em', 'code', 'pre',
    'ul', 'li', 'ol', 'hr'
    # <a> 不在清單,所有錨點標籤靜默移除
]
clean_html = strip_tags(raw_html, allowed_tags=ALLOWED_TAGS)

修法後(補上 <a>,Prompt 端同步加強)

# 修法一:允許清單補上錨點標籤
ALLOWED_TAGS = [
    'h4', 'p', 'blockquote',
    'strong', 'em', 'code', 'pre',
    'ul', 'li', 'ol', 'hr',
    'a'  # ← 補上,允許 href / rel 屬性
]
ALLOWED_ATTRS = {'a': ['href', 'rel', 'target']}
clean_html = strip_tags(raw_html, allowed_tags=ALLOWED_TAGS, allowed_attrs=ALLOWED_ATTRS)

# 修法二:Prompt 強制要求 Outbound Links 段落
# 中英各嵌入 1-2 個外部連結,只連 Wikipedia / MDN / 官方文件

該被隔離的側效應類型

  • SEO 指標污染:Yoast 外連計數長期為 0,導致評分持續扣分,觀察者會以為是模型或 prompt 問題,而非 pipeline 結構問題。
  • 稽核工具誤判:任何在過濾後才執行的 HTML 驗證工具(連結掃描、可及性檢查)都會把空連結視為「正確狀態」,失去偵錯價值。
  • 生成端的假驗證:LLM 輸出的品質評估若在過濾前執行,結果不反映最終入庫內容——驗證工具和生產結果看到的是不同版本。
  • Prompt 迭代方向錯誤:連結消失被歸因於模型不穩定,導致工程資源投入錯誤方向,實際根因(過濾層)反而未被調查。
  • 靜默版本漂移:模型更新後輸出新 HTML 結構,白名單未同步更新,新標籤同樣被吃掉,且不觸發任何通知。
  • A/B 測試污染:若分組依賴錨點(追蹤連結、UTM 參數),過濾層的存在讓實驗資料完全無效,卻不觸發任何 alert。
  • 讀者體驗靜默降級:讀者無法點擊應有的參考連結,但不會看到錯誤訊息——只知道文章「沒有連結」,無法回報具體問題。
  • 延伸閱讀連結失效風險:若延伸閱讀區段的連結也經過此過濾層,站內連結可能被靜默移除,進一步影響頁面 SEO 結構。

判斷標準:如果這段處理邏輯失敗或靜默修改輸出,卻不影響主流程的成功狀態回報——它就是需要獨立稽核的邊界點。

一個漏掉的標籤等於一道禁令

允許清單的邏輯是白名單:沒列到的,一律不通過。這個設計本身沒有問題,它的存在是為了防止任意 HTML 注入。但白名單有個隱性的維護成本——每次有新的合法用途,就需要有人主動把對應的標籤加進去,否則那個用途永遠被靜默封鎖。

漏掉 <a>,等於對所有連結下了一道沒有公告的禁令。不是刻意的,只是沒想到。

下次碰到「輸出看起來正確但結果不符預期」,值得問的不只是生成端,而是中間有沒有一層會靜默修改內容的過濾步驟——它不會主動告訴你它動了什麼。

— 邱柏宇

延伸閱讀


The Filter Didn’t Fail — It Just Ate Every Link

There’s a parcel pickup list posted on the lobby door of many older apartment buildings — a list of approved courier companies. If your package arrives with a carrier not on that list, it gets turned away. No error. No notification. You just never receive it.

This bug worked exactly the same way.

Technical Environment

n8n automation pipeline. The LLM generates HTML articles; before writing to WordPress, the output passes through a strip_tags-style allowlist sanitization step. The flow is synchronous: Prompt → LLM output → HTML filter → WP REST API write. No async queues. The issue is framework-agnostic — any pipeline with a silent filtering step and an unaudited allowlist will reproduce the same behavior.

Three Rounds, Zero Outbound Links

The SEO tool returned the same number three times in a row: 0 outbound links. Not occasionally. Not sometimes. Every single run.

The prompt explicitly requested reference links. The language model dutifully produced anchor tags in its output — they were visible in the generated HTML. But by the time the article reached the database, every link had vanished. Yoast’s outbound links field was empty every time.

Tracing back through the pipeline revealed the culprit: before writing to the database, the HTML output passed through a sanitization step with an allowlist of permitted tags. The list included <h4>, <p>, <blockquote>, <strong>, <em>. The anchor tag <a> was not on it.

The filter threw no errors. It did exactly what it was designed to do: silently remove anything not on the list.

Error Propagation Sequence

n8n Prompt        LLM Output          HTML Filter Layer     WordPress DB
     |                 |                     |                    |
     |── generate ─────>|                    |                    |
     |                 |── HTML (with <a>) ──>|              |
     |                 |                     |── allowlist check  |
     |                 |                     |   <a> not listed  |
     |                 |                     |── silently removed |
     |                 |                     |── clean HTML out   |
     |                 |                     |── POST /wp/v2/posts>|
     |                 |                     |                    |── OK ✓
LLM output: links ✓                       DB stored: no links ✗

The filter is the only divergence point: LLM output validates correctly, DB write succeeds — only this middle layer silently altered the content without reporting any error.

Why It Wasn’t Caught Immediately

The difficulty with this class of bug is that everything looks normal at every visible checkpoint. The generation step succeeds. The output looks correct. The write operation completes without errors. The only place the problem exists is inside the stored HTML — and nobody was comparing the generated version against the persisted version.

The natural first suspicion was that the model wasn’t generating links reliably. Check the prompt: the requirement is there. Check the model output: the links are there. It’s easy to stop at that point and conclude it’s model inconsistency. The links were never missing from the generation. They were removed at the last step, quietly.

Silent filtering is the reason this kind of issue stays hidden. No exception, no log entry, no change in the overall success status of the pipeline. Data written, process complete, everything looks fine.

The Fix Has Two Parts

First: add <a href> to the allowlist. Without this, no link ever survives the sanitization step.

Second: add a mandatory Outbound Links section to the prompt, requiring the model to embed 1–2 external links per article in both Chinese and English versions — linking only to stable, authoritative sources like Wikipedia, MDN, or official documentation, and only to general concepts rather than specific claims.

The second fix isn’t redundant. It decouples link presence from the model’s discretion. The prompt ensures the model produces links; the allowlist ensures those links survive the filter. Both conditions need to hold for a link to make it into the published article.

Code Diff: Before and After

Before (anchor tag missing from allowlist)

# n8n Function node — HTML sanitization
ALLOWED_TAGS = [
    'h4', 'p', 'blockquote',
    'strong', 'em', 'code', 'pre',
    'ul', 'li', 'ol', 'hr'
    # <a> not listed; all anchor tags silently removed
]
clean_html = strip_tags(raw_html, allowed_tags=ALLOWED_TAGS)

After (anchor tag added, prompt reinforced)

# Fix 1: add anchor tag to allowlist
ALLOWED_TAGS = [
    'h4', 'p', 'blockquote',
    'strong', 'em', 'code', 'pre',
    'ul', 'li', 'ol', 'hr',
    'a'  # ← added; permit href / rel attributes
]
ALLOWED_ATTRS = {'a': ['href', 'rel', 'target']}
clean_html = strip_tags(raw_html, allowed_tags=ALLOWED_TAGS, allowed_attrs=ALLOWED_ATTRS)

# Fix 2: prompt-level enforcement
# Require 1-2 outbound links per article (both ZH and EN versions)
# Only link to Wikipedia / MDN / official documentation

Side Effects That Should Be Isolated

  • SEO metric pollution: Yoast’s outbound link count stays at 0, triggering persistent score penalties. Observers assume the problem is in the model or prompt — not the pipeline structure.
  • Auditing tool invalidation: Any HTML validation tool that runs post-filter will treat the stripped output as the correct baseline, losing all diagnostic value.
  • False LLM quality signal: If LLM output is evaluated before filtering, the quality assessment doesn’t reflect what ends up in the database — validation and production are looking at different versions.
  • Prompt engineering misdirection: Missing links get attributed to model instability, causing effort to focus on prompt tuning while the actual root cause goes uninvestigated.
  • Silent version drift: When the LLM is updated and produces new HTML structures, the un-updated allowlist will strip those new tags too, with no alert triggered.
  • A/B test contamination: If test variants rely on anchor tags (tracking links, UTM parameters), the filter silently invalidates all experiment data without triggering any alert.
  • Silent reader experience degradation: Readers can’t click reference links that should be there. No error shown. The article just looks like it has no links — nothing concrete to report.
  • Internal link structure risk: If related-post sections also pass through this filter, internal links may be silently removed, further degrading SEO link architecture.

The heuristic: if a processing step can silently modify its output without affecting the success status reported to the caller, it needs an independent audit boundary.

A Missing Tag Is an Undeclared Ban

An allowlist operates on a simple rule: anything not explicitly permitted is removed. The design is sound — it exists to prevent arbitrary HTML injection. But allowlists carry a maintenance obligation. Every time there’s a new legitimate use case, someone has to add the corresponding tag. If nobody does, that use case is silently blocked forever.

Missing <a> from the list placed an unannounced ban on every outbound link. Not intentional. Just overlooked.

The next time output looks correct but results don’t match expectations, it’s worth asking whether there’s a filtering step somewhere in the middle — one that modifies content without announcing what it removed.

— 邱柏宇

Related Posts