自我審查機制解析 | 台日文化對話中的內容生成挑戰

早餐店點完餐，阿姨在小紙條上記了「少糖去冰」，結果把那張便條紙直接塞進袋子裡。打開袋子看到的不是飲料，是備注本身。

這個月的 pipeline 出的問題，結構上就是這樣。

現象

一個內容生成流程在 v5 上線後，Parse Article 節點出現 50% 失敗率。錯誤訊息很直接：Unexpected token '#', "## Pre-Wri"...。JSON parser 看到第一個字元是 #，直接炸掉。往上查，是 Claude 把 Pre-Write Self-Check 的 markdown 筆記寫進了 output body 開頭，後面才接預期的 JSON。parser 假設收到的是乾淨的 JSON，直接 JSON.parse(text)，沒有任何前處理，於是一半的 execution 就這樣掛掉。

分界點

v5 之前沒有這個問題。v5 加了一條指令：先做 self-check 再輸出。本意是讓輸出品質更穩，加了一道品管步驟。失敗率從接近零跳到 50%。

問題不是 self-check 這個概念，是 Claude 對「先做」的解讀不是 deterministic 的。有時候它把 self-check 放進 thinking block，有時候放進 output body。Extended thinking 不等於強制走 thinking——模型只在自己判斷需要時才走那條路。Prompt 裡寫了「先做 X 再 Y」，它就可能把 X 寫到輸出裡，而不是只在內部處理。

容易誤判的原因

失敗不是每次都發生。Execution #1604 就沒有洩漏，直接 success。這讓問題初期看起來像偶發性的模型抽風，不像系統性的 prompt 設計缺陷。加上 prompt 裡已經寫了「Return ONLY valid JSON」，直覺上以為有寫就有保障。

沒有。Prompt 前文裡的 markdown 標題本身就在引導模型用 markdown 格式回答，那條 output format 規則輸給了上下文的示範效果。

確認方式

把失敗的 execution 原始輸出拉出來比對：失敗的開頭是 ## Pre-Write Self-Check，成功的開頭是 {。分佈規律而非隨機，代表不是 API 抖動，是 prompt 在某些 run 裡沒有壓住模型的輸出格式。

修法與留給未來的話

修法是雙層的。上游把 prompt 說得更硬：self-check protocol 標題加上「MANDATORY — perform this INSIDE your thinking/reasoning block only; NEVER include this section text in the JSON response body」，output format 段插入 CRITICAL OUTPUT RULE，明示失敗後果「if ANY markdown appears before the opening { the pipeline fails」。下游加了前綴剝離邏輯，在 JSON.parse() 前先找第一個 { 到最後一個 }，不管開頭有沒有殘留文字都能找到真正的 JSON。

prompt 側收斂管大多數的 case，parser 防呆兜剩下的。只做其中一層都不夠——模型不是 deterministic 的，parser 必須假設它偶爾不聽話。

往後每次在 prompt 加「先做 X 再 Y」風格的指令，都值得警覺：X 有沒有可能跑進 output？如果有，要嘛在 prompt 明示「X 只能在 thinking」，要嘛在 parser 預先設好 resilience。品質控制的步驟本身，也可以是失敗的來源。

— 邱柏宇

The Self-Check Leaked Into the Answer

Think of ordering breakfast at a street-side shop in Taiwan. The auntie jots your special requests — less sugar, no ice — on a small slip of paper, then stuffs that slip directly into your bag. You open the bag and find the note instead of your drink.

That’s structurally what happened here.

What Broke

After a content pipeline shipped v5, the Parse Article node started failing at a 50% rate. The error was unambiguous: Unexpected token '#', "## Pre-Wri".... The JSON parser saw # as the first character and threw. Claude had written the Pre-Write Self-Check markdown notes into the output body, followed by the expected JSON. The parser assumed clean JSON, called JSON.parse(text) with no preprocessing, and half of all executions died.

The Dividing Line

Before v5, failure rate was near zero. v5 added one instruction: perform a self-check before writing. The intent was tighter output quality. The result was a 50% failure rate.

The issue isn’t the self-check concept — it’s that Claude’s interpretation of “first do X then Y” is not deterministic. Sometimes it places the self-check inside the thinking block. Sometimes it writes it directly into the output. Extended thinking is not a guarantee; the model only routes through that path when it decides to. Write “do X first” in a prompt and it may produce X in the output rather than processing it silently.

Why It Wasn’t Obvious

The failure wasn’t consistent. Some executions went through cleanly. That made it look like random model variance rather than a systemic prompt design flaw. The prompt already said “Return ONLY valid JSON” — it seemed covered. It wasn’t. The markdown headers earlier in the same prompt were actively demonstrating a markdown-formatted response style, and that context outweighed the output format instruction.

Verification and the Fix

Pulling raw outputs from failed versus successful executions made the pattern clear: failures began with ## Pre-Write Self-Check, successes began with {. Consistent, not random. That ruled out API instability and pointed squarely at the prompt.

The fix runs on two layers. Upstream: the self-check protocol heading was rewritten to explicitly state it must live inside the thinking/reasoning block only and must never appear in the JSON response body. A CRITICAL OUTPUT RULE was added to the output format section, warning that any markdown before the opening { breaks the pipeline. Downstream: a strip pass before JSON.parse() now finds the first { and last } and slices to that range, regardless of what precedes it.

Prompt hardening handles the majority of cases. Parser resilience catches the rest. Neither layer alone is sufficient — the model is not deterministic, and the parser has to be built assuming occasional non-compliance.

Any prompt instruction in “do X then Y” form carries this risk. If X could conceivably appear in output, either constrain it explicitly to the thinking block in the prompt, or build the parser to survive it. Quality control steps are not exempt from being failure sources.

— 邱柏宇

短網址分享

https://justfly.idv.tw/s/H5rmPAe

JustFLY~JustBlog~

Just Do It!!

叫它先自我審查，它把審查過程直接寫進答案裡

現象

分界點