讓 pipeline 變聰明的那個 API,也是最容易卡死的那個點

讓 pipeline 變聰明的那個 API,也是最容易卡死的那個點

靜態到動態,那一步很小,風險很大

台灣早餐店阿姨五點多就開始備料,菜單大致固定,但今天食材有什麼就微調什麼。市場不開的話她不會關店,就照昨天的做。這個邏輯其實比很多 pipeline 設計都成熟。

技術環境

Node.js pipeline,主路徑中段插入一個對外部文化事件 HTTP API 的即時呼叫,目的是撈素材給後續 LLM 節點用。API 回應以 stream 分段傳回,需用 Buffer.concat 逐 chunk 拼接後解析 JSON。原始設計未設 timeout,也未實作 fallback 路徑——呼叫懸掛或格式損壞,後續節點直接等或拿到爛 input。問題模式與框架無關,任何在主路徑同步等待外部服務、且沒有隔離設計的節點都會複現。

原本的流水線靠靜態 prompt 控制輸出場景。每次跑出來的結果類似,久了可預測,穩定但無聊。為了讓輸出多變,在流水線中段加了一個對外部服務的即時 API 呼叫,目的是撈最新的在地文化事件當素材——輸出確實鮮活了,但同時多了一個可以卡死整條 pipeline 的節點。

這個改動本身沒有問題。問題是在加入「聰明」的同時,沒有同步設計「不聰明時怎麼辦」。

外部服務的本質:你不控制它

外部服務不受控。網路慢、服務暫停、回應格式突然改版,都不是你能預防的事。把一個你不控制的節點直接嵌進 pipeline 主路徑,等於把系統的穩定性部分委託給別人。

這個呼叫如果沒有 timeout,慢回應會讓 pipeline 在這個節點懸掛,後面所有工作都等著。分段回傳的資料如果沒有正確用 Buffer.concat 拼接,靜默地拿到的可能是截斷或損壞的內容,錯誤不會明顯報出來,只會讓輸出莫名其妙。最糟的情況是:你以為 pipeline 在跑,其實它卡在那個 API 呼叫等到超時。

錯誤傳染鏈(時序)

— 場景 A:無 timeout,服務慢或無回應 —

Pipeline Runner         External Event API         LLM Node
      |                        |                      |
      |── GET /cultural-events >|                      |
      |                        |                      |
      |  (未設 timeout)         | ← 服務無回應 / 極慢   |
      |  ... 懸掛等待 ...       |                      |
      |  後續所有節點阻塞        |                      |
Pipeline 狀態:看起來在跑 ✓ / 實際:卡住等外部服務 ✗

— 場景 B:stream 分段 + Buffer 拼接有誤 —

Pipeline Runner         External Event API         LLM Node
      |                        |                      |
      |── GET /cultural-events >|                      |
      |<── chunk 1 ────────────|                      |
      |<── chunk 2(連線中斷)─ |                      |
      |  chunks.join('') 截斷   |                      |
      |  損壞 JSON 繼續往下      |                      |
      |──────────────────────────── 損壞 input ───────>|
      |<────────────────────────── 輸出莫名其妙 ────── |
Pipeline 狀態:有跑完 ✓ / 輸出:損壞或亂碼 ✗

兩種失敗模式的共同點:pipeline 不報錯,問題靜默,直到輸出或效能異常才被發現。

容易誤判的地方在這裡:第一眼看到 pipeline 變慢或輸出異常,會去查 LLM 那側——prompt 有沒有問題、模型有沒有回傳奇怪的東西。但真因是更早的那個節點,那個安靜地等著外部服務的節點,沒有人在看它。

解法的核心不是讓它更穩,是讓它失敗得更優雅

把這個呼叫包進 try/catch 加明確 timeout。任何一種失敗——網路超時、服務 500、格式解析錯誤——都讓 pipeline 安靜地走 fallback,不崩整條。fallback 就是靜態素材,也就是原本就在用的那個版本。

這個設計的關鍵判斷點是:動態資料是輸出品質的加分項,不是系統的前提條件。一旦把它設計成前提條件,外部服務的任何抖動都變成你的 P0 事件。

分段回傳的處理也是一樣的邏輯。Buffer.concat 正確拼接不只是「讓資料完整」,而是讓失敗模式可預測——拼接正確的情況下,如果資料截斷,你能在 try/catch 裡抓到,走 fallback。拼接有誤的情況下,損壞的資料進了下一個節點,錯誤在哪裡冒出來就不一定了。

Code 對照:修法前後

修法前:無 timeout,stream 拼接有誤,損壞 input 直接進 LLM

// 呼叫外部事件 API(修法前)
const response = await fetch(externalApiUrl); // ← 無 timeout
const chunks = [];
for await (const chunk of response.body) {
  chunks.push(chunk);
}
const data = chunks.join('');          // ← 字元串接,非 Buffer.concat,UTF-8 截斷
const context = JSON.parse(data);      // ← 損壞 JSON 直接進下一節點,錯誤不明顯

修法後:明確 timeout + Buffer.concat + try/catch fallback

// 呼叫外部事件 API(修法後)
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), 5000); // ← 明確 timeout
let context;
try {
  const response = await fetch(externalApiUrl, { signal: controller.signal });
  const chunks = [];
  for await (const chunk of response.body) {
    chunks.push(chunk);
  }
  clearTimeout(timer);
  const data = Buffer.concat(chunks).toString('utf8'); // ← 正確拼接,UTF-8 安全
  context = JSON.parse(data);
} catch (err) {
  context = staticFallback; // ← 任何失敗:timeout / 500 / parse error,走 fallback
}
// pipeline 繼續,輸出品質略降,但不崩

該被隔離的不穩定節點類型

  • 即時資料 API:動態素材、外部推薦、即時搜尋——加分項,不是前提條件,必有 fallback
  • 第三方通知服務:推播、SMS、email 發送——失敗不應讓主路徑失敗
  • Webhook 對外投遞:timeout + retry,主路徑 fire-and-forget,不等回應
  • 快取寫入:Redis / CDN 快取刷新——快取失敗不 block 主流程
  • 搜尋索引更新:Elasticsearch / Algolia indexing——非同步,主操作先完成
  • Analytics 事件:GA / Mixpanel track——fire-and-forget,錯誤靜默丟棄
  • 非同步 Queue 入隊:背景任務排程——投遞失敗 retry,主路徑繼續
  • Event Log 寫入:審計日誌、操作記錄——寫入失敗不 rollback 主操作

判斷準則:這個節點失敗,pipeline 整條會掛掉嗎?如果答案是「會」,它就需要 fallback 隔離設計。

聰明跟穩定要分開設計

這個模式的核心是把「聰明」跟「穩定」分開設計。動態資料讓輸出更鮮活,但不能讓它變成系統的新死因。

驗證方式不複雜:把外部服務故意關掉,或讓它回傳錯誤,看 pipeline 是否安靜地走 fallback、輸出是否還合理。如果 pipeline 在這個情況下崩了或卡住了,設計就還沒完成。

留給下次遇到類似情境的一件事:加入外部服務之前,先問「這個服務掛掉的時候,這條 pipeline 會怎樣」。如果答案不確定,就先把 fallback 路徑設計好,再接上去。

— 邱柏宇

延伸閱讀


The API That Made the Pipeline Smarter Also Made It Fragile

One Node Changed, Everything Became Fragile

The original pipeline used static prompts to control output scenarios. Results were consistent, predictable, eventually boring. Adding a live API call mid-pipeline to pull in local cultural events as source material made the output feel alive. It also introduced a node that can freeze the entire pipeline on any given run.

The change itself isn’t the mistake. The mistake is adding “smart” without designing what happens when smart isn’t available.

Technical Environment

Node.js pipeline with an external cultural-events HTTP API call inserted mid-path to supply dynamic context to the downstream LLM node. The API returns a streamed response, requiring Buffer.concat to reassemble chunks before JSON parsing. The original design had no timeout and no fallback path — a hung call stalls everything downstream; a malformed reassembly sends corrupt input to the LLM. The pattern is framework-agnostic: any synchronous external call on the main path without isolation will reproduce the same failure.

External Services Don’t Belong to You

An external service call in the middle of a pipeline means your system’s stability is partially delegated to something outside your control. Slow network, service downtime, a quietly changed response format — none of these are preventable on your end.

Without an explicit timeout, a slow response hangs the pipeline at that node. Everything downstream waits. Streamed response data that isn’t correctly reassembled with Buffer.concat produces silently truncated content — no loud error, just subtly broken output. The worst case: the pipeline appears to be running while it’s actually stuck waiting on that API call.

Error Propagation Sequence

— Scenario A: No timeout, service slow or unresponsive —

Pipeline Runner         External Event API         LLM Node
      |                        |                      |
      |── GET /cultural-events >|                      |
      |                        |                      |
      |  (no timeout set)      | <─ no response       |
      |  ... hanging ...       |                      |
      |  all downstream nodes  |                      |
      |  blocked               |                      |
Pipeline state: appears running ✓ / actually: stuck at external call ✗

— Scenario B: Streamed chunks + bad Buffer reassembly —

Pipeline Runner         External Event API         LLM Node
      |                        |                      |
      |── GET /cultural-events >|                      |
      |<── chunk 1 ────────────|                      |
      |<── chunk 2 (conn drop) |                      |
      |  chunks.join('') truncates UTF-8               |
      |  corrupt JSON passes through                   |
      |────────────────────────────── corrupt input ──>|
      |<─────────────────────────── garbled output ── |
Pipeline state: completed ✓ / output: corrupted ✗

Both failure modes are silent — no exception raised, no obvious error, until performance degrades or output looks wrong.

The common misdiagnosis here is looking at the LLM side first — checking whether the prompt changed, whether the model returned something odd. The actual fault is earlier, at that external call sitting quietly with no timeout, and nobody watching it.

Graceful Failure Is the Design

Wrapping the call in a try/catch with an explicit timeout means every failure mode — network timeout, 500 error, parse failure — routes the pipeline quietly to fallback. Fallback is the static material that was there before. The output is less fresh, but the pipeline completes.

The key judgment is this: dynamic data is a quality enhancement, not a system precondition. Once you treat it as a precondition, every hiccup in the external service becomes your incident.

Correct stream reassembly with Buffer.concat serves the same purpose. When data is properly joined, truncation produces a catchable error — the try/catch handles it, fallback runs. When joining is wrong, corrupted data reaches the next node and the failure surfaces somewhere unpredictable downstream.

Code Diff: Before and After

Before: no timeout, bad stream reassembly, corrupt input passed directly to LLM

// External event API call (before)
const response = await fetch(externalApiUrl); // ← no timeout
const chunks = [];
for await (const chunk of response.body) {
  chunks.push(chunk);
}
const data = chunks.join('');          // ← string join, not Buffer.concat; truncates UTF-8
const context = JSON.parse(data);      // ← corrupt JSON silently passed downstream

After: explicit timeout + Buffer.concat + try/catch fallback

// External event API call (after)
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), 5000); // ← explicit timeout
let context;
try {
  const response = await fetch(externalApiUrl, { signal: controller.signal });
  const chunks = [];
  for await (const chunk of response.body) {
    chunks.push(chunk);
  }
  clearTimeout(timer);
  const data = Buffer.concat(chunks).toString('utf8'); // ← correct reassembly, UTF-8 safe
  context = JSON.parse(data);
} catch (err) {
  context = staticFallback; // ← timeout / 500 / parse error all route to fallback
}
// pipeline continues; output quality degrades slightly, doesn't crash

Side Effects That Should Be Isolated

  • Live data APIs: dynamic context, real-time recommendations — quality enhancement, not precondition; always needs fallback
  • Third-party notification services: push, SMS, email — failure must not propagate to the main path
  • Outbound webhooks: timeout + retry; main path fires and forgets, does not await delivery
  • Cache writes: Redis / CDN invalidation — cache failure must not block the primary flow
  • Search index updates: Elasticsearch / Algolia indexing — async; main operation completes first
  • Analytics events: GA / Mixpanel track calls — fire-and-forget; errors silently discarded
  • Async queue enqueue: background task scheduling — enqueue failure retries; main path continues
  • Event log writes: audit logs, operation records — write failure does not roll back the main operation

The decision rule: if this node fails and the entire pipeline hangs or crashes, it needs isolation and a fallback path.

Smart and Stable Are Separate Design Problems

“Smart” and “stable” need to be designed independently. The external API makes output richer. It should not be allowed to become the new single point of failure.

The verification is straightforward: disable the external service intentionally, or make it return an error. Watch whether the pipeline falls back quietly and produces reasonable output. If it hangs or crashes instead, the design isn’t finished yet.

One thing worth carrying forward: before adding any external service call to a pipeline, ask what happens to the pipeline when that service goes down. If the answer is uncertain, design the fallback path first, then connect the live data source.

— 邱柏宇

Related Posts