週報第一行的亂碼,是 TCP 還沒說完話

週報第一行的亂碼,是 TCP 還沒說完話

夜市廣播喊單喊到一半斷訊,你只收到「炒蚵仔」,以為客人點錯,其實「麵線」還在路上。這個場景跟自動化週報第一行偶爾出現的 替代符號,是同一件事。

現象

每週定時執行的報告腳本,大多數週輸出正常,偶爾——不是每次——第一行中文變成亂碼。不是固定哪個字,不是固定哪週,沒有可預測的觸發條件。錯誤不拋、日誌不報,輸出就是那樣靜靜地壞在那裡。

腳本的寫法很直覺:監聽 HTTP 回應的 data 事件,每來一個 chunk 就立刻用 += 串接成字串。看起來沒問題,畢竟資料到了就處理,邏輯上沒有漏接。

分界點

中文字的 UTF-8 編碼,每個字佔三個 byte。問題發生的條件只有一個:某個字的三個 byte 剛好被 TCP 切割在兩個 chunk 的邊界上。

第一個 chunk 先到,帶著那個字不完整的前半截 byte。腳本立刻把它轉成字串,Node.js 看到不合法的 byte 序列,填入替代符號 U+FFFD。第二個 chunk 才到,帶著剩下的 byte,但已經太晚——字串已經定案了,多出來的 byte 被解讀成下一個字的起點,後面全部跟著歪掉。

問題偶發的原因也在這裡:chunk 的大小由網路層決定,不固定。大部分時候,字恰好落在完整的 byte 邊界上,腳本運作正常。偶爾網路狀況讓某個 chunk 在一個字的中間截斷,亂碼就出現了。兩種情況在程式碼層面看起來一模一樣,這就是為什麼問題那麼難追。

容易誤判的地方

第一時間很容易懷疑 API 回傳了錯誤的 Content-Type,或者伺服器端編碼設定不對。這條路查下去什麼都查不到——伺服器端完全正常,UTF-8 宣告也在,問題根本不在那裡。

另一個誤判方向是「偶發就代表是外部問題」。網路抖動、CDN 快取、第三方服務的偶爾異常,這些方向聽起來合理,但都繞開了真因:是本地腳本的收集策略本身不對,跟外部完全無關。

台灣注音輸入法不允許聲母韻母聲調打到一半就確認——輸到中間強制送出,組出來的不是那個字。HTTP 串流的 chunk 處理邏輯本質上一樣:中文字不允許被截一半再解碼。

確認方式

最直接的確認:在 data 回調裡把每個 chunk 的原始 byte 長度印出來,看看哪個 chunk 的尾端是不完整的三 byte 序列。用 Buffer.isBuffer(chunk) 確認收到的是 Buffer 物件,再看 chunk.length % 3 是否不為零——這不是決定性的,但可以縮小範圍。

修法只有一種正確方向:不在 data 事件裡做字串轉換,改成把每個 chunk 推進陣列,等 end 事件觸發之後,用 Buffer.concat 一次把所有 chunk 合併,再統一做 .toString('utf8')。這樣不管 TCP 怎麼切,解碼時每個字的三個 byte 都已經在同一塊完整的 Buffer 裡了。

留給下次的一件事

串流處理裡,「資料到了就處理」這個直覺在大多數情境下成立,但在邊界條件上會失效——特別是當資料單位(中文字的三個 byte)跟傳輸單位(TCP chunk)之間沒有對齊保證的時候。下次碰到偶發性的編碼問題,先問:這個解碼動作是在「片段」上做的,還是在「完整的資料集合」上做的?如果是前者,偶發亂碼幾乎是必然的,只是還沒碰到那個邊界而已。

— 邱柏宇

延伸閱讀


The Garbled Line Was TCP Mid-Sentence

A night market broadcaster cuts out mid-announcement — you catch “fried oyster” but “noodles” is still in transit. That’s the same thing happening when an automated weekly report silently produces a garbled first line.

The Phenomenon

A scheduled report script runs without errors. Most weeks, output is clean. Occasionally — not predictably, not consistently — the first line of Chinese text becomes a replacement character: . No exceptions thrown. No log entries. The output is just quietly wrong.

The script was written in the obvious way: listen to the HTTP response’s data event, concatenate each chunk into a string with += as it arrives. Intuitive. Data comes in, get processed immediately. Nothing looks broken.

The Boundary

Each Chinese character in UTF-8 takes exactly three bytes. The problem has one condition: a character’s three bytes land on opposite sides of a TCP chunk boundary.

The first chunk arrives carrying an incomplete byte sequence — the front half of a character. The script immediately converts it to a string. Node.js encounters an invalid byte sequence and substitutes U+FFFD. The second chunk arrives with the remaining bytes, but the string is already written. Those bytes are now interpreted as the start of the next character. Everything downstream shifts.

The intermittent nature comes from this: TCP chunk size is determined by the network layer and is not fixed. Most of the time, character boundaries happen to align with chunk boundaries, and the script runs fine. Occasionally, a chunk cuts through the middle of a character, and the corruption appears. Both cases look identical in the code.

Why It’s Easy to Misdiagnose

The first instinct is to check the server’s Content-Type header or the API’s encoding declaration. That investigation goes nowhere — the server is correct, UTF-8 is properly declared, the problem isn’t there.

Another wrong turn: “intermittent means external.” Network jitter, CDN caching, upstream service instability — all plausible directions, all wrong. The issue is entirely in the local script’s collection strategy.

Taiwan’s zhuyin input method won’t let you confirm a character mid-composition — force-confirm halfway through and you get garbage, not the intended character. Streaming chunk handling is the same constraint: a Chinese character cannot be decoded in pieces.

The Fix

Don’t convert to string inside the data callback. Instead, push each chunk into an array. When the end event fires, use Buffer.concat to merge all chunks into a single Buffer, then call .toString('utf8') once. By that point, every character’s three bytes are guaranteed to be in the same contiguous block. TCP can cut wherever it wants — decoding happens only after the complete message has arrived.

One Thing Worth Remembering

The instinct to “process data as it arrives” is correct most of the time. It breaks when the unit of meaning in the data (three bytes per Chinese character) has no alignment guarantee with the unit of transmission (the TCP chunk). If a decoding step happens on fragments rather than on a complete, assembled buffer, intermittent corruption isn’t a bug — it’s a scheduled appointment. The only question is when the boundary misalignment finally occurs.

— 邱柏宇

Related Posts