
每週定時跑的報告腳本,多數週正常。偶爾——不是每次——輸出的第一行中文變成 �。沒有 exception,也沒有錯誤日誌,只有輸出內容異常。
現象
多數週輸出正常。問題偶發:第一行某個中文字變成替代符號 �。不固定哪個字、不固定哪一週,無可預測的觸發條件。重跑、換執行時間、換資料量,皆無法穩定複現。
腳本寫法很直覺:監聽 HTTP 回應的 data 事件,每來一個 chunk 就用 += 串成字串。資料到達即處理,邏輯上無漏接,當下看不出問題。
先查錯的方向
第一個懷疑:Content-Type 或伺服器端編碼設定。檢查後排除。伺服器回應正常,UTF-8 宣告存在,問題不在 response header。
第二個方向:「偶發即外部問題」——網路抖動、CDN 快取、第三方服務異常。以上方向皆未命中。真正問題在本地腳本的 response 收集策略,跟外部無關。
真正的原因
多數常見中文字在 UTF-8 中是三個 byte,但 UTF-8 本質上是變長編碼。真正觸發條件是:多位元組字元剛好被切在兩個 chunk 的邊界上。
前一個 chunk 先到,尾端帶著不完整的 byte sequence。腳本立刻轉字串,Node.js 遇到不合法序列,填入 U+FFFD。後一個 chunk 到達時,原本的 byte sequence 已被拆開處理,輸出結果因此出現 � 或局部亂碼。
chunk 多大由網路層決定,不固定,這就是偶發的來源。多數時候字元落在完整邊界,輸出正常;偶爾切在中間,就產生亂碼。兩種情況在程式碼層一致,因此難以重現與定位。簡單說:字元還沒收完,就先解碼。
修法
確認方式:在 data callback 印出每個 chunk 的 byte 長度與尾端 bytes,確認是否存在尾端不完整的 UTF-8 byte sequence。
修法只有一個方向:不要在 data 事件裡轉字串。處理方式:
data階段:只收Bufferend階段:Buffer.concat(chunks)- 最後:
.toString('utf8')
這樣不管 TCP 怎麼切,解碼時看到的都是完整 byte sequence。改用此法後,未再出現亂碼。
留給下次的一件事
「資料到達即處理」這個直覺在多數情境成立,但在邊界條件下不可靠——當資料單位(多位元組字元)與傳輸單位(TCP chunk)沒有對齊保證時。下次碰到偶發的編碼問題,先確認:解碼發生在「片段」階段,還是「完整資料集合」階段?若解碼發生在片段階段,偶發亂碼屬於預期結果,不是隨機異常。
— 邱柏宇
延伸閱讀
The Garbled First Line: A UTF-8 Character Split Across TCP Chunks
A scheduled weekly report script. Most weeks the output is normal. Occasionally — not every run — the first line of Chinese comes out as �. No exception, no error log, only abnormal output content.
The Phenomenon
Output is normal most weeks. The corruption is intermittent: one Chinese character in the first line becomes a replacement symbol �. No fixed character, no fixed week, no predictable trigger. Re-running, changing the run time, changing the data size — none reproduces it reliably.
The script was written the obvious way: listen to the HTTP response’s data event, concatenate each chunk with += as it arrives. Data is processed on arrival; logically nothing is dropped; nothing looks wrong at the time.
Wrong Directions First
First suspect: Content-Type or server-side encoding. Checked and ruled out. The server response is normal, UTF-8 is declared; the problem is not in the response header.
Second direction: “intermittent means external” — network jitter, CDN caching, upstream instability. None of these hit. The actual problem is in the local script’s response-collection strategy, unrelated to anything external.
The Actual Cause
Most common Chinese characters are three bytes in UTF-8, but UTF-8 is fundamentally a variable-length encoding. The real trigger condition: a multi-byte character is split across two chunk boundaries.
The previous chunk arrives first, its tail carrying an incomplete byte sequence. The script converts it to a string immediately; Node.js hits an invalid sequence and inserts U+FFFD. When the next chunk arrives, the original byte sequence has already been split and processed, so the output shows � or partial corruption.
chunk size is decided by the network layer and is not fixed — that is the source of the intermittence. Most of the time characters land on complete boundaries and output is normal; occasionally one is cut mid-character, and corruption appears. Both cases are identical at the code level, which makes it hard to reproduce and locate. In short: decoding starts before the character is fully received.
The Fix
Confirmation: in the data callback, log each chunk‘s byte length and tail bytes, and check whether an incomplete UTF-8 byte sequence exists at the tail.
One direction for the fix: do not convert to string inside the data event. Handling:
datastage: collectBuffers onlyendstage:Buffer.concat(chunks)- finally:
.toString('utf8')
This way, no matter how TCP splits the stream, decoding always sees a complete byte sequence. After switching to this approach, the corruption did not recur.
One Thing Worth Remembering
The instinct to “process data on arrival” holds in most situations but is unreliable at the boundary — when the unit of meaning (a multi-byte character) has no alignment guarantee with the unit of transmission (the TCP chunk). Next time an intermittent encoding bug appears, confirm one thing first: does decoding happen at the fragment stage or the complete-dataset stage? If decoding happens at the fragment stage, intermittent corruption is an expected result, not a random anomaly.
— 邱柏宇
Related Posts
https://justfly.idv.tw/s/5pe7Lc1