
在 prompt 裡沒寫「VERBATIM 保留原始語言」,大型語言模型就會自動把非英文台詞翻成英文。
就像你請朋友轉述一段對話,結果他自作主張幫你「整理」成摘要。你要的是原話,他給的是理解後的版本。
LLM 被訓練成要幫忙的工具
這個問題的根源在於,大型語言模型從訓練階段就被設定為「助人」的角色。當它看到一段日文或法文台詞,預設行為是「用戶可能看不懂,我翻成英文比較有幫助」。這個邏輯在大多數場景下沒問題,但當你需要的是原樣輸出時,它就成了災難。
處理多語言影片字幕時這個問題特別明顯。一部電影裡可能有英文、日文、法文對白,如果 LLM 擅自統一翻成英文,你就永遠不知道原始場景裡角色說的是哪種語言。後續要做語言標記、多語言搜尋、甚至只是還原原始台詞,都變得不可能。
解決方法是在 prompt 裡加上明確指令:「保留所有原始語言,不要翻譯,不要轉換,逐字輸出」。要說得這麼白,因為模型的預設模式就是要「優化」你的輸入。你得明確告訴它:不要幫忙。
另一個陷阱:副檔名與實際格式不符
類似的信任問題也出現在圖片處理上。某些檔案標示為 `.jpg`,打開一看實際是 SVG 向量圖或小到不行的 PNG 圖示。送進 AI 視覺 API 就直接回傳「400 invalid format」。
這種格式謊言來自檔案命名的隨意性。可能是批次轉檔時用錯參數,可能是內容管理系統自動生成縮圖時亂給副檔名,也可能只是有人手動改了檔名想矇混過關。結果就是,你不能信任副檔名。
實務上的做法是在處理流程裡加 try/catch,抓到格式錯誤就自動剔除該圖片,切換成純文字模式繼續跑。這不只是防錯,更是承認一個現實:數位內容生態系統本來就很混亂,與其期待上游把資料整理乾淨,不如在自己的流程裡多設幾道防線。
自動化不等於免維護
這兩個案例指向同一件事:當你把內容生產交給自動化系統,不代表你可以放手不管。LLM 會按照它的訓練邏輯自動翻譯,檔案系統會帶著錯誤格式標籤流進你的 pipeline,這些都不是 bug,而是預設行為。
品質控制不是在最後加一道人工審核,而是在每個環節埋驗證機制。明確告訴模型「不要幫忙」,用程式驗證檔案格式而不是信任副檔名,抓到錯誤就自動降級處理。自動化省下的時間,有一部分得拿來設計這些機制。
— 邱柏宇
Without explicitly requesting “VERBATIM preserve original language” in prompts, large language models automatically translate non-English dialogue to English.
It’s like asking a friend to relay a conversation, and they take it upon themselves to “summarize” it for you. You wanted the exact words, they gave you their interpreted version.
LLMs Are Trained to Be Helpful
The root cause is that large language models are designed from training to be “helpful” assistants. When they encounter Japanese or French dialogue, the default behavior is “the user probably can’t read this, translating to English would be more helpful.” This logic works fine in most scenarios, but becomes a disaster when you need exact output.
This problem becomes especially apparent when processing multilingual video subtitles. A film might contain English, Japanese, and French dialogue. If the LLM arbitrarily unifies everything into English, you’ll never know which language characters originally spoke. Any subsequent language tagging, multilingual search, or even just restoring original dialogue becomes impossible.
The fix is adding explicit instructions in your prompt: “Preserve all original languages, do not translate, do not convert, output verbatim.” You need to be this explicit because the model’s default mode is to “optimize” your input. You have to explicitly tell it: don’t help.
Another Trap: Filename Extensions vs. Actual Format
Similar trust issues appear in image processing. Some files labeled `.jpg` turn out to be SVG vectors or tiny PNG icons when opened. Feed these to AI vision APIs and you get immediate “400 invalid format” errors.
These format lies stem from careless file naming. Maybe batch conversion used wrong parameters, maybe the CMS auto-generated thumbnails with random extensions, or maybe someone just manually renamed files hoping to slip through. The result: you can’t trust file extensions.
The practical approach is adding try/catch in your processing pipeline. When format errors are caught, automatically filter out the image and switch to text-only mode. This isn’t just error prevention—it’s acknowledging reality: digital content ecosystems are inherently messy. Rather than expecting upstream to clean their data, build multiple safeguards in your own pipeline.
Automation Doesn’t Mean Maintenance-Free
These two cases point to the same thing: delegating content production to automated systems doesn’t mean you can let go completely. LLMs will auto-translate according to their training logic, file systems will pass along incorrectly labeled formats into your pipeline. These aren’t bugs—they’re default behaviors.
Quality control isn’t adding manual review at the end. It’s embedding verification mechanisms at every stage. Explicitly tell models “don’t help”, verify file formats programmatically instead of trusting extensions, automatically downgrade processing when errors are caught. Part of the time saved by automation needs to go into designing these mechanisms.
— 邱柏宇