robots.txt 一直都在,只是裡面裝的是 HTML

robots.txt 一直都在,只是裡面裝的是 HTML

夜市攤位旁貼了 QR code,掃下去跳出的是隔壁補習班的招生頁。你要的是菜單,拿到的是廣告,狀態碼還是 200,一切看起來正常。這次的問題結構一模一樣。

五天,零成功

4 月 24 日起,IG 發文連續五天 0 成功。排程卡在 QUEUE,4 月 28 日回的錯誤碼是 subcode 2207076,訊息寫「robots.txt blocked」。第一反應是外部平台的問題,或者 Integration 設定錯了,或者 token 過期了。

robots.txt 本身從來不是懷疑對象。路徑一直有回應,HTTP 200,沒有任何錯誤訊號。

分界點在反向代理

根本原因在 Caddy 沒有為 /robots.txt 設專屬路由。請求進來,fall through 到 SPA 的 catch-all,SPA 把所有路徑都回傳同一份 HTML,Content-Type 是 text/html,狀態碼 200。

Meta 的爬蟲收到的不是 text/plain 的爬蟲指令,而是整頁 HTML。保守的實作把這個回應解讀為「robots.txt blocked」,然後對整個 domain 的媒體停止抓取。

讓這個問題難找的地方不是它出錯了,而是它看起來完全正常。任何監控都不會對 200 回應發出警報。要發現這件事,需要有人主動去看回應的 Content-Type,或者去看回應的 body。

容易誤診的原因

SPA 的 catch-all 路由設計本來就是要接住所有未定義路徑,防止 404。這個設計在大多數情境下是對的。問題只在 /robots.txt 這個例外——它不是一般的頁面路徑,是協定層級的約定,外部爬蟲對它的 Content-Type 有嚴格預期。

兩個設計之間的衝突沒有人特別注意,因為兩邊各自都工作正常。Caddy 轉發了,SPA 回應了,HTTP 200 出去了。只有爬蟲那一側默默做了判斷,然後停了。

確認與修法

驗證方式直接:curl -I 那個路徑,看 Content-Type。如果拿到的是 text/html 而不是 text/plain,問題就在這裡。

修法在反向代理層加一條明確路由,讓 /robots.txt 在 SPA catch-all 之前被攔截,直接回傳靜態文字檔,狀態碼 200,Content-Type text/plain,內容包含對 facebookexternalhit、meta-externalagent、Instagram 的明確 Allow 規則。SPA 的其他路由不用動。

下次碰到「外部服務突然停止抓取」的症狀,先確認 /robots.txt 的回應格式,不只是狀態碼。200 不代表對方拿到了想要的東西。

— 邱柏宇

延伸閱讀


robots.txt Was Always There, Just Full of HTML

A QR code on a night market stall, scanned, and up comes an ad for a cram school next door. You wanted the menu. You got something else. HTTP 200 the whole time, no error, nothing visibly wrong. This incident had the same structure.

Five Days, Zero Posts

Starting April 24, IG publishing hit zero successful posts for five consecutive days. Jobs stacked in QUEUE. By April 28, the error code was subcode 2207076 — “robots.txt blocked”. The obvious suspects were an external platform issue, a broken integration, an expired token.

The robots.txt path itself was never under suspicion. It always responded. HTTP 200. No alerts, no errors.

The Break Point Was in the Reverse Proxy

Caddy had no dedicated route for /robots.txt. Requests fell through to the SPA’s catch-all, which returned the same HTML document for every path — Content-Type: text/html, status 200.

Meta’s crawler received HTML instead of plain-text crawl directives. A conservative implementation read that as “robots.txt blocked” and stopped fetching media from the entire domain.

Nothing in the system was throwing an error. A 200 response doesn’t trigger monitoring alerts. Catching this required someone to actively inspect the Content-Type — or the body — of the robots.txt response.

Why It Gets Misdiagnosed

SPA catch-all routing exists precisely to prevent 404s on unmatched paths. It does its job correctly in nearly every case. The conflict only surfaces at /robots.txt, which isn’t a page path — it’s a protocol-level convention. External crawlers expect text/plain. The SPA delivers text/html. Both sides behave as designed; only the crawler silently draws a conclusion.

Confirming It, Fixing It

One check: curl -I the path, read the Content-Type. If it returns text/html instead of text/plain, the cause is confirmed.

The fix goes in the reverse proxy layer — a dedicated route that intercepts /robots.txt before the SPA catch-all, serving a static plain-text file with explicit Allow rules for facebookexternalhit, meta-externalagent, and Instagram. Nothing else in the SPA routing needs to change.

Next time a crawler stops fetching without an obvious error: check robots.txt’s response format, not just its status code. 200 only means the server responded. It says nothing about what the client was expecting to receive.

— 邱柏宇

Related Posts