指令成功了,版本沒動:symlink 才是那個真正的入口

指令成功了,版本沒動:symlink 才是那個真正的入口

十台機器,九台跑升級指令,幾秒內版本號更新,確認完收工。第十台一樣的指令、一樣的回傳訊息,進去確認,版本還是舊的。

技術環境

Node.js CLI 工具(openclaw),以 Docker 容器部署於 Linux 節點。Shell 命令解析機制:執行 openclaw 時,shell 依 $PATH 找到 /usr/local/bin/openclaw——這是一個在 Docker image build 時建立的 symlink,指向 image 內的 /app/openclaw.mjs。升級操作 npm install -g 修改的是 /usr/local/lib/node_modules/ 路徑層,與 symlink 指向的 /app/ 路徑完全平行,互不干涉。問題的本質與語言或框架無關——任何以 image 內建 symlink 作為命令解析入口的容器化部署方式,都會複現相同行為。

就像在超商的取件 App 上更新了聯絡資訊,但包裹在出廠時就貼好舊標籤了——App 改了沒用,要回到貼標籤的那一關才算數。這台機器的問題就是這個結構。

symlink 是入口,不是捷徑

進容器查,/usr/local/bin/ 底下那支執行檔不是普通的二進位。是一個 symlink,靜靜指向 /app/openclaw.mjs——一支在 Docker image build 時就打包進去的 .mjs 檔。

npm install -g 確實把新版本裝進了 /usr/local/lib/node_modules/package.json 顯示新版,指令回傳成功。但 shell 在解析指令時,先找到的是 /usr/local/bin/openclaw,那個 symlink 還指著 /app/,新裝的版本根本沒有被執行的機會。

容器從 image 啟動,in-container 裝的東西不持久。就算再跑一次 force-recreate,容器回到 image 的狀態,npm install 的結果一起消失。問題不在指令,在 image 本身。

升級路徑分叉(時序)

Engineer               Shell               npm Layer            Image Layer
   |                     |                     |                     |
   |-- npm install -g -->|                     |                     |
   |                     |-- writes node_mods ->|                    |
   |                     |<-- exit 0 ----------|  <- npm 成功 v      |
   |<-- 指令成功 ---------|                     |                     |
   |                     |                     |                     |
   |-- openclaw --version>|                     |                     |
   |                     |-- 解析 $PATH --------+--> /usr/local/bin/openclaw (symlink)
   |                     |                                           |--> /app/openclaw.mjs (image 版)
   |<-- 舊版本號 ---------|                                          |    新版本未執行 x
   |                     |                     |                     |
   |-- force-recreate -->|                     |                     |
   |                     |-- 容器回 image 狀態->-- node_modules 清除 x

npm layer: 更新成功 v  /  shell 解析入口: 仍是 image 內建版本 x

關鍵分叉點:npm install -g 的成功訊息來自 npm 層,但命令解析走的是 symlink 層,兩者在容器化環境中是互不知情的平行軌道。

容易誤判的節點

其他九台裸機的升級流程是對的——npm install -g 改的就是 shell 實際會找到的那支二進位,路徑一致,沒有 symlink 隔層。這台容器化的機器表面行為完全相同,回傳訊息一樣乾淨,所以第一時間不會懷疑路徑結構有差異。

版本驗證腳本如果查的是 package.json 而不是執行 --version,也會回傳正確版號,問題繼續被掩蓋。過去有段時間升級一直是「假成功」的狀態,npm 世界裡版本確實更新了,image 世界毫無反應,兩條線平行跑,直到某次 container recreatenpm 裝的東西洗掉,才暴露出來。

確認方式

進容器,跑 ls -la /usr/local/bin/openclaw。如果輸出是 symlink → /app/openclaw.mjs,這台機器的升級路徑就跟其他裸機不同。再跑 openclaw --versioncat /app/openclaw.mjs | head -1 比對版本號,一旦對不上,原因就確認了。

正確路徑是:checkout 目標版本 tag、重新 build Docker image、force-recreate 容器。每次程式本體有變動,都要走這條路。沒有捷徑,也不應該有。

Code 對照:修法前後

修法前:in-container 升級(假成功)

# 進容器執行——看似成功,實際指向 image 舊版
docker exec -it openclaw-node bash
npm install -g openclaw@latest   # <-- 更新 node_modules,但 symlink 不動
openclaw --version               # <-- 回傳的是 /app/openclaw.mjs 版本,非剛安裝的版本

# 驗證腳本的陷阱
cat /usr/local/lib/node_modules/openclaw/package.json | grep version
# --> "1.2.3"(新版)— 這裡是對的,但沒有被執行
ls -la /usr/local/bin/openclaw
# --> /usr/local/bin/openclaw -> /app/openclaw.mjs  — symlink 仍指向 image 版本

修法後:重建 image(正確路徑)

# 在宿主機執行,不進容器
git checkout v1.2.3                    # <-- checkout 目標版本 tag
docker build -t openclaw:1.2.3 .       # <-- 重建 image,/app/openclaw.mjs 更新
docker compose up -d --force-recreate  # <-- 容器從新 image 啟動

# 正確驗證方式(混合環境分開判斷)
if docker inspect openclaw-node > /dev/null 2>&1; then
  docker exec openclaw-node openclaw --version  # 容器:用 --version
else
  openclaw --version  # 裸機:直接執行
fi

混合部署中該被分開驗證的層次

  • 版本號回報層package.jsonversion 欄位反映 npm 安裝狀態,不反映 shell 實際執行的二進位版本——用 --version 才是正確的真相來源。
  • 升級成功通知npm install -g 的 exit 0 是 npm 層成功,不等於命令入口已更新——自動化通知觸發點應移至 --version 驗證通過後。
  • 監控面板版本顯示:如果監控系統查的是 package.json 或 npm registry,顯示的是假版本——應改為執行 openclaw --version 並解析輸出。
  • 健康檢查探針:健康端點若不含版本資訊,HEALTHCHECK 通過只代表進程存活,無法確認版本正確性——需額外加版本驗證層。
  • CI/CD pipeline 部署狀態:若 pipeline 以 docker exec npm install -g 的 exit code 判斷部署成功,此狀態標記與實際執行版本脫鉤——應加入 post-deploy verification step。
  • 異動追蹤日誌:部署日誌記錄「npm 升級成功」,但未記錄 symlink 指向層的版本——日誌的審計價值依賴正確的驗證來源。
  • 共用升級腳本的假設:同一套升級腳本在裸機和容器節點行為不同,但錯誤訊息相同——腳本的驗證邏輯必須感知部署類型,不能用同一把尺量兩種環境。
  • Webhook / 自動化觸發器:依賴版本號觸發後續流程(如通知、rollback 判斷)的 webhook,若版本來源錯誤,整條觸發鏈都會在假成功的前提下執行。

判斷標準:如果這個驗證步驟失敗不會讓部署流程停下來,它就需要被當作獨立的驗證關卡——而不是附帶的確認動作。

留給下一次的一件事

在混合環境裡——同一套工具,部分裸機、部分容器——升級腳本的驗證邏輯要分開寫。用 --version 驗證,不用 package.json。兩種部署方式的正確驗證點不在同一層,用同一把尺量,遲早量錯一次。

— 邱柏宇

延伸閱讀


The Upgrade Succeeded and Nothing Changed

Ten machines. Nine of them: run the upgrade command, version number updates, done in seconds. The tenth: same command, same success message, same clean output. Go in to verify — still the old version.

Like updating your contact info in a pickup app after the package has already shipped with the old label printed on it. The app change doesn’t matter. The label was baked in at the source.

Technical Environment

Node.js CLI tool (openclaw), deployed in Docker containers on Linux nodes. Shell command resolution: when openclaw is called, the shell walks $PATH and finds /usr/local/bin/openclaw — a symlink created at Docker image build time, pointing to /app/openclaw.mjs inside the image. The upgrade command npm install -g writes to /usr/local/lib/node_modules/, a path entirely parallel to — and invisible to — the symlink’s target at /app/. The root issue isn’t specific to Node.js: any containerized deployment that uses an image-baked symlink as the command resolution entry point will reproduce this behavior.

The symlink is the entry point

Inside the container, /usr/local/bin/openclaw isn’t a binary. It’s a symlink pointing at /app/openclaw.mjs — a file baked into the Docker image at build time.

npm install -g did install the new version into /usr/local/lib/node_modules/. The package.json showed the right version. The command returned success. But when the shell resolved the command, it hit the symlink first, and that symlink still pointed at /app/. The newly installed version never ran.

Containers start from the image. Anything installed in-container doesn’t persist. A force-recreate just resets everything back to image state — the npm install result disappears with it. The problem isn’t the command. It’s the image.

Upgrade Path Divergence (Sequence)

Engineer               Shell               npm Layer            Image Layer
   |                     |                     |                     |
   |-- npm install -g -->|                     |                     |
   |                     |-- writes node_mods ->|                    |
   |                     |<-- exit 0 ----------|  <- npm success v   |
   |<-- command success --|                     |                     |
   |                     |                     |                     |
   |-- openclaw --version>|                     |                     |
   |                     |-- resolve $PATH -----+--> /usr/local/bin/openclaw (symlink)
   |                     |                                           |--> /app/openclaw.mjs (image ver)
   |<-- old version ------|                                          |    new install never ran x
   |                     |                     |                     |
   |-- force-recreate -->|                     |                     |
   |                     |-- reset to image -->-- node_modules wiped x

npm layer: updated v  /  shell resolution entry: still image-baked version x

The critical fork: npm install -g succeeds in the npm layer. But command resolution travels through the symlink layer. In a containerized environment, these two layers have no awareness of each other.

Why it’s easy to miss

The nine bare-metal machines use the same upgrade flow, and it’s correct for them — npm install -g replaces exactly the binary the shell will find. No symlink indirection. This container looks identical from the outside: same command, same clean return code, same apparent success.

If your verification script checks package.json instead of running --version, it returns the right version number and the problem stays hidden. For a stretch, upgrades were silently succeeding in the npm layer while the image layer sat untouched — two parallel worlds, until a container recreate wiped the npm state and the discrepancy finally surfaced.

How to confirm it

Inside the container: ls -la /usr/local/bin/openclaw. If the output shows symlink → /app/openclaw.mjs, this machine has a different upgrade path from every bare-metal node. Cross-check openclaw --version against the version string in /app/openclaw.mjs. If they diverge, the diagnosis is confirmed.

The correct upgrade path: checkout the target version tag, rebuild the Docker image, force-recreate the container. Every time the program itself changes, that’s the path. No shortcut exists, and the environment doesn’t forgive assuming one does.

Code Diff: Before and After

Before: in-container upgrade (silent failure)

# Run inside the container -- looks like success, symlink unchanged
docker exec -it openclaw-node bash
npm install -g openclaw@latest   # <-- updates node_modules, symlink untouched
openclaw --version               # <-- returns /app/openclaw.mjs version, not the one just installed

# The verification trap
cat /usr/local/lib/node_modules/openclaw/package.json | grep version
# --> "1.2.3" (new) -- correct, but this file is never executed
ls -la /usr/local/bin/openclaw
# --> /usr/local/bin/openclaw -> /app/openclaw.mjs  -- symlink still points at image version

After: rebuild the image (correct path)

# Run on the host, not inside the container
git checkout v1.2.3                    # <-- checkout target version tag
docker build -t openclaw:1.2.3 .       # <-- rebuild image, /app/openclaw.mjs updated
docker compose up -d --force-recreate  # <-- container starts from new image

# Correct verification (separate logic per deployment type)
if docker inspect openclaw-node > /dev/null 2>&1; then
  docker exec openclaw-node openclaw --version  # container: use --version
else
  openclaw --version  # bare-metal: direct execution
fi

Layers That Need Separate Verification in Mixed Deployments

  • Version number reporting: package.json‘s version field reflects npm installation state, not the version the shell actually executes. --version is the correct source of truth.
  • Upgrade success notifications: npm install -g exit 0 means npm-layer success, not that the command entry point changed. Notifications should trigger only after --version verification passes.
  • Monitoring dashboard version display: If monitoring queries package.json or the npm registry, it’s showing a false version. Fix: execute openclaw --version and parse the output.
  • Health check probes: A passing HEALTHCHECK confirms the process is alive, not that the version is correct. Version verification needs its own probe layer.
  • CI/CD deployment status: If the pipeline uses docker exec npm install -g exit code to mark deployment complete, that status flag is decoupled from the actual running version. Add a post-deploy verification step.
  • Change tracking logs: Deployment logs that record “npm upgrade succeeded” but not the symlink-layer version lose their audit value — log accuracy depends on verifying the right source.
  • Shared upgrade script assumptions: The same upgrade script behaves differently on bare-metal vs containerized nodes but produces identical success output. Scripts in mixed environments must be deployment-type-aware.
  • Webhooks and automated triggers: Any downstream trigger depending on version number — rollback decisions, notifications, integration tests — will execute on a false premise if the version source is wrong.

The rule: if a verification step failing wouldn’t stop the deployment process, it needs to be treated as an independent verification gate — not an afterthought confirmation.

One thing for next time

In mixed environments — same toolchain, some bare-metal, some containerized — upgrade verification logic needs to be written separately for each deployment type. Verify with --version, not package.json. The two deployment models have different layers of truth. One ruler measuring both will eventually get one wrong.

— 邱柏宇

Related Posts