Skip to content

fix: stop clearing MILADY_API_TOKEN/ELIZA_API_TOKEN in Docker provider#423

Open
0xSolace wants to merge 42 commits intomainfrom
fix/consolidated-cloud-fixes
Open

fix: stop clearing MILADY_API_TOKEN/ELIZA_API_TOKEN in Docker provider#423
0xSolace wants to merge 42 commits intomainfrom
fix/consolidated-cloud-fixes

Conversation

@0xSolace
Copy link
Copy Markdown
Collaborator

Problem

docker-sandbox-provider.ts cleared MILADY_API_TOKEN and ELIZA_API_TOKEN to empty strings before passing env vars to the container. The DB-generated token (from managed-milady-env.ts) was discarded.

With agent v2.0.4+ (which tightened auth in milady-ai/milady#1434), cloud containers with no token reject all requests with 401.

The pair flow also broke: /api/auth/pair returns the DB token to the browser, but the container didn't have that token, so the browser's auth header didn't match anything.

Fix

Removed the two lines that override MILADY_API_TOKEN and ELIZA_API_TOKEN to empty strings. The DB-generated token now flows through to the container so:

  1. isAuthorized() has a real token to match against
  2. The pair flow hands browsers the correct key via the nginx cookie auth bridge

Testing

  • packages/tests/unit/auth-pair-route.test.ts — 7 tests pass
  • packages/tests/unit/milaidy-pairing-token-route.test.ts — 4 tests pass

Companion PR: milady-ai/milady#1551 (agent-side safety net)

0xSolace and others added 30 commits March 25, 2026 16:23
…port (31337)

- Change -p ${bridgePort}:DEFAULT_BRIDGE_PORT (→31337) to map to MILADY_PORT (2138)
- Both bridgePort and webUiPort now map to the same internal port 2138,
  which is the only port the milady agent actually listens on.
- Also set MILADY_API_BIND=0.0.0.0 alongside ELIZA_API_BIND so the agent
  binds to all interfaces (not just loopback) — required for Docker port
  mapping to work from outside the container.
- DEFAULT_BRIDGE_PORT (31337) constant remains for env-var compat but is
  no longer used as the docker container-side port.
fix: bridge port maps to wrong internal port (31337 → 2138)
- vercel.json: remove container-lifecycle crons (process-provisioning-jobs,
  health-check, deployment-monitor) — these are now exclusively owned by
  milady-provisioning-worker on the VPS
- deploy-backend.yml: add restart of milady-provisioning-worker after
  eliza-cloud restart so worker picks up new code on each deploy
- INFRASTRUCTURE.md: document Vercel vs VPS ownership, docker nodes,
  Neon DB, Redis, and missing GitHub Actions secrets

VPS .env.local updated separately (out of band):
- MILADY_DOCKER_IMAGE bumped to v2.0.0-steward-8
- ELIZA_CLOUD_AGENT_BASE_DOMAIN changed from waifu.fun to milady.ai
- Added: MILADY_SANDBOX_PROVIDER, MILADY_BRIDGE_INTERNAL_PORT,
  STEWARD_CONTAINER_URL, REDIS_URL, KV_URL, MILADY_SSH_KEY (base64)
…uting

- Add WalletProvider interface (packages/lib/services/wallet-provider.ts)
- Add StewardClient singleton wrapper (packages/lib/services/steward-client.ts)
- Add wallet migration feature flags (packages/lib/config/wallet-provider-flags.ts)
- Update agent_server_wallets schema: wallet_provider, steward_agent_id, steward_tenant_id columns
- Make privy_wallet_id nullable for Steward-only wallets
- Add DB migration 0058 with CHECK constraint ensuring exactly one provider ID
- Update server-wallets.ts: dual-provider routing for provisionServerWallet() and executeServerWalletRpc()
- New wallets routed to Steward when USE_STEWARD_FOR_NEW_WALLETS=true
- Existing Privy wallets continue working unchanged
- Install @stwd/sdk@0.3.0

Rollback: set USE_STEWARD_FOR_NEW_WALLETS=false to revert to Privy for all new wallets.
- Add steward-client.ts: lightweight client for querying Steward agent/wallet info
- Add GET /api/v1/milady/agents/[agentId]/wallet endpoint for detailed wallet info
- Update agent detail API to include walletAddress, walletProvider, walletStatus
- Update compat API (toCompatAgent) to include wallet_address, wallet_provider
- Update admin docker-containers route to show wallet provider and address
- Update managed env to inject STEWARD_API_URL and STEWARD_AGENT_ID
- Pass sandboxId through to prepareManagedMiladyEnvironment

Docker-backed agents query Steward for wallet data; Privy agents fall back to DB.
All Steward calls are best-effort with timeouts — API degrades gracefully.
The check-types-split.ts script was scanning 'lib/', 'db/', and 'components/'
— none of which exist at the project root. The actual source lives at
packages/lib, packages/db, and packages/ui/src/components.

This meant packages/lib (including feature-gate.ts, use-feature-flags.ts,
and feature-flags.ts) was never actually type-checked in CI. The split
type-checker silently skipped all those files.

Fix: update getDirectoriesToCheck() to use the correct package paths.
…ess fallback

- CloudFormation: default-deny direct container port; opt-in via DirectContainerPortCidr param
- milady-web-ui: getClientSafeMiladyAgentWebUiUrl returns null instead of falling back to headscale direct access
- Dashboard pages: remove webUiUrl gating on connect button (always use pairing flow for running agents)
- agent-actions/sandboxes-table: drop getConnectUrl and webUiUrl prop threading
- Add cloudformation-template unit test
- Gitignore env backup files
…e-slop UI

Infrastructure:
- Add wallet proxy route (/api/v1/milady/agents/[id]/api/wallet/[...path])
  Proxies wallet/steward requests to agent's REST API with proper auth
- Switch Neon provisioning from projects to branches (fixes 100-project limit)
  New agents get branches within shared parent project
- Add cleanup-stuck-provisioning cron (resets agents stuck >10min)
- Remove process-provisioning-jobs Vercel cron (VPS worker handles this)
- Add milady.ai to redirect allowlists for Stripe checkout

Dashboard UI:
- Agent detail: add Wallet, Transactions, Policies tabs
- Billing: replace credit pack cards with custom amount + card/crypto
- Agent cards: deterministic character images instead of identical fallbacks
- De-slop text across all dashboard pages
- Create dialog: cleaner copy, Deploy button
- Pricing: tighter descriptions
Replace the broken custom billing page with the working BillingTab
component from Settings. Same Stripe + crypto flow, invoices, and
balance display.
…ution

Root cause: bun test --max-concurrency=1 runs all test files in a single
process. When a test file calls mock.module("@/db/repositories", ...) with
only a partial set of exports, the mock persists and breaks all subsequent
test files that import named exports not included in the mock.

Fixes:
1. privy-sync: change @/db/repositories mock to use specific sub-module
   paths (organization-invites, users) so the full repositories index is
   never replaced globally.
2. admin-service-pricing route+test: import servicePricingRepository from
   its specific sub-module (service-pricing) instead of the full index, and
   update the test mock accordingly.
3. Add missing InsufficientCreditsError export to @/lib/services/credits
   mocks in four test files that omitted it, preventing mcp-tools.test.ts
   from failing when it transitively imports app/api/mcp/tools/memory.ts.
…ry whitelist, typed token access, no hardcoded fallback

- Wallet proxy: whitelist allowed wallet sub-paths (prevents path traversal)
- Wallet proxy: whitelist allowed query params (limit, offset, cursor, type, status)
- Wallet proxy: typed environment_vars access instead of unsafe casts
- Wallet proxy: POST body size limit (1MB) + Content-Type validation
- Wallet proxy: reject multi-segment paths
- Neon: remove hardcoded project ID fallback, warn if env var missing
…async job queue

The VPS worker handles SSH/Docker operations. Sync provisioning in Vercel
serverless functions can't do SSH and times out. Force async path so the
VPS worker is always the one deploying containers.
The VPS worker writes bridge_url and status to the primary DB.
The wallet proxy reads with findRunningSandbox which was using the
read replica (dbRead). Replica lag caused 503 'not running' errors.
Switched to dbWrite (primary) for consistent reads.
…internal IPs

Vercel serverless functions can't reach Hetzner internal Docker IPs.
Route wallet proxy through the agent's public domain ({agentId}.waifu.fun)
which is accessible from anywhere via nginx/cloudflare routing.
The cron was still processing jobs despite being removed from vercel.json.
Replace the route with a no-op that returns immediately. Provisioning is
handled exclusively by the standalone VPS worker.
Neon plan limits (100 projects, 10 branches/project) block new agent
provisioning with BRANCHES_LIMIT_EXCEEDED. ElizaOS plugin-sql tables
already scope data by agent UUID, so all agents can safely share one DB.

Changes:
- provisionNeon() now returns process.env.DATABASE_URL instead of
  calling neon.createProject()
- cleanupNeon() accepts null/undefined and no-ops in shared-DB mode
- Delete path already guarded by if (rec.neon_project_id) check
- neon-client.ts preserved for future use / legacy project cleanup
- Existing neon_project_id/neon_branch_id columns unchanged in schema
…ard wallets + dashboard deslop + shared DB
…ning, concurrency limits, correctness fixes

- Delete INFRASTRUCTURE.md (VPS IPs in public repo)
- Delete TRIAGE_NOTES.md, CLAUDE.md (stray merge artifacts)
- Delete duplicate lib/milady-web-ui.ts + its test (packages/lib version is canonical)
- Add GET handler to process-provisioning-jobs stub (Vercel cron uses GET)
- Fix wallet_provider inference: null when not fetched, not inferred from node_id
- Remove hardcoded Neon project ID fallback
- Add STEWARD_TENANT_API_KEY missing warning in docker-sandbox-provider and steward-client
- Add concurrency limiter (max 5) for Steward enrichment in admin containers endpoint
- Remove unused sanitizeProjectNameSegment function
The DB-generated API token from managed-milady-env.ts is the canonical
inbound auth credential.  Clearing it to empty caused cloud containers
to start without a token, which (combined with the agent-side auth gate
in v2.0.4+) resulted in 401 on every request.

The token now flows through to the container so the pair flow can hand
browsers the correct key via the nginx cookie auth bridge.
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
eliza-cloud-v2 Ready Ready Preview, Comment Apr 2, 2026 7:47pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4d82174e-6033-4bbb-b1ca-956892f4fff9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/consolidated-cloud-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Code Review

Overall this is a well-structured PR with clear motivation. The core fix (removing the two empty-string overrides) is correct and the Steward wallet integration is architecturally sound. A few issues worth addressing before merge:


🔴 Critical

provisionNeon now shares DATABASE_URL with all agents

milady-sandbox.ts provisionNeon() was rewritten to set database_uri = process.env.DATABASE_URL (the cloud platform's own primary DB) for every new agent. The previous behaviour created isolated per-agent Neon projects/branches.

This means all agent containers receive the cloud platform's connection string. If an ElizaOS plugin has a bug, a misconfigured agent writes to the wrong tables, or a container is compromised, it has access to the shared cloud DB — including other tenants' data, the jobs table, billing records, etc.

The comment says "ElizaOS plugin-sql tables scope all data by agent UUID", but that relies on every plugin being correct and well-behaved. A single DROP TABLE or runaway migration from an agent would affect all tenants.

Recommendation: Strongly consider using Neon branches off NEON_PARENT_PROJECT_ID for isolation (the new createBranch / deleteBranch methods are already added in this PR) rather than the platform's own DB. If shared DB is truly the right call, document the threat model explicitly and add DB-level row-security policies or a dedicated low-privilege DB user that agents receive.


🟡 High

CLAUDE.md deleted

The file was deleted entirely. It contained important developer guidance: migration rules (no CREATE INDEX CONCURRENTLY, no omnibus migrations, etc.), the check-types filtering workflow, and the monorepo structure. If this is intentional because the content moved elsewhere, please note the destination in the PR description.

@stwd/sdk@0.3.0 is an unvetted external dependency

This is a new package (@stwd/sdk) that has not appeared in prior commits. The package handles wallet signing and agent provisioning — a security-sensitive code path. Please confirm:

  • Is this an internal/first-party package or a public package?
  • Has it been audited or pinned to a specific hash?
  • The lock file entry shows a sha512 hash but no registry source. Ensure it comes from a trusted registry.

🟠 Medium

stewardFetch silently swallows auth errors

// packages/lib/services/steward-client.ts
if (!response.ok) {
  if (response.status === 404) return null;
  logger.warn(`[steward-client] ${path} returned ${response.status}`);
  return null;  // 401, 403, 500 all silently return null
}

A misconfigured or rotated STEWARD_TENANT_API_KEY (401) is treated identically to "agent not found" — the dashboard will just show no wallet info with no indication of an auth failure. Consider returning a typed error or at least logging warn with the status code for non-404 failures (it does log warn, but callers can't distinguish auth failure from missing data).

Both bridgePort and webUiPort now map to the same container port

`-p ${bridgePort}:${allEnv.PORT || DEFAULT_AGENT_PORT}`,
`-p ${webUiPort}:${allEnv.PORT || DEFAULT_AGENT_PORT}`,

Two separate host ports are allocated for the same container port 2139. This wastes ports on the VPS host (both bridge_port and web_ui_port DB columns exist and are tracked separately). This may be intentional for routing purposes but it's worth a comment explaining why separate ports are needed rather than a single mapping.

getClientSafeMiladyAgentWebUiUrl now unconditionally returns null

// packages/lib/milady-web-ui.ts
-  return getMiladyAgentDirectWebUiUrl(sandbox, options);
+  return null;

The previous fallback returned the direct URL when no canonical URL was set. Any caller that used this for display (e.g. showing users a link to their agent's web UI) will now silently get nothing. milady/agents/[id]/page.tsx still references webUiUrl (line 1071 in the diff) which will always be null now — the "Web UI" cell in the infrastructure section will never render. Confirm this is intentional.

Module-level console.warn at import time

Both steward-client.ts and docker-sandbox-provider.ts emit console.warn at module load time when STEWARD_TENANT_API_KEY is absent. In development (where Steward isn't configured), every server start will print this noise. Consider gating it behind an explicit check at call time or converting to a lazy warn.


🟢 Low / Nits

Query parameter values not validated in wallet proxy

proxyWalletRequest filters keys against ALLOWED_QUERY_PARAMS but doesn't validate values. A caller could pass limit=99999999 which passes through to the agent. Consider adding basic numeric bounds for pagination params.

const sync = false — dead code

// provision/route.ts
const sync = false;

The variable is assigned but the comment says the sync path is for local dev only. If sync is always false in production, the conditional branch below it is dead code. Either remove the branch or use an env-var guard like process.env.NODE_ENV === 'development' so it's obvious.

Double blank line at end of feature-flags.ts

+
+// Steward wallet migration flags live in wallet-provider-flags.ts

There's a spurious blank line before the comment.

Journal index gap

The _journal.json adds entries at idx 53–55 for migrations 00560058, implying 0054 and 0055 were added in other commits but aren't visible here. This is fine but worth confirming the journal is consistent with what's on disk.


✅ What looks good

  • The core bug fix is correct and well-explained
  • mapWithConcurrency for Steward enrichment is a clean, safe bounded-parallelism pattern
  • Wallet proxy path whitelist (ALLOWED_WALLET_PATHS) is the right approach
  • DB migration is additive and zero-downtime safe; IF NOT EXISTS guards are correct
  • Integration tests for dual-provider routing are thorough (680 lines, all major paths covered)
  • CloudFormation security group change (CIDR opt-in instead of 0.0.0.0/0) is a good hardening
  • findRunningSandbox moved to dbWrite primary to avoid read-replica lag — correct fix

PR #421 fixed milady-sandbox.ts to use the shared DATABASE_URL but
missed user-database.ts which still called neonClient.createProject().
The ai-app-builder calls this path, hitting BRANCHES_LIMIT_EXCEEDED.

Same pattern: use process.env.DATABASE_URL, encrypt before storing,
mark as ready. Legacy per-app projects with existing project IDs
still get cleaned up correctly via cleanupDatabase().
@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Code Review

This PR introduces Phase 1 of a Privy → Steward wallet provider migration plus several fixes. The core Docker token fix is correct and the DB migration is safe. A few issues worth addressing before merge:


Bugs / Correctness

Hard-coded chainId: 8453 in server-wallets.ts
executeStewardRpc defaults to Base mainnet when chainId is absent from the transaction payload. A transaction missing chainId will be signed for the wrong chain rather than rejected. Either require chainId explicitly or derive it from the agent's configured network.

eth_signTypedData_v4 JSON parsing

const parsed = JSON.parse(typedData);

JSON.parse throws a SyntaxError if a client sends an already-parsed object (some EIP-712 clients do). Wrap in a try/catch or use typeof typedData === 'string' ? JSON.parse(typedData) : typedData.

steward_agent_id not namespaced by org in server-wallets.ts
provisionStewardWallet derives the Steward agent name from characterId alone (e.g. cloud-${characterId}). If characterId values are not globally unique across organizations, two orgs with the same character ID would collide at the Steward level. Consider prefixing with org-${organizationId}-${characterId}.

Missing Steward 409 handling in provisionStewardWallet
Only DB unique-constraint violations are caught. If Steward returns 409 (agent already exists) but the DB record was deleted, the SDK error propagates unhandled. Add explicit 409 handling to re-fetch the existing Steward agent.


Security

DISABLE_PRIVY_WALLETS flag is declared but not enforced
wallet-provider-flags.ts exports DISABLE_PRIVY_WALLETS but server-wallets.ts only checks USE_STEWARD_FOR_NEW_WALLETS. Setting DISABLE_PRIVY_WALLETS=true has no effect — legacy wallets still route to Privy via executePrivyRpc. Either enforce the flag or remove it until it's wired up, to avoid giving operators false confidence.

Missing startup guard for missing STEWARD_TENANT_API_KEY
docker-sandbox-provider.ts logs a console.warn but allows execution to continue unauthenticated. If Steward is configured but the key is absent, calls will fail at runtime with unhelpful errors. Consider throwing at startup if USE_STEWARD_FOR_NEW_WALLETS is enabled but the key is unset.


Reliability

No timeout on Steward SDK calls in getStewardWalletInfo
stewardFetch correctly uses AbortSignal.timeout(10_000), but getStewardWalletInfo's SDK path calls client.getAgent() and client.getBalance() with no timeout. In a serverless context these could hang indefinitely if Steward is slow. Apply AbortSignal.timeout or set a deadline at the call site.

stuckSinceMinutes is always 10 in the cron response
The field name implies the actual duration the agent was stuck, but it's hardwired to STUCK_THRESHOLD_MINUTES. Rename it to stuckThresholdMinutes so it's clear it's the threshold, not a measurement.

jobs.data->>'agentId' JSON path in cleanup cron
The NOT EXISTS subquery in cleanup-stuck-provisioning/route.ts depends on agentId being a top-level key in the jobs JSONB column. If the jobs schema changes this field name, the cron will silently stop filtering and reset all provisioning agents. Add a comment cross-referencing the jobs schema, or assert the path in a test.


Minor

CLAUDE.md deletion — This PR removes CLAUDE.md. Was that intentional? It contains the DB migration rules and type-check guidance that contributors rely on.

console.warn vs logger.warn inconsistency in steward-client.ts — The startup warning uses console.warn while all other logging in the same file uses logger.warn. Use logger.warn throughout.

toCompatAgent silent null for wallet fields — Callers in agents/route.ts pass no walletInfo (list endpoint), so all agents in that response get wallet_address: null / wallet_provider: null. If downstream clients treat null differently from field-absent this is a breaking change. Worth calling out in the PR description or adding a note in the API changelog.

wallet_provider column has no Drizzle-level enum — The schema uses text("wallet_provider") with no union constraint, so the ORM won't catch an invalid provider value before hitting the DB CHECK constraint. A pgEnum or application-level validation before insert would surface errors earlier.


Test Coverage Gaps

  • cleanup-stuck-provisioning route has no tests (NOT EXISTS logic, verifyCronSecret rejection, empty-result path).
  • eth_signTypedData_v4 with a pre-parsed object is not covered.
  • isStewardAvailable is not tested.
  • SDK-fallback path in getStewardWalletInfo is not tested.
  • New origins in redirect-validation.ts have no test.

The core fix (token flow-through in docker-sandbox-provider.ts) is correct and the DB migration is zero-downtime safe. The issues above are mostly in the new Steward integration. The chainId default and the unenforced DISABLE_PRIVY_WALLETS flag are the two I'd prioritize before shipping.

…hrow

The validateRateLimitConfig() function set hasValidatedConfig=true BEFORE
throwing its error. This meant:
  - 1st request to /api/auth/pair → flag set → throw → HTTP 500 ("Pairing failed")
  - 2nd request → flag already set → early return → HTTP 200 ✓

Root cause confirmed via nginx access.log:
  POST /api/auth/pair → 500 (1st try)
  POST /api/auth/pair → 200 (2nd try, 4 seconds later)

And journalctl showing the exact error text thrown:
  '🚨 SECURITY: Redis rate limiting is required in production...'

Fix: convert the throw to a logger.warn so in-memory rate limiting is allowed
on single-server VPS deployments. Multi-instance deployments should still
configure Redis (REDIS_RATE_LIMITING=true) to share rate-limit state across
processes, but the single-server case is safe and should never 500.
@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Code Review — PR #423

Overall the fix to stop clearing MILADY_API_TOKEN/ELIZA_API_TOKEN is correct and well-motivated. The companion changes (port mapping, health-check port, MILADY_API_BIND, findRunningSandboxdbWrite) are all sound bug fixes. The Steward wallet integration is well-structured with good test coverage. A few things worth addressing before merge:


🔴 Critical

Shared DATABASE_URL used as per-agent database URI
packages/lib/services/milady-sandbox.ts (and user-database.ts) now sets database_uri = process.env.DATABASE_URL — the cloud's primary database — for every agent. This means every container receives the production DB credentials. Application-level UUID scoping (plugin-sql) is not the same as database-level isolation. A compromised or malicious container could read other agents' data, system tables, or cloud infrastructure records.

If the goal is to avoid Neon project limits, the safer path is per-agent Neon branches on the shared project (the createBranch/deleteBranch methods you added to NeonClient are perfect for this), rather than handing all agents the root connection string.


🟡 Important

CLAUDE.md deleted with no replacement
The project's development guide (migration rules, type-check workflow, db:push prohibition, etc.) is dropped from the repository. If this is moving to a wiki or docs folder, that destination isn't in the diff. Dev onboarding context is lost.

Migration journal gap
packages/db/migrations/meta/_journal.json jumps from tag 0053_add_milady_billing_columns (idx 52) directly to idx 53 → 0056_add_billing_status_check, skipping 0054 and 0055. If those migration files exist on disk but aren't in the journal, Drizzle's runner will behave inconsistently across environments. Please confirm that 0054/0055 journal entries are intentionally absent (or add them).

Rate-limiting guard downgraded from throw to warn
packages/lib/middleware/rate-limit.ts removes the hard production guard that prevented in-memory rate limiting in multi-instance deployments. The comment about single-VPS deployments is valid, but this now silently allows Vercel serverless functions (multiple instances) to run with per-instance limits, which can be bypassed by distributing requests. Consider keeping the throw for serverless environments and only suppressing it when a SINGLE_SERVER_MODE=true flag is explicitly set.


🟠 Minor

console.warn instead of logger.warn
Two spots log with console.warn instead of the project's structured logger:

  • packages/lib/services/docker-sandbox-provider.tsSTEWARD_TENANT_API_KEY warning
  • packages/lib/services/steward-client.ts — same warning

These should use logger.warn(...) for consistency and structured log shipping.

Both host ports mapped to the same container port

-p ${bridgePort}:${allEnv.PORT || DEFAULT_AGENT_PORT}
-p ${webUiPort}:${allEnv.PORT || DEFAULT_AGENT_PORT}

Two different host ports bound to the same container port is valid and intentional per the comment, but it means web_ui_port and bridge_port are interchangeable externally. Worth an explicit comment explaining why (nginx routing by path rather than port).

Noop change in compat agents list route

// before
agents.map(toCompatAgent)
// after
agents.map((a) => toCompatAgent(a))

These are functionally identical. If this was meant to pass walletInfo, it's missing; if it was just a style change, it adds noise.

proxyWalletRequest returns a 400 Response for invalid paths, but null for not-found
The caller only checks if (!agentResponse) to detect proxy failure. An invalid-path 400 (from the path allowlist inside proxyWalletRequest) would be forwarded as a real agent response. In practice the route handler validates the path first, so this can't be reached, but the inconsistent return semantics (Response for errors vs null for not-running) are a footgun for future callers.


✅ Good

  • The core token fix is clean and the PR description clearly explains the causal chain
  • MILADY_API_BIND: "0.0.0.0" alongside ELIZA_API_BIND is the right fix for the port-binding regression
  • Health check now correctly probes PORT (2139) not MILADY_PORT (2138)
  • findRunningSandboxdbWrite correctly addresses replica lag for the wallet proxy
  • sync = false for serverless provisioning is the right decision; keeping the code path but disabling it preserves local dev utility
  • Steward client has good defense-in-depth: route-handler path validation + service-layer allowlist, plus query-param filtering
  • The stuck-provisioning cron is a good operational safety net with a clean single-UPDATE-RETURNING pattern
  • Integration tests for dual-provider routing are thorough (routing, schema validation, guards, dispatch)
  • Steward client gracefully degrades: SDK → lightweight fetch → null, with appropriate logging at each step
  • CloudFormation change to opt-in direct container port exposure is a good default-deny security improvement

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

PR Review

This PR bundles a targeted auth fix with a substantial Steward wallet integration (Phase 1). The core fix is correct and well-explained. Several issues in the surrounding changes worth addressing:


Bugs / Correctness

Both host ports map to the same container port
docker-sandbox-provider.ts:

-p ${bridgePort}:${allEnv.PORT || DEFAULT_AGENT_PORT}
-p ${webUiPort}:${allEnv.PORT || DEFAULT_AGENT_PORT}

Two external ports publishing to the same container port is valid Docker but means nginx must route by host header or path — if it routes by port number, one of these will always be wrong. The comment says "nginx can reach /api/* via bridge_url and the UI via web_ui_port" but both now point at the same listener. Worth confirming the nginx config handles this.

vercel.json not updated for the new cron route
cleanup-stuck-provisioning/route.ts has a comment: Schedule: every 5 minutes ("* /5 * * * *" in vercel.json) — but vercel.json is not in the diff. The cron job won't fire without that entry. Also: "* /5 * * * *" (with a space) is malformed; the correct syntax is */5 * * * *.

provision/route.ts: sync hardcoded to false

const sync = false;

The comment says "The sync path remains in code for local dev only" but there's no way to enable it without editing source. If local dev still uses the sync path, this breaks it. A process.env.NODE_ENV === 'development' guard or env var would be safer than silently ignoring the query param.


Security

Wallet proxy path validation gap in route handler
[...path]/route.ts only checks path.length !== 1 || path[0].includes(".."). The service layer (milady-sandbox.ts) does validate against ALLOWED_WALLET_PATHS, but the route handler rejects unknown paths with a 400 — except it doesn't: it passes any single-segment path (e.g. ../../etc) to the service. The .. check only catches that specific string but not URL-encoded variants like %2e%2e. This is defense-in-depth; the actual allowlist in the service is the real guard. Consider adding the allowlist check at the route layer too, or documenting the trust boundary.

Shared DATABASE_URL stored per-app (encrypted)
user-database.ts now encrypts and stores the master DATABASE_URL as each user app's user_database_uri. The encryption is per-org, so a compromised decryption key for one org exposes the master credentials. The comment trusts ElizaOS plugin-sql to scope data by agent UUID — this is an application-layer invariant, not enforced at the DB level. Worth a note in the code that apps cannot be migrated to different Postgres instances, and that data isolation depends entirely on agent behavior.


Code Quality

Module-level console.warn at import time (both steward-client.ts:27 and docker-sandbox-provider.ts:84):

if (!STEWARD_TENANT_API_KEY) {
  console.warn("[steward-client] STEWARD_TENANT_API_KEY is not set...");
}

This fires every time the module is imported (e.g. in tests, cold starts). Use logger.warn inside the actual request functions with a once-guard, or accept it only logs on startup. Mixing console.warn with logger is also inconsistent.

Redundant getStewardWalletInfo fallback
steward-client.ts: the function tries client.getAgent() (SDK), catches any error and falls back to getStewardAgent() (fetch). Both call the same STEWARD_HOST_URL. If Steward is down, both fail. The fallback adds complexity without reliability benefit — consider removing it and using only the fetch path for this lightweight read.

agents/route.ts no-op change

- agents.map(toCompatAgent)
+ agents.map((a) => toCompatAgent(a))

These are equivalent; the lambda wrapper adds overhead with no benefit.

CLAUDE.md deleted
This file contained migration rules (no db:push, no CREATE INDEX CONCURRENTLY, IF NOT EXISTS requirements, migration size limits) and type-checking guidance that aren't obvious from the code. Deleting it removes context that prevents prod incidents. If it was moved elsewhere, a pointer would help. If it's intentionally gone, the migration rules should live in docs/database-migrations.md (which it referenced).


Minor / Nits

  • managed-milady-env.ts sets STEWARD_API_URL / STEWARD_AGENT_ID on the base env, but docker-sandbox-provider.ts also sets these in the container-specific env. Confirm merge order ensures the docker provider's values win (they should if spread order is { ...baseEnv, ...dockerSpecificEnv }).
  • milady-managed-launch.ts: model defaults changed from kimi-k2 to minimax/minimax-m2.7 + claude-sonnet-4.6. eliza-app/config.ts does the same for webhook channels. Fine, but two separate places define the same defaults — a single config source would prevent drift.
  • Migration 0058 constraint is correct for the existing data shape (privy_wallet_id was NOT NULL before step 3). Good.

What's Good

  • The core fix (removing the two token-clearing lines) is correct and the explanation is clear.
  • MILADY_API_BIND: "0.0.0.0" addition is the right fix for the bind issue.
  • Health cmd now probes PORT instead of MILADY_PORT — consistent with the port consolidation.
  • cleanup-stuck-provisioning cron is well-written: single atomic UPDATE+RETURNING, proper threshold, good logging.
  • Steward client's stewardFetch has a 10s timeout and treats 404 as null — good defensive design.
  • DB migration is zero-downtime safe (additive, IF NOT EXISTS, proper constraint ordering).
  • Wallet proxy allowlist in milady-sandbox.ts is thorough and well-documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants