fix(bootstrap): surface diagnostics for K8s namespace not ready failures#466
Merged
fix(bootstrap): surface diagnostics for K8s namespace not ready failures#466
Conversation
The 'K8s namespace not ready' error had three gaps preventing diagnostic information from reaching users: 1. The non-interactive (CI/piped) code path used bare error propagation with no diagnosis at all. 2. The interactive path's pattern matcher returned None for the common timeout case, and the generic_failure_diagnosis fallback existed but was never called. 3. Container logs were never passed to the diagnosis engine, so patterns only visible in logs (node pressure, corrupted state, etc.) could not match. Fix all three by fetching container logs at the CLI error-handling site, passing them to diagnose_failure, and falling back to generic_failure_diagnosis when no specific pattern matches. Also add container logs to the two wait_for_namespace error paths that were missing them (timeout and exec-error-on-final-attempt), and update the generic diagnosis to suggest 'openshell doctor' commands.
… bootstrap message The auto-bootstrap banner told users to run 'openshell gateway status', but 'status' is a top-level command, not a gateway subcommand. Running the suggested command produced 'unrecognized subcommand' error.
Cover the three key behaviors introduced by the diagnostic fix: - generic_failure_diagnosis suggests doctor logs/check commands - Plain namespace timeout returns None from diagnose_failure (confirming the generic fallback is necessary) - Container logs enable pattern matching for namespace errors that would otherwise go undiagnosed (node pressure, corrupted state, no route, network connectivity) - End-to-end fallback pattern mirrors the actual CLI unwrap_or_else chain
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Users hitting "K8s namespace not ready" saw a bare error with zero recovery guidance, despite extensive diagnostic plumbing existing in the codebase. This PR closes three compounding gaps so that every failure path now surfaces actionable diagnosis.
Changes
Gap 1: Non-interactive path had no diagnosis at all
deploy_gateway_with_panelnon-interactive path (CI, piped output) used bare.await?propagationGap 2: Interactive path silently dropped unmatched failures
diagnose_failure()returnedNonefor the common timeout case (no pattern matched)generic_failure_diagnosis()existed but was never called — there was noelsebranch.unwrap_or_else(|| generic_failure_diagnosis(name))so there's always guidance shownGap 3: Container logs were never passed to the diagnosis engine
diagnose_failure(name, &err_str, None)— alwaysNoneextension-apiserver-authentication,HEALTHCHECK_NODE_PRESSURE,no default route presentcould never match unless they appeared in the miette error chainfetch_gateway_logs()and passes them to the matcherAdditional fixes
wait_for_namespacenow includes container logs (like the DNS and crash paths already did)generic_failure_diagnosisnow suggestsopenshell doctor logsandopenshell doctor checkbefore the destroy-and-recreate step, making existing diagnostic tooling discoverableTesting
mise run pre-commitpassesChecklist