Add health check to watch.stream for silent connection drops#2527
Add health check to watch.stream for silent connection drops#2527Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
watch.stream for silent connection drops#2527Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Urvashi0109 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an optional client-side “health check” mechanism to Watch.stream() to detect silent/half-open watch connections (e.g., during control plane upgrades) by configuring a read timeout and reconnecting automatically.
Changes:
- Add
_health_check_intervalparameter toWatch.stream()that sets_request_timeout(read timeout) when enabled. - Catch
ReadTimeoutError/ProtocolErrorduring streaming to trigger a reconnect using the last knownresource_version. - Add unit tests covering reconnect behavior, default behavior, and
_request_timeouthandling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kubernetes/base/watch/watch.py | Implements _health_check_interval handling, timeout configuration, and reconnect-on-timeout logic. |
| kubernetes/base/watch/watch_test.py | Adds unit tests for the new health-check/reconnect behavior and timeout parameter interactions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except (ReadTimeoutError, ProtocolError) as e: | ||
| # Only treat a read timeout / protocol error as a silent connection drop | ||
| should_retry = ( | ||
| health_check_interval > 0 | ||
| and not disable_retries | ||
| and watch_arg == "watch" | ||
| and self.resource_version is not None | ||
| ) | ||
| if should_retry: | ||
| # Add a small sleep to avoid a tight reconnect loop | ||
| # in case the endpoint is hard-down or errors immediately. | ||
| time.sleep(min(1.0,health_check_interval)) | ||
| else: | ||
| raise |
There was a problem hiding this comment.
The new retry behavior for ProtocolError isn’t covered by tests (current tests only simulate ReadTimeoutError). Add a unit test that raises ProtocolError (ideally wrapping a ReadTimeoutError, as urllib3 often does for broken streams) to verify the watch reconnects only in the intended cases and that non-timeout ProtocolErrors still propagate.
| if should_retry: | ||
| # Add a small sleep to avoid a tight reconnect loop | ||
| # in case the endpoint is hard-down or errors immediately. | ||
| time.sleep(min(1.0,health_check_interval)) |
There was a problem hiding this comment.
Minor style: add a space after the comma in min(1.0, health_check_interval) to match typical formatting in this file / PEP8.
| time.sleep(min(1.0,health_check_interval)) | |
| time.sleep(min(1.0, health_check_interval)) |
| should_retry = ( | ||
| health_check_interval > 0 | ||
| and not disable_retries | ||
| and watch_arg == "watch" | ||
| and self.resource_version is not None | ||
| ) |
There was a problem hiding this comment.
With _health_check_interval enabled, a ReadTimeoutError will occur whenever the watch is idle longer than the interval. The current retry gate requires self.resource_version is not None, and the finally block sets self._stop = True when it’s None, so an idle watch that hasn’t observed an event yet will raise instead of reconnecting. Consider allowing retries even when resource_version hasn’t been observed yet (or performing an initial list to obtain one) and avoid forcing _stop=True in that case so long-running idle watches don’t fail immediately.
| except (ReadTimeoutError, ProtocolError) as e: | ||
| # Only treat a read timeout / protocol error as a silent connection drop | ||
| should_retry = ( | ||
| health_check_interval > 0 | ||
| and not disable_retries | ||
| and watch_arg == "watch" | ||
| and self.resource_version is not None | ||
| ) | ||
| if should_retry: | ||
| # Add a small sleep to avoid a tight reconnect loop | ||
| # in case the endpoint is hard-down or errors immediately. | ||
| time.sleep(min(1.0,health_check_interval)) | ||
| else: |
There was a problem hiding this comment.
Catching and retrying on any ProtocolError when _health_check_interval is enabled can mask non-timeout protocol problems (e.g., bad chunked framing) and potentially lead to an infinite reconnect loop. It would be safer to only retry ProtocolError cases that are known to represent timeouts / broken connections (for example, a ProtocolError that wraps a ReadTimeoutError), and re-raise other protocol errors so callers can handle them.
What type of PR is this?
/kind bug
/kind feature
What this PR does / why we need it:
When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.
This PR adds a
_health_check_intervalparameter towatch.stream()that detects silent connection drops and automatically reconnects:_health_check_intervalis set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connectionurllib3raises aReadTimeoutErrorresource_version, ensuring no events are missed_health_check_interval=0), preserving full backward compatibilityReadTimeoutErrorpropagates to the caller as beforeWhich issue(s) this PR fixes:
Fixes #2462
Special notes for your reviewer:
_request_timeout) to break out of the blocking read, then catching the resultingReadTimeoutError/ProtocolErrorexceptions_health_check_intervalfollows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API serverDoes this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: