Skip to content

Add health check to watch.stream for silent connection drops#2527

Open
Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
Urvashi0109:Fix-Added-Watch-Health-Check
Open

Add health check to watch.stream for silent connection drops#2527
Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
Urvashi0109:Fix-Added-Watch-Health-Check

Conversation

@Urvashi0109
Copy link
Contributor

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.

This PR adds a _health_check_interval parameter to watch.stream() that detects silent connection drops and automatically reconnects:

  • When _health_check_interval is set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connection
  • If no data arrives within the specified interval, urllib3 raises a ReadTimeoutError
  • The watch catches this exception and automatically reconnects using the last known resource_version, ensuring no events are missed
  • The feature is disabled by default (_health_check_interval=0), preserving full backward compatibility
  • When disabled, ReadTimeoutError propagates to the caller as before

Which issue(s) this PR fixes:

Fixes #2462

Special notes for your reviewer:

  • This PR takes approach: leveraging urllib3's existing read timeout mechanism (_request_timeout) to break out of the blocking read, then catching the resulting ReadTimeoutError/ProtocolError exceptions
  • The _ prefix on _health_check_interval follows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API server
  • 5 new unit tests added, all 24 tests (19 existing + 5 new) pass with zero regressions

Does this PR introduce a user-facing change?

Added `_health_check_interval` parameter to `watch.stream()` to detect and recover from silent connection drops during Kubernetes control plane upgrades. When set to a value > 0 (seconds), the watch will automatically reconnect if no data is received within the specified interval. Disabled by default for backward compatibility.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 20, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Urvashi0109
Once this PR has been reviewed and has the lgtm label, please assign yliaog for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 20, 2026
@k8s-ci-robot k8s-ci-robot requested review from fabianvf and yliaog March 20, 2026 08:31
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2026
@Urvashi0109 Urvashi0109 marked this pull request as ready for review March 20, 2026 08:33
Copilot AI review requested due to automatic review settings March 20, 2026 08:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional client-side “health check” mechanism to Watch.stream() to detect silent/half-open watch connections (e.g., during control plane upgrades) by configuring a read timeout and reconnecting automatically.

Changes:

  • Add _health_check_interval parameter to Watch.stream() that sets _request_timeout (read timeout) when enabled.
  • Catch ReadTimeoutError / ProtocolError during streaming to trigger a reconnect using the last known resource_version.
  • Add unit tests covering reconnect behavior, default behavior, and _request_timeout handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
kubernetes/base/watch/watch.py Implements _health_check_interval handling, timeout configuration, and reconnect-on-timeout logic.
kubernetes/base/watch/watch_test.py Adds unit tests for the new health-check/reconnect behavior and timeout parameter interactions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +256 to +269
except (ReadTimeoutError, ProtocolError) as e:
# Only treat a read timeout / protocol error as a silent connection drop
should_retry = (
health_check_interval > 0
and not disable_retries
and watch_arg == "watch"
and self.resource_version is not None
)
if should_retry:
# Add a small sleep to avoid a tight reconnect loop
# in case the endpoint is hard-down or errors immediately.
time.sleep(min(1.0,health_check_interval))
else:
raise
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new retry behavior for ProtocolError isn’t covered by tests (current tests only simulate ReadTimeoutError). Add a unit test that raises ProtocolError (ideally wrapping a ReadTimeoutError, as urllib3 often does for broken streams) to verify the watch reconnects only in the intended cases and that non-timeout ProtocolErrors still propagate.

Copilot uses AI. Check for mistakes.
if should_retry:
# Add a small sleep to avoid a tight reconnect loop
# in case the endpoint is hard-down or errors immediately.
time.sleep(min(1.0,health_check_interval))
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style: add a space after the comma in min(1.0, health_check_interval) to match typical formatting in this file / PEP8.

Suggested change
time.sleep(min(1.0,health_check_interval))
time.sleep(min(1.0, health_check_interval))

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +263
should_retry = (
health_check_interval > 0
and not disable_retries
and watch_arg == "watch"
and self.resource_version is not None
)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With _health_check_interval enabled, a ReadTimeoutError will occur whenever the watch is idle longer than the interval. The current retry gate requires self.resource_version is not None, and the finally block sets self._stop = True when it’s None, so an idle watch that hasn’t observed an event yet will raise instead of reconnecting. Consider allowing retries even when resource_version hasn’t been observed yet (or performing an initial list to obtain one) and avoid forcing _stop=True in that case so long-running idle watches don’t fail immediately.

Copilot uses AI. Check for mistakes.
Comment on lines +256 to +268
except (ReadTimeoutError, ProtocolError) as e:
# Only treat a read timeout / protocol error as a silent connection drop
should_retry = (
health_check_interval > 0
and not disable_retries
and watch_arg == "watch"
and self.resource_version is not None
)
if should_retry:
# Add a small sleep to avoid a tight reconnect loop
# in case the endpoint is hard-down or errors immediately.
time.sleep(min(1.0,health_check_interval))
else:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching and retrying on any ProtocolError when _health_check_interval is enabled can mask non-timeout protocol problems (e.g., bad chunked framing) and potentially lead to an infinite reconnect loop. It would be safer to only retry ProtocolError cases that are known to represent timeouts / broken connections (for example, a ProtocolError that wraps a ReadTimeoutError), and re-raise other protocol errors so callers can handle them.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail watch gracefully on control plane upgrade

3 participants