Add health check to `watch.stream` for silent connection drops by Urvashi0109 · Pull Request #2527 · kubernetes-client/python

Urvashi0109 · 2026-03-20T08:31:49Z

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.

This PR adds a _health_check_interval parameter to watch.stream() that detects silent connection drops and automatically reconnects:

When _health_check_interval is set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connection
If no data arrives within the specified interval, urllib3 raises a ReadTimeoutError
The watch catches this exception and automatically reconnects using the last known resource_version, ensuring no events are missed
The feature is disabled by default (_health_check_interval=0), preserving full backward compatibility
When disabled, ReadTimeoutError propagates to the caller as before

Which issue(s) this PR fixes:

Fixes #2462

Special notes for your reviewer:

This PR takes approach: leveraging urllib3's existing read timeout mechanism (_request_timeout) to break out of the blocking read, then catching the resulting ReadTimeoutError/ProtocolError exceptions
The _ prefix on _health_check_interval follows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API server
5 new unit tests added, all 24 tests (19 existing + 5 new) pass with zero regressions

Does this PR introduce a user-facing change?

Added `_health_check_interval` parameter to `watch.stream()` to detect and recover from silent connection drops during Kubernetes control plane upgrades. When set to a value > 0 (seconds), the watch will automatically reconnect if no data is received within the specified interval. Disabled by default for backward compatibility.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

k8s-ci-robot · 2026-03-20T08:31:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Urvashi0109
Once this PR has been reviewed and has the lgtm label, please assign yliaog for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

kubernetes/base/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds an optional client-side “health check” mechanism to Watch.stream() to detect silent/half-open watch connections (e.g., during control plane upgrades) by configuring a read timeout and reconnecting automatically.

Changes:

Add _health_check_interval parameter to Watch.stream() that sets _request_timeout (read timeout) when enabled.
Catch ReadTimeoutError / ProtocolError during streaming to trigger a reconnect using the last known resource_version.
Add unit tests covering reconnect behavior, default behavior, and _request_timeout handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
kubernetes/base/watch/watch.py	Implements `_health_check_interval` handling, timeout configuration, and reconnect-on-timeout logic.
kubernetes/base/watch/watch_test.py	Adds unit tests for the new health-check/reconnect behavior and timeout parameter interactions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T08:37:32Z

kubernetes/base/watch/watch.py

+            except (ReadTimeoutError, ProtocolError) as e:
+                # Only treat a read timeout / protocol error as a silent connection drop
+                should_retry = (
+                    health_check_interval > 0
+                    and not disable_retries
+                    and watch_arg == "watch"
+                    and self.resource_version is not None
+                )
+                if should_retry:
+                    # Add a small sleep to avoid a tight reconnect loop
+                    # in case the endpoint is hard-down or errors immediately.
+                    time.sleep(min(1.0,health_check_interval))
+                else:
+                    raise


The new retry behavior for ProtocolError isn’t covered by tests (current tests only simulate ReadTimeoutError). Add a unit test that raises ProtocolError (ideally wrapping a ReadTimeoutError, as urllib3 often does for broken streams) to verify the watch reconnects only in the intended cases and that non-timeout ProtocolErrors still propagate.

Copilot · 2026-03-20T08:37:32Z

kubernetes/base/watch/watch.py

+                if should_retry:
+                    # Add a small sleep to avoid a tight reconnect loop
+                    # in case the endpoint is hard-down or errors immediately.
+                    time.sleep(min(1.0,health_check_interval))


Minor style: add a space after the comma in min(1.0, health_check_interval) to match typical formatting in this file / PEP8.

Suggested change

time.sleep(min(1.0,health_check_interval))

time.sleep(min(1.0, health_check_interval))

Copilot · 2026-03-20T08:37:32Z

kubernetes/base/watch/watch.py

+                should_retry = (
+                    health_check_interval > 0
+                    and not disable_retries
+                    and watch_arg == "watch"
+                    and self.resource_version is not None
+                )


With _health_check_interval enabled, a ReadTimeoutError will occur whenever the watch is idle longer than the interval. The current retry gate requires self.resource_version is not None, and the finally block sets self._stop = True when it’s None, so an idle watch that hasn’t observed an event yet will raise instead of reconnecting. Consider allowing retries even when resource_version hasn’t been observed yet (or performing an initial list to obtain one) and avoid forcing _stop=True in that case so long-running idle watches don’t fail immediately.

Copilot · 2026-03-20T08:37:33Z

kubernetes/base/watch/watch.py

+            except (ReadTimeoutError, ProtocolError) as e:
+                # Only treat a read timeout / protocol error as a silent connection drop
+                should_retry = (
+                    health_check_interval > 0
+                    and not disable_retries
+                    and watch_arg == "watch"
+                    and self.resource_version is not None
+                )
+                if should_retry:
+                    # Add a small sleep to avoid a tight reconnect loop
+                    # in case the endpoint is hard-down or errors immediately.
+                    time.sleep(min(1.0,health_check_interval))
+                else:


Catching and retrying on any ProtocolError when _health_check_interval is enabled can mask non-timeout protocol problems (e.g., bad chunked framing) and potentially lead to an infinite reconnect loop. It would be safer to only retry ProtocolError cases that are known to represent timeouts / broken connections (for example, a ProtocolError that wraps a ReadTimeoutError), and re-raise other protocol errors so callers can handle them.

Added health check to watch.stream for silent connection drops

1174077

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 20, 2026

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 20, 2026

k8s-ci-robot requested review from fabianvf and yliaog March 20, 2026 08:31

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2026

Urvashi0109 marked this pull request as ready for review March 20, 2026 08:33

Copilot AI review requested due to automatic review settings March 20, 2026 08:33

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026

Copilot started reviewing on behalf of Urvashi0109 March 20, 2026 08:33 View session

Urvashi0109 mentioned this pull request Mar 20, 2026

Fail watch gracefully on control plane upgrade #2462

Open

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health check to `watch.stream` for silent connection drops#2527

Add health check to `watch.stream` for silent connection drops#2527
Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
Urvashi0109:Fix-Added-Watch-Health-Check

Urvashi0109 commented Mar 20, 2026

Uh oh!

k8s-ci-robot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	time.sleep(min(1.0,health_check_interval))
	time.sleep(min(1.0, health_check_interval))

Conversation

Urvashi0109 commented Mar 20, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants