Skip to content

docs(acp): add StatefulSet application DR solution#129

Open
tossmilestone wants to merge 1 commit intomainfrom
feat/statefulset-dr
Open

docs(acp): add StatefulSet application DR solution#129
tossmilestone wants to merge 1 commit intomainfrom
feat/statefulset-dr

Conversation

@tossmilestone
Copy link
Member

@tossmilestone tossmilestone commented Mar 12, 2026

Add StatefulSet application DR solution with VolSync.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Cross-Cluster Application Disaster Recovery guide for stateful applications: Active‑Passive architecture, core components (cluster roles, traffic management, data replication, GitOps-based config), architecture diagram, and configuration templates.
    • Includes step‑by‑step procedures for planned migration, emergency failover and failback, prerequisites and parameters, operational commands, replication workflows (one‑time and scheduled), and risk/RPO‑RTO considerations.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

Walkthrough

New documentation file describing an Active‑Passive cross‑cluster disaster recovery solution for stateful applications. It covers architecture, component roles (Primary/Standby, GSLB, VolSync, GitOps), configuration templates, GitOps and VolSync setup, and step‑by‑step procedures for migrations, failover, and failback with risks and checks.

Changes

Cohort / File(s) Summary
Documentation
docs/en/solutions/acp/StatefulSet_Application_DR.md
Added a comprehensive guide for cross‑cluster Active‑Passive DR for StatefulSets: architecture diagram, component responsibilities (Cluster Roles, Traffic Management, VolSync, GitOps), GitOps base/overrides examples, VolSync ReplicationSource/ReplicationDestination and secret instructions, PVC/StatefulSet/Service templates, operational procedures for planned migration, emergency failover, failback, status/check commands, and risks/RPO‑RTO notes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GSLB
    participant GitOps
    participant PrimaryCluster
    participant VolSync
    participant StandbyCluster

    User->>GSLB: Resolve service DNS
    GSLB->>PrimaryCluster: Route traffic to Primary
    Note right of PrimaryCluster: Primary handles reads/writes
    PrimaryCluster->>VolSync: Replicate PVs (async rsync-tls)
    VolSync->>StandbyCluster: Stream replicated data
    GitOps->>StandbyCluster: Sync standby manifests (kustomize patches)
    alt Emergency failover
        GSLB->>StandbyCluster: Switch DNS to Standby
        StandbyCluster->>User: Serve traffic (read/write)
    end
    alt Failback
        PrimaryCluster->>VolSync: Re-sync data back
        GitOps->>PrimaryCluster: Reapply primary manifests
        GSLB->>PrimaryCluster: Route traffic back
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • tyzhou111
  • sinbadonline
  • fanzy618

Poem

🐰 I hopped through manifests, patches in paw,
VolSync whispered copies, steady and raw,
GSLB pointed paths where traffic would play,
GitOps kept the states in tidy array,
Now clusters nap safe till disaster's day. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: adding documentation for a StatefulSet application disaster recovery solution, which aligns perfectly with the 405-line documentation file added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/statefulset-dr
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use oxc to improve the quality of JavaScript and TypeScript code reviews.

Add a configuration file to your project to customize how CodeRabbit runs oxc.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/solutions/acp/StatefulSet_Application_DR.md`:
- Around line 291-295: Replace the indented command block for the
ReplicationSource status check with a fenced shell code block (```bash ... ```)
so it no longer triggers MD046, and update the explanatory text that references
trigger.manual to describe it as a spec field (e.g., "trigger.manual is a spec
field") rather than a label; apply the same fenced-block and wording correction
to the later occurrence that spans the other referenced lines and ensure the
command shown is the kubectl -n <application-namespace> get ReplicationSource
rs-pvc-01 -o jsonpath='{.status}' example.
- Around line 135-148: The example only creates a headless Service
(metadata.name: my-stateful-app-headless with clusterIP: None and selector app:
my-stateful-app) but omits the client-facing exposure needed for GSLB traffic
switching; update the doc to either add a separate external
Service/Ingress/Gateway example (showing a Service with a clusterIP and/or type:
LoadBalancer or an Ingress/Gateway pointing to the headless selector) or add a
clear sentence after the headless Service block explicitly stating that a
separate client-facing Service/Ingress/Gateway is required for external traffic
and must target the same selector (app: my-stateful-app) to enable the
traffic-switching flow.
- Around line 174-178: The Parameter tables (the block starting with
"**Parameters**:" that lists **application-namespace** and **psk.txt**) need
blank lines inserted immediately before and after the table to satisfy MD058 and
ensure proper rendering; update the markdown around the "**Parameters**:" table
and the other similar tables (the ones containing **application-namespace** /
**psk.txt** entries) by adding one empty line above the table header and one
empty line below the table end so each table is separated from surrounding
paragraphs.
- Around line 40-45: The diagram incorrectly shows a StatefulSet Pod in the
passive cluster; remove or hide the Pod node (PodS / "StatefulSet Pod
<br>replicas=0") from the "Secondary Cluster Passive" subgraph so the visual
matches the text (replicas=0), leaving only the Service (SvcS), Destination
PVC/Snapshot (PVCS), and VolSync ReplicationDestination (VSR_Dest) nodes in that
subgraph.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e347fff8-7779-4eed-b814-323c0b74c76b

📥 Commits

Reviewing files that changed from the base of the PR and between cfa09a1 and 50e0a99.

📒 Files selected for processing (1)
  • docs/en/solutions/acp/StatefulSet_Application_DR.md

Comment on lines +40 to +45
subgraph Secondary[Secondary Cluster Passive]
SvcS[Service]
PodS[StatefulSet Pod <br>replicas=0]
PVCS[(Destination PVC / Snapshot)]
VSR_Dest[VolSync ReplicationDestination]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the standby Pod from the passive-cluster diagram.

Lines 40-45 show a StatefulSet Pod in the secondary cluster even though the rest of the doc defines the standby side as replicas=0. That contradiction makes the target topology ambiguous for readers following the runbook.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 40 - 45,
The diagram incorrectly shows a StatefulSet Pod in the passive cluster; remove
or hide the Pod node (PodS / "StatefulSet Pod <br>replicas=0") from the
"Secondary Cluster Passive" subgraph so the visual matches the text
(replicas=0), leaving only the Service (SvcS), Destination PVC/Snapshot (PVCS),
and VolSync ReplicationDestination (VSR_Dest) nodes in that subgraph.

Comment on lines +135 to +148
apiVersion: v1
kind: Service
metadata:
name: my-stateful-app-headless
namespace: <application-namespace>
spec:
clusterIP: None
selector:
app: my-stateful-app
ports:
- name: tcp
port: 80
targetPort: 8080
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Call out the client-facing Service separately from the headless Service.

This sample only defines a headless Service (clusterIP: None), which is fine for StatefulSet identity but does not match the GSLB traffic-switching flow described earlier. Please either add the external Service/Ingress/Gateway example here or explicitly state that a separate client-facing exposure resource is required.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 135 - 148,
The example only creates a headless Service (metadata.name:
my-stateful-app-headless with clusterIP: None and selector app: my-stateful-app)
but omits the client-facing exposure needed for GSLB traffic switching; update
the doc to either add a separate external Service/Ingress/Gateway example
(showing a Service with a clusterIP and/or type: LoadBalancer or an
Ingress/Gateway pointing to the headless selector) or add a clear sentence after
the headless Service block explicitly stating that a separate client-facing
Service/Ingress/Gateway is required for external traffic and must target the
same selector (app: my-stateful-app) to enable the traffic-switching flow.

Comment on lines +174 to +178
**Parameters**:
| **Parameter** | **Explanation** |
|:--|:--|
| **application-namespace** | The namespace of secret, should same as application |
| **psk.txt**| This field adheres to the format expected by stunnel: `<id>:<at least 32 hex digits>`. <br></br>for example, `1:23b7395fafc3e842bd8ac0fe142e6ad1`. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the parameter tables.

These tables currently violate MD058, and some Markdown renderers handle them poorly when they are not separated from surrounding paragraphs. A blank line before and after each table will make the page render more reliably.

Also applies to: 207-216, 247-257

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 175-175: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 174 - 178,
The Parameter tables (the block starting with "**Parameters**:" that lists
**application-namespace** and **psk.txt**) need blank lines inserted immediately
before and after the table to satisfy MD058 and ensure proper rendering; update
the markdown around the "**Parameters**:" table and the other similar tables
(the ones containing **application-namespace** / **psk.txt** entries) by adding
one empty line above the table header and one empty line below the table end so
each table is separated from surrounding paragraphs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
docs/en/solutions/acp/StatefulSet_Application_DR.md (4)

174-178: ⚠️ Potential issue | 🟡 Minor

Add blank lines before each parameters table (MD058).

At Line 175, Line 208, and Line 248, the table starts immediately after **Parameters**:. Markdownlint MD058 expects a blank line before the table header.

Also applies to: 207-216, 247-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 174 - 178,
The tables after each "**Parameters**:" heading start immediately on the next
line and violate MD058; insert a single blank line between every
"**Parameters**:" line and its following table header (the lines beginning with
"| **Parameter** | **Explanation** |" / the table rows such as "|
**application-namespace** | ..." and "| **psk.txt**| ...") so each Parameters
section has one blank line before the table.

40-43: ⚠️ Potential issue | 🟡 Minor

Remove the standby Pod node from the passive-cluster diagram.

At Line 40–43, showing StatefulSet Pod <br>replicas=0 in the passive subgraph conflicts with the “no running workload” standby model and can mislead readers during runbook execution.

Suggested doc patch
 subgraph Secondary[Secondary Cluster Passive]
     SvcS[Service]
-    PodS[StatefulSet Pod <br>replicas=0]
     PVCS[(Destination PVC / Snapshot)]
     VSR_Dest[VolSync ReplicationDestination]
 end
@@
-GSLB -. "Standby Traffic <br>(0 Weight)" .-> SvcS
+GSLB -. "Standby Traffic <br>(0 Weight)" .-> SvcS
@@
-SvcP --> PodP
+SvcP --> PodP
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 40 - 43,
The passive-cluster diagram incorrectly includes the standby Pod node
"PodS[StatefulSet Pod <br>replicas=0]" which contradicts the intended "no
running workload" standby model; remove the PodS node (and its label
"StatefulSet Pod <br>replicas=0") from the Secondary subgraph so only Service
(SvcS) and PVC/Snapshot (PVCS) remain, ensuring the diagram reflects a passive
cluster with no pods running.

135-148: ⚠️ Potential issue | 🟠 Major

Document a separate client-facing Service/Ingress for GSLB traffic switching.

At Line 135–148, only a headless Service is defined (clusterIP: None). That is fine for StatefulSet identity, but it does not represent the client-facing endpoint required by the GSLB flow described earlier.

Suggested doc patch
 apiVersion: v1
 kind: Service
 metadata:
   name: my-stateful-app-headless
@@
   - name: tcp
     port: 80
     targetPort: 8080
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: my-stateful-app
+  namespace: <application-namespace>
+spec:
+  selector:
+    app: my-stateful-app
+  ports:
+  - name: tcp
+    port: 80
+    targetPort: 8080

Or add one explicit sentence after the block: “A separate client-facing Service/Ingress/Gateway is required for GSLB and must target app: my-stateful-app.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 135 - 148,
The current manifest only defines a headless Service (clusterIP: None) for
StatefulSet identity but lacks a client-facing endpoint for GSLB; update the doc
by adding either a separate Service/Ingress/Gateway example that selects app:
my-stateful-app (i.e., a normal ClusterIP/LoadBalancer/Ingress/Gateway resource)
or insert one clear sentence immediately after the headless Service block
stating: “A separate client-facing Service/Ingress/Gateway is required for GSLB
and must target app: my-stateful-app.” Reference the headless Service block
(name: my-stateful-app-headless / selector: app: my-stateful-app) when adding
the new Service/Ingress/Gateway guidance.

291-295: ⚠️ Potential issue | 🟡 Minor

Use a proper fenced block and call trigger.manual a spec field.

At Line 291–295, the command block is still indented, which triggers MD046. At Line 313, trigger.manual is described as a “label,” but in your YAML it is a spec field.

Suggested doc patch
-    Check synchronization from the `ReplicationSource`.
-
-    ```bash
-    kubectl -n <application-namespace> get ReplicationSource rs-pvc-01 -o jsonpath='{.status}'
-    ```
+Check synchronization from the `ReplicationSource`.
+
+```bash
+kubectl -n <application-namespace> get ReplicationSource rs-pvc-01 -o jsonpath='{.status}'
+```
@@
-- Create a One-Time `ReplicationSource` with the `trigger.manual` label on the primary cluster.
+- Create a One-Time `ReplicationSource` with the `trigger.manual` spec field on the primary cluster.

Also applies to: 313-313

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` around lines 291 - 295,
Remove the indentation for the bash snippet so it uses a proper fenced code
block (```bash ... ```) around the kubectl command for ReplicationSource and
update the text that mentions trigger.manual to describe it as a spec field
rather than a label (i.e., change "trigger.manual label" to "trigger.manual spec
field"); ensure references to ReplicationSource and trigger.manual are corrected
consistently (e.g., the Get command block and the "Create a One-Time
ReplicationSource" sentence).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/en/solutions/acp/StatefulSet_Application_DR.md`:
- Around line 174-178: The tables after each "**Parameters**:" heading start
immediately on the next line and violate MD058; insert a single blank line
between every "**Parameters**:" line and its following table header (the lines
beginning with "| **Parameter** | **Explanation** |" / the table rows such as "|
**application-namespace** | ..." and "| **psk.txt**| ...") so each Parameters
section has one blank line before the table.
- Around line 40-43: The passive-cluster diagram incorrectly includes the
standby Pod node "PodS[StatefulSet Pod <br>replicas=0]" which contradicts the
intended "no running workload" standby model; remove the PodS node (and its
label "StatefulSet Pod <br>replicas=0") from the Secondary subgraph so only
Service (SvcS) and PVC/Snapshot (PVCS) remain, ensuring the diagram reflects a
passive cluster with no pods running.
- Around line 135-148: The current manifest only defines a headless Service
(clusterIP: None) for StatefulSet identity but lacks a client-facing endpoint
for GSLB; update the doc by adding either a separate Service/Ingress/Gateway
example that selects app: my-stateful-app (i.e., a normal
ClusterIP/LoadBalancer/Ingress/Gateway resource) or insert one clear sentence
immediately after the headless Service block stating: “A separate client-facing
Service/Ingress/Gateway is required for GSLB and must target app:
my-stateful-app.” Reference the headless Service block (name:
my-stateful-app-headless / selector: app: my-stateful-app) when adding the new
Service/Ingress/Gateway guidance.
- Around line 291-295: Remove the indentation for the bash snippet so it uses a
proper fenced code block (```bash ... ```) around the kubectl command for
ReplicationSource and update the text that mentions trigger.manual to describe
it as a spec field rather than a label (i.e., change "trigger.manual label" to
"trigger.manual spec field"); ensure references to ReplicationSource and
trigger.manual are corrected consistently (e.g., the Get command block and the
"Create a One-Time ReplicationSource" sentence).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9aaeeecf-2453-4f9a-9f56-81069ae4dfe9

📥 Commits

Reviewing files that changed from the base of the PR and between 50e0a99 and 425569e.

📒 Files selected for processing (1)
  • docs/en/solutions/acp/StatefulSet_Application_DR.md

Add StatefulSet application DR solution with VolSync.

Signed-off-by: Xiaoxi He <xxhe@alauda.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
docs/en/solutions/acp/StatefulSet_Application_DR.md (3)

178-178: Use standard line break syntax.

The <br></br> tag is non-standard HTML. Use either <br/> (self-closing tag) or double-space markdown line breaks for better compatibility.

✏️ Proposed fix
-| **psk.txt**| This field adheres to the format expected by stunnel: `<id>:<at least 32 hex digits>`. <br></br>for example, `1:23b7395fafc3e842bd8ac0fe142e6ad1`. |
+| **psk.txt**| This field adheres to the format expected by stunnel: `<id>:<at least 32 hex digits>`. <br/>For example, `1:23b7395fafc3e842bd8ac0fe142e6ad1`. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` at line 178, Replace the
non-standard <br></br> usage in the table cell that describes **psk.txt** with a
standard line break: either use a self-closing HTML break tag (<br/>) or convert
to a Markdown line break (end the previous line with two spaces) so the example
`1:23b7395fafc3e842bd8ac0fe142e6ad1` renders consistently; update the table cell
text containing "**psk.txt**| This field adheres to the format expected by
stunnel: `<id>:<at least 32 hex digits>`. <br></br>for example, ..."
accordingly.

403-403: Consider using "before" instead of "prior to".

The phrase "prior to a disaster" can be simplified to "before a disaster" for more concise writing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` at line 403, Replace the
phrase "prior to a disaster" in the sentence that reads "The storage and
computational resources sit idle prior to a disaster, inaccessible to the
external network, and incapable of offloading query pressure like an
Active-Active architecture would." with the simpler "before a disaster" so the
sentence reads "...sit idle before a disaster..."; update the document text in
StatefulSet_Application_DR.md accordingly.

80-80: Consider simplifying "In order to construct".

The phrase "In order to construct" can be shortened to "To construct" for more concise writing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/acp/StatefulSet_Application_DR.md` at line 80, Replace the
verbose phrase "In order to construct the \"Active-Passive\" setup" with the
concise "To construct the \"Active-Passive\" setup" in the sentence that
describes using Kustomize overrides to set `replicas` (the line mentioning
GitOps Base Configuration Sync and Kustomize overrides manipulating `replicas`
so Primary is `N` and Secondary is `replicas=0`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/en/solutions/acp/StatefulSet_Application_DR.md`:
- Line 178: Replace the non-standard <br></br> usage in the table cell that
describes **psk.txt** with a standard line break: either use a self-closing HTML
break tag (<br/>) or convert to a Markdown line break (end the previous line
with two spaces) so the example `1:23b7395fafc3e842bd8ac0fe142e6ad1` renders
consistently; update the table cell text containing "**psk.txt**| This field
adheres to the format expected by stunnel: `<id>:<at least 32 hex digits>`.
<br></br>for example, ..." accordingly.
- Line 403: Replace the phrase "prior to a disaster" in the sentence that reads
"The storage and computational resources sit idle prior to a disaster,
inaccessible to the external network, and incapable of offloading query pressure
like an Active-Active architecture would." with the simpler "before a disaster"
so the sentence reads "...sit idle before a disaster..."; update the document
text in StatefulSet_Application_DR.md accordingly.
- Line 80: Replace the verbose phrase "In order to construct the
\"Active-Passive\" setup" with the concise "To construct the \"Active-Passive\"
setup" in the sentence that describes using Kustomize overrides to set
`replicas` (the line mentioning GitOps Base Configuration Sync and Kustomize
overrides manipulating `replicas` so Primary is `N` and Secondary is
`replicas=0`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a25e62d4-791b-4fae-a089-b14154f78d96

📥 Commits

Reviewing files that changed from the base of the PR and between 425569e and 4b96dd6.

📒 Files selected for processing (1)
  • docs/en/solutions/acp/StatefulSet_Application_DR.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant