This issue proposes the implementation of a comprehensive set of intelligent diagnosis tools for the Dubbo Admin AI Agent. The goal is to enhance the AI agent's ability to diagnose and resolve issues in Dubbo microservices across three deployment modes (Universal, Half, K8s) by leveraging multi-dimensional observability data (metrics, logs, traces).
Current State
The current Dubbo Admin AI Agent in the ai/ directory only uses mock tools. While Dubbo Admin has rich observability capabilities and service management APIs, these are not exposed to the AI agent for intelligent diagnosis.
Existing Capabilities Identified
Observability Infrastructure:
- Prometheus integration with metric collection from Dubbo instances
- Grafana dashboard integration for visualization
- Distributed tracing support with dashboard links
- Comprehensive logging infrastructure based on Zap
- Real-time metrics and monitoring capabilities
Service Management APIs:
- Complete CRUD operations for services, instances, and applications
- Traffic rule management (condition routes, tag routes)
- Configuration management (timeout, retry, load balancing)
- Multi-deployment mode support (Universal, Half, K8s)
- K8s resource models and management capabilities
Missing Integration
The gap is between these existing capabilities and the AI agent tools. The agent needs structured APIs to:
- Query and analyze observability data
- Perform intelligent diagnosis using LLM reasoning
- Execute safe recovery operations
- Provide comprehensive root cause analysis
Proposed Solution
Phase 1: Foundation Tools (High Priority)
1. Metrics Query Tools
query_service_metrics - Query basic metrics (QPS, RT, success rate) with filtering
get_application_overview - Get application health status and summary
analyze_metrics_anomaly - Detect metric anomalies based on historical data
compare_instance_performance - Compare performance across multiple instances
2. Log Analysis Tools
search_service_logs - Search logs by service, instance, keywords, time range
analyze_error_logs - Analyze error patterns and frequency
correlate_logs_with_metrics - Correlate logs with metric anomalies
trace_error_propagation - Track error propagation in call chains
3. Basic Service Management Tools
list_applications - List applications with filtering and pagination
get_service_details - Get comprehensive service information
list_service_instances - List service instances with health status
get_instance_status - Get detailed instance status and health checks
Phase 2: Advanced Analysis Tools (Medium Priority)
4. Distributed Tracing Tools
query_service_traces - Query distributed traces with filtering
analyze_trace_performance - Analyze trace performance and bottlenecks
detect_trace_anomalies - Detect trace anomalies (failures, timeouts)
map_service_dependencies - Build service dependency topology
5. Traffic Management Tools
list_traffic_rules - List all traffic control rules
get_traffic_rule_details - Get detailed rule configuration and impact
analyze_traffic_distribution - Analyze traffic patterns and anomalies
simulate_traffic_impact - Simulate traffic rule changes impact
6. Configuration Management Tools
get_service_config - Get service configuration details
list_config_changes - List configuration change history
analyze_config_consistency - Analyze configuration consistency
validate_config_changes - Validate configuration changes safety
Phase 3: Intelligent Diagnosis Tools (Low Priority)
7. Cross-Mode Management Tools
get_deployment_mode - Get deployment mode information
list_k8s_resources - List K8s resources in K8s mode
analyze_cross_mode_consistency - Analyze cross-mode consistency
migrate_service_mode - Assist with deployment mode migration
8. Intelligent Diagnosis Tools
diagnose_service_issues - Comprehensive issue diagnosis using multi-dim data
predict_service_anomalies - Predict potential anomalies based on history
generate_recovery_plan - Generate automated recovery plans
execute_safe_recovery - Execute safe recovery operations
Required API Enhancements
New API Endpoints to Implement
Metrics APIs:
POST /api/v1/metrics/batch-query # Batch metric queries
POST /api/v1/metrics/anomaly-detection # Anomaly detection
POST /api/v1/metrics/comparison # Metric comparison analysis
Log APIs:
POST /api/v1/logs/search # Log search
POST /api/v1/logs/error-analysis # Error log analysis
POST /api/v1/logs/correlation # Log-metric correlation
Trace APIs:
POST /api/v1/traces/query # Trace query
POST /api/v1/traces/performance-analysis # Performance analysis
POST /api/v1/traces/dependency-map # Dependency map generation
Diagnosis APIs:
POST /api/v1/diagnosis/comprehensive # Comprehensive diagnosis
POST /api/v1/prediction/anomalies # Anomaly prediction
POST /api/v1/recovery/plan-generation # Recovery plan generation
Existing API Enhancements Required
- Enhance
/api/v1/application/detail - Add health status assessment
- Enhance
/api/v1/service/detail - Add more related information
- Enhance
/api/v1/instance/detail - Add health checks and resource usage
- Enhance
/api/v1/promQL/query - Support batch queries and time ranges
- Enhance traffic rule APIs - Add impact analysis and statistics
This issue proposes the implementation of a comprehensive set of intelligent diagnosis tools for the Dubbo Admin AI Agent. The goal is to enhance the AI agent's ability to diagnose and resolve issues in Dubbo microservices across three deployment modes (Universal, Half, K8s) by leveraging multi-dimensional observability data (metrics, logs, traces).
Current State
The current Dubbo Admin AI Agent in the
ai/directory only uses mock tools. While Dubbo Admin has rich observability capabilities and service management APIs, these are not exposed to the AI agent for intelligent diagnosis.Existing Capabilities Identified
Observability Infrastructure:
Service Management APIs:
Missing Integration
The gap is between these existing capabilities and the AI agent tools. The agent needs structured APIs to:
Proposed Solution
Phase 1: Foundation Tools (High Priority)
1. Metrics Query Tools
query_service_metrics- Query basic metrics (QPS, RT, success rate) with filteringget_application_overview- Get application health status and summaryanalyze_metrics_anomaly- Detect metric anomalies based on historical datacompare_instance_performance- Compare performance across multiple instances2. Log Analysis Tools
search_service_logs- Search logs by service, instance, keywords, time rangeanalyze_error_logs- Analyze error patterns and frequencycorrelate_logs_with_metrics- Correlate logs with metric anomaliestrace_error_propagation- Track error propagation in call chains3. Basic Service Management Tools
list_applications- List applications with filtering and paginationget_service_details- Get comprehensive service informationlist_service_instances- List service instances with health statusget_instance_status- Get detailed instance status and health checksPhase 2: Advanced Analysis Tools (Medium Priority)
4. Distributed Tracing Tools
query_service_traces- Query distributed traces with filteringanalyze_trace_performance- Analyze trace performance and bottlenecksdetect_trace_anomalies- Detect trace anomalies (failures, timeouts)map_service_dependencies- Build service dependency topology5. Traffic Management Tools
list_traffic_rules- List all traffic control rulesget_traffic_rule_details- Get detailed rule configuration and impactanalyze_traffic_distribution- Analyze traffic patterns and anomaliessimulate_traffic_impact- Simulate traffic rule changes impact6. Configuration Management Tools
get_service_config- Get service configuration detailslist_config_changes- List configuration change historyanalyze_config_consistency- Analyze configuration consistencyvalidate_config_changes- Validate configuration changes safetyPhase 3: Intelligent Diagnosis Tools (Low Priority)
7. Cross-Mode Management Tools
get_deployment_mode- Get deployment mode informationlist_k8s_resources- List K8s resources in K8s modeanalyze_cross_mode_consistency- Analyze cross-mode consistencymigrate_service_mode- Assist with deployment mode migration8. Intelligent Diagnosis Tools
diagnose_service_issues- Comprehensive issue diagnosis using multi-dim datapredict_service_anomalies- Predict potential anomalies based on historygenerate_recovery_plan- Generate automated recovery plansexecute_safe_recovery- Execute safe recovery operationsRequired API Enhancements
New API Endpoints to Implement
Metrics APIs:
Log APIs:
Trace APIs:
Diagnosis APIs:
Existing API Enhancements Required
/api/v1/application/detail- Add health status assessment/api/v1/service/detail- Add more related information/api/v1/instance/detail- Add health checks and resource usage/api/v1/promQL/query- Support batch queries and time ranges