Saga Architecture
This document provides a comprehensive overview of Saga's architecture, design decisions, and internal workings. Understanding the architecture helps with integration, troubleshooting, and optimization.
Saga is a centralized service discovery system built with Rust and Actix Web. It provides dynamic service registration and discovery capabilities for microservices architectures, using Redis as the persistent storage backend and an in-memory cache for performance.
System Architecture
The following diagram illustrates Saga's high-level architecture:
graph TB
Clients[Client Services] -->|HTTP API / CLI| Saga[Saga Service]
Saga -->|Redis Protocol| Redis[(Redis Storage)]
subgraph Saga
HTTP[HTTP Server<br/>Actix Web]
Registry[Service Registry<br/>In-Memory Cache]
RedisClient[Redis Client<br/>redis-rs]
HTTP --> Registry
Registry --> RedisClient
end
RedisClient --> Redis
Key Components:
- HTTP Server: RESTful API endpoints (Actix Web)
- Service Registry: In-memory cache for fast lookups
- Redis Client: Persistent storage backend
- Redis: Distributed storage for service metadata
Saga follows these design principles:
- Performance First: In-memory caching for sub-millisecond lookups
- Reliability: Redis-backed persistence with automatic expiration
- Simplicity: Clean REST API and straightforward integration
- Scalability: Stateless design enables horizontal scaling
Component Details
HTTP Server
The HTTP server is built with Actix Web, a high-performance async web framework for Rust.
Capabilities:
- ✅ RESTful API endpoints
- ✅ Async request handling
- ✅ Concurrent request processing
- ✅ JSON request/response handling
- ✅ Health check endpoints
- ✅ Error handling and logging
Available Endpoints:
GET /api/v1/health- Health checkPOST /api/v1/services/register- Register serviceGET /api/v1/services- List all servicesGET /api/v1/services/{name}- Get service detailsDELETE /api/v1/services/{name}- Unregister servicePOST /api/v1/services/{name}/heartbeat- Refresh registration
Performance Characteristics:
- Handles thousands of requests per second
- Sub-millisecond response times for cached requests
- Async I/O prevents blocking
- Efficient memory usage
Service Registry (In-Memory Cache)
The service registry maintains an in-memory cache of all registered services for ultra-fast lookups.
// Simplified representation
Arc<RwLock<HashMap<String, ServiceInfo>>>
Characteristics:
- Thread-safe: Uses
Arc<RwLock<>>for concurrent access - Fast lookups: O(1) average case complexity
- Automatic refresh: Background task updates cache periodically
- Statistics: Tracks hits, misses, and hit ratio
Refresh Mechanism:
- Background task runs every 30 seconds (configurable)
- Queries all services from Redis
- Updates in-memory cache atomically
- Updates cache statistics
Benefits:
- Ensures cache consistency
- Handles Redis updates from other Saga instances
- Maintains performance with fresh data
The cache tracks performance metrics:
{
"size": 5, // Number of services cached
"hits": 142, // Cache hits (fast lookups)
"misses": 18, // Cache misses (Redis queries)
"hit_ratio": 0.8875, // Efficiency (hits / total)
"last_refresh": "..." // Last refresh timestamp
}
Interpreting Metrics:
- Hit ratio > 0.8: Excellent cache performance
- Hit ratio 0.5-0.8: Good cache performance
- Hit ratio < 0.5: Consider optimization
Redis Client
The Redis client provides persistent storage for service metadata.
Library: redis-rs (async Redis client for Rust)
Features:
- ✅ Async/await support
- ✅ Connection pooling
- ✅ Automatic reconnection
- ✅ TTL management
- ✅ JSON serialization
Connection Handling:
- Connection pool for concurrent requests
- Automatic reconnection on failure
- Health check integration
- Graceful degradation when Redis unavailable
Configuration:
REDIS_URL=redis://localhost:6379
Supported Formats:
redis://localhost:6379- Basic connectionredis://:password@host:6379- With passwordredis://user:pass@host:6379- With username/passwordredis-sentinel://...- Redis Sentinelredis-cluster://...- Redis Cluster
Key Pattern: service:{service_name}
Example Keys:
service:authenticationservice:paymentservice:gateway
Value Format: JSON metadata
{
"service_name": "authentication",
"service_url": "http://localhost:8001",
"service_id": "uuid-here",
"registered_at": "2025-01-01T00:00:00Z",
"last_heartbeat": "2025-01-01T00:00:00Z",
"capabilities": ["rest", "graphql"]
}
CLI Interface
Saga provides a command-line interface for service management and debugging.
Available Commands:
service list- List all registered servicesservice register- Register a new serviceservice get {name}- Get service detailsservice unregister {name}- Unregister a service
# List services
cargo run -- service list
# Register service
cargo run -- service register \
--name my-service \
--url http://localhost:8000 \
--capabilities rest,graphql
# Get service details
cargo run -- service get my-service
# Unregister service
cargo run -- service unregister my-service
Data Flow
Understanding how data flows through Saga helps with debugging and optimization.
Service Registration Flow
1. Client sends POST /api/v1/services/register
↓
2. Saga validates request (name, URL, capabilities)
↓
3. Saga generates unique service_id (UUID)
↓
4. Saga stores metadata in Redis with TTL
↓
5. Saga updates in-memory cache immediately
↓
6. Saga returns registration confirmation
Key Points:
- Validation happens before storage
- Cache is updated synchronously for immediate availability
- Redis storage provides persistence across restarts
// Simplified registration flow
async fn register_service(request: RegisterRequest) -> Result<ServiceInfo> {
// 1. Validate
validate_service_name(&request.service_name)?;
validate_service_url(&request.service_url)?;
// 2. Create service info
let service_info = ServiceInfo {
service_id: Uuid::new_v4(),
service_name: request.service_name,
service_url: request.service_url,
registered_at: Utc::now(),
capabilities: request.capabilities,
};
// 3. Store in Redis
redis_client.set_with_ttl(&key, &service_info, ttl).await?;
// 4. Update cache
cache.write().await.insert(service_info.service_name.clone(), service_info.clone());
// 5. Return
Ok(service_info)
}
Service Discovery Flow
1. Client sends GET /api/v1/services/{name}
↓
2. Saga checks in-memory cache first
↓
3a. Cache HIT → Return cached data immediately (< 1ms)
↓
3b. Cache MISS → Query Redis (5-10ms)
↓
4. Update cache with result
↓
5. Return service metadata
Performance:
- Cache hit: < 1ms (in-memory lookup)
- Cache miss: 5-10ms (Redis query + cache update)
// Simplified discovery flow
async fn discover_service(name: &str) -> Result<ServiceInfo> {
// 1. Check cache first
if let Some(cached) = cache.read().await.get(name) {
cache_stats.record_hit();
return Ok(cached.clone());
}
// 2. Cache miss - query Redis
cache_stats.record_miss();
let service_info = redis_client.get(&format!("service:{}", name)).await?;
// 3. Update cache
cache.write().await.insert(name.to_string(), service_info.clone());
// 4. Return
Ok(service_info)
}
Cache Refresh Flow
1. Background task triggers every 30 seconds
↓
2. Query all service keys from Redis
↓
3. Fetch all service metadata
↓
4. Atomically replace cache contents
↓
5. Update cache statistics
↓
6. Log refresh completion
Benefits:
- Ensures cache consistency
- Handles updates from other Saga instances
- Maintains fresh data without blocking requests
// Simplified cache refresh flow
async fn refresh_cache() {
loop {
tokio::time::sleep(Duration::from_secs(30)).await;
// Query all services from Redis
let services: Vec<ServiceInfo> = redis_client
.keys("service:*")
.await
.iter()
.map(|key| redis_client.get(key).await)
.collect();
// Atomically update cache
let mut cache = cache.write().await;
cache.clear();
for service in services {
cache.insert(service.service_name.clone(), service);
}
// Update statistics
update_cache_stats();
}
}
Storage Model
Redis Key Structure
Pattern: service:{service_name}
Examples:
service:authenticationservice:paymentservice:gatewayservice:my-service
Benefits:
- Simple and predictable
- Easy to query all services (
KEYS service:*) - Namespace isolation
TTL Configuration:
- Default: 60 seconds
- Configurable via
REGISTRATION_TTLenvironment variable - Refreshed via heartbeat endpoint
- Automatic expiration if no heartbeat
TTL Refresh:
# Send heartbeat to refresh TTL
curl -X POST http://localhost:8030/api/v1/services/my-service/heartbeat
Best Practices:
- Set TTL to 2x heartbeat interval
- Send heartbeats every 30 seconds for 60-second TTL
- Handle expiration gracefully in clients
Value Format
Service metadata is stored as JSON in Redis:
{
"service_name": "authentication",
"service_url": "http://localhost:8001",
"service_id": "a674dbe7-5147-441b-ae56-f1c05f61cdbd",
"registered_at": "2025-12-22T14:30:38.531328+00:00",
"last_heartbeat": "2025-12-22T14:30:38.531330+00:00",
"capabilities": ["rest", "graphql"]
}
Field Descriptions:
| Field | Type | Description |
|---|---|---|
service_name | string | Unique service identifier |
service_url | string | Base URL where service can be reached |
service_id | string (UUID) | Unique registration ID |
registered_at | string (ISO 8601) | Initial registration timestamp |
last_heartbeat | string (ISO 8601) | Last heartbeat timestamp |
capabilities | string[] | Supported protocols (rest, graphql, grpc, mcp) |
Concurrency Model
Thread Safety
Implementation:
Arc<RwLock<HashMap<String, ServiceInfo>>>
Characteristics:
- Arc: Shared ownership across threads
- RwLock: Multiple readers or single writer
- HashMap: Fast O(1) lookups
- Concurrent reads: Multiple threads can read simultaneously
- Exclusive writes: Only one writer at a time
Performance:
- Read operations are non-blocking (multiple concurrent readers)
- Write operations block readers briefly
- Minimal contention in typical workloads
Implementation:
- Redis client is
Send + Syncsafe - Connection pool handles concurrent requests
- Async operations prevent blocking
Benefits:
- Safe to share across threads
- Efficient connection reuse
- Automatic connection management
Request Handling:
- Each request runs in async task
- Multiple requests processed concurrently
- No shared mutable state per request
- Stateless design enables scaling
Cache Consistency
Type: Eventually Consistent
Characteristics:
- Cache refreshes every 30 seconds
- Direct Redis queries bypass cache when needed
- Cache invalidation on service unregistration
- Multiple Saga instances share Redis backend
Trade-offs:
- Pros: High performance, low latency
- Cons: Potential stale data (max 30 seconds)
Use Cases:
- Service discovery (stale data acceptable)
- Health checks (real-time data from Redis)
- Registration (immediate cache update)
Refresh Interval: 30 seconds (configurable)
Refresh Process:
- Background task triggers periodically
- Queries all services from Redis
- Atomically replaces cache contents
- Updates statistics
Optimization:
- Can be reduced for more consistency
- Can be increased for less Redis load
- Balance between freshness and performance
Performance Characteristics
Latency
Latency: < 1ms
Breakdown:
- Cache lookup: ~0.1ms
- JSON serialization: ~0.2ms
- HTTP response: ~0.5ms
Optimization:
- In-memory hash map lookup
- No network I/O
- Minimal CPU overhead
Latency: 5-10ms
Breakdown:
- Redis query: ~2-5ms
- Cache update: ~0.5ms
- JSON serialization: ~0.5ms
- HTTP response: ~1-2ms
Optimization:
- Connection pooling reduces overhead
- Async I/O prevents blocking
- Cache update happens asynchronously
Latency: 10-20ms
Breakdown:
- Validation: ~0.5ms
- Redis write: ~5-10ms
- Cache update: ~0.5ms
- JSON serialization: ~1ms
- HTTP response: ~2-3ms
Throughput
Capacity: Thousands of requests per second
Factors:
- Cache hit ratio (higher = better)
- Redis performance
- Network latency
- CPU resources
Typical Performance:
- Cache hits: 10,000+ req/s
- Cache misses: 1,000-2,000 req/s
- Mixed workload: 3,000-5,000 req/s
Cache Optimization:
- High cache hit ratio (>80%)
- Reduce cache refresh frequency
- Increase cache size if needed
Redis Optimization:
- Use Redis Cluster for scaling
- Optimize network latency
- Use connection pooling
Application Optimization:
- Batch requests when possible
- Use async/await properly
- Monitor and tune resource limits
Scalability
Strategy: Multiple Saga Instances
Architecture:
Load Balancer
├── Saga Instance 1 ──┐
├── Saga Instance 2 ──┼──→ Redis (Shared)
└── Saga Instance 3 ──┘
Benefits:
- ✅ High availability
- ✅ Load distribution
- ✅ Fault tolerance
- ✅ Easy scaling
Considerations:
- Shared Redis backend
- Cache consistency (eventual)
- Load balancer configuration
Options:
- Redis Sentinel: High availability
- Redis Cluster: Horizontal scaling
- Redis Replication: Read scaling
Recommendations:
- Use Sentinel for HA
- Use Cluster for scale
- Monitor Redis performance
Security Considerations
Current State
Status:
- ❌ No authentication required
- ❌ No authorization checks
- ❌ No TLS/HTTPS support
- ✅ Network security recommended
Use Cases:
- Internal microservices networks
- Development environments
- Trusted network deployments
Network Security:
- Deploy behind firewall
- Restrict access to internal networks
- Use reverse proxy with TLS
- Implement network policies
Access Control:
- Use network-level access control
- Deploy in private networks
- Monitor access logs
- Implement rate limiting at proxy level
Future Enhancements
Authentication:
- API key authentication
- OAuth2/JWT support
- mTLS (mutual TLS)
Authorization:
- Role-based access control (RBAC)
- Service-level permissions
- Admin vs read-only access
Security:
- TLS/HTTPS support
- Rate limiting
- Request signing
- Audit logging
Reliability
Redis Failover
Behavior:
- Saga continues running when Redis unavailable
- Health endpoint reports Redis status
- Service registration fails gracefully
- Service discovery uses fallback configuration
Error Handling:
- Clear error messages
- Retry logic for transient failures
- Fallback mechanisms in clients
Redis Sentinel:
- Automatic failover
- Multiple Redis instances
- No single point of failure
Configuration:
REDIS_URL=redis-sentinel://sentinel1:26379,sentinel2:26379/mymaster
Service Availability
Purpose: Keep service registrations alive
Process:
- Service sends heartbeat every 30 seconds
- Saga refreshes TTL in Redis
- Service remains discoverable
- Automatic expiration if heartbeat stops
Benefits:
- Automatic cleanup of dead services
- Fresh service information
- No manual intervention needed
Default TTL: 60 seconds
Expiration Process:
- Service registered with TTL
- Heartbeat refreshes TTL
- If no heartbeat, TTL expires
- Service automatically removed from registry
Best Practices:
- Send heartbeats every 30 seconds
- Handle expiration in clients
- Implement re-registration logic
Monitoring
Health Endpoint
Endpoint: GET /api/v1/health
Response:
{
"status": "healthy",
"service": "saga",
"version": "0.8.1",
"redis": "connected",
"cache": {
"size": 5,
"hits": 142,
"misses": 18,
"hit_ratio": 0.8875,
"last_refresh": "2025-12-22T14:29:53.944455Z"
}
}
Use Cases:
- Container health checks
- Load balancer health checks
- Monitoring systems
- Alerting systems
Service Metrics:
- Status (healthy/unhealthy)
- Version information
- Redis connection status
Cache Metrics:
- Cache size
- Cache hits/misses
- Cache hit ratio
- Last refresh time
Integration:
- Prometheus (future)
- Custom metrics endpoint (future)
- Structured logging
Logging
Library: tracing (Rust)
Log Levels:
error- Errors onlywarn- Warnings and errorsinfo- Informational messages (default)debug- Debug informationtrace- Very verbose logging
Configuration:
RUST_LOG=saga=info cargo run
RUST_LOG=saga=debug cargo run
RUST_LOG=saga=trace cargo run
Request/Response:
- HTTP method and path
- Request duration
- Response status
- Error details
Service Operations:
- Registration events
- Discovery requests
- Cache operations
- Redis operations
Performance:
- Cache hit/miss rates
- Request latencies
- Redis query times
Future Enhancements
Planned Features
Service Management:
- Service health monitoring
- Load balancing support
- Service versioning
- Multi-region support
API Enhancements:
- GraphQL API
- WebSocket subscriptions
- Event streaming
- Batch operations
Integration:
- Service mesh integration
- Kubernetes integration
- Prometheus metrics
- Distributed tracing
Caching:
- Redis pub/sub for cache invalidation
- Distributed caching
- Cache warming strategies
Optimization:
- Query optimization
- Connection pooling improvements
- Batch operations
- Compression
Next Steps
- Review Getting Started for setup instructions
- Check API Reference for endpoint details
- See Integration Examples for code samples
- Read Troubleshooting for common issues