Skip to main content

Saga Architecture

This document provides a comprehensive overview of Saga's architecture, design decisions, and internal workings. Understanding the architecture helps with integration, troubleshooting, and optimization.

Architecture Overview

Saga is a centralized service discovery system built with Rust and Actix Web. It provides dynamic service registration and discovery capabilities for microservices architectures, using Redis as the persistent storage backend and an in-memory cache for performance.

System Architecture

The following diagram illustrates Saga's high-level architecture:

graph TB
Clients[Client Services] -->|HTTP API / CLI| Saga[Saga Service]
Saga -->|Redis Protocol| Redis[(Redis Storage)]

subgraph Saga
HTTP[HTTP Server<br/>Actix Web]
Registry[Service Registry<br/>In-Memory Cache]
RedisClient[Redis Client<br/>redis-rs]

HTTP --> Registry
Registry --> RedisClient
end

RedisClient --> Redis

Key Components:

  • HTTP Server: RESTful API endpoints (Actix Web)
  • Service Registry: In-memory cache for fast lookups
  • Redis Client: Persistent storage backend
  • Redis: Distributed storage for service metadata
Design Philosophy

Saga follows these design principles:

  • Performance First: In-memory caching for sub-millisecond lookups
  • Reliability: Redis-backed persistence with automatic expiration
  • Simplicity: Clean REST API and straightforward integration
  • Scalability: Stateless design enables horizontal scaling

Component Details

HTTP Server

The HTTP server is built with Actix Web, a high-performance async web framework for Rust.

Capabilities:

  • ✅ RESTful API endpoints
  • ✅ Async request handling
  • ✅ Concurrent request processing
  • ✅ JSON request/response handling
  • ✅ Health check endpoints
  • ✅ Error handling and logging

Available Endpoints:

  • GET /api/v1/health - Health check
  • POST /api/v1/services/register - Register service
  • GET /api/v1/services - List all services
  • GET /api/v1/services/{name} - Get service details
  • DELETE /api/v1/services/{name} - Unregister service
  • POST /api/v1/services/{name}/heartbeat - Refresh registration

Performance Characteristics:

  • Handles thousands of requests per second
  • Sub-millisecond response times for cached requests
  • Async I/O prevents blocking
  • Efficient memory usage

Service Registry (In-Memory Cache)

The service registry maintains an in-memory cache of all registered services for ultra-fast lookups.

// Simplified representation
Arc<RwLock<HashMap<String, ServiceInfo>>>

Characteristics:

  • Thread-safe: Uses Arc<RwLock<>> for concurrent access
  • Fast lookups: O(1) average case complexity
  • Automatic refresh: Background task updates cache periodically
  • Statistics: Tracks hits, misses, and hit ratio

Refresh Mechanism:

  • Background task runs every 30 seconds (configurable)
  • Queries all services from Redis
  • Updates in-memory cache atomically
  • Updates cache statistics

Benefits:

  • Ensures cache consistency
  • Handles Redis updates from other Saga instances
  • Maintains performance with fresh data

The cache tracks performance metrics:

{
"size": 5, // Number of services cached
"hits": 142, // Cache hits (fast lookups)
"misses": 18, // Cache misses (Redis queries)
"hit_ratio": 0.8875, // Efficiency (hits / total)
"last_refresh": "..." // Last refresh timestamp
}

Interpreting Metrics:

  • Hit ratio > 0.8: Excellent cache performance
  • Hit ratio 0.5-0.8: Good cache performance
  • Hit ratio < 0.5: Consider optimization

Redis Client

The Redis client provides persistent storage for service metadata.

Library: redis-rs (async Redis client for Rust)

Features:

  • ✅ Async/await support
  • ✅ Connection pooling
  • ✅ Automatic reconnection
  • ✅ TTL management
  • ✅ JSON serialization

Connection Handling:

  • Connection pool for concurrent requests
  • Automatic reconnection on failure
  • Health check integration
  • Graceful degradation when Redis unavailable

Configuration:

REDIS_URL=redis://localhost:6379

Supported Formats:

  • redis://localhost:6379 - Basic connection
  • redis://:password@host:6379 - With password
  • redis://user:pass@host:6379 - With username/password
  • redis-sentinel://... - Redis Sentinel
  • redis-cluster://... - Redis Cluster

Key Pattern: service:{service_name}

Example Keys:

  • service:authentication
  • service:payment
  • service:gateway

Value Format: JSON metadata

{
"service_name": "authentication",
"service_url": "http://localhost:8001",
"service_id": "uuid-here",
"registered_at": "2025-01-01T00:00:00Z",
"last_heartbeat": "2025-01-01T00:00:00Z",
"capabilities": ["rest", "graphql"]
}

CLI Interface

Saga provides a command-line interface for service management and debugging.

Available Commands:

  • service list - List all registered services
  • service register - Register a new service
  • service get {name} - Get service details
  • service unregister {name} - Unregister a service
# List services
cargo run -- service list

# Register service
cargo run -- service register \
--name my-service \
--url http://localhost:8000 \
--capabilities rest,graphql

# Get service details
cargo run -- service get my-service

# Unregister service
cargo run -- service unregister my-service

Data Flow

Understanding how data flows through Saga helps with debugging and optimization.

Service Registration Flow

1. Client sends POST /api/v1/services/register

2. Saga validates request (name, URL, capabilities)

3. Saga generates unique service_id (UUID)

4. Saga stores metadata in Redis with TTL

5. Saga updates in-memory cache immediately

6. Saga returns registration confirmation

Key Points:

  • Validation happens before storage
  • Cache is updated synchronously for immediate availability
  • Redis storage provides persistence across restarts
// Simplified registration flow
async fn register_service(request: RegisterRequest) -> Result<ServiceInfo> {
// 1. Validate
validate_service_name(&request.service_name)?;
validate_service_url(&request.service_url)?;

// 2. Create service info
let service_info = ServiceInfo {
service_id: Uuid::new_v4(),
service_name: request.service_name,
service_url: request.service_url,
registered_at: Utc::now(),
capabilities: request.capabilities,
};

// 3. Store in Redis
redis_client.set_with_ttl(&key, &service_info, ttl).await?;

// 4. Update cache
cache.write().await.insert(service_info.service_name.clone(), service_info.clone());

// 5. Return
Ok(service_info)
}

Service Discovery Flow

1. Client sends GET /api/v1/services/{name}

2. Saga checks in-memory cache first

3a. Cache HIT → Return cached data immediately (< 1ms)

3b. Cache MISS → Query Redis (5-10ms)

4. Update cache with result

5. Return service metadata

Performance:

  • Cache hit: < 1ms (in-memory lookup)
  • Cache miss: 5-10ms (Redis query + cache update)
// Simplified discovery flow
async fn discover_service(name: &str) -> Result<ServiceInfo> {
// 1. Check cache first
if let Some(cached) = cache.read().await.get(name) {
cache_stats.record_hit();
return Ok(cached.clone());
}

// 2. Cache miss - query Redis
cache_stats.record_miss();
let service_info = redis_client.get(&format!("service:{}", name)).await?;

// 3. Update cache
cache.write().await.insert(name.to_string(), service_info.clone());

// 4. Return
Ok(service_info)
}

Cache Refresh Flow

1. Background task triggers every 30 seconds

2. Query all service keys from Redis

3. Fetch all service metadata

4. Atomically replace cache contents

5. Update cache statistics

6. Log refresh completion

Benefits:

  • Ensures cache consistency
  • Handles updates from other Saga instances
  • Maintains fresh data without blocking requests
// Simplified cache refresh flow
async fn refresh_cache() {
loop {
tokio::time::sleep(Duration::from_secs(30)).await;

// Query all services from Redis
let services: Vec<ServiceInfo> = redis_client
.keys("service:*")
.await
.iter()
.map(|key| redis_client.get(key).await)
.collect();

// Atomically update cache
let mut cache = cache.write().await;
cache.clear();
for service in services {
cache.insert(service.service_name.clone(), service);
}

// Update statistics
update_cache_stats();
}
}

Storage Model

Redis Key Structure

Pattern: service:{service_name}

Examples:

  • service:authentication
  • service:payment
  • service:gateway
  • service:my-service

Benefits:

  • Simple and predictable
  • Easy to query all services (KEYS service:*)
  • Namespace isolation

TTL Configuration:

  • Default: 60 seconds
  • Configurable via REGISTRATION_TTL environment variable
  • Refreshed via heartbeat endpoint
  • Automatic expiration if no heartbeat

TTL Refresh:

# Send heartbeat to refresh TTL
curl -X POST http://localhost:8030/api/v1/services/my-service/heartbeat

Best Practices:

  • Set TTL to 2x heartbeat interval
  • Send heartbeats every 30 seconds for 60-second TTL
  • Handle expiration gracefully in clients

Value Format

Service metadata is stored as JSON in Redis:

{
"service_name": "authentication",
"service_url": "http://localhost:8001",
"service_id": "a674dbe7-5147-441b-ae56-f1c05f61cdbd",
"registered_at": "2025-12-22T14:30:38.531328+00:00",
"last_heartbeat": "2025-12-22T14:30:38.531330+00:00",
"capabilities": ["rest", "graphql"]
}

Field Descriptions:

FieldTypeDescription
service_namestringUnique service identifier
service_urlstringBase URL where service can be reached
service_idstring (UUID)Unique registration ID
registered_atstring (ISO 8601)Initial registration timestamp
last_heartbeatstring (ISO 8601)Last heartbeat timestamp
capabilitiesstring[]Supported protocols (rest, graphql, grpc, mcp)

Concurrency Model

Thread Safety

Implementation:

Arc<RwLock<HashMap<String, ServiceInfo>>>

Characteristics:

  • Arc: Shared ownership across threads
  • RwLock: Multiple readers or single writer
  • HashMap: Fast O(1) lookups
  • Concurrent reads: Multiple threads can read simultaneously
  • Exclusive writes: Only one writer at a time

Performance:

  • Read operations are non-blocking (multiple concurrent readers)
  • Write operations block readers briefly
  • Minimal contention in typical workloads

Implementation:

  • Redis client is Send + Sync safe
  • Connection pool handles concurrent requests
  • Async operations prevent blocking

Benefits:

  • Safe to share across threads
  • Efficient connection reuse
  • Automatic connection management

Request Handling:

  • Each request runs in async task
  • Multiple requests processed concurrently
  • No shared mutable state per request
  • Stateless design enables scaling

Cache Consistency

Type: Eventually Consistent

Characteristics:

  • Cache refreshes every 30 seconds
  • Direct Redis queries bypass cache when needed
  • Cache invalidation on service unregistration
  • Multiple Saga instances share Redis backend

Trade-offs:

  • Pros: High performance, low latency
  • Cons: Potential stale data (max 30 seconds)

Use Cases:

  • Service discovery (stale data acceptable)
  • Health checks (real-time data from Redis)
  • Registration (immediate cache update)

Refresh Interval: 30 seconds (configurable)

Refresh Process:

  1. Background task triggers periodically
  2. Queries all services from Redis
  3. Atomically replaces cache contents
  4. Updates statistics

Optimization:

  • Can be reduced for more consistency
  • Can be increased for less Redis load
  • Balance between freshness and performance

Performance Characteristics

Latency

Latency: < 1ms

Breakdown:

  • Cache lookup: ~0.1ms
  • JSON serialization: ~0.2ms
  • HTTP response: ~0.5ms

Optimization:

  • In-memory hash map lookup
  • No network I/O
  • Minimal CPU overhead

Latency: 5-10ms

Breakdown:

  • Redis query: ~2-5ms
  • Cache update: ~0.5ms
  • JSON serialization: ~0.5ms
  • HTTP response: ~1-2ms

Optimization:

  • Connection pooling reduces overhead
  • Async I/O prevents blocking
  • Cache update happens asynchronously

Latency: 10-20ms

Breakdown:

  • Validation: ~0.5ms
  • Redis write: ~5-10ms
  • Cache update: ~0.5ms
  • JSON serialization: ~1ms
  • HTTP response: ~2-3ms

Throughput

Capacity: Thousands of requests per second

Factors:

  • Cache hit ratio (higher = better)
  • Redis performance
  • Network latency
  • CPU resources

Typical Performance:

  • Cache hits: 10,000+ req/s
  • Cache misses: 1,000-2,000 req/s
  • Mixed workload: 3,000-5,000 req/s

Cache Optimization:

  • High cache hit ratio (>80%)
  • Reduce cache refresh frequency
  • Increase cache size if needed

Redis Optimization:

  • Use Redis Cluster for scaling
  • Optimize network latency
  • Use connection pooling

Application Optimization:

  • Batch requests when possible
  • Use async/await properly
  • Monitor and tune resource limits

Scalability

Strategy: Multiple Saga Instances

Architecture:

Load Balancer
├── Saga Instance 1 ──┐
├── Saga Instance 2 ──┼──→ Redis (Shared)
└── Saga Instance 3 ──┘

Benefits:

  • ✅ High availability
  • ✅ Load distribution
  • ✅ Fault tolerance
  • ✅ Easy scaling

Considerations:

  • Shared Redis backend
  • Cache consistency (eventual)
  • Load balancer configuration

Options:

  • Redis Sentinel: High availability
  • Redis Cluster: Horizontal scaling
  • Redis Replication: Read scaling

Recommendations:

  • Use Sentinel for HA
  • Use Cluster for scale
  • Monitor Redis performance

Security Considerations

Current State

Status:

  • ❌ No authentication required
  • ❌ No authorization checks
  • ❌ No TLS/HTTPS support
  • ✅ Network security recommended

Use Cases:

  • Internal microservices networks
  • Development environments
  • Trusted network deployments

Network Security:

  • Deploy behind firewall
  • Restrict access to internal networks
  • Use reverse proxy with TLS
  • Implement network policies

Access Control:

  • Use network-level access control
  • Deploy in private networks
  • Monitor access logs
  • Implement rate limiting at proxy level

Future Enhancements

Authentication:

  • API key authentication
  • OAuth2/JWT support
  • mTLS (mutual TLS)

Authorization:

  • Role-based access control (RBAC)
  • Service-level permissions
  • Admin vs read-only access

Security:

  • TLS/HTTPS support
  • Rate limiting
  • Request signing
  • Audit logging

Reliability

Redis Failover

Behavior:

  • Saga continues running when Redis unavailable
  • Health endpoint reports Redis status
  • Service registration fails gracefully
  • Service discovery uses fallback configuration

Error Handling:

  • Clear error messages
  • Retry logic for transient failures
  • Fallback mechanisms in clients

Redis Sentinel:

  • Automatic failover
  • Multiple Redis instances
  • No single point of failure

Configuration:

REDIS_URL=redis-sentinel://sentinel1:26379,sentinel2:26379/mymaster

Service Availability

Purpose: Keep service registrations alive

Process:

  1. Service sends heartbeat every 30 seconds
  2. Saga refreshes TTL in Redis
  3. Service remains discoverable
  4. Automatic expiration if heartbeat stops

Benefits:

  • Automatic cleanup of dead services
  • Fresh service information
  • No manual intervention needed

Default TTL: 60 seconds

Expiration Process:

  1. Service registered with TTL
  2. Heartbeat refreshes TTL
  3. If no heartbeat, TTL expires
  4. Service automatically removed from registry

Best Practices:

  • Send heartbeats every 30 seconds
  • Handle expiration in clients
  • Implement re-registration logic

Monitoring

Health Endpoint

Endpoint: GET /api/v1/health

Response:

{
"status": "healthy",
"service": "saga",
"version": "0.8.1",
"redis": "connected",
"cache": {
"size": 5,
"hits": 142,
"misses": 18,
"hit_ratio": 0.8875,
"last_refresh": "2025-12-22T14:29:53.944455Z"
}
}

Use Cases:

  • Container health checks
  • Load balancer health checks
  • Monitoring systems
  • Alerting systems

Service Metrics:

  • Status (healthy/unhealthy)
  • Version information
  • Redis connection status

Cache Metrics:

  • Cache size
  • Cache hits/misses
  • Cache hit ratio
  • Last refresh time

Integration:

  • Prometheus (future)
  • Custom metrics endpoint (future)
  • Structured logging

Logging

Library: tracing (Rust)

Log Levels:

  • error - Errors only
  • warn - Warnings and errors
  • info - Informational messages (default)
  • debug - Debug information
  • trace - Very verbose logging

Configuration:

RUST_LOG=saga=info cargo run
RUST_LOG=saga=debug cargo run
RUST_LOG=saga=trace cargo run

Request/Response:

  • HTTP method and path
  • Request duration
  • Response status
  • Error details

Service Operations:

  • Registration events
  • Discovery requests
  • Cache operations
  • Redis operations

Performance:

  • Cache hit/miss rates
  • Request latencies
  • Redis query times

Future Enhancements

Planned Features

Service Management:

  • Service health monitoring
  • Load balancing support
  • Service versioning
  • Multi-region support

API Enhancements:

  • GraphQL API
  • WebSocket subscriptions
  • Event streaming
  • Batch operations

Integration:

  • Service mesh integration
  • Kubernetes integration
  • Prometheus metrics
  • Distributed tracing

Caching:

  • Redis pub/sub for cache invalidation
  • Distributed caching
  • Cache warming strategies

Optimization:

  • Query optimization
  • Connection pooling improvements
  • Batch operations
  • Compression

Next Steps