Monitoring Microservices with LogFlux

Best practices for monitoring distributed microservices using centralized logging. Learn about correlation IDs, distributed tracing, service-level logging strategies, and performance monitoring.

Microservices architecture has revolutionized how we build scalable applications, but it’s also introduced new challenges in monitoring and debugging. When your application is split across dozens or hundreds of services, understanding what’s happening becomes exponentially more complex. This is where centralized logging with LogFlux becomes invaluable.

The Microservices Monitoring Challenge

In a microservices architecture, a single user request might traverse through multiple services:

  1. API Gateway receives the request
  2. Authentication service validates the user
  3. Order service processes the order
  4. Payment service handles the transaction
  5. Notification service sends confirmations
  6. Audit service logs the activity

When something goes wrong, you need to trace the request across all these services quickly. Without proper logging, this becomes a nightmare.

Correlation IDs: Your Best Friend

The foundation of microservices monitoring is the correlation ID – a unique identifier that follows a request through all services:

// API Gateway
app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuid.v4();
  req.logger = logger.child({
    correlationId: req.correlationId,
    service: 'api-gateway'
  });
  
  // Pass to downstream services
  res.setHeader('x-correlation-id', req.correlationId);
  next();
});

// Downstream service
const axios = require('axios');

async function callPaymentService(data, correlationId) {
  logger.info('Calling payment service', {
    correlationId: correlationId,
    service: 'order-service',
    action: 'payment-request'
  });
  
  const response = await axios.post('http://payment-service/process', data, {
    headers: {
      'x-correlation-id': correlationId
    }
  });
  
  logger.info('Payment service response', {
    correlationId: correlationId,
    service: 'order-service',
    status: response.status
  });
  
  return response.data;
}

Service-Level Logging Strategy

1. Entry and Exit Logging

Log every service entry and exit point:

func ProcessOrderHandler(w http.ResponseWriter, r *http.Request) {
    correlationID := r.Header.Get("X-Correlation-Id")
    
    // Entry log
    logger.Info("Order processing started", map[string]interface{}{
        "correlationId": correlationID,
        "method": r.Method,
        "path": r.URL.Path,
        "service": "order-service",
    })
    
    defer func(start time.Time) {
        // Exit log with duration
        logger.Info("Order processing completed", map[string]interface{}{
            "correlationId": correlationID,
            "duration": time.Since(start).Milliseconds(),
            "service": "order-service",
        })
    }(time.Now())
    
    // Process order...
}

2. Error Boundary Logging

Implement comprehensive error logging at service boundaries:

from functools import wraps
import traceback

def log_service_errors(service_name):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                correlation_id = kwargs.get('correlation_id', 'unknown')
                
                logger.error(f"Service error in {service_name}", {
                    'correlationId': correlation_id,
                    'service': service_name,
                    'error': str(e),
                    'stackTrace': traceback.format_exc(),
                    'function': func.__name__
                })
                
                raise
        return wrapper
    return decorator

@log_service_errors('inventory-service')
def check_inventory(product_id, quantity, correlation_id=None):
    # Business logic here
    pass

3. Performance Metrics

Track service performance across your architecture:

class PerformanceLogger {
  constructor(logger, serviceName) {
    this.logger = logger;
    this.serviceName = serviceName;
  }
  
  async trackOperation(operationName, operation, correlationId) {
    const start = Date.now();
    const memBefore = process.memoryUsage().heapUsed;
    
    try {
      const result = await operation();
      
      this.logger.info('Operation completed', {
        correlationId: correlationId,
        service: this.serviceName,
        operation: operationName,
        duration: Date.now() - start,
        memoryDelta: process.memoryUsage().heapUsed - memBefore,
        status: 'success'
      });
      
      return result;
    } catch (error) {
      this.logger.error('Operation failed', {
        correlationId: correlationId,
        service: this.serviceName,
        operation: operationName,
        duration: Date.now() - start,
        error: error.message,
        status: 'failure'
      });
      
      throw error;
    }
  }
}

// Usage
const perfLogger = new PerformanceLogger(logger, 'user-service');

await perfLogger.trackOperation(
  'fetchUserProfile',
  () => getUserProfile(userId),
  correlationId
);

Distributed Tracing with LogFlux

LogFlux makes it easy to trace requests across services:

Setting Up Service Context

// Configure LogFlux with service metadata
logger := logflux.New(logflux.Config{
    APIKey: os.Getenv("LOGFLUX_API_KEY"),
    DefaultFields: map[string]interface{}{
        "service": "payment-service",
        "version": os.Getenv("SERVICE_VERSION"),
        "environment": os.Getenv("ENVIRONMENT"),
        "pod": os.Getenv("HOSTNAME"),
    },
})

Implementing Request Tracing

import time
from logflux import Logflux

class RequestTracer:
    def __init__(self, service_name):
        self.logger = Logflux(
            api_key=os.environ['LOGFLUX_API_KEY'],
            service=service_name
        )
        
    def trace_request(self, correlation_id, operation):
        """Trace a request through the service"""
        trace = {
            'correlationId': correlation_id,
            'service': self.service_name,
            'timestamp': time.time(),
            'spans': []
        }
        
        return RequestContext(self.logger, trace)

class RequestContext:
    def __init__(self, logger, trace):
        self.logger = logger
        self.trace = trace
        
    def start_span(self, name):
        span = {
            'name': name,
            'start': time.time()
        }
        self.trace['spans'].append(span)
        return span
        
    def end_span(self, span, status='success', metadata=None):
        span['end'] = time.time()
        span['duration'] = span['end'] - span['start']
        span['status'] = status
        if metadata:
            span['metadata'] = metadata
            
        self.logger.info(f"Span completed: {span['name']}", {
            'correlationId': self.trace['correlationId'],
            'span': span
        })

Health Check Monitoring

Monitor service health across your microservices:

// Health check endpoint with logging
app.get('/health', async (req, res) => {
  const healthChecks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    downstream: await checkDownstreamServices()
  };
  
  const overallHealth = Object.values(healthChecks).every(h => h.status === 'healthy');
  
  logger.info('Health check performed', {
    service: 'order-service',
    health: overallHealth ? 'healthy' : 'unhealthy',
    checks: healthChecks,
    timestamp: new Date().toISOString()
  });
  
  res.status(overallHealth ? 200 : 503).json({
    status: overallHealth ? 'healthy' : 'unhealthy',
    checks: healthChecks
  });
});

Circuit Breaker Logging

Log circuit breaker state changes for better visibility:

type CircuitBreaker struct {
    logger     *logflux.Logger
    service    string
    state      string
    failures   int
    threshold  int
}

func (cb *CircuitBreaker) Call(operation func() error, correlationID string) error {
    if cb.state == "open" {
        cb.logger.Warn("Circuit breaker is open", map[string]interface{}{
            "correlationId": correlationID,
            "service": cb.service,
            "state": cb.state,
        })
        return ErrCircuitOpen
    }
    
    err := operation()
    
    if err != nil {
        cb.failures++
        
        if cb.failures >= cb.threshold {
            cb.state = "open"
            cb.logger.Error("Circuit breaker opened", map[string]interface{}{
                "correlationId": correlationID,
                "service": cb.service,
                "failures": cb.failures,
                "threshold": cb.threshold,
            })
        }
        
        return err
    }
    
    if cb.failures > 0 {
        cb.failures = 0
        cb.logger.Info("Circuit breaker reset", map[string]interface{}{
            "correlationId": correlationID,
            "service": cb.service,
        })
    }
    
    return nil
}

Querying Across Services

Use LogFlux Inspector to trace requests across all services:

# Find all logs for a specific correlation ID
logflux query \
  --filter "correlationId:abc-123-def" \
  --sort timestamp

# Find slow requests across all services
logflux query \
  --filter "duration>1000" \
  --group-by service \
  --stats

# Track error rate by service
logflux stats \
  --filter "level:error" \
  --group-by service,hour \
  --interval 24h

# Find requests that touched specific services
logflux query \
  --filter "correlationId IN (SELECT correlationId FROM logs WHERE service='payment-service' AND status='failed')" \
  --sort timestamp

Dashboard for Microservices

Create a comprehensive dashboard in LogFlux:

# LogFlux Dashboard Configuration
name: Microservices Overview
widgets:
  - type: line_chart
    title: Request Rate by Service
    query: |
      SELECT service, COUNT(*) as requests
      FROM logs
      WHERE timestamp > NOW() - INTERVAL '1 hour'
      GROUP BY service, time_bucket('1 minute', timestamp)
      
  - type: heatmap
    title: Service Latency Heatmap
    query: |
      SELECT service, percentile_disc(0.5) as p50,
             percentile_disc(0.95) as p95,
             percentile_disc(0.99) as p99
      FROM logs
      WHERE duration IS NOT NULL
      GROUP BY service
      
  - type: table
    title: Recent Errors
    query: |
      SELECT timestamp, service, correlationId, error
      FROM logs
      WHERE level = 'error'
      ORDER BY timestamp DESC
      LIMIT 20
      
  - type: graph
    title: Service Dependencies
    query: |
      SELECT source_service, target_service, COUNT(*) as calls
      FROM service_calls
      GROUP BY source_service, target_service

Best Practices Summary

  1. Always use correlation IDs - They’re essential for tracing distributed requests
  2. Log at service boundaries - Entry, exit, and error points
  3. Include service metadata - Service name, version, instance ID
  4. Monitor health endpoints - Regular health checks with logging
  5. Track performance metrics - Duration, memory, CPU usage
  6. Implement circuit breakers - With proper logging of state changes
  7. Use structured logging - Makes querying and analysis much easier
  8. Set up alerting - For error rates, latency spikes, and service failures

Conclusion

Monitoring microservices doesn’t have to be overwhelming. With proper logging strategies and LogFlux’s powerful querying capabilities, you can maintain complete visibility into your distributed system. The key is consistency – ensure all services follow the same logging patterns and always include correlation IDs.

Start implementing these patterns in your microservices today, and transform your monitoring from reactive firefighting to proactive system management. Your future self (and your on-call team) will thank you!