Java Application Observability: Monitoring, Tracing & Logging

1️⃣ Introduction

Observability is a critical aspect of modern Java application development, allowing teams to understand complex system behavior, identify performance bottlenecks, and quickly diagnose issues in production environments.

This comprehensive guide covers the key pillars of observability:

  • Metrics: Numerical data capturing system performance and behavior
  • Logging: Structured and contextualized event records
  • Tracing: Following request flows through distributed systems
  • Health monitoring: Real-time system status and alerts

By implementing effective observability practices, development teams can gain visibility into application behavior, improve troubleshooting, and enhance overall system reliability.

2️⃣ Key Concepts & Terminology

  • Metrics: Numerical measurements of system behaviors and performance
  • Distributed Tracing: Following requests across service boundaries
  • Structured Logging: Machine-parseable, contextual log data
  • Correlation IDs: Identifiers linking events across distributed services
  • Instrumentation: Code added to applications to generate observability data
  • Telemetry: The data collected for observability purposes
  • Cardinality: The uniqueness of data within metrics
  • APM: Application Performance Monitoring

3️⃣ Metrics Implementation

Metrics provide numerical data about application behavior and performance, allowing teams to monitor trends, identify anomalies, and set alerts.

🔹 Micrometer: Java Metrics Facade

Micrometer provides a vendor-neutral metrics collection API for Java applications.

// Add dependencies in pom.xml
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <version>1.10.2</version>
</dependency>

🔹 Creating and Recording Metrics

// Create a counter
Counter requestCounter = Metrics.counter("http.requests", 
    "uri", "/api/users", 
    "method", "GET");

// Increment the counter
requestCounter.increment();

// Create a timer
Timer responseTimer = Metrics.timer("http.response.time", 
    "uri", "/api/users");

// Record request duration
responseTimer.record(() -> {
    // Method that makes the HTTP request
    return processRequest();
});

🔹 Common Metrics Types

  • Counters: Continuously increasing values (e.g., request count)
  • Gauges: Values that can increase or decrease (e.g., active connections)
  • Timers: Measure duration of operations (e.g., response time)
  • Distribution Summaries: Record distribution of events (e.g., payload sizes)

4️⃣ Logging Best Practices

Effective logging provides context-rich information about application events and errors, aiding in troubleshooting and analysis.

🔹 Structured Logging with SLF4J and Logback

// Add dependencies
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>2.0.6</version>
</dependency>
<dependency>
    <groupId>ch.qos.logback</groupId>
    <artifactId>logback-classic</artifactId>
    <version>1.4.5</version>
</dependency>

🔹 JSON Formatting for Machine Readability

<!-- logback.xml configuration for JSON output -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>

🔹 Logging Best Practices

  • Include correlation IDs in all log entries
  • Use appropriate log levels (ERROR, WARN, INFO, DEBUG, TRACE)
  • Log actionable information with context
  • Include relevant metadata like user ID, request ID, and service name
  • Avoid logging sensitive information
  • Use structured formats (JSON) for machine parsing

5️⃣ Distributed Tracing

Distributed tracing tracks requests as they flow through microservices, providing visibility into end-to-end transactions.

🔹 OpenTelemetry Integration

OpenTelemetry provides vendor-neutral APIs, libraries, and agents for collecting traces, metrics, and logs.

// Add OpenTelemetry dependencies
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
    <version>1.22.0</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-sdk</artifactId>
    <version>1.22.0</version>
</dependency>

🔹 Creating Spans for Custom Tracing

// Get the current span from the context
Span currentSpan = tracer.spanBuilder("processOrder")
    .setSpanKind(SpanKind.INTERNAL)
    .setAttribute("orderId", orderId)
    .startSpan();

try (Scope scope = currentSpan.makeCurrent()) {
    // Execute the business logic
    processOrderItems(orderId);
} catch (Exception e) {
    currentSpan.recordException(e);
    currentSpan.setStatus(StatusCode.ERROR, e.getMessage());
    throw e;
} finally {
    currentSpan.end();
}

Implementing Trace Context Propagation

For distributed tracing to work across service boundaries, context must be propagated through various transport mechanisms:

  • HTTP Headers: Pass trace context in standard headers
  • Message Queues: Include trace context in message metadata
  • RPC Frameworks: Pass context through gRPC metadata

6️⃣ Health & Performance Monitoring

Real-time health monitoring enables proactive issue detection and resolution.

🔹 Spring Boot Actuator for Health Endpoints

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

# application.properties
management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoint.health.show-details=always

🔹 Custom Health Indicators

@Component
public class DatabaseHealthIndicator implements HealthIndicator {
    
    private final DataSource dataSource;
    
    public DatabaseHealthIndicator(DataSource dataSource) {
        this.dataSource = dataSource;
    }
    
    @Override
    public Health health() {
        try (Connection conn = dataSource.getConnection()) {
            try (Statement stmt = conn.createStatement()) {
                stmt.execute("SELECT 1");
                return Health.up()
                    .withDetail("database", "PostgreSQL")
                    .withDetail("version", getDatabaseVersion(conn))
                    .build();
            }
        } catch (SQLException e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
    
    private String getDatabaseVersion(Connection conn) throws SQLException {
        try (Statement stmt = conn.createStatement()) {
            try (ResultSet rs = stmt.executeQuery("SELECT version()")) {
                return rs.next() ? rs.getString(1) : "unknown";
            }
        }
    }
}

7️⃣ Observability Tools Ecosystem

A variety of tools are available for collecting, storing, and visualizing observability data:

Popular Observability Stacks

Stack Components Best For
ELK Stack Elasticsearch, Logstash, Kibana Log aggregation and analysis
Prometheus + Grafana Prometheus, Alertmanager, Grafana Metrics collection and visualization
Jaeger Jaeger Collector, Query Service, UI Distributed tracing
Datadog Unified SaaS platform Enterprise-scale observability
New Relic Unified SaaS platform Full-stack observability

8️⃣ Q&A / Frequently Asked Questions

Monitoring focuses on tracking known system metrics and setting predefined alerts, while observability provides deeper insights into system behavior through metrics, logs, and traces. Observability allows for asking arbitrary questions about system state without predefined queries, enabling more effective troubleshooting of complex systems.

Consider your specific requirements: deployment model (cloud vs. on-premise), budget, team expertise, integration capabilities, and scalability needs. For smaller teams, open-source solutions like Prometheus and Grafana may be sufficient. Larger organizations might benefit from commercial platforms like New Relic or Datadog that offer unified monitoring. Prioritize tools that support open standards like OpenTelemetry to avoid vendor lock-in.

Start with the "Four Golden Signals": latency, traffic, errors, and saturation. For Java applications specifically, also monitor JVM metrics (heap usage, garbage collection stats, thread counts), application metrics (request rates, response times, error rates), system metrics (CPU, memory, disk I/O), and business metrics relevant to your domain. Balance comprehensive monitoring with the cost of collection and storage.

9️⃣ Best Practices & Pro Tips 🚀

  • Follow the principle of "Three Pillars of Observability": logs, metrics, and traces
  • Implement correlation IDs to link events across distributed systems
  • Use structured logging formats for easier parsing and analysis
  • Avoid high-cardinality metrics that can overwhelm storage systems
  • Create dashboards for common troubleshooting scenarios
  • Set up proactive alerts based on SLIs/SLOs
  • Practice "observability-driven development" by considering telemetry needs during design
  • Regularly review and improve your observability implementation

🔟 Read Next 📖

Conclusion

Implementing robust observability practices is critical for modern Java applications, particularly in distributed and microservice architectures. By combining metrics, logging, tracing, and health monitoring, development teams can gain comprehensive visibility into application behavior, leading to faster troubleshooting, more reliable systems, and improved user experiences.

Start with small, focused improvements to your observability stack, prioritizing the areas that provide the most value for your specific use cases. As your applications evolve, continuously refine your observability strategy to address new challenges and leverage emerging tools and practices.