Unified Observability implementation for Java & Node.js Microservices based SAAS application on GKE

Observability And Monitoring With Prometheus and Grafana

Company Overview

Company Z provides a healthcare software-as-a-service application for government bodies. Its services are mainly on Google Cloud Platform (GCP), with a combined tech stack of  GKE, Helm, GitHub Actions, Java (Spring Boot), Node.js (Express/Nest), PostgreSQL, Redis.

Challenges Faced

Company Z provides a healthcare software-as-a-service application for government bodies. Its services are mainly on Google Cloud Platform (GCP), with a combined tech stack of  GKE, Helm, GitHub Actions, Java (Spring Boot), Node.js (Express/Nest), PostgreSQL, Redis.

Solution And Implementation

Deploy a standardized, fully open-source observability stack using Helm on GKE:

AreaTool/Tech
MetricsPrometheus (Operator)
DashboardsGrafana with SSO
LogsFluent Bit → Grafana Loki
TracesOpenTelemetry SDK → Collector → Tempo
ProfilingPyroscope (on critical services)
AlertsGrafana Alerts + Slack, PagerDuty webhooks
Error TrackingSentry.io integration (Node/Java SDKs)

Key Implementation Components

Cluster Environment

  • 3 GKE clusters: dev, staging, prod

  • Deployed observability stack per environment using Helm + Terraform

  • Used GCP Workload Identity for secure component-to-component communication

Metrics via Prometheus

  • Each component of monitoring is deployed via Helm

  • Node.js: prom-client exposing custom app metrics

  • Java: Spring Boot with Micrometer Prometheus exporter

  • All services included:

    • HTTP latency

    • Request count

    • DB queries

    • Queue processing times (Kafka, Pub/Sub)

Grafana Dashboards

  • Single Grafana per cluster, federated from other Prometheus instances

  • SSO via Google Workspace (OAuth)

  • Dashboards:

    • GKE node and pod health

    • JVM metrics (heap, GC, threads)

    • Express.js API performance

    • ArgoCD sync metrics

    • Error rate from Sentry (via data source plugin)

Logs via Fluent Bit → Loki

  • Fluent Bit DaemonSet on each GKE node

  • Parsed logs using custom multiline parsers

  • Custom Labels

  • Per environment Log retention periods

Tracing via OpenTelemetry → Tempo

  • OpenTelemetry SDK integrated:

    • Java (via Micrometer & OTLP exporter)

    • Node.js (using OpenTelemetry Node SDK)

  • Trace context passed across services via headers

  • OTEL Collector pushes traces to Tempo

  • Grafana Tempo plugin shows end-to-end spans:

Profiling via Pyroscope

  • Deployed Pyroscope Agent on Java & Node services

  • Enabled CPU + memory flame graphs in Grafana

  • Developers could correlate slow traces to hot paths

Alerts & Integrations

  • Alerts created via Grafana Alerting

  • Used Slack, PagerDuty, and Google Chat webhooks

  • Custom Alert rules fine-tuned per application

  • Sentry SDK in apps for exception monitoring (also sent to Slack)

Results

MetricBeforeAfter
Time to diagnose errors1–2 hours~10 minutes (trace + logs combo)
Log discoverabilityStackdriver onlyQueryable + labeled logs in Grafana
Metric adoption40% teams100% (standard templates)
Alert reliabilityPoor (spammy)90% signal-based, team routed
Service-level traceabilityNone100% span-coverage across all calls
Developer satisfaction5/109/10

ConCLUSION

By standardizing observability on GKE using Grafana OSS tooling, Company Z unified:

  • Monitoring, logs, and traces across Java and Node services

  • End-to-end visibility for all production traffic

  • Proactive alerting for latency, resource pressure, and failures

This replaced fragmented tooling with an integrated system that scaled with both teams and services.