Unified Observability implementation for Java & Node.js Microservices based SAAS application on GKE

Company Overview

Company Z provides a healthcare software-as-a-service application for government bodies. Its services are mainly on Google Cloud Platform (GCP), with a combined tech stack of GKE, Helm, GitHub Actions, Java (Spring Boot), Node.js (Express/Nest), PostgreSQL, Redis.

Challenges Faced

Solution And Implementation

Deploy a standardized, fully open-source observability stack using Helm on GKE:

Area	Tool/Tech
Metrics	Prometheus (Operator)
Dashboards	Grafana with SSO
Logs	Fluent Bit → Grafana Loki
Traces	OpenTelemetry SDK → Collector → Tempo
Profiling	Pyroscope (on critical services)
Alerts	Grafana Alerts + Slack, PagerDuty webhooks
Error Tracking	Sentry.io integration (Node/Java SDKs)

Key Implementation Components

Cluster Environment

3 GKE clusters: dev, staging, prod
Deployed observability stack per environment using Helm + Terraform
Used GCP Workload Identity for secure component-to-component communication

Metrics via Prometheus

Each component of monitoring is deployed via Helm
Node.js: prom-client exposing custom app metrics
Java: Spring Boot with Micrometer Prometheus exporter
All services included:
- HTTP latency
- Request count
- DB queries
- Queue processing times (Kafka, Pub/Sub)

Grafana Dashboards

Single Grafana per cluster, federated from other Prometheus instances
SSO via Google Workspace (OAuth)
Dashboards:
- GKE node and pod health
- JVM metrics (heap, GC, threads)
- Express.js API performance
- ArgoCD sync metrics
- Error rate from Sentry (via data source plugin)

Logs via Fluent Bit → Loki

Fluent Bit DaemonSet on each GKE node
Parsed logs using custom multiline parsers
Custom Labels
Per environment Log retention periods

Tracing via OpenTelemetry → Tempo

OpenTelemetry SDK integrated:
- Java (via Micrometer & OTLP exporter)
- Node.js (using OpenTelemetry Node SDK)
Trace context passed across services via headers
OTEL Collector pushes traces to Tempo
Grafana Tempo plugin shows end-to-end spans:

Profiling via Pyroscope

Deployed Pyroscope Agent on Java & Node services
Enabled CPU + memory flame graphs in Grafana
Developers could correlate slow traces to hot paths

Alerts & Integrations

Alerts created via Grafana Alerting
Used Slack, PagerDuty, and Google Chat webhooks
Custom Alert rules fine-tuned per application
Sentry SDK in apps for exception monitoring (also sent to Slack)

Results

Metric	Before	After
Time to diagnose errors	1–2 hours	~10 minutes (trace + logs combo)
Log discoverability	Stackdriver only	Queryable + labeled logs in Grafana
Metric adoption	40% teams	100% (standard templates)
Alert reliability	Poor (spammy)	90% signal-based, team routed
Service-level traceability	None	100% span-coverage across all calls
Developer satisfaction	5/10	9/10

ConCLUSION

By standardizing observability on GKE using Grafana OSS tooling, Company Z unified:

Monitoring, logs, and traces across Java and Node services
End-to-end visibility for all production traffic
Proactive alerting for latency, resource pressure, and failures

This replaced fragmented tooling with an integrated system that scaled with both teams and services.