Skip to main content
Learn how to deploy comprehensive monitoring and observability for your Halo CMMS deployments using OpenTelemetry, Google Cloud Monitoring, and Cloud Trace.

Overview

The Halo CMMS supports observability through:
  • OpenTelemetry for metrics and traces collection
  • Google Cloud Monitoring for metrics visualization and alerting
  • Google Cloud Trace for distributed tracing
  • Automatic Java instrumentation for JVM metrics and RPC monitoring
  • Custom Halo metrics for CMMS-specific operations

Architecture

The monitoring setup consists of:
  1. OpenTelemetry Java Agent - Automatically instruments JVM applications
  2. OpenTelemetry Collector - Receives, processes, and exports telemetry data
  3. OpenTelemetry Operator - Manages collector deployment and instrumentation injection
  4. Google Cloud Services - Stores and visualizes metrics and traces

Prerequisites

Before you start:
  • Deploy a Halo component (Kingdom, Duchy, or Reporting Server)
  • See: Kingdom Deployment or Duchy Deployment
  • Have kubectl configured for your cluster
  • Have appropriate Google Cloud permissions

Google Cloud Configuration

1

Enable required APIs

Enable Cloud Monitoring and Cloud Trace APIs from the APIs and Services page:
gcloud services enable monitoring.googleapis.com
gcloud services enable cloudtrace.googleapis.com
2

Create service account

Create an IAM service account for OpenTelemetry with Workload Identity:
# Create service account
gcloud iam service-accounts create open-telemetry \
  --display-name="OpenTelemetry Collector"

# Grant monitoring and trace permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:open-telemetry@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/monitoring.metricWriter"

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:open-telemetry@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/cloudtrace.agent"
3

Configure Workload Identity

Bind the Kubernetes service account to the Google Cloud service account:
gcloud iam service-accounts add-iam-policy-binding \
  open-telemetry@PROJECT_ID.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[default/open-telemetry]"

OpenTelemetry Deployment

Install cert-manager

The OpenTelemetry Operator requires cert-manager for webhook certificates.
Version CompatibilityEnsure cert-manager, OpenTelemetry Operator, and collector image versions are compatible. See the Compatibility Matrix.Recommended versions:
  • cert-manager: v1.18.2
  • OpenTelemetry Operator: v0.129.1
  • Collector image: Specified in the Halo configuration
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.18.2/cert-manager.yaml

Install OpenTelemetry Operator

Deploy the OpenTelemetry Operator:
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.129.1/opentelemetry-operator.yaml
Verify the operator is running:
kubectl get pods -n opentelemetry-operator-system

Create OpenTelemetry Configuration

The Halo dev environment provides reference configurations:
1

Generate configuration from CUE (optional)

If using the Halo CUE-based configuration:
bazel build //src/main/k8s/dev:open_telemetry_gke
The generated YAML will be in bazel-bin/src/main/k8s/dev/.
2

Customize configuration

Customize the generated or reference configuration for your environment:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: default
spec:
  mode: deployment
  serviceAccount: open-telemetry
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    
    processors:
      batch:
    
    exporters:
      googlecloud:
        project: YOUR_PROJECT_ID
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [googlecloud]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [googlecloud]
3

Apply configuration

Apply the OpenTelemetry configuration:
kubectl apply -f opentelemetry-config.yaml
Verify the collector is running:
kubectl get pods -l app.kubernetes.io/component=opentelemetry-collector
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector

Enable Instrumentation

The OpenTelemetry Operator can automatically inject the Java agent into your pods.

Create Instrumentation Resource

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: open-telemetry-java-agent
spec:
  exporter:
    endpoint: http://default-collector:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"
Apply the instrumentation:
kubectl apply -f instrumentation.yaml

Annotate Deployments

Add the instrumentation annotation to your pod specifications:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kingdom-v2alpha-public-api-server
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "open-telemetry-java-agent"
    spec:
      # ... rest of pod spec

Restart Deployments

Restart all deployments to pick up the Java agent instrumentation:
for deployment in $(kubectl get deployments -o name); do 
  kubectl rollout restart $deployment
done
Verify instrumentation is active:
# Check for JAVA_TOOL_OPTIONS environment variable
kubectl get pods -o jsonpath='{.items[0].spec.containers[0].env[?(@.name=="JAVA_TOOL_OPTIONS")]}'

Available Metrics

Metrics are visible in Google Cloud Monitoring under the Workload domain.

Automatic Java Instrumentation Metrics

Class Loading:
  • jvm.class.count - Current number of loaded classes
  • jvm.class.loaded - Total number of classes loaded since JVM start
  • jvm.class.unloaded - Total number of classes unloaded since JVM start
CPU:
  • jvm.cpu.count - Number of available processors
  • jvm.cpu.recent_utilization - Recent CPU utilization
  • jvm.cpu.time - CPU time used by the JVM
Memory:
  • jvm.memory.committed - Amount of memory committed for JVM to use
  • jvm.memory.limit - Maximum amount of memory available
  • jvm.memory.used - Amount of memory currently used
  • jvm.memory.used_after_last_gc - Memory used after last garbage collection
Garbage Collection:
  • jvm.gc.duration - Time spent in garbage collection
Threads:
  • jvm.thread.count - Current number of threads
Client-side:
  • rpc.client.duration - Duration of RPC client requests
Server-side:
  • rpc.server.duration - Duration of RPC server requests
These metrics include labels for:
  • RPC method
  • Status code
  • Target service

Halo Custom Metrics

Thread Pool:
  • halo_cmm.thread_pool.size - Thread pool size
  • halo_cmm.thread_pool.active_count - Number of active threads
Computation:
  • halo_cmm.computation.stage.crypto.cpu.time - CPU time spent in cryptographic operations
  • halo_cmm.computation.stage.crypto.time - Wall-clock time for cryptographic operations
  • halo_cmm.computation.stage.time - Total time for computation stages
Retention:
  • halo_cmm.retention.deleted_measurements - Number of measurements deleted by retention policies
  • halo_cmm.retention.deleted_exchanges - Number of exchanges deleted by retention policies
  • halo_cmm.retention.cancelled_measurements - Number of measurements cancelled by retention policies

Health Checks and Diagnostics

Collector Health

Check OpenTelemetry Collector health:
# Collector pods
kubectl get pods -l app.kubernetes.io/component=opentelemetry-collector

# Collector logs
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector --tail=100

# Check for export errors
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector | grep -i error

Verify Metrics Export

Check if metrics are reaching Google Cloud Monitoring:
# Using gcloud
gcloud monitoring time-series list \
  --filter='metric.type=starts_with("workload.googleapis.com")' \
  --limit=10
Or visit the Metrics Explorer in Google Cloud Console.

Verify Traces Export

Check if traces are reaching Google Cloud Trace: Visit Cloud Trace in Google Cloud Console to see trace data.

Creating Dashboards

1

Navigate to Monitoring

Go to Google Cloud Monitoring in the Cloud Console.
2

Create dashboard

Click DashboardsCreate Dashboard
3

Add charts

Add charts for key metrics:Example: JVM Memory Usage
Resource Type: Kubernetes Container
Metric: workload.googleapis.com/jvm.memory.used
Filter: cluster_name = "your-cluster"
Grouping: container_name
Example: RPC Request Duration
Resource Type: Kubernetes Container
Metric: workload.googleapis.com/rpc.server.duration
Aggregator: 95th percentile
Filter: cluster_name = "your-cluster"
Example: Computation Stage Time
Resource Type: Kubernetes Container  
Metric: workload.googleapis.com/halo_cmm.computation.stage.time
Filter: cluster_name = "your-cluster"
Grouping: stage_name

Setting Up Alerts

1

Create alert policy

Navigate to MonitoringAlertingCreate Policy
2

Configure conditions

Example alert conditions:High Memory Usage:
Metric: workload.googleapis.com/jvm.memory.used
Condition: Above threshold
Threshold: 1.5 GB
Duration: 5 minutes
High RPC Error Rate:
Metric: workload.googleapis.com/rpc.server.duration
Filter: rpc.grpc.status_code != "OK"
Condition: Rate of change
Threshold: > 10 errors/minute
Pod Restart Rate:
Metric: kubernetes.io/container/restart_count
Condition: Rate of change
Threshold: > 5 restarts in 10 minutes
3

Configure notifications

Set up notification channels:
  • Email
  • Slack
  • PagerDuty
  • SMS

Troubleshooting

Check collector status:
kubectl logs -l app.kubernetes.io/component=opentelemetry-collector
Verify service account permissions:
gcloud projects get-iam-policy PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:open-telemetry@PROJECT_ID.iam.gserviceaccount.com"
Check Workload Identity binding:
kubectl describe sa open-telemetry
Look for the iam.gke.io/gcp-service-account annotation.
Verify Instrumentation resource:
kubectl get instrumentation
kubectl describe instrumentation open-telemetry-java-agent
Check pod annotations:
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'
Check operator logs:
kubectl logs -n opentelemetry-operator-system -l app.kubernetes.io/name=opentelemetry-operator
If you see warnings about high cardinality metrics:
  1. Review metric labels and reduce unnecessary dimensions
  2. Use metric filtering in the collector configuration
  3. Aggregate metrics before export
  4. Consider sampling high-volume metrics

Best Practices

Monitoring Strategy
  • Monitor both infrastructure (Kubernetes) and application (Halo) metrics
  • Set up alerts for critical issues (OOM, high error rates, pod crashes)
  • Create dashboards for different audiences (operators, developers, business)
  • Regularly review and update alert thresholds
  • Use distributed tracing to debug complex multi-service issues
Cost Management
  • Monitor Google Cloud Monitoring costs in the Billing console
  • Adjust metric retention periods based on needs
  • Use metric filtering to reduce exported metrics
  • Consider sampling for high-volume traces
  • Archive old metrics to Cloud Storage if needed

Additional Resources