Senior JAVA SRE

DYNE IT Services•Santa Monica, CA

9h•Hybrid

About The Position

We are seeking a Senior Java Site Reliability Engineer (SRE)to architect, operate, and continuously improve hyperscale, globally distributed platforms on Google Cloud Platform (GCP). This role is highly hands-on and requires deep expertise across Java JVM performance engineering, Kubernetes/GKE at scale, cloud-native reliability engineering, and real-time streaming systems. The ideal candidate will act as a technical authorityfor availability, performance, scalability, and operational excellence—driving 99.99%+ uptimeacross multi-region systems while mentoring teams and shaping long-term platform reliability strategy.

Requirements

Java– Advanced JVM internals, GC tuning, performance optimization
Google Cloud Platform (GCP)– Professional-level expertise
Kubernetes / GKE– Multi-cluster, fleet, and Anthos architectures
Terraform, Docker, Infrastructure as Code
CI/CD– GitLab CI/CD, Jenkins
Kafka, Kafka Streams, KSQLDB
Spark Streaming
GCP Pub/Sub
Istio, Anthos Service Mesh
Nginx Controller, Seesaw
eBPF-based observability and networking diagnostics
Prometheus, Datadog, Splunk, Kiali
Linux/Unix, Bash scripting
Python or Go
Internal PaaS design and enablement
Multi-cluster Kubernetes governance
High-traffic SaaS or consumer-scale platforms
Google Professional Cloud ArchitectOR Professional Cloud DevOps Engineer
Certified Kubernetes Administrator (CKA)or Certified Kubernetes Security Specialist (CKS)

Nice To Haves

Experience operating hyperscale production systems
Strong background in real-time streaming & event-driven platforms
Proven leadership in incident response and reliability governance
Excellent communication skills with ability to lead cross-functional teams

Responsibilities

Platform Architecture & Reliability
Architect and operate multi-region, globally distributed GCP platforms with 99.99%+ availability targets.
Define, implement, and govern SLIs, SLOs, error budgets, and reliability frameworks.
Lead incident command, production war rooms, post-incident RCA, and long-term remediation initiatives.
Design fault-tolerant systems with zero-downtime deployments and graceful degradation.
Java & JVM Performance Engineering
Engineer and tune high-throughput Java microservice scale.
Deep expertise in JVM internals, garbage collection strategies, heap optimisation, and memory profiling.
Identify and resolve performance bottlenecks under peak traffic conditions.
Kubernetes & GCP Infrastructure
Design and operate GKE at scale, including multi-cluster and fleet management.
Implement GCP-native architecturesusing: GKE, Compute Engine, Cloud Load Balancing Cloud Spanner, Bigtable, Cloud SQL Pub/Sub, Cloud Storage IAM, VPC Service Controls
Implement secure, repeatable infrastructureusing Terraform and policy-as-code.
Service Mesh & Traffic Management
Architect advanced service mesh solutionsusing Istio / Anthos Service Mesh.
Implement traffic shaping strategies including canary, blue/green, and progressive rollouts.
Manage advanced ingress, load balancing, and routing using Nginx Controller and Seesaw.
Stateful & Streaming Systems
Design and operate stateful Kubernetes workloadsusing Portworx.
Support real-time, event-driven architecturesusing: Kafka, Kafka Streams, KSQLDB Spark Streaming GCP Pub/Sub
Optimize systems for low-latency, high-throughput workloads.
Observability & Performance
Implement enterprise-grade observability using: Prometheus, Datadog, Splunk, Kiali
Leverage eBPFfor kernel-level tracing, networking diagnostics, and performance tuning.
Continuously enhance monitoring, alerting, and incident response maturity.
CI/CD & Platform Engineering
Architect high-scale CI/CD pipelinesusing GitLab CI/CD, Jenkins, and GCP-native tooling.
Build and evolve internal developer platforms (PaaS)to standardize deployments and reduce operational toil.
Automate operations using Python, Go, Bash, and custom reliability tooling.
Operational Excellence
Participate in on-call rotations, weekend releases, and critical incident management.
Provide 24×7 production supportacross U.S. time zones.
Champion SRE best practices across engineering and operations teams.