Senior JAVA SRE

DYNE IT ServicesSanta Monica, CA
9hHybrid

About The Position

We are seeking a Senior Java Site Reliability Engineer (SRE)to architect, operate, and continuously improve hyperscale, globally distributed platforms on Google Cloud Platform (GCP). This role is highly hands-on and requires deep expertise across Java JVM performance engineering, Kubernetes/GKE at scale, cloud-native reliability engineering, and real-time streaming systems. The ideal candidate will act as a technical authorityfor availability, performance, scalability, and operational excellence—driving 99.99%+ uptimeacross multi-region systems while mentoring teams and shaping long-term platform reliability strategy.

Requirements

  • Java– Advanced JVM internals, GC tuning, performance optimization
  • Google Cloud Platform (GCP)– Professional-level expertise
  • Kubernetes / GKE– Multi-cluster, fleet, and Anthos architectures
  • Terraform, Docker, Infrastructure as Code
  • CI/CD– GitLab CI/CD, Jenkins
  • Kafka, Kafka Streams, KSQLDB
  • Spark Streaming
  • GCP Pub/Sub
  • Istio, Anthos Service Mesh
  • Nginx Controller, Seesaw
  • eBPF-based observability and networking diagnostics
  • Prometheus, Datadog, Splunk, Kiali
  • Linux/Unix, Bash scripting
  • Python or Go
  • Internal PaaS design and enablement
  • Multi-cluster Kubernetes governance
  • High-traffic SaaS or consumer-scale platforms
  • Google Professional Cloud ArchitectOR Professional Cloud DevOps Engineer
  • Certified Kubernetes Administrator (CKA)or Certified Kubernetes Security Specialist (CKS)

Nice To Haves

  • Experience operating hyperscale production systems
  • Strong background in real-time streaming & event-driven platforms
  • Proven leadership in incident response and reliability governance
  • Excellent communication skills with ability to lead cross-functional teams

Responsibilities

  • Platform Architecture & Reliability
  • Architect and operate multi-region, globally distributed GCP platforms with 99.99%+ availability targets.
  • Define, implement, and govern SLIs, SLOs, error budgets, and reliability frameworks.
  • Lead incident command, production war rooms, post-incident RCA, and long-term remediation initiatives.
  • Design fault-tolerant systems with zero-downtime deployments and graceful degradation.
  • Java & JVM Performance Engineering
  • Engineer and tune high-throughput Java microservice scale.
  • Deep expertise in JVM internals, garbage collection strategies, heap optimisation, and memory profiling.
  • Identify and resolve performance bottlenecks under peak traffic conditions.
  • Kubernetes & GCP Infrastructure
  • Design and operate GKE at scale, including multi-cluster and fleet management.
  • Implement GCP-native architecturesusing: GKE, Compute Engine, Cloud Load Balancing Cloud Spanner, Bigtable, Cloud SQL Pub/Sub, Cloud Storage IAM, VPC Service Controls
  • Implement secure, repeatable infrastructureusing Terraform and policy-as-code.
  • Service Mesh & Traffic Management
  • Architect advanced service mesh solutionsusing Istio / Anthos Service Mesh.
  • Implement traffic shaping strategies including canary, blue/green, and progressive rollouts.
  • Manage advanced ingress, load balancing, and routing using Nginx Controller and Seesaw.
  • Stateful & Streaming Systems
  • Design and operate stateful Kubernetes workloadsusing Portworx.
  • Support real-time, event-driven architecturesusing: Kafka, Kafka Streams, KSQLDB Spark Streaming GCP Pub/Sub
  • Optimize systems for low-latency, high-throughput workloads.
  • Observability & Performance
  • Implement enterprise-grade observability using: Prometheus, Datadog, Splunk, Kiali
  • Leverage eBPFfor kernel-level tracing, networking diagnostics, and performance tuning.
  • Continuously enhance monitoring, alerting, and incident response maturity.
  • CI/CD & Platform Engineering
  • Architect high-scale CI/CD pipelinesusing GitLab CI/CD, Jenkins, and GCP-native tooling.
  • Build and evolve internal developer platforms (PaaS)to standardize deployments and reduce operational toil.
  • Automate operations using Python, Go, Bash, and custom reliability tooling.
  • Operational Excellence
  • Participate in on-call rotations, weekend releases, and critical incident management.
  • Provide 24×7 production supportacross U.S. time zones.
  • Champion SRE best practices across engineering and operations teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service