Senior JAVA SRE

DYNE IT ServicesMcLean, VA
12hHybrid

About The Position

We are seeking a Senior Java Site Reliability Engineer (SRE)to design, build, and operate highly resilient, low-latency, enterprise-scale systems supporting core banking, payments, and trading platforms. This role requires deep expertise across Java microservices, Kubernetes, AWS cloud infrastructure, and SRE best practices, with hands-on responsibility for reliability, scalability, and production excellence in high-transaction environments. The ideal candidate will operate at L3/L4 production support level, lead reliability engineering initiatives, and work closely with platform, application, and security teams to ensure zero-downtime, compliance-aligned operations.

Requirements

  • Java:JVM internals, GC tuning, microservices architecture
  • Cloud:AWS (EKS, EC2, IAM, VPC, RDS, CloudWatch)
  • Containers & Orchestration:Kubernetes (CKA/CKS-level depth), Docker
  • Infrastructure as Code:Terraform
  • CI/CD:GitLab CI/CD, Jenkins
  • Streaming Platforms:Kafka, KSQLDB, Kafka Streams, Spark Streaming
  • Service Mesh:Istio, Anthos Service Mesh
  • Observability:Prometheus, Datadog, Splunk, Kiali
  • OS & Scripting:Linux/Unix, Bash
  • Programming:Python and/or Go
  • Virtualization:VMware
  • Networking & Performance:Nginx Controller, Seesaw, eBPF
  • Experience supporting core banking systems, payment gateways, or trading platforms
  • Exposure to high-frequency, high-volume transaction environments
  • Proven experience with zero-downtime deployments, high availability, and disaster recovery
  • Strong understanding of regulatory audits and financial compliance controls
  • AWS Certified Solutions Architect – Professional or AWS DevOps Engineer – Professional
  • Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
  • Experience Required:15+ Years

Responsibilities

  • Design, implement, and operate highly available, fault-tolerant, and scalable systemsfor mission-critical financial platforms.
  • Lead SRE practicesincluding SLIs, SLOs, error budgets, and reliability-driven engineering decisions.
  • Provide L3/L4 production support, including incident management, root cause analysis (RCA), and post-incident remediation.
  • Drive continuous improvement through blameless postmortemsand operational excellence initiatives.
  • Support and optimize Java-based microservices, including JVM internals, GC tuning, and performance optimization.
  • Operate and scale workloads on Kubernetes (EKS)across multi-cluster environments.
  • Implement and manage AWS servicesincluding EC2, EKS, IAM, VPC, RDS, DynamoDB, S3, and CloudWatch.
  • Design and maintain zero-downtime deployment strategiesand robust disaster recovery (DR)architectures.
  • Build and manage infrastructure using Terraformand infrastructure-as-code best practices.
  • Automate operational workflows using Python, Go, Bash, and cloud-native tooling.
  • Architect and maintain enterprise-grade CI/CD pipelinesusing GitLab CI/CD, Jenkins, and Kubernetes-native integrations.
  • Manage Kubernetes networking, storage, and ingress using Nginx Controller, Seesaw, and advanced networking patterns.
  • Implement and operate service mesh solutionsincluding Istio and Anthos Service Mesh.
  • Design and manage Kubernetes storage solutionsusing Portworx.
  • Support multi-cluster Kubernetes environments, including federation and cross-cluster communication.
  • Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards.
  • Utilize eBPFfor deep kernel-level observability, performance analysis, and system tuning.
  • Optimize latency, throughput, and scalabilityunder high-frequency transaction loads.
  • Support real-time data platforms using Kafka, Kafka Streams, KSQLDB, and Spark Streaming.
  • Ensure reliability and performance of streaming pipelines in high-volume, low-latency environments.
  • Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access.
  • Support platforms aligned with SOC 2, PCI-DSS, SOX, and internal banking security standards.
  • Participate in regulatory audits, risk assessments, and compliance reviews.
  • Participate in 24×7 on-call rotations, including nights and weekends, supporting U.S. time zones.
  • Act as a senior escalation point during major incidents and platform outages.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service