Senior JAVA SRE

DYNE IT Services•McLean, VA

18d•Hybrid

About The Position

We are seeking a Senior Java Site Reliability Engineer (SRE)to design, build, and operate highly resilient, low-latency, enterprise-scale systems supporting core banking, payments, and trading platforms. This role requires deep expertise across Java microservices, Kubernetes, AWS cloud infrastructure, and SRE best practices, with hands-on responsibility for reliability, scalability, and production excellence in high-transaction environments. The ideal candidate will operate at L3/L4 production support level, lead reliability engineering initiatives, and work closely with platform, application, and security teams to ensure zero-downtime, compliance-aligned operations.

Requirements

Java:JVM internals, GC tuning, microservices architecture
Cloud:AWS (EKS, EC2, IAM, VPC, RDS, CloudWatch)
Containers & Orchestration:Kubernetes (CKA/CKS-level depth), Docker
Infrastructure as Code:Terraform
CI/CD:GitLab CI/CD, Jenkins
Streaming Platforms:Kafka, KSQLDB, Kafka Streams, Spark Streaming
Service Mesh:Istio, Anthos Service Mesh
Observability:Prometheus, Datadog, Splunk, Kiali
OS & Scripting:Linux/Unix, Bash
Programming:Python and/or Go
Virtualization:VMware
Networking & Performance:Nginx Controller, Seesaw, eBPF
Experience supporting core banking systems, payment gateways, or trading platforms
Exposure to high-frequency, high-volume transaction environments
Proven experience with zero-downtime deployments, high availability, and disaster recovery
Strong understanding of regulatory audits and financial compliance controls
AWS Certified Solutions Architect – Professional or AWS DevOps Engineer – Professional
Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
Experience Required:15+ Years

Responsibilities

Design, implement, and operate highly available, fault-tolerant, and scalable systemsfor mission-critical financial platforms.
Lead SRE practicesincluding SLIs, SLOs, error budgets, and reliability-driven engineering decisions.
Provide L3/L4 production support, including incident management, root cause analysis (RCA), and post-incident remediation.
Drive continuous improvement through blameless postmortemsand operational excellence initiatives.
Support and optimize Java-based microservices, including JVM internals, GC tuning, and performance optimization.
Operate and scale workloads on Kubernetes (EKS)across multi-cluster environments.
Implement and manage AWS servicesincluding EC2, EKS, IAM, VPC, RDS, DynamoDB, S3, and CloudWatch.
Design and maintain zero-downtime deployment strategiesand robust disaster recovery (DR)architectures.
Build and manage infrastructure using Terraformand infrastructure-as-code best practices.
Automate operational workflows using Python, Go, Bash, and cloud-native tooling.
Architect and maintain enterprise-grade CI/CD pipelinesusing GitLab CI/CD, Jenkins, and Kubernetes-native integrations.
Manage Kubernetes networking, storage, and ingress using Nginx Controller, Seesaw, and advanced networking patterns.
Implement and operate service mesh solutionsincluding Istio and Anthos Service Mesh.
Design and manage Kubernetes storage solutionsusing Portworx.
Support multi-cluster Kubernetes environments, including federation and cross-cluster communication.
Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards.
Utilize eBPFfor deep kernel-level observability, performance analysis, and system tuning.
Optimize latency, throughput, and scalabilityunder high-frequency transaction loads.
Support real-time data platforms using Kafka, Kafka Streams, KSQLDB, and Spark Streaming.
Ensure reliability and performance of streaming pipelines in high-volume, low-latency environments.
Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access.
Support platforms aligned with SOC 2, PCI-DSS, SOX, and internal banking security standards.
Participate in regulatory audits, risk assessments, and compliance reviews.
Participate in 24×7 on-call rotations, including nights and weekends, supporting U.S. time zones.
Act as a senior escalation point during major incidents and platform outages.