Senior Site Reliability Engineer (DevOps, Java)

Exadel•Bulgaria, Georgia, Lithuania, Poland, Romania, Uzbekistan, GA

8h•Hybrid

About The Position

Exadel is seeking a Senior Site Reliability Engineer (DevOps, Java) to join a project focused on a taxi ordering service. This role involves designing, building, and operating reliable, scalable distributed systems, improving system availability and performance, and automating infrastructure and deployment processes. The engineer will also be responsible for diagnosing and resolving production issues, leading upgrades and migrations, participating in on-call rotations, and collaborating with development teams to enhance operability. Key responsibilities include driving best practices in monitoring, alerting, and capacity planning, reducing operational toil through automation, and contributing to incident management, post-mortems, and disaster recovery strategies.

Requirements

7+ years of experience, specializing in Kubernetes and AWS
Strong programming skills in Java, with willingness to learn Ruby
Solid understanding of concurrency, runtime behavior, and performance optimization
Hands-on experience with Docker and containerized workloads
Strong Kubernetes expertise (Deployments, StatefulSets, Services, Ingress, Helm, troubleshooting, autoscaling)
Strong AWS experience (EC2, EKS, RDS, S3, IAM, VPC, Load Balancers, CloudWatch)
Experience designing infrastructure for high availability and disaster recovery
Experience with CI/CD pipelines and Infrastructure as Code (Terraform, CloudFormation, Pulumi, or similar)
Experience with RabbitMQ or similar messaging systems (Kafka, SQS, Pulsar, etc.)
Strong understanding of relational databases (MySQL/PostgreSQL), including query optimization, replication, and failover strategies
Familiarity with NoSQL and in-memory databases (Redis, DynamoDB, MongoDB)
Experience with distributed systems, microservices, capacity planning, and fault tolerance
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK/OpenSearch, OpenTelemetry)
Strong understanding of Linux systems and networking fundamentals (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing)
Experience with SRE practices, including SLOs/SLIs/SLAs, load testing, resilience testing, and incident management
Strong communication skills and ability to collaborate across engineering teams
Calm and effective during incidents with an ownership mindset

Nice To Haves

Experience operating production systems written in Ruby, Java, or other major platforms
Framework experience such as Ruby on Rails, Spring Boot, or similar
Experience operating high-traffic SaaS platforms
Cost optimization in cloud environments
Chaos engineering practices
Experience mentoring junior engineers

Responsibilities

Design, build, and operate reliable, scalable distributed systems
Improve system availability, performance, and resilience
Automate infrastructure, deployments, and operational processes
Diagnose and resolve production issues
Lead upgrades and migrations with minimal or zero downtime
Participate in on-call rotations and incident response
Collaborate closely with development teams to improve operability
Drive best practices around monitoring, alerting, and capacity planning
Reduce operational toil through automation
Contribute to incident management, post-mortems, disaster recovery strategies, and continuous reliability improvements