Senior Site Reliability Engineer (DevOps, Java)

ExadelBulgaria, Georgia, Lithuania, Poland, Romania, Uzbekistan, GA
Hybrid

About The Position

Exadel is seeking a Senior Site Reliability Engineer (DevOps, Java) to join a project focused on a taxi ordering service. This role involves designing, building, and operating reliable, scalable distributed systems, improving system availability and performance, and automating infrastructure and deployment processes. The engineer will also be responsible for diagnosing and resolving production issues, leading upgrades and migrations, participating in on-call rotations, and collaborating with development teams to enhance operability. Key responsibilities include driving best practices in monitoring, alerting, and capacity planning, reducing operational toil through automation, and contributing to incident management, post-mortems, and disaster recovery strategies.

Requirements

  • 7+ years of experience, specializing in Kubernetes and AWS
  • Strong programming skills in Java, with willingness to learn Ruby
  • Solid understanding of concurrency, runtime behavior, and performance optimization
  • Hands-on experience with Docker and containerized workloads
  • Strong Kubernetes expertise (Deployments, StatefulSets, Services, Ingress, Helm, troubleshooting, autoscaling)
  • Strong AWS experience (EC2, EKS, RDS, S3, IAM, VPC, Load Balancers, CloudWatch)
  • Experience designing infrastructure for high availability and disaster recovery
  • Experience with CI/CD pipelines and Infrastructure as Code (Terraform, CloudFormation, Pulumi, or similar)
  • Experience with RabbitMQ or similar messaging systems (Kafka, SQS, Pulsar, etc.)
  • Strong understanding of relational databases (MySQL/PostgreSQL), including query optimization, replication, and failover strategies
  • Familiarity with NoSQL and in-memory databases (Redis, DynamoDB, MongoDB)
  • Experience with distributed systems, microservices, capacity planning, and fault tolerance
  • Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK/OpenSearch, OpenTelemetry)
  • Strong understanding of Linux systems and networking fundamentals (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing)
  • Experience with SRE practices, including SLOs/SLIs/SLAs, load testing, resilience testing, and incident management
  • Strong communication skills and ability to collaborate across engineering teams
  • Calm and effective during incidents with an ownership mindset

Nice To Haves

  • Experience operating production systems written in Ruby, Java, or other major platforms
  • Framework experience such as Ruby on Rails, Spring Boot, or similar
  • Experience operating high-traffic SaaS platforms
  • Cost optimization in cloud environments
  • Chaos engineering practices
  • Experience mentoring junior engineers

Responsibilities

  • Design, build, and operate reliable, scalable distributed systems
  • Improve system availability, performance, and resilience
  • Automate infrastructure, deployments, and operational processes
  • Diagnose and resolve production issues
  • Lead upgrades and migrations with minimal or zero downtime
  • Participate in on-call rotations and incident response
  • Collaborate closely with development teams to improve operability
  • Drive best practices around monitoring, alerting, and capacity planning
  • Reduce operational toil through automation
  • Contribute to incident management, post-mortems, disaster recovery strategies, and continuous reliability improvements

Benefits

  • Medical healthcare
  • Recognition program
  • Ongoing learning & reimbursement
  • Well-being program
  • Team events & local benefits
  • Sports compensation
  • Referral bonuses
  • Top-tier equipment provision
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service