SRE

Acclaim
Remote

About The Position

We are expanding our infrastructure team and looking for a strong Site Reliability Engineer (SRE) to join us in building and operating an advanced platform for creating and managing AI agents. This platform can be deployed on-premises as an enterprise solution or offered as a SaaS version. It handles real-time voice and telephony, GPU and LLM inference, and streaming analytics, running in both cloud and on-prem environments, including sensitive banking sectors. This is a role for a strong, independent engineer who is passionate about creating transparent, reliable systems. As a Senior SRE, you will have significant influence on how systems are built and operated, while also handling DevOps tasks. Your primary focus will be SRE principles: reliability, observability, incident management, and performance under load.

Requirements

  • 5+ years in SRE/DevOps.
  • Proven experience being responsible for the reliability of high-load production systems.
  • Deep, practical understanding of Docker and Kubernetes, with production operational experience.
  • Mature understanding of metrics and alerts, with hands-on experience writing, tuning, and maintaining them.
  • Practical experience with Prometheus, Alertmanager, and Grafana.
  • Ability and willingness to build clear, useful, and easy-to-work-with dashboards.
  • Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
  • Experience with load testing and basic capacity planning.
  • Proficiency in Python for automation, exporters, tooling, and related tasks.
  • Cloud experience with GCP and/or AWS.
  • Strong Linux skills.
  • Solid networking knowledge at an operational level.
  • DevOps fundamentals: CI/CD and infrastructure as code (e.g., GitHub Actions, Terraform, Ansible).
  • Willingness to understand and support the product in customer environments, including on-prem deployments.
  • Ownership mindset: taking responsibility, driving tasks to completion, and thinking ahead.
  • Friendly, non-toxic, and pleasant to work with.
  • Strong communication skills with developers: ability to clearly and constructively explain positions, defend them, and find common ground.
  • Willingness and ability to mentor, teach, and share knowledge.
  • Analytical mindset: ability to dig down to the root cause.
  • Proactivity: focus on preventing outages.
  • Strong attention to detail and reliability.

Nice To Haves

  • Experience using AI agents for routine and recurring tasks.
  • Real-time telephony experience (SIP, FreeSWITCH, RTP, WebRTC).
  • GPU/ML serving experience (Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM); understanding of LLM/ML model deployment specifics.
  • Streaming data and analytics experience (Kafka, ClickHouse).
  • Deep experience with IaC and GitOps (e.g., Terraform, Ansible, ArgoCD).
  • Logging experience with Loki/ELK.
  • gRPC experience.
  • Experience working in isolated and highly secure environments.
  • Experience preparing systems for significant growth in load.

Responsibilities

  • Responsible for the reliability of services, including SLIs/SLOs and availability.
  • Identify and eliminate bottlenecks across the system.
  • Set up monitoring for services, including metrics, alerts, and dashboards.
  • Build and maintain Grafana dashboards for internal teams and customers.
  • Run load testing, analyze results, and provide recommendations on resources and scaling.
  • Investigate incidents, participate in on-call rotations, write and lead postmortems.
  • Ensure that failures do not reoccur.
  • Work closely with developers to communicate and defend technical positions, challenge decisions, and find collaborative solutions.
  • Develop and support Kubernetes-based infrastructure across cloud environments (GCP, AWS).
  • Automate routine work and assist with CI/CD and general team tasks.
  • Participate in delivering and supporting the platform for customers, including on-prem deployments.
  • Mentor colleagues and help raise the engineering bar across the team.

Benefits

  • 21 vacation days + public holidays
  • 5 sick days
  • Private English lessons via Preply
  • Fully remote across Europe
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service