SRE

Acclaim

1d•Remote

About The Position

We are expanding our infrastructure team and looking for a strong Site Reliability Engineer (SRE) to join us in building and operating an advanced platform for creating and managing AI agents. This platform can be deployed on-premises as an enterprise solution or offered as a SaaS version. It handles real-time voice and telephony, GPU and LLM inference, and streaming analytics, running in both cloud and on-prem environments, including sensitive banking sectors. This is a role for a strong, independent engineer who is passionate about creating transparent, reliable systems. As a Senior SRE, you will have significant influence on how systems are built and operated, while also handling DevOps tasks. Your primary focus will be SRE principles: reliability, observability, incident management, and performance under load.

Requirements

5+ years in SRE/DevOps.
Proven experience being responsible for the reliability of high-load production systems.
Deep, practical understanding of Docker and Kubernetes, with production operational experience.
Mature understanding of metrics and alerts, with hands-on experience writing, tuning, and maintaining them.
Practical experience with Prometheus, Alertmanager, and Grafana.
Ability and willingness to build clear, useful, and easy-to-work-with dashboards.
Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
Experience with load testing and basic capacity planning.
Proficiency in Python for automation, exporters, tooling, and related tasks.
Cloud experience with GCP and/or AWS.
Strong Linux skills.
Solid networking knowledge at an operational level.
DevOps fundamentals: CI/CD and infrastructure as code (e.g., GitHub Actions, Terraform, Ansible).
Willingness to understand and support the product in customer environments, including on-prem deployments.
Ownership mindset: taking responsibility, driving tasks to completion, and thinking ahead.
Friendly, non-toxic, and pleasant to work with.
Strong communication skills with developers: ability to clearly and constructively explain positions, defend them, and find common ground.
Willingness and ability to mentor, teach, and share knowledge.
Analytical mindset: ability to dig down to the root cause.
Proactivity: focus on preventing outages.
Strong attention to detail and reliability.

Nice To Haves

Experience using AI agents for routine and recurring tasks.
Real-time telephony experience (SIP, FreeSWITCH, RTP, WebRTC).
GPU/ML serving experience (Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM); understanding of LLM/ML model deployment specifics.
Streaming data and analytics experience (Kafka, ClickHouse).
Deep experience with IaC and GitOps (e.g., Terraform, Ansible, ArgoCD).
Logging experience with Loki/ELK.
gRPC experience.
Experience working in isolated and highly secure environments.
Experience preparing systems for significant growth in load.

Responsibilities

Responsible for the reliability of services, including SLIs/SLOs and availability.
Identify and eliminate bottlenecks across the system.
Set up monitoring for services, including metrics, alerts, and dashboards.
Build and maintain Grafana dashboards for internal teams and customers.
Run load testing, analyze results, and provide recommendations on resources and scaling.
Investigate incidents, participate in on-call rotations, write and lead postmortems.
Ensure that failures do not reoccur.
Work closely with developers to communicate and defend technical positions, challenge decisions, and find collaborative solutions.
Develop and support Kubernetes-based infrastructure across cloud environments (GCP, AWS).
Automate routine work and assist with CI/CD and general team tasks.
Participate in delivering and supporting the platform for customers, including on-prem deployments.
Mentor colleagues and help raise the engineering bar across the team.