Senior DevOps Engineer

Acclaim

48d•Remote

About The Position

Our team brings huge of cutting-edge, specialized expertise in Machine Learning and Speech Technologies, which are used daily by hundreds of millions of people worldwide. We already have several major projects underway and are looking to strengthen our team for a DevOps/SRE Engineer!

Requirements

Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role
Strong hands-on experience with Linux system administration
Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments
Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)
Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki
Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles
Advanced expertise in Terraform, Ansible, and Python
Comfortable working in high-uncertainty environments: we are building a new product, requirements evolve quickly, and the ability to rapidly learn new technologies and patterns is essential
Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product
Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises

Nice To Haves

Experience with ML inference on GPU/CPU is a strong plus

Responsibilities

Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher)
Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius
Build and maintain Docker images for all microservices and ensure a stable service lifecycle
Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting
Develop and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments
Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes
Manage cluster access via NetBird VPN, implementing role-based access control using group policies
Deploy and manage infrastructure using IaC practices with Terraform and Ansible
Develop and continuously improve observability systems: Grafana & Prometheus for metrics ELK stack for centralized log storage and analysis
Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD