Cloud Site Reliability Engineer

SambaNova Systems•San Jose, CA

48d•Hybrid

About The Position

The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale. SambaNova Suite™ is the first full-stack, generative AI platform, from chip to model, optimized for enterprise and government organizations. Powered by the intelligent SN40L chip, the SambaNova Suite is a fully integrated platform, delivered on-premises or in the cloud, combined with state-of-the-art open-source models that can be easily and securely fine-tuned using customer data for greater accuracy. Once adapted with customer data, customers retain model ownership in perpetuity, so they can turn generative AI into one of their most valuable assets. About SambaNova Systems Join the company that's building the future of AI computing. At SambaNova, we are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems. The Role As a Cloud Site Reliability Engineer (SRE) specializing in our AI Inferencing Service, you will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24/7 service reliability.

Requirements

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).
Strong programming/scripting skills in languages like Python, Go, or Java.
Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.

Nice To Haves

Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure.
Direct experience supporting ML/AI inferencing services in production.
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.
Knowledge of model serving frameworks like vLLM, SGLang or Ray.
Understanding of MLOps principles and practices.
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).
Strong Linux/Unix system administration fundamentals.

Responsibilities

Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions.
Implement and support AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business.
Participate in a balanced on-call rotation to provide 24/7 support for the service.
Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization.
Ensure alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
Proactively identify and eliminate performance bottlenecks.
Design and implement auto-scaling policies to handle variable inference loads cost-effectively.
Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates.
Automate manual toil identified during on-call shifts, reducing future operational overhead.
Forecast infrastructure needs based on product roadmaps and usage trends.
Work with finance and engineering teams to manage cloud costs and optimize spending.
Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.

Benefits

Competitive Compensation
Equity
Excellent benefits
Flexible work environment
95% premium coverage for employee medical insurance
77% premium coverage for dependents
Health Savings Account (HSA) with employer contribution
Dental insurance
Vision insurance
Short/Long term Disability insurance
Basic Life insurance
Voluntary Life insurance
AD&D insurance
Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
Headspace subscription
Gympass+ membership with access to physical gyms
One Medical membership
Counseling services with an Employee Assistance Program