Intuitive-posted 6 days ago
Full-time • Mid Level
Sunnyvale, CA
1-10 employees

We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems , leveraging platforms such as Domino Data Lab , Slurm , and NVIDIA Base Command , while driving automation, observability, and networking optimization

  • Contribute to deployment, and maintenance of infrastructure across AWS, GCP, and Azure, as well as on-prem NVIDIA DGX systems.
  • Implement and manage Infrastructure as Code (IaC) using Terraform and Ansible for automated provisioning and configuration.
  • Support cloud and on-prem networking solutions for secure, high-performance connectivity.
  • Manage and optimize Domino Data Lab workflows and Slurm clusters for distributed training and inference.
  • Integrate and support NVIDIA Base Command for GPU-based compute environments.
  • Develop automation scripts and tools in Python to streamline operations and improve reliability.
  • Support CI/CD pipelines using GitLab, ensuring smooth deployments to UAT and production environments.
  • Implement and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, and cloud-native services.
  • Deploy and manage Kubernetes clusters (EKS, GKE) for scalable containerized workloads.
  • Troubleshoot complex workflows and ensure high availability of critical systems.
  • Collaborate with data science and engineering teams to optimize resource utilization and workflow efficiency.
  • Drive best practices for incident response, capacity planning, and system reliability in multi-cloud and HPC environments.
  • Administer and optimize ITSM platforms (e.g., Jira Service Management, ServiceNow) for release/change/incident workflows.
  • Support tooling across CI/CD, monitoring, and ticketing systems to ensure traceability and automation.
  • Maintain documentation and evidence for audits related to release/change/incident processes.
  • Partner with Compliance and InfoSec teams to ensure controls meet HIPAA, HITRUST, FDA GxP, and ISO 27001 standards.
  • Act as the primary liaison between engineering, product, support, and compliance teams for operational readiness.
  • Facilitate regular status updates, incident reviews, RCA’s and change planning sessions with stakeholders.
  • Support in updating onboarding materials and training sessions for engineers and product managers on release/change/incident protocols.
  • Promote a culture of ownership and reliability through education and process transparency.
  • Support retrospectives for major releases and incidents to identify process gaps and improvement opportunities.
  • Track and report on KPIs such as change success rate, incident recurrence, and release velocity.
  • Identify operational risks and escalate proactively to leadership.
  • Maintain escalation matrices and ensure readiness for high-severity incidents.
  • 5+ years of experience in Site Reliability Engineering or Cloud Infrastructure Engineering.
  • Strong proficiency in AWS and GCP; working knowledge of Azure.
  • Expertise in Terraform, Ansible, and IaC principles.
  • Solid understanding of networking fundamentals, VPC design, and security best practices.
  • Hands-on experience managing AI/ML workloads, including Domino Data Lab, Slurm, and GPU-based environments.
  • Advanced scripting and automation skills in Python.
  • Experience with CI/CD pipelines and release management using GitLab.
  • Strong troubleshooting skills and experience with observability tools (Prometheus, Grafana, ELK).
  • Hands-on experience with Kubernetes in AWS (EKS) and GCP (GKE).
  • Proficiency with NFS and NetApp Data ONTAP.
  • Strong Linux systems knowledge, including familiarity with file systems, kernel internals, cgroups, and environment variables.
  • Experience using debugging tools and performing debugging and analysis for complex systems.
  • Excellent communication and collaboration skills in cross-functional environments.
  • Education : Bachelor’s degree in computer science, Information Systems, Engineering, or related field required.
  • Experience: Minimum of 7+ years in technical operations, SRE, or IT service management roles.
  • Proven experience supporting release cycles, change governance, and incident response in regulated environments (e.g., healthcare, life sciences, financial services).
  • Familiarity with NVIDIA Base Command and GPU orchestration.
  • Knowledge of container orchestration beyond Kubernetes (Docker, Helm).
  • Understanding data security and compliance for AI/ML workloads.
  • Exposure to MLOps best practices and ML lifecycle management.
  • Master’s degree or certifications in ITIL, DevOps, or regulatory compliance preferred.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service