About The Position

We’re seeking a motivated and driven individual to join our innovative team. As a cornerstone of our production software, you’ll play a crucial role in ensuring the uncompromising reliability, security, and scalability of our systems. These systems encompass infrastructure supporting LLM inference, AI/ML training pipelines, and intelligent automation. Your expertise will be instrumental in maintaining constant uptime, seamless scalability, and fostering a thriving environment for new applications and services. With a growing emphasis on AI-powered capabilities, your contributions will be pivotal in designing and implementing solutions that enhance system stability, security, and scalability. By leveraging LLMs and AI, you’ll contribute to improved operational efficiency and system intelligence. Your collaboration with developers, architects, and AI/ML engineers will be essential in achieving these goals.

Requirements

  • Kubernetes Expertise: Deep understanding of Kubernetes architecture, components, and best practices, including orchestration of AI/ML and LLM inference workloads. Proficiency in developing Kubernetes clusters, deploying applications, and automating workflows using tools like Helm and Kustomize.
  • Cloud Platforms: Experience with major public cloud providers and their cloud-native services, including GPU-accelerated compute and AI/ML platform services. Familiarity with infrastructure as code (IaC) tools like Terraform or Ansible.
  • SRE Principles: Adherence to SRE principles, including monitoring, alerting, error budgets, fault analysis, and automation. Strong focus on reliability, availability, and performance.
  • Telemetry and Observability: Expertise in implementing and coordinating telemetry using tools like Splunk, Grafana, and Prometheus. Ability to analyze and troubleshoot complex system issues.
  • Programming: Proficiency in GoLang or Python for developing automation scripts, tools, and custom applications. Familiarity with Python-based AI/ML ecosystems is a plus.
  • AI/ML Fundamentals: Understanding of LLM serving infrastructure, model deployment patterns, and AI/ML pipeline concepts (e.g., model training, fine-tuning, inference optimization).
  • Collaboration: Excellent interpersonal and communication skills. Ability to work effectively in cross-functional teams — including AI/ML engineering teams — and foster a collaborative environment.
  • BS or MS in Computer Science or equivalent proven experience

Nice To Haves

  • Production & Non-Production Environments: Operate, monitor, and prioritize tasks across all production and non-production environments, including AI/ML training and LLM serving clusters, demonstrating strong operational focus.
  • LLM & AI Infrastructure: Experience deploying and managing large language model (LLM) inference services, GPU clusters, and AI/ML pipelines at scale.
  • Innovative Problem Solver: Design, build, and implement innovative software solutions including AI-driven automation and intelligent observability tools — to address existing challenges and proactively anticipate future needs.
  • Documentation & Collaboration: Create clear alert handling procedures and runbooks, ensuring knowledge transfer and collaboration within and between SRE teams.
  • Automation Champion: Automate service deployment and orchestration in the cloud environment, leveraging AI/ML and LLM-based tooling to streamline operations and reduce toil.
  • Resilience & Growth: Actively participate in capability planning, scale testing, and disaster recovery exercises to ensure our systems, including AI infrastructure, remain resilient.
  • Team Player: Foster strong relationships and provide support to partner teams like engineering, QA, AI/ML, and program management.

Responsibilities

  • Establishing reliability practices for our private and public cloud services
  • Maintaining constant uptime and seamless scalability
  • Designing and implementing solutions that enhance system stability, security, and scalability
  • Leveraging LLMs and AI to contribute to improved operational efficiency and system intelligence
  • Collaborating with developers, architects, and AI/ML engineers
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service