Senior Site Reliability Engineer, Managed AI

CrusoeSan Francisco, CA
5h$172,000 - $209,000

About The Position

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure. About the Role: At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for a Senior Site Reliability Engineer with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

Requirements

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs Building monitoring and observability systems Driving performance and reliability improvements Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment

Nice To Haves

  • Experience scaling inference or training workloads for LLMs

Responsibilities

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service