Senior+ Site Reliability Engineer

CrusoeSan Francisco, CA
18d$172,000 - $209,000

About The Position

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure. About This Role: Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe’s GPU cloud. This role is ideal for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform. You’ll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe’s incident management practices.

Requirements

  • 5+ years of experience in cloud operations, SRE, or related roles
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
  • Basic Scripting and automation experience (Go, Python, C, C++, or similar)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
  • A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement

Nice To Haves

  • Experience with Kubernetes, container orchestration, or large-scale distributed systems
  • Exposure to change management, operational readiness reviews, or structured RCAs
  • Familiarity with self-healing systems, automated remediation, or event-driven operations
  • Interest in scaling AI/HPC infrastructure and solving reliability challenges in GPU-heavy environments
  • Passion for learning, mentorship, and developing deeper SRE capabilities over time

Responsibilities

  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs.
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews.
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability.
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities.
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness.
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization.
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service