Principal, Customer Reliability Engineer

CrusoeSan Francisco, CA
2d$230,000 - $280,000

About The Position

As a Principal Customer Reliability Engineer, you define and elevate the technical reliability strategy of Crusoe Cloud at the company level. You are an organization-wide authority in distributed systems, AI/ML infrastructure, networking, storage, compute, k8, and cloud operations. Your impact extends beyond CX, you shape how Crusoe designs, deploys, and scales high-performance GPU infrastructure. This is not an escalation engineer role. This is a systems architect and reliability strategist role with direct impact on enterprise readiness and revenue protection.

Requirements

  • 12+ years experience in distributed systems, SRE, DevOps, or HPC engineering.
  • Deep expertise in: Linux internals Kubernetes at scale Infiniband / RDMA GPU cluster performance engineering Large-scale AI/ML workloads
  • Demonstrated ability to architect reliability systems, not just troubleshoot them.
  • Experience leading large-scale incident reform or platform redesign.
  • Exceptional cross-functional influence.
  • Strong executive communication skills.

Responsibilities

  • Define the technical vision for AI/ML workload reliability.
  • Architect guardrails across compute, storage, networking, and orchestration.
  • Partner with Product & Engineering to influence roadmap decisions impacting scalability and resilience.
  • Lead post-incident structural reforms for major outages.
  • Define enterprise-grade incident management standards.
  • Establish reliability metrics that align with ARR protection and expansion.
  • Evaluate and improve: Kubernetes multi-cluster design Software-defined networking IB fabric architecture GPU lifecycle management Observability frameworks
  • Drive automation-first operational maturity.
  • Serve as technical spokesperson during high-severity events.
  • Build enterprise confidence in Crusoe’s technical depth.
  • Contribute to technical thought leadership (blogs, architecture reviews, customer briefings).
  • Mentor Sr. Staff engineers.
  • Raise hiring bar for advanced infrastructure roles.
  • Create technical learning frameworks for HPC & AI operations.
  • Work on tooling and automation for the CX team
  • Engage with customers during their onboarding phase
  • Work on Executive level escalations and high priority incidents

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300/month
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service