Senior Site Reliability Engineer, Compute

CrusoeSunnyvale, CA
36d$166,000 - $201,000

About The Position

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

Requirements

  • 5+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.
  • Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.
  • Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.
  • Familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques.
  • Expert-level skills in at least one programming language: Go, C or Rust.
  • Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.
  • Proficiency in Infrastructure as Code tooling and CI/CD practices for bare-metal or cloud infrastructure.
  • Strong understanding of compute scheduling, resource management, and high-throughput networking.

Responsibilities

  • Develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers.
  • Support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors.
  • Collaborate with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads.
  • Optimize performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources.
  • Participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation.
  • Tune kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling.
  • Work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per pay period
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service