Staff Software Engineer, Model LifeCycle

CrusoeSunnyvale, CA
8d$208,725 - $253,000

About The Position

Crusoe is seeking a Staff Software Engineer to join our Model LifeCycle team, where you will be a key architect of a managed platform designed for the next generation of AI application development. In this role, you will build the infrastructure that allows developers to leverage Large Language Models (LLMs) and advanced foundational models at an unprecedented scale. By focusing on the end-to-end model development lifecycle, you will ensure that Crusoe’s sustainable, high-performance cloud remains the platform of choice for the world’s most advanced AI builders. As a Staff Engineer, you will have significant scope for ownership, transitioning from design to high-impact implementation of core systems from first principles. You will bridge the gap between complex research and robust production systems, creating the abstractions and APIs that will define how models are trained, managed, and deployed. This is a full-time position for a seasoned engineer who is passionate about merging deep AI infrastructure with world-class systems engineering.

Requirements

  • Deep Engineering Foundations: 8–10+ years of industry experience with a demonstrated history of leading a varied portfolio of high-impact technical initiatives.
  • Production Excellence: A proven track record of delivering complex production features on time and at scale within a fast-paced environment.
  • Cloud Infrastructure Expertise: Hands-on experience with core cloud-based services, including elastic compute, object storage, virtual private networks, and managed databases.
  • Generative AI Mastery: Practical experience with Generative AI (LLMs, Multimodal) and the underlying infrastructure required for both training and inference.
  • Systemic Autonomy: A proactive, collaborative approach with the ability to drive independent workstreams while aligning with broader team goals.
  • Communication & Passion: Strong interpersonal skills and a visible passion for solving the industry's most challenging technical problems in the AI space.
  • Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.

Nice To Haves

  • Production Language Proficiency: Advanced skills in Golang or Python for building large-scale, production-level services.
  • Framework Expertise: Deep experience working with PyTorch and a history of training and fine-tuning LLMs in production environments.
  • Performance Optimization: Experience with GPU system optimizations and performance tuning for inference frameworks.
  • Open-Source Contributions: A background in contributing to or maintaining open-source AI projects.
  • Aspirational Drive: A desire to build "gold standard" infrastructure that aligns the future of computing with the future of the climate.

Responsibilities

  • Fine-Tuning System Development: Contribute to the development of sophisticated fine-tuning systems (SFT, PEFT, LoRA, adapters), ensuring reliable multi-node orchestration, checkpointing, and failure recovery.
  • End-to-End Training Pipelines: Implement and maintain robust training rimes for LLMs, including distillation and reinforcement learning pipelines such as preference and policy optimization.
  • Agent Execution Infrastructure: Develop and maintain the scalable, high-performance infrastructure required for reliable agentic execution and complex model workflows.
  • Lifecycle Management Features: Implement enterprise-grade features for dataset, model, and experiment management, focusing on versioning, lineage, and reproducible fine-tuning at scale.
  • Collaborative API Design: Work closely with Principal Engineers and product teams to shape the core abstractions and APIs that power the Crusoe AI ecosystem.
  • Architectural Strategy: Contribute to mission-critical decisions regarding training runtimes, scheduling, storage, and the long-term evolution of model lifecycle management.
  • Ecosystem Engagement: Engage with the open-source LLM community to ensure our platform stays at the cutting edge of AI innovation.

Benefits

  • Competitive compensation
  • Restricted Stock Units
  • Paid time off & paid holidays
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service