Principal Engineer, AI Model LifeCycle

CrusoeSunnyvale, CA
$260,000 - $326,000

About The Position

Crusoe is seeking a visionary Principal Software Engineer for our Model LifeCycle team to architect a comprehensive managed platform for the next generation of AI development. In this high-impact role, you will be the technical authority responsible for the entire application development lifecycle, specifically optimized for Large Language Models (LLMs) and advanced foundational models. By building these core systems from first principles, you will enable developers to leverage Crusoe’s sustainable, high-performance infrastructure to push the boundaries of what is possible in AI. As a Principal Engineer, you will have significant 0 → 1 ownership, designing mission-critical abstractions and influencing long-term architectural decisions across training runtimes, scheduling, and storage. This is a full-time position for a seasoned expert who thrives on technical complexity and is eager to lead the industry toward a more sustainable and powerful AI future.

Requirements

  • Advanced Technical Foundation: An advanced degree (Masters or PhD) in Computer Science, Engineering, or a related technical field.
  • Extensive Industry Experience: 10–15+ years of professional experience driving high-impact engineering projects, with a significant tenure dedicated to the AI/ML space.
  • 0 → 1 Delivery Track Record: A proven history of architecting and delivering early-stage, foundational projects under tight deadlines and high-growth conditions.
  • Cloud Infrastructure Mastery: Expert-level proficiency in cloud-based services, including elastic compute, object storage, virtual private networks, and managed databases.
  • Generative AI Expertise: Deep, hands-on experience in Generative AI (LLMs, Multimodal) and the underlying infrastructure required for both training and large-scale inference.
  • Leadership & Communication: Exceptional interpersonal skills with the ability to work autonomously while proactively collaborating with stakeholders at all levels.

Nice To Haves

  • High-Scale Production Skills: Advanced proficiency in Golang or Python for building large-scale, production-level services.
  • Open-Source Contributions: Visible contributions to prominent AI projects such as vLLM, DeepSpeed, or similar high-performance frameworks.
  • Hardware & Performance Optimization: Deep experience with GPU system optimizations and inference framework performance tuning.
  • Deep Learning Frameworks: Extensive experience working specifically with PyTorch and specialized LLM fine-tuning libraries.
  • Technical Passion: A demonstrated obsession with building cutting-edge AI products and solving the industry’s most challenging technical "impossible" problems.

Responsibilities

  • Model Fine-Tuning Orchestration: Design and manage sophisticated systems for large foundation models (SFT, PEFT, LoRA, adapters), ensuring seamless multi-node orchestration, checkpointing, and cost-efficient scaling.
  • End-to-End Training Pipelines: Implement and maintain robust training rimes for LLMs, including distillation and reinforcement learning pipelines such as PPO, DPO, and reward modeling.
  • Agent & Execution Infrastructure: Architect the underlying infrastructure required for reliable agentic execution and complex model-driven workflows.
  • Lifecycle Management Systems: Develop enterprise-grade systems for dataset, model, and experiment management, emphasizing versioning, lineage, and reproducible fine-tuning at scale.
  • Strategic Architectural Influence: Drive long-term decisions regarding training runtimes and storage, shaping the core APIs that will define the user experience of Crusoe’s AI platform.
  • Cross-Functional Collaboration: Partner closely with product, business, and platform teams to translate high-level goals into scalable, performant technical realities.
  • Open-Source Engagement: Represent Crusoe within the open-source LLM ecosystem, contributing to and staying ahead of industry-standard frameworks and tools.

Benefits

  • Competitive compensation
  • Restricted Stock Units
  • Paid time off & paid holidays
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Education Level

Ph.D. or professional degree

Number of Employees

251-500 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service