Senior Staff Software Engineer, Model LifeCycle

CrusoeSunnyvale, CA
$237,600 - $288,000

About The Position

Crusoe is seeking a visionary Senior Staff Engineer to join our Model LifeCycle team, where you will architect the backbone of our managed AI application platform. In this high-impact role, you will lead the development of a comprehensive ecosystem for the entire model development lifecycle, specifically optimized for Large Language Models (LLMs) and advanced Machine Learning workflows. By building these core systems from first principles, you will empower developers to harness Crusoe’s sustainable high-performance computing power to build the next generation of AI-driven applications. As a technical leader, you will experience significant 0 → 1 ownership, designing and implementing mission-critical abstractions and APIs that define how models are trained, managed, and deployed at scale. This is a full-time position for a foundational engineer who is passionate about blending deep AI expertise with robust systems engineering to solve the industry's most challenging infrastructure hurdles.

Requirements

  • Advanced Technical Foundation: An advanced degree (Masters or PhD) in Computer Science, Engineering, or a related technical field.
  • Deep Industry Experience: 8–12+ years of professional experience driving high-impact engineering projects, with a significant portion dedicated specifically to the AI/ML space.
  • Cloud Infrastructure Expertise: Expert-level proficiency in leveraging cloud-based services, including elastic compute, object storage, virtual networking, and managed databases to build scalable systems.
  • Generative AI Mastery: Deep technical experience in Generative AI, specifically focusing on the infrastructure requirements for LLM training and large-scale inference.
  • Rapid Project Delivery: A proven track record of architecting and delivering 0 → 1 projects under tight deadlines while maintaining high engineering standards.
  • Collaborative Leadership: Strong interpersonal skills with a proactive approach to autonomy, mentorship, and cross-functional problem-solving.

Nice To Haves

  • Production Language Proficiency: Advanced skills in Golang or Python specifically for building large-scale, production-ready services.
  • Open-Source Contributions: Active contributions to prominent AI projects such as vLLM, DeepSpeed, or similar high-performance frameworks.
  • Hardware Optimization: Experience with GPU performance tuning, CUDA kernels, or specialized inference framework optimizations.
  • Framework Expertise: Deep hands-on experience with PyTorch and specialized libraries for LLM training and fine-tuning.
  • Aspirational Mindset: A visible passion for solving "impossible" technical problems and a desire to build cutting-edge products that redefine the AI landscape.

Responsibilities

  • Model Fine-Tuning Orchestration: Design and manage sophisticated fine-tuning systems for large foundation models, incorporating SFT, PEFT, LoRA, and adapters while ensuring multi-node orchestration, checkpointing, and cost-efficient scaling.
  • End-to-End Training Pipelines: Implement and maintain robust training rimes for LLMs, including distillation and reinforcement learning pipelines such as preference optimization (PPO/DPO) and reward modeling.
  • Agent & Execution Infrastructure: Build and scale the underlying infrastructure required for reliable agent execution and complex model-driven workflows.
  • Lifecycle Management: Develop comprehensive systems for dataset, model, and experiment management, ensuring rigorous versioning, lineage, and reproducible fine-tuning at an enterprise scale.
  • Strategic Architectural Leadership: Influence long-term decisions regarding training runtimes, scheduling, and storage, shaping the core abstractions that will define Crusoe’s platform.
  • Cross-Functional Collaboration: Partner closely with product, business, and platform teams to translate complex technical requirements into intuitive, high-performance system APIs.
  • Ecosystem Engagement: Actively contribute to and engage with the open-source LLM community to ensure Crusoe remains at the forefront of AI infrastructure innovation.

Benefits

  • Competitive compensation
  • Restricted Stock Units
  • Paid time off & paid holidays
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service