Principal Engineer, AI Model LifeCycle

CrusoeSan Francisco, CA
1d$256,000 - $320,000

About The Position

The Principal Software Engineer for the Model LifeCycle team will play a crucial role in building a comprehensive managed platform for the entire application development lifecycle, with a specific focus on leveraging Machine Learning models, including Large Language Models (LLMs).

Requirements

  • Advanced degree in Computer Science, Engineering, or a related field.
  • 10-15+ years of industry experience driving impactful projects in the AI Space
  • Proven track record of delivering early-stage projects under tight deadlines.
  • Expertise in using cloud-based services, such as, elastic compute, object storage, virtual private networks, managed database, etc.
  • Experience in Generative AI (Large Language Models, Multimodal).
  • Deep experience with AI infrastructure, including training, inference.

Nice To Haves

  • Proficiency in Golang or Python for large-scale, production-level services.
  • Contributions to open-source AI projects such as vLLM or similar frameworks.
  • Performance optimizations on GPU systems and inference frameworks.
  • Experience working with PyTorch
  • Experience with training and fine-tuning LLMs
  • Proactive and collaborative approach with the ability to work autonomously.
  • Strong communication and interpersonal skills.
  • Passion for building cutting-edge AI products and solving challenging technical problems.

Responsibilities

  • Manage fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling.
  • Implement and maintain end-to-end training pipelines for Large Language Models.
  • Distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling).
  • Agent execution infrastructure
  • Dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale.
  • Work closely with product, business, and platform teams to shape the core abstractions and APIs of the system.
  • Influence long-term architectural decisions around training runtimes, scheduling, storage, and model lifecycle management.
  • Contribute to and engage with the open-source LLM ecosystem.
  • This role offers significant 0 → 1 ownership — you'll be designing and building core systems from first principles.

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300/month
  • Compensation will be paid in the range of up to $256,000 - $320,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.
  • Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service