About The Position

We are seeking a highly skilled and driven Senior AI Engineer to join our team as a founding member, developing the critical data and AI infrastructure for training vision models and other foundation models for power grid applications. You will be instrumental in building and optimizing the end-to-end systems, data pipelines, and training processes that will power our AI research. Working closely with research scientists, you will translate cutting-edge research into robust, scalable, and efficient implementations, enabling the rapid development and deployment of transformational AI solutions. This role requires deep hands-on expertise in distributed training, data engineering, and some MLOps - a proven track record of building scalable AI infrastructure.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
  • 3 or more years of hands-on experience in AI Engineering/Machine Learning Engineering.
  • Deep practical expertise with AI frameworks (PyTorch, Pytorch Lightning, TorchTitan, etc). Hands-on experience with large-scale multi-node GPU training, and other optimization strategies for developing computer vision models / other foundation models. Ability to scale solutions involving large datasets and complex models on distributed compute infrastructure.
  • Proven history and background working with Computer Vision related tasks and projects.
  • Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges; Strong communication and teamwork skills, with a collaborative approach to working with research scientists and other engineers.
  • Experience with MLOps best practices for model tracking, evaluation and deployment.
  • A track record of open-source contributions to relevant projects is a BIG PLUS.

Nice To Haves

  • Experience writing CUDA/Triton/CUTLASS kernels.
  • Experience with performance monitoring and profiling tools for distributed training and data pipelines.
  • Experience with vision foundation models or multimodal architectures.
  • Publications or presentations in top-tier AI conferences (NeurIPS, CVPR, ICML, etc.) are a strong plus.

Responsibilities

  • Design, build, and optimize everything necessary for large-scale training and/or fine-tuning with different model architectures. Design and optimize the full training stack, from data ingestion and preprocessing to model training and inference pipelines, with a focus on maximizing Model Flop Utilization (MFU) across multi-node GPU clusters.
  • Collaborate closely and proactively with research scientists, translating research ideas and algorithms into high-performance, production-ready code on our infrastructure. Ability to rapidly implement, iterate and test ideas from research publications or open-source codebases.
  • Relentlessly profile and resolve training performance bottlenecks, optimizing every layer of the training stack from data loading to model inference for speed and efficiency.
  • Contribute to technology evaluations and selection of hardware, software, and cloud services that will define our AI infrastructure platform.
  • Experience with MLOps frameworks (MLFlow, WnB, etc) to implement best practices across the model lifecycle - development, training, validation, and monitoring - ensuring reproducibility, reliability, and continuous improvement.
  • Create thorough documentation for infrastructure, data pipelines, and training procedures, ensuring maintainability and knowledge transfer within the growing AI lab.
  • Stay at the forefront of advancements in AI for large-scale training methods and data engineering, and proactively driving improvements and innovation in our workflows and infrastructure.
  • High-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and high quality code.

Benefits

  • Career growth and development opportunities
  • Supportive work culture
  • Company paid Health and wellness benefits
  • Paid Time Off and paid holidays
  • 401K savings plan with company match
  • Family building benefits and parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Electrical Equipment, Appliance, and Component Manufacturing

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service