ML Infrastructure Engineer (Staff / Principal)

Genesis Therapeutics•Burlingame, CA

83d

About The Position

We're seeking experienced ML infrastructure engineers to join the team and lead engineering efforts focused on driving forward our ML research agenda for generative modeling of molecular systems, which is instrumental to our mission. As an engineer at Genesis, you will lead rapid iteration on our AI platform and infrastructure, unlocking the next level of performance, efficiency, and scale that was not previously possible. You will build massively distributed training and inference pipelines, core MLOps tools and frameworks, and optimize GPU operations to speed up ML models. Genesis is a highly-collaborative and cross functional environment, and you will work in close partnership with our exceptional engineers, researchers, and scientists.

Requirements

Strong engineer who constantly strives for technical excellence. You can write clean code and have a deep understanding of the codebases you work in.
Deeply experienced with distributed training and inference of large models on GPU clusters and some of the core libraries and frameworks we use: Pytorch, Pytorch Lightning, Pytorch Geometric, and Ray.
Independent thinker with a strong sense of ownership and capability of engineering robust systems from first-principles-based conceptualization to state-of-the-art realization.
Curious, problem-oriented thinker who is excited to dive deep into the emerging field at the intersection of AI, physics, chemistry, and biology and make foundational contributions and discoveries (no previous experience in anything but ML necessary).

Nice To Haves

Experienced with building, maintaining and debugging low-level cluster infrastructure running on multiple clouds using Kubernetes and Terraform.
Experienced GPU engineer who can quickly figure out performance bottlenecks and architect highly performant code for large scale ML workloads.
Experience with XLA, Triton, CUDA, or similar accelerator programming languages and/or deep learning compiler stacks.
Experience working with some of the following: molecular systems (protein sequences and 3D structures, small molecules, etc.), ML force fields or other physics-informed models and methods, or point cloud data in other application domains, such as 3D graphics.

Responsibilities

Lead engineering efforts focused on continuous improvement of the AI platform, focused on rapid build out and iteration on scalable and robust distributed infrastructure for ML training, inference, and evaluation.
Support model training and deployment across multiple clusters and multiple clouds, optimizing for throughput and cost.
Optimizing efficiency of ML models and other workloads in terms of latency, throughput, memory consumption, etc. (e.g., via GPU performance engineering), pushing the limits of what's possible with the current hardware.
Contribute to the long-term vision for Genesis' ML platform.
Have the opportunity to mentor and guide more junior members of our technical team as well as research interns, fostering an environment of growth and innovation.

Benefits

Competitive compensation package that includes salary and equity.
Comprehensive health benefits: Medical, Dental, and Vision (covered 100% for the employees).
401(k) plan.
Open (unlimited) PTO policy.
Free lunches and dinners at our offices.
Paid family leave (maternity and paternity).
Life and long- and short-term disability insurance.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume