AI Infrastructure Engineer

Zoom•San Jose, CA

1d•Hybrid

About The Position

We are seeking an experienced AI Infrastructure Engineer to join our AI Incubation team. You will be focused on building and optimizing large-scale training infrastructure for Large Language Models (LLMs). The ideal candidate will combine engineering fundamentals with practical experience in AI infrastructure development, demonstrating both technical depth and the ability to deliver scalable solutions for complex AI systems. About the Team The AI incubation team is dedicated to incubating AI breakthroughs, including foundational AI techniques and AI native applications that will largely improve people’s work productivity.

Requirements

Have a bachelor's degree in Computer Science, Engineering, AI, Machine Learning, Distributed System or related field
5+ years of software engineering experience with focus on infrastructure and systems
Have expertise in GPU programming and CUDA optimization
Have experience with container technologies (Docker, Kubernetes), distributed systems and cloud computing
Demonstrate experience building large-scale distributed systems and optimizing neural network performance
Possess programming skills in Python, C++, and CUDA, with deep learning frameworks (PyTorch, Transformers)

Responsibilities

Designing and develop scalable AI infrastructure solutions for training and deploying large language models
Building and optimize distributed training platforms using cutting-edge technologies
Implementing and maintain containerized AI environments using Docker and Kubernetes
Optimizing CUDA kernels for maximum GPU utilization and performance
Developing platform software to support AI/ML workflows
Collaborating with AI researchers to implement efficient training and inference pipelines

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume