Intern - AI Cluster Engineering (Summer 2026)

SK hynix AmericaSan Jose, CA
2d$26 - $50Onsite

About The Position

We are building the AI DC level software framework. It is a cutting-edge platform to validate next-generation, full-stack AI infrastructure. We are looking for talented interns to develop the AI application workload framework. Beyond AI models, AI applications will be containerized, optimized, and stress-tested on versatile GPU clusters to find architectural bottlenecks. We are looking for interns who will onboard and optimize diverse AI Applications onto our AI cluster, deep-diving into specific domains like Generative AI (LLM), Physical AI (Robotics), and AI for Science (Bio/Physics) to benchmark their performance on the latest GPUs, DPUs, and Network fabrics.

Requirements

  • Education: Currently pursuing a MS, or PhD in Computer Science, Electrical Engineering, AI, or related fields.
  • Programming: Strong proficiency in Python (Bash scripting is a plus).
  • Containerization: Experience with Docker (building Dockerfiles, managing dependencies).
  • AI Fundamentals: Basic understanding of Deep Learning workflows (Training vs. Inference) and frameworks (PyTorch, TensorFlow).
  • OS: Comfort with Linux command-line environment

Nice To Haves

  • Orchestration: Experience with Kubernetes (K8s) or Slurm.
  • Profiling: Experience with performance profiling tools (Nsight Systems, DCGM, PyTorch Profiler).
  • Domain Expertise (in one of the following):
  • LLM: Experience with vLLM, TGI, or TensorRT-LLM.
  • Distributed Systems: Experience with Multi-node training (Megatron-LM, DeepSpeed).
  • Robotics: Experience with ROS 2, Isaac Sim, or Reinforcement Learning.
  • HPC/Science: Experience with MPI, OpenMP, or Bio-informatics tools.
  • Hardware: Curiosity about computer architecture (GPU, DPU, Memory, Network Fabric).

Responsibilities

  • Onboard Diverse AI Applications: Port and deploy state-of-the-art AI workloads, including LLM (vLLM, TGI), Physical AI (Isaac Sim, ROS 2), and Scientific AI (AlphaFold, GROMACS).
  • Resolve software dependency issues and build optimized Docker images for the A^3 Registry.
  • Application Profiling & Analysis: Analyze the unique resource patterns of each application (e.g., “Isaac Sim requires heavy ray-tracing capability” or “Megatron-LM is bottlenecked by RDMA”).
  • Identify performance bottlenecks using profiling tools (Nsight Systems, PyTorch Profiler).
  • Develop "Test Recipes": Define the standard configuration (Recipe) for each application to ensure reproducible testing.
  • Collaboration: Work with infrastructure engineers to tune the system (OS, Network, Storage) to fit your application's needs.

Benefits

  • Eligible interns will receive a housing allowance during their internship.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service