Senior Software Engineer, Back End

Capital OneMcLean, VA
2d

About The Position

Senior Software Engineer, Back End Do you love building at the intersection of infrastructure and artificial intelligence? Do you enjoy solving complex distributed systems problems in a fast-paced, collaborative, and iterative environment? At Capital One, you'll be part of a big group of makers, breakers, doers, and disruptors who love to solve real problems. We are seeking specialized Backend Software Engineers who are passionate about building the engines that power the next generation of AI. As a Software Engineer on the AI Training Platform team, you won’t just be building applications; you will be building the foundational Managed Services, SDKs, and Compute Infrastructure that allow our Data Scientists to train massive models across hundreds of distributed GPUs. You will be on the forefront of driving a major machine learning transformation within Capital One. What You’ll Do Build the Platform: Design and develop the control plane and managed services that orchestrate complex AI training workloads across large-scale GPU clusters. Empower Data Scientists: Build intuitive SDKs and CLIs that abstract the complexity of distributed computing, allowing data scientists to focus on modeling rather than infrastructure. Master Distributed Systems: Solve hard problems related to job scheduling, resource allocation, and fault tolerance across hundreds of distributed GPUs using tools like Kubernetes and Ray. Optimize Performance: Debug and optimize the training stack, from the network layer (NCCL, MPI) to the framework level (PyTorch), ensuring high utilization of expensive GPU resources. Collaborate & Innovate: Partner with Machine Learning Engineers to understand their pain points and deliver robust cloud-based solutions. Stay on top of HPC trends, experiment with new orchestration patterns, and mentor others in the engineering community. Tech Stack: Utilize programming languages like Python, Go, and C++, alongside Container Orchestration services (Docker, Kubernetes), GPU hardware (Nvidia), and AWS cloud infrastructure.

Requirements

  • Bachelor’s Degree
  • At least 3 years of professional software engineering experience (Internship experience does not apply)
  • Utilize programming languages like Python, Go, and C++, alongside Container Orchestration services (Docker, Kubernetes), GPU hardware (Nvidia), and AWS cloud infrastructure.

Nice To Haves

  • 5+ years of experience in at least one of the following: Java, Scala, Python, Go, or Node.js
  • 1+ years of experience with AWS, GCP, Azure, or another cloud service
  • 3+ years of experience in open source frameworks, especially those for distributed computing and training LLMs (e.g. Kubeflow Training Operator, Ray, PyTorch)
  • 1+ years of experience building monitoring for high-throughput systems (Prometheus, Grafana, DCGM) to track GPU utilization and training metrics.
  • 1+ years of experience managing or developing against NVIDIA GPU clusters (A100/H100), including knowledge of CUDA, NCCL, or NVLink.
  • 2+ years of experience in Agile practices

Responsibilities

  • Build the Platform: Design and develop the control plane and managed services that orchestrate complex AI training workloads across large-scale GPU clusters.
  • Empower Data Scientists: Build intuitive SDKs and CLIs that abstract the complexity of distributed computing, allowing data scientists to focus on modeling rather than infrastructure.
  • Master Distributed Systems: Solve hard problems related to job scheduling, resource allocation, and fault tolerance across hundreds of distributed GPUs using tools like Kubernetes and Ray.
  • Optimize Performance: Debug and optimize the training stack, from the network layer (NCCL, MPI) to the framework level (PyTorch), ensuring high utilization of expensive GPU resources.
  • Collaborate & Innovate: Partner with Machine Learning Engineers to understand their pain points and deliver robust cloud-based solutions.
  • Stay on top of HPC trends, experiment with new orchestration patterns, and mentor others in the engineering community.

Benefits

  • This role is also eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI). Incentives could be discretionary or non discretionary depending on the plan.
  • Capital One offers a comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service