Sr. Director - Backend Engineering

CoupangSeattle, WA
40d$184,000 - $376,000

About The Position

Sr. Director Back-end Engineering - AI Infrastructure Orchestration Company Introduction We exist to wow our customers. We know we're doing the right thing when we hear our customers say, "How did we ever live without Coupang?" Born out of an obsession to make shopping, eating, and living easier than ever, we're collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce. We are proud to have the best of both worlds - a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what's possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world. Role Overview Strategy and Leadership Define and execute the long-term vision and roadmap for the company's AI infrastructure orchestration layer, aligning it with overall business and AI Services goals. Lead, mentor, and grow a high-performing engineering and operations team focused on AI infrastructure, and platform engineering. Manage budget and resource allocation for AI infrastructure delievrables. Act as a key liaison between AI Infra and other services owners and consumers, core engineering, Cloud infrastructure, and executive leadership. AI Infra Development and Operations Oversee the design, implementation, and maintenance of the core orchestration platforms for large-scale AI model training (e.g., distributed training, hyperparameter tuning) and deployment (e.g., containerization, serverless functions, edge deployment). Ensure reliability, security, and compliance of the AI infrastructure, meeting strict standards for data governance and model integrity. Establish Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) for the AI platform services and lead efforts for continuous optimization and performance tuning. Success Metrics A successful Senior Director - AI Infrastructure Orchestration will be measured by: The time-to-market for AI Infar build, scale and operate The resource utilization rate and cost efficiency of the AI compute infrastructure. The reliability and uptime of the core AI platform services. The talent retention and development within the AI Infrastructure team. Technology and Architecture Select, evaluate, and integrate the core technologies required for the AI stack (e.g., Kubernetes, Kubeflow, Ray, ML frameworks, GPU/accelerator management, distributed file systems). Champion infrastructure-as-code (IaC) principles to manage and provision AI resources consistently and at scale.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
  • 15+ years of progressive experience in software engineering, infrastructure, or platform operations.
  • 5+ years of experience leading and managing technical teams, ideally in a Director or Sr. Director level or equivalent capacity.
  • Deep, hands-on experience designing and operating large-scale distributed systems and cloud-native architectures
  • Proven experience specifically with AI infrastructure orchestration (e.g., using Kubernetes, Kubeflow, or similar MLOps tools) and managing acceleratedcompute resources (GPUs, TPUs etc).
  • 15+ years of Cloud backend engineering, Cloud Design, Deployment, DevOps
  • 15+ years of experience leading system design, architecture leveraging Private Clouds and AWS and/or Azure/ GCP.
  • 10+ years of demonstrable building and operating infrastructure as code. Infra Automation, Comfortable with many flavors of Linux
  • 15+ year's of experience in building high-performance, highly-available and scalable distributed systems in the cloud.
  • 15+ year's of experience in building and managing high-performance, highly-available and scalable Hybrid Cloud cloud environments .
  • Excellent cross-group collaboration, outstanding verbal and written communication.
  • Expert-level knowledge of containerization and orchestration (Docker, Kubernetes).
  • Strong background in DevOps and MLOps principles and tooling.
  • Proficiency in at least one modern programming language (e.g., Python, Go).
  • Exceptional strategic planning, organizational, and written/verbal communication skills.

Nice To Haves

  • Prior experience managing infrastructure for training and inferecning of large language models (LLMs) or foundation models.
  • Experience in a regulated industry with strict compliance requirements.
  • AI Private Cloud -Building and operating

Responsibilities

  • Define and execute the long-term vision and roadmap for the company's AI infrastructure orchestration layer, aligning it with overall business and AI Services goals.
  • Lead, mentor, and grow a high-performing engineering and operations team focused on AI infrastructure, and platform engineering.
  • Manage budget and resource allocation for AI infrastructure delievrables.
  • Act as a key liaison between AI Infra and other services owners and consumers, core engineering, Cloud infrastructure, and executive leadership.
  • Oversee the design, implementation, and maintenance of the core orchestration platforms for large-scale AI model training (e.g., distributed training, hyperparameter tuning) and deployment (e.g., containerization, serverless functions, edge deployment).
  • Ensure reliability, security, and compliance of the AI infrastructure, meeting strict standards for data governance and model integrity.
  • Establish Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) for the AI platform services and lead efforts for continuous optimization and performance tuning.
  • Select, evaluate, and integrate the core technologies required for the AI stack (e.g., Kubernetes, Kubeflow, Ray, ML frameworks, GPU/accelerator management, distributed file systems).
  • Champion infrastructure-as-code (IaC) principles to manage and provision AI resources consistently and at scale.

Benefits

  • Medical/Dental/Vision/Life, AD&D insurance
  • Flexible Spending Accounts (FSA) & Health Savings Account (HSA)
  • Long-term/Short-term Disability
  • Employee Assistance Program (EAP) program
  • 401K Plan with Company Match
  • 18-21 days of the Paid Time Off (PTO) a year based on the tenure
  • 12 Public Holidays
  • Paid Parental leave
  • Pre-tax commuter benefits
  • MTV - [Free] Electric Car Charging Station
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service