Director, Support Engineering

Together AI•San Francisco, CA

20h•Remote

About The Position

We’re hiring a Support Leader to own and scale Together AI’s customer support function across two distinct, technically demanding domains: API Support (billing, serverless inference, and dedicated inference) and GPU Support (large-scale GPU infrastructure for model training workloads). You’ll work closely with Together AI’s VP of Customer Experience and partner tightly with SRE, Inference Platform, and Engineering to represent customers internally and drive resolution at speed. This is a player-coach role: you’ll be hands-on in escalations. Our support operation runs 24/7. Our GPU infrastructure customers hold us to high-stakes SLAs on training workloads. Our API customer base spans thousands of PLG and enterprise accounts relying on our serverless and dedicated inference endpoints. Both domains need a leader who can keep pace technically and build the operational muscle to scale.

Requirements

10+ years of support engineering or technical support leadership experience, with at least 3 years managing a team.
Demonstrated experience leading infrastructure support or cloud operations. You understand how large-scale workloads behave on distributed systems.
Working knowledge of AI infrastructure. You know how APIs work, can reason about latency and throughput issues, and understand the operational surface of a managed inference platform.
Technical depth to be a credible player-coach. Ability to guide engineers through root cause analysis, and bring credibility to customer-facing escalations.
Experience running SLA-driven support operations with real accountability. Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.
Strong communication skills, especially under pressure. You can write a clear, concise customer-facing update in the middle of a live incident and distill a complex infrastructure issue into a crisp internal escalation.
Startup mindset. You’re comfortable building process where none exists, and you thrive in environments where priorities shift fast.

Responsibilities

Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.
Establish clear performance expectations, career growth paths, and a coaching culture leveraged to identify skill gaps and build training programs to close them.
Run structured 1:1s, team reviews, and escalation retrospectives.
Assess and overhaul support workflows, SLA frameworks, and escalation playbooks
Build triage, prioritization, and handoff protocols that allow the team to scale with customer growth without proportional headcount growth.
Define and own support KPIs: SLA attainment, time-to-resolution, escalation rate, CSAT
Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.
Manage high-stakes SLA obligations with GPU cloud customers running multi-thousand-GPU training workloads
Coordinate closely with SRE and infrastructure engineering on hardware-level issues and cluster bringup.
Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.
Represent the team on complex cases: dedicated endpoint startup failures, safetensors validation errors, NFS/storage performance issues on inference clusters, billing disputes and negative-balance enforcement, and rate limit escalations.
Work with the Inference Platform, Commerce, and Product teams to surface patterns and drive fixes upstream.
Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.
Partner with SRE, Engineering, and Sales on shared priorities. Represent the support team’s perspective in cross-functional planning.
Own the relationship with support tooling vendors and drive improvements to alerting, SLA tracking, and ticket routing.
Systematically analyze ticket patterns and surface product and infrastructure gaps to Engineering and Product. Turn support signal into actionable roadmap input.
Build documentation and self-service resources that reduce inbound volume over time.