Technical Program Manager

RadixArk•Palo Alto, CA

About The Position

As a Technical Program Manager at RadixArk, you'll drive the execution of complex, cross-functional programs across our inference and training infrastructure. You'll partner closely with Product Management, Research, and Engineering to turn ambitious technical roadmaps into shipped reality, coordinating across kernel teams, distributed systems engineers, and external partners to deliver infrastructure that serves billions of tokens daily and coordinates 10,000+ GPU training runs. This role is for someone who thrives at the intersection of deep technical understanding and rigorous program execution. You'll own the "how" and "when" of our most critical initiatives.

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
4+ years of direct experience in Technical Program Management, Engineering Management, or a senior engineering role with significant program ownership, in a software or infrastructure company.
Strong technical fluency in systems software, distributed systems, or AI/ML infrastructure; able to read code, follow architecture discussions, and challenge technical assumptions productively.
Demonstrated track record shipping complex, multi-team programs on time, including managing dependencies, risks, and scope changes.
Excellent written and verbal communication skills; able to drive alignment across engineers, executives, and external partners.

Nice To Haves

Direct experience shipping AI/ML infrastructure such as inference engines, training frameworks, GPU kernels, distributed schedulers, or model serving platforms.
Hands-on coding background (Python, C++, CUDA) and comfort working in engineering codebases, including reading PRs, running benchmarks, and reproducing issues.
Experience coordinating with hardware vendors (Nvidia, AMD, Google TPU, AWS Trainium/Inferentia) on enablement or co-engineering programs.
Experience driving open-source release programs or working in OSS communities, including issue triage, RFC processes, and contributor coordination.
Familiarity with release engineering, CI/CD systems, and observability tooling for large-scale distributed systems.
Experience supporting B2B or developer-facing products with enterprise SLAs.

Responsibilities

Drive end-to-end execution of large-scale, cross-functional programs spanning inference engines (e.g., SGLang), training frameworks (e.g., Miles), and hardware integration efforts.
Define program structure, including milestones, dependencies, critical paths, risks, and success criteria. Maintain a clear source of truth for status across all stakeholders.
Run design reviews, sprint planning, release readiness reviews, and post-mortems. Ensure decisions are documented and follow-ups are closed out.
Identify and unblock cross-team dependencies across kernel, runtime, scheduler, networking, and model teams before they become release blockers.
Drive release management for major versions, including changelog ownership, compatibility validation, partner rollout sequencing, and rollback planning.
Partner with Product Management to translate roadmap priorities into executable program plans, with clear scope, staffing, and timelines.
Work shoulder-to-shoulder with engineering leads on technical trade-off decisions; understand the architecture deeply enough to ask the right questions and surface hidden risks.
Coordinate hardware enablement programs with partners like Nvidia, Google, and AWS, including new accelerator bring-up, kernel co-development, and benchmark validation.
Manage integration programs with frontier AI labs and early adopters, ensuring technical requirements, SLAs, and feedback loops are well-defined.
Build and improve the engineering operating cadence, including standups, planning rituals, OKR tracking, dashboards, and reporting to leadership.
Establish metrics and instrumentation for program health such as velocity, defect rates, benchmark regressions, and customer-reported issues, and drive accountability against them.
Lead incident response coordination for production issues affecting partners; own root-cause review and corrective-action tracking.
Improve developer productivity by identifying and removing systemic friction in our build, test, and release pipelines.
Serve as the connective tissue between engineering, product, GTM, and external partners, ensuring everyone has the right information at the right altitude.
Produce clear, concise written updates for leadership and partners. Translate engineering progress into business-relevant signals.
Represent program status honestly, including risks and slips, with concrete mitigation plans.