Senior Staff Technical Program Manager (AI Platform/OS)

Red Cell Partners•McLean, VA

4h•$200,000 - $260,000

About The Position

As a Senior Staff Technical Program Manager, you will own internal program execution across our operating system - ensuring that platform investments translate into shipped, reliable, and measurable outcomes. This is not a coordination or reporting role. You are responsible for: Driving execution across highly coupled, multi-team platform work Creating the operating system for engineering execution Ensuring platform systems (runtime, infrastructure, AI workflows) ship predictably and safely You will operate at the intersection of: Platform engineering (agent runtime, workflows, system orchestration) DevOps / SRE (deployment, reliability, observability) DevEx (developer workflows, CI/CD, release safety) AI/ML systems (LLM-driven workflows, evaluation, and inference pipelines) You are expected to be deeply technical - able to: Read architecture diagrams and system designs fluently Understand and reason about code, APIs, and system behavior Engage engineers on tradeoffs across infrastructure, runtime, and AI systems While this is not a hands-on coding role, the ability to read and occasionally write code to unblock or validate work is highly valuable. Why this Role is Needed Our operating system is a distributed, orchestration-heavy platform with: Long-live, stateful workflows Cross-service and cross-environment dependencies AI/LLM-driven execution paths requiring observability and control Strict reliability, security, and auditability requirements As the platform scales, the bottleneck shifts to: Cross-team coordination Dependency sequencing Release readiness Execution predictability This role exists to: Reduce coordination overhead on engineering leads Ensure platform work is sequenced, unblocked, and measurable Improve delivery predictability without slowing velocity Translate platform investments into real shipped outcomes

Requirements

12+ years of experience in technical program management, engineering, or related roles
Experience working on distributed systems, cloud infrastructure, CI/CD and deployment systems
Strong understanding of DevOps / SRE workflows, system dependencies and failure modes
Demonstrated ability to break down ambiguous technical problems, drive execution across teams, influence without authority
Strong technical fluency with ability to read and understand production code, reason about system architecture and APIs, engage in technical tradeoff discussions
Experience with or exposure to AI/ML systems and LLM-based workflows, AI infrastructure (inference, evaluation, orchestration)
Ability to write code when needed (for debugging, validation, or prototyping), though not a primary responsibility
High ownership and accountability
Strong bias for action and clarity
Comfortable operating in ambiguity
Focused on outcomes over process

Nice To Haves

Experience working closely with DevOps / SRE teams, platform engineering teams
Familiarity with Kubernetes, Infrastructure-as-Code, observability systems
Experience in regulated or high-security environments

Responsibilities

Own end-to-end execution of internal platform initiatives across the Trase operating system, translating ambiguous work across infrastructure, runtime systems, and AI/ML workflows into clear, actionable plans while ensuring alignment across Engineering, DevOps/SRE, DevEx, and Product.
Identify and manage cross-team dependencies across services, cloud infrastructure, and AI pipelines, sequencing work to minimize blocking dependencies, reduce integration risk, and avoid rework.
Establish and maintain a lightweight operating rhythm that drives execution, including milestone tracking, execution reviews, and release readiness checkpoints, ensuring teams have clear priorities, defined success criteria, and visibility into risks.
Partner with DevOps and SRE to ensure releases are safe, validated, and traceable, and that platform and AI/ML changes are observable, auditable, and ready for production environments; drive go/no-go decisions based on system readiness and risk.
Proactively identify and manage system-level risks across infrastructure, deployment systems, AI/ML pipelines, and runtime behavior, ensuring mitigation strategies are in place before issues impact delivery.
Define and track key execution and reliability signals, including delivery predictability, release success rates, dependency resolution, and system health, acting as the source of truth for execution status and risk.
Continuously improve engineering execution by identifying inefficiencies in CI/CD workflows, testing and integration systems, and AI workflow evaluation, partnering with DevEx and DevOps to increase developer velocity, release safety, and overall system reliability.

Benefits

Career track opportunity with potential for rapid advancement with strong performance as the firm grows
100% employer paid, comprehensive health care including medical, dental, and vision for you and your family.
Paid maternity and paternity for 14 weeks at employees' normal pay.
Unlimited PTO, with management approval.
Opportunities for professional development and continued learning.
Optional 401K, FSA, and equity incentives available.
Mental health benefits are available through Tara Mind.
Cost effective GLP-1 solutions available through Crux.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume