Senior Staff Technical Program Manager (AI Platform/OS)

Red Cell PartnersMcLean, VA
4h$200,000 - $260,000

About The Position

As a Senior Staff Technical Program Manager, you will own internal program execution across our operating system - ensuring that platform investments translate into shipped, reliable, and measurable outcomes. This is not a coordination or reporting role. You are responsible for: Driving execution across highly coupled, multi-team platform work Creating the operating system for engineering execution Ensuring platform systems (runtime, infrastructure, AI workflows) ship predictably and safely You will operate at the intersection of: Platform engineering (agent runtime, workflows, system orchestration) DevOps / SRE (deployment, reliability, observability) DevEx (developer workflows, CI/CD, release safety) AI/ML systems (LLM-driven workflows, evaluation, and inference pipelines) You are expected to be deeply technical - able to: Read architecture diagrams and system designs fluently Understand and reason about code, APIs, and system behavior Engage engineers on tradeoffs across infrastructure, runtime, and AI systems While this is not a hands-on coding role, the ability to read and occasionally write code to unblock or validate work is highly valuable. Why this Role is Needed Our operating system is a distributed, orchestration-heavy platform with: Long-live, stateful workflows Cross-service and cross-environment dependencies AI/LLM-driven execution paths requiring observability and control Strict reliability, security, and auditability requirements As the platform scales, the bottleneck shifts to: Cross-team coordination Dependency sequencing Release readiness Execution predictability This role exists to: Reduce coordination overhead on engineering leads Ensure platform work is sequenced, unblocked, and measurable Improve delivery predictability without slowing velocity Translate platform investments into real shipped outcomes

Requirements

  • 12+ years of experience in technical program management, engineering, or related roles
  • Experience working on distributed systems, cloud infrastructure, CI/CD and deployment systems
  • Strong understanding of DevOps / SRE workflows, system dependencies and failure modes
  • Demonstrated ability to break down ambiguous technical problems, drive execution across teams, influence without authority
  • Strong technical fluency with ability to read and understand production code, reason about system architecture and APIs, engage in technical tradeoff discussions
  • Experience with or exposure to AI/ML systems and LLM-based workflows, AI infrastructure (inference, evaluation, orchestration)
  • Ability to write code when needed (for debugging, validation, or prototyping), though not a primary responsibility
  • High ownership and accountability
  • Strong bias for action and clarity
  • Comfortable operating in ambiguity
  • Focused on outcomes over process

Nice To Haves

  • Experience working closely with DevOps / SRE teams, platform engineering teams
  • Familiarity with Kubernetes, Infrastructure-as-Code, observability systems
  • Experience in regulated or high-security environments

Responsibilities

  • Own end-to-end execution of internal platform initiatives across the Trase operating system, translating ambiguous work across infrastructure, runtime systems, and AI/ML workflows into clear, actionable plans while ensuring alignment across Engineering, DevOps/SRE, DevEx, and Product.
  • Identify and manage cross-team dependencies across services, cloud infrastructure, and AI pipelines, sequencing work to minimize blocking dependencies, reduce integration risk, and avoid rework.
  • Establish and maintain a lightweight operating rhythm that drives execution, including milestone tracking, execution reviews, and release readiness checkpoints, ensuring teams have clear priorities, defined success criteria, and visibility into risks.
  • Partner with DevOps and SRE to ensure releases are safe, validated, and traceable, and that platform and AI/ML changes are observable, auditable, and ready for production environments; drive go/no-go decisions based on system readiness and risk.
  • Proactively identify and manage system-level risks across infrastructure, deployment systems, AI/ML pipelines, and runtime behavior, ensuring mitigation strategies are in place before issues impact delivery.
  • Define and track key execution and reliability signals, including delivery predictability, release success rates, dependency resolution, and system health, acting as the source of truth for execution status and risk.
  • Continuously improve engineering execution by identifying inefficiencies in CI/CD workflows, testing and integration systems, and AI workflow evaluation, partnering with DevEx and DevOps to increase developer velocity, release safety, and overall system reliability.

Benefits

  • Career track opportunity with potential for rapid advancement with strong performance as the firm grows
  • 100% employer paid, comprehensive health care including medical, dental, and vision for you and your family.
  • Paid maternity and paternity for 14 weeks at employees' normal pay.
  • Unlimited PTO, with management approval.
  • Opportunities for professional development and continued learning.
  • Optional 401K, FSA, and equity incentives available.
  • Mental health benefits are available through Tara Mind.
  • Cost effective GLP-1 solutions available through Crux.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service