About The Position

We’re defining how AI runs in production across the enterprise. As AI adoption scales, the challenge shifts from building models to operating them reliably. This role owns how AI/LLM and agentic systems are run, supported, and governed, ensuring they are reliable, observable, cost-efficient, and continuously improving in real-world environments. You will lead the development of the enterprise AI Operations practice, establishing the standards, operating model, and visibility required to support AI at scale. This includes defining how systems are monitored, how incidents are managed, how risks are controlled, and how performance is continuously improved. Working closely with Engineering, AI Platform, Product, and Delivery teams, you will ensure all production AI systems meet clear operational standards and that leadership has consistent visibility into system health, performance, and risk. This is a hands-on, senior leadership role with end-to-end accountability for how AI systems perform in production. This position will offer flexibility for hybrid work schedules to include both in-office presence and telecommute/virtual work, to be based from either Houston or Dallas, TX.

Requirements

  • Bachelor's Degree plus extensive years of SRE, MLOps, production operations, or platform engineering experience, including 6 years of leadership experience, or demonstrated equivalency of experience and/or education
  • Experience operating AI/ML/LLM systems in production (serving real users at scale) with clear ownership and accountability
  • Background in SRE, MLOps, or distributed systems, with depth in reliability and operational excellence
  • Strong understanding of AI production failure modes (e.g., drift, hallucinations, orchestration issues, cost inefficiencies)
  • Experience building and scaling observability, monitoring, and telemetry systems (e.g., OpenTelemetry, Datadog, Prometheus, Grafana)
  • Proven track record defining SLAs/SLOs, incident management, and operational frameworks for complex systems
  • Experience leading cross-functional efforts across engineering, platform, and product teams
  • Ability to operate at both strategic and hands-on levels, setting direction while driving execution

Nice To Haves

  • Experience with LLM platforms or frameworks (e.g., Azure AI, AWS Bedrock, LangChain)
  • Experience with agentic systems, RAG pipelines, or orchestration frameworks
  • Background in ITIL or service management, applied to modern distributed systems
  • Familiarity with Responsible AI and governance frameworks

Responsibilities

  • Define and scale the enterprise AI Operations practice, including operating model, standards, and governance
  • Establish production readiness and operability standards across AI/LLM and agentic systems
  • Own production reliability, including SLAs/SLOs, incident management, and support models
  • Implement observability and monitoring for AI systems (latency, drift, behavior, failures, cost)
  • Ensure clear ownership, escalation paths, and accountability across production AI systems
  • Build controls for agent behavior, model usage, and operational risk
  • Drive performance, reliability, and cost optimization across AI workloads
  • Lead operational reviews and reporting, providing visibility into system health, risks, and trends
  • Identify systemic issues and drive continuous improvement across AI systems and processes
  • Partner with Engineering, Product, and Platform teams to ensure production readiness and alignment

Benefits

  • medical
  • dental
  • vision
  • life
  • AD&D
  • disability benefits
  • paid time off
  • leaves of absences
  • voluntary benefits
  • perks
  • flexible work options
  • well-being resources
  • employee assistance program
  • business travel insurance
  • service recognition awards
  • retirement savings plan
  • employee stock purchase plan

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service