About The Position

The AI Observability Architect is a senior technical leader responsible for designing, deploying, and operating an enterprise-grade, production-ready AI observability platform that spans the full spectrum of modern agentic AI — from large language model (LLM) workflows and multi-agent orchestration to physical AI systems, reinforcement learning harnesses, multi-modal pipelines, and agentic marketplaces. This role serves as the strategic and engineering authority for end-to-end telemetry, tracing, safety, and quality signals across heterogeneous agent frameworks and platforms. The architect leads the convergence of AI observability with safety & security (including red teaming), Responsible AI (RAI), data science, physical AI, memory/skills engineering, agent fleet management, self-evolving harnesses, reinforcement learning, agent-to-agent protocols (A2A, UCP, AP2), and continuous quality engineering — making this a uniquely broad and high-impact role within the AI Solutions & Platforms organization. The role also owns OpenTelemetry (OTEL) integration across third-party agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and others), enabling unified observability and governance at enterprise scale.

Requirements

  • Bachelor's or Master's degree in Computer Science, AI/ML, Data Science, Software Engineering, or a related field (PhD a plus for research-heavy domains).
  • 12+ years in technology with deep experience in enterprise observability, distributed systems, platform engineering, or AI/ML infrastructure.
  • 5+ years in a senior/principal or architect-level role with demonstrated ownership of complex, cross-functional technical programs.
  • Expert-level knowledge of observability primitives (metrics, logs, traces, events) applied to LLM/ML/agentic systems; hands-on OpenTelemetry (OTEL) instrumentation including custom exporters, semantic conventions, and trace propagation across agent/tool boundaries.
  • Direct experience with agentic AI platforms, multi-agent orchestration, LLM-based workflow design, and agent lifecycle management at production scale.
  • Demonstrated experience conducting red team exercises against AI systems; knowledge of adversarial attack patterns, prompt injection, model evasion, and multi-agent trust boundary failures; ability to design safety telemetry pipelines.
  • Working knowledge of agent memory architectures (episodic, semantic, working memory), Model Context Protocol (MCP), skill registries, and context injection patterns — with ability to design observability for these layers.
  • Familiarity with A2A (Agent-to-Agent), UCP (Universal Communication Protocol), and AP2 patterns; ability to implement protocol-level observability and policy enforcement.
  • Understanding of RL training loops, reward signal capture, policy evaluation, and harness instrumentation for continuously improving agent systems.
  • Experience or strong familiarity with observability for physical AI pipelines (robotics, edge inference, sensor fusion) and multi-modal models (vision, audio, text).
  • Proficiency in Python at a senior engineering level; experience with statistical anomaly detection, time-series analysis, and data pipeline design applied to observability data at scale.
  • Hands-on experience integrating OTEL with enterprise agentic platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or similar; strong understanding of enterprise integration patterns and API design.
  • Cloud fluency across Azure, AWS, and GCP; proficiency in Kubernetes, service mesh, IaC (Terraform/Bicep), and CI/CD tooling; experience with event streaming platforms (Kafka, Event Hubs).
  • Experience designing continuous quality frameworks (CQE) for agentic solutions including eval harnesses, regression detection, quality gates, and SLA-backed quality benchmarking.
  • Familiarity with RAI principles — fairness, bias detection, explainability, and safety — and ability to operationalize RAI signal capture within production observability pipelines.
  • Experience or strong familiarity with agent marketplace architectures, capability registries, and platform governance — ideally with observability or monitoring responsibilities for marketplace-registered components.

Nice To Haves

  • Published contributions or hands-on experience with emerging agent frameworks (LangGraph, AutoGen, CrewAI, Semantic Kernel, Bedrock Agents, or equivalent).
  • Experience with Grafana, Datadog, New Relic, Dynatrace, or equivalent enterprise observability platforms — ideally extended to support AI/LLM workloads.
  • Familiarity with vector databases (Pinecone, Weaviate, pgvector) and semantic search observability patterns relevant to RAG pipelines.
  • Background in MLOps, LLMOps, or model lifecycle management — including model versioning, drift detection, and deployment governance.
  • Experience designing observability APIs and SDK hooks for developer self-service onboarding.

Responsibilities

  • Define and own the enterprise observability architecture for AI agents, LLMs, multi-agent workflows, and physical AI systems — covering planner/executor loops, tool/function calls, RAG retrieval chains, and memory/state transitions.
  • Build and operate unified telemetry pipelines incorporating metrics, logs, distributed traces, semantic/vector signals, and real-time event streaming (Kafka) at enterprise scale.
  • Instrument OpenTelemetry (OTEL) across heterogeneous platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and internal frameworks — delivering protocol-level observability for agent ecosystems including MCP, A2A, UCP, and AP2.
  • Design and implement observability for Agent Fleets, multi-modal pipelines, physical AI systems, and self-evolving reinforcement learning harnesses — including signal capture for reward shaping and policy evaluation.
  • Deliver dashboards, alerting, SLO/SLA management, incident runbook automation, and RCA tooling that drive measurable reliability improvements and reduce MTTR across agentic services.
  • Establish cost telemetry and FinOps observability for AI workloads — token consumption, inference cost allocation, and GPU/compute efficiency across cloud environments (Azure, AWS, GCP).
  • Lead observability-driven red team exercises targeting agentic AI systems — instrumenting attack surfaces, adversarial prompt injection vectors, model evasion attempts, and multi-agent trust boundary failures.
  • Design telemetry pipelines that capture safety-critical signals: guardrail trigger rates, policy violation events, PII exposure risks, prompt leakage, and agent hallucination rates.
  • Partner with Security and RAI teams to embed threat modeling, zero-trust agent authentication, and behavioral anomaly detection into the observability platform.
  • Instrument secure policy enforcement layers across agent-to-agent communication protocols (A2A, UCP, AP2) and maintain audit-ready traceability for all AI decision events.
  • Develop and maintain a Security Observability Playbook covering incident classification, escalation paths, and forensic trace retention policies for agentic AI systems.
  • Integrate RAI signal capture — fairness, bias detection, explainability, and safety metrics — directly into observability pipelines, making compliance measurable and audit-ready.
  • Deliver governance dashboards that surface RAI compliance posture across all active AI agents and LLM deployments, aligned with global regulatory standards.
  • Support risk assessments, gap analyses, and governance frameworks with real-time observability insights — enabling proactive risk mitigation rather than reactive audit responses.
  • Collaborate with RAI CoE and Legal/Compliance teams to define data retention, consent logging, and model decision traceability standards embedded in the telemetry architecture.
  • Own the Continuous Quality Engineering (CQE) framework for post-production agentic solutions — defining and tracking quality metrics across accuracy, latency, agent success rate, tool-call fidelity, and user outcome measures.
  • Build automated quality gates within CI/CD pipelines that leverage observability data to detect regressions, drift, and degradation in agent performance — preventing silent failures in production.
  • Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — providing traceability from eval results to production behavior.
  • Partner with product and business stakeholders to define SLA-backed quality benchmarks and deliver automated alerting when quality thresholds are breached.
  • Drive root-cause analysis for quality failures using distributed trace data, enabling rapid iteration and continuous improvement cycles for agentic solutions.
  • Design and implement observability for the agent memory layer — episodic, semantic, and working memory read/write operations — providing latency, accuracy, and drift monitoring across memory backends.
  • Instrument MCP (Model Context Protocol) server interactions, tool registrations, skill invocations, and context injection pipelines with full trace propagation and semantic tagging.
  • Own observability for self-evolving harness and reinforcement learning (RL) systems — capturing reward signals, policy update events, environment state transitions, and learning convergence metrics.
  • Monitor harness execution fidelity, skill eval pass/fail rates, and regression signals across training, fine-tuning, and inference workflows — feeding data back into the quality engineering loop.
  • Lead a team of senior Python engineers building high-performance, production-grade observability tooling — including custom OTEL exporters, semantic trace enrichers, signal aggregators, and anomaly detection pipelines.
  • Apply data science methods — statistical process control, time-series anomaly detection, clustering, and causal inference — to transform raw telemetry into actionable AI operational intelligence.
  • Build and maintain Python-native SDKs and libraries that simplify observability onboarding for agent developers across the organization.
  • Establish code quality standards, testing frameworks, and peer review practices for the observability engineering team — embedding software craftsmanship into the team culture.
  • Instrument the Agentic Marketplace and Agent Registry platforms — providing usage telemetry, adoption metrics, capability health scores, and dependency mapping for registered agents and skills.
  • Design observability APIs and SDK hooks that allow marketplace-registered agents to self-report health, performance, and behavioral signals into the central observability platform.
  • Monitor inter-agent communication patterns across the marketplace ecosystem — identifying latency hotspots, circular dependencies, and protocol mismatches in agent-to-agent (A2A) workflows.
  • Deliver a Marketplace Observability Dashboard surfacing agent catalog health, adoption trends, quality scores, and incident history — supporting marketplace governance and curation decisions.
  • Build and maintain CI/CD pipelines for observability services and agent operations center components, incorporating automated testing, deployment gates, and rollback mechanisms.
  • Automate onboarding for new agent use cases using templates, scaffolding, and configuration validation — reducing time-to-observability from weeks to hours.
  • Drive infrastructure-as-code (IaC) practices for observability platform components across Azure, AWS, and GCP — ensuring reproducible, version-controlled, and auditable deployments.
  • Operate with a product mindset — defining observability platform roadmaps, OKRs, adoption playbooks, and release milestones in partnership with AI platform and business teams.
  • Collaborate with transformation teams, enterprise architects, security, and business stakeholders to tailor observability solutions to domain-specific requirements.
  • Serve as the technical authority in executive and governance forums — translating complex observability data into business-relevant insights on risk, cost, and AI performance.
  • Partner with SRE, AI platform, and product teams to drive standard adoption and reduce integration friction across the agentic AI ecosystem.
  • Build, mentor, and lead a high-performing observability engineering team — spanning Python developers, data scientists, and platform engineers — with talent initially based in India.
  • Define career paths, skills development plans, and leveling criteria aligned with PepsiCo job architecture — fostering an inclusive, high-accountability team culture.
  • Drive hiring, coaching, performance management, and succession planning across the observability function.

Benefits

  • Paid parental leave
  • Vacation
  • Sick
  • Bereavement
  • Medical
  • Dental
  • Vision
  • Disability
  • Health
  • Dependent Care Reimbursement Accounts
  • Employee Assistance Program (EAP)
  • Insurance (Accident, Group Legal, Life)
  • Defined Contribution Retirement Plan
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service