AI Solution and Platforms Observability Architect/Engineer

PepsiCo•Plano, TX

1d•Remote

About The Position

The AI Observability Architect is a senior technical leader responsible for designing, deploying, and operating an enterprise-grade, production-ready AI observability platform that spans the full spectrum of modern agentic AI — from large language model (LLM) workflows and multi-agent orchestration to physical AI systems, reinforcement learning harnesses, multi-modal pipelines, and agentic marketplaces. This role serves as the strategic and engineering authority for end-to-end telemetry, tracing, safety, and quality signals across heterogeneous agent frameworks and platforms. The architect leads the convergence of AI observability with safety & security (including red teaming), Responsible AI (RAI), data science, physical AI, memory/skills engineering, agent fleet management, self-evolving harnesses, reinforcement learning, agent-to-agent protocols (A2A, UCP, AP2), and continuous quality engineering — making this a uniquely broad and high-impact role within the AI Solutions & Platforms organization. The role also owns OpenTelemetry (OTEL) integration across third-party agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and others), enabling unified observability and governance at enterprise scale.

Requirements

Bachelor's or Master's degree in Computer Science, AI/ML, Data Science, Software Engineering, or a related field (PhD a plus for research-heavy domains).
12+ years in technology with deep experience in enterprise observability, distributed systems, platform engineering, or AI/ML infrastructure.
5+ years in a senior/principal or architect-level role with demonstrated ownership of complex, cross-functional technical programs.
Expert-level knowledge of observability primitives (metrics, logs, traces, events) applied to LLM/ML/agentic systems; hands-on OpenTelemetry (OTEL) instrumentation including custom exporters, semantic conventions, and trace propagation across agent/tool boundaries.
Direct experience with agentic AI platforms, multi-agent orchestration, LLM-based workflow design, and agent lifecycle management at production scale.
Demonstrated experience conducting red team exercises against AI systems; knowledge of adversarial attack patterns, prompt injection, model evasion, and multi-agent trust boundary failures; ability to design safety telemetry pipelines.
Working knowledge of agent memory architectures (episodic, semantic, working memory), Model Context Protocol (MCP), skill registries, and context injection patterns — with ability to design observability for these layers.
Familiarity with A2A (Agent-to-Agent), UCP (Universal Communication Protocol), and AP2 patterns; ability to implement protocol-level observability and policy enforcement.
Understanding of RL training loops, reward signal capture, policy evaluation, and harness instrumentation for continuously improving agent systems.
Experience or strong familiarity with observability for physical AI pipelines (robotics, edge inference, sensor fusion) and multi-modal models (vision, audio, text).
Proficiency in Python at a senior engineering level; experience with statistical anomaly detection, time-series analysis, and data pipeline design applied to observability data at scale.
Hands-on experience integrating OTEL with enterprise agentic platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or similar; strong understanding of enterprise integration patterns and API design.
Cloud fluency across Azure, AWS, and GCP; proficiency in Kubernetes, service mesh, IaC (Terraform/Bicep), and CI/CD tooling; experience with event streaming platforms (Kafka, Event Hubs).
Experience designing continuous quality frameworks (CQE) for agentic solutions including eval harnesses, regression detection, quality gates, and SLA-backed quality benchmarking.
Familiarity with RAI principles — fairness, bias detection, explainability, and safety — and ability to operationalize RAI signal capture within production observability pipelines.
Experience or strong familiarity with agent marketplace architectures, capability registries, and platform governance — ideally with observability or monitoring responsibilities for marketplace-registered components.

Nice To Haves

Published contributions or hands-on experience with emerging agent frameworks (LangGraph, AutoGen, CrewAI, Semantic Kernel, Bedrock Agents, or equivalent).
Experience with Grafana, Datadog, New Relic, Dynatrace, or equivalent enterprise observability platforms — ideally extended to support AI/LLM workloads.
Familiarity with vector databases (Pinecone, Weaviate, pgvector) and semantic search observability patterns relevant to RAG pipelines.
Background in MLOps, LLMOps, or model lifecycle management — including model versioning, drift detection, and deployment governance.
Experience designing observability APIs and SDK hooks for developer self-service onboarding.

Responsibilities

Define and own the enterprise observability architecture for AI agents, LLMs, multi-agent workflows, and physical AI systems — covering planner/executor loops, tool/function calls, RAG retrieval chains, and memory/state transitions.
Build and operate unified telemetry pipelines incorporating metrics, logs, distributed traces, semantic/vector signals, and real-time event streaming (Kafka) at enterprise scale.
Instrument OpenTelemetry (OTEL) across heterogeneous platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and internal frameworks — delivering protocol-level observability for agent ecosystems including MCP, A2A, UCP, and AP2.
Design and implement observability for Agent Fleets, multi-modal pipelines, physical AI systems, and self-evolving reinforcement learning harnesses — including signal capture for reward shaping and policy evaluation.
Deliver dashboards, alerting, SLO/SLA management, incident runbook automation, and RCA tooling that drive measurable reliability improvements and reduce MTTR across agentic services.
Establish cost telemetry and FinOps observability for AI workloads — token consumption, inference cost allocation, and GPU/compute efficiency across cloud environments (Azure, AWS, GCP).
Lead observability-driven red team exercises targeting agentic AI systems — instrumenting attack surfaces, adversarial prompt injection vectors, model evasion attempts, and multi-agent trust boundary failures.
Design telemetry pipelines that capture safety-critical signals: guardrail trigger rates, policy violation events, PII exposure risks, prompt leakage, and agent hallucination rates.
Partner with Security and RAI teams to embed threat modeling, zero-trust agent authentication, and behavioral anomaly detection into the observability platform.
Instrument secure policy enforcement layers across agent-to-agent communication protocols (A2A, UCP, AP2) and maintain audit-ready traceability for all AI decision events.
Develop and maintain a Security Observability Playbook covering incident classification, escalation paths, and forensic trace retention policies for agentic AI systems.
Integrate RAI signal capture — fairness, bias detection, explainability, and safety metrics — directly into observability pipelines, making compliance measurable and audit-ready.
Deliver governance dashboards that surface RAI compliance posture across all active AI agents and LLM deployments, aligned with global regulatory standards.
Support risk assessments, gap analyses, and governance frameworks with real-time observability insights — enabling proactive risk mitigation rather than reactive audit responses.
Collaborate with RAI CoE and Legal/Compliance teams to define data retention, consent logging, and model decision traceability standards embedded in the telemetry architecture.
Own the Continuous Quality Engineering (CQE) framework for post-production agentic solutions — defining and tracking quality metrics across accuracy, latency, agent success rate, tool-call fidelity, and user outcome measures.
Build automated quality gates within CI/CD pipelines that leverage observability data to detect regressions, drift, and degradation in agent performance — preventing silent failures in production.
Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — providing traceability from eval results to production behavior.
Partner with product and business stakeholders to define SLA-backed quality benchmarks and deliver automated alerting when quality thresholds are breached.
Drive root-cause analysis for quality failures using distributed trace data, enabling rapid iteration and continuous improvement cycles for agentic solutions.
Design and implement observability for the agent memory layer — episodic, semantic, and working memory read/write operations — providing latency, accuracy, and drift monitoring across memory backends.
Instrument MCP (Model Context Protocol) server interactions, tool registrations, skill invocations, and context injection pipelines with full trace propagation and semantic tagging.
Own observability for self-evolving harness and reinforcement learning (RL) systems — capturing reward signals, policy update events, environment state transitions, and learning convergence metrics.
Monitor harness execution fidelity, skill eval pass/fail rates, and regression signals across training, fine-tuning, and inference workflows — feeding data back into the quality engineering loop.
Lead a team of senior Python engineers building high-performance, production-grade observability tooling — including custom OTEL exporters, semantic trace enrichers, signal aggregators, and anomaly detection pipelines.
Apply data science methods — statistical process control, time-series anomaly detection, clustering, and causal inference — to transform raw telemetry into actionable AI operational intelligence.
Build and maintain Python-native SDKs and libraries that simplify observability onboarding for agent developers across the organization.
Establish code quality standards, testing frameworks, and peer review practices for the observability engineering team — embedding software craftsmanship into the team culture.
Instrument the Agentic Marketplace and Agent Registry platforms — providing usage telemetry, adoption metrics, capability health scores, and dependency mapping for registered agents and skills.
Design observability APIs and SDK hooks that allow marketplace-registered agents to self-report health, performance, and behavioral signals into the central observability platform.
Monitor inter-agent communication patterns across the marketplace ecosystem — identifying latency hotspots, circular dependencies, and protocol mismatches in agent-to-agent (A2A) workflows.
Deliver a Marketplace Observability Dashboard surfacing agent catalog health, adoption trends, quality scores, and incident history — supporting marketplace governance and curation decisions.
Build and maintain CI/CD pipelines for observability services and agent operations center components, incorporating automated testing, deployment gates, and rollback mechanisms.
Automate onboarding for new agent use cases using templates, scaffolding, and configuration validation — reducing time-to-observability from weeks to hours.
Drive infrastructure-as-code (IaC) practices for observability platform components across Azure, AWS, and GCP — ensuring reproducible, version-controlled, and auditable deployments.
Operate with a product mindset — defining observability platform roadmaps, OKRs, adoption playbooks, and release milestones in partnership with AI platform and business teams.
Collaborate with transformation teams, enterprise architects, security, and business stakeholders to tailor observability solutions to domain-specific requirements.
Serve as the technical authority in executive and governance forums — translating complex observability data into business-relevant insights on risk, cost, and AI performance.
Partner with SRE, AI platform, and product teams to drive standard adoption and reduce integration friction across the agentic AI ecosystem.
Build, mentor, and lead a high-performing observability engineering team — spanning Python developers, data scientists, and platform engineers — with talent initially based in India.
Define career paths, skills development plans, and leveling criteria aligned with PepsiCo job architecture — fostering an inclusive, high-accountability team culture.
Drive hiring, coaching, performance management, and succession planning across the observability function.

Benefits

Paid parental leave
Vacation
Sick
Bereavement
Medical
Dental
Vision
Disability
Health
Dependent Care Reimbursement Accounts
Employee Assistance Program (EAP)
Insurance (Accident, Group Legal, Life)
Defined Contribution Retirement Plan

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

AI Solution and Platforms Observability Architect/Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Similar AI Solution and Platforms Observability Architect/Engineer job opportunities

Tools

Templates & Examples

Resources

Comparisons

Company