Principal AI Architect/Engineer

PepsiCo•Plano, TX

About The Position

The AI Platform/Observability Architect is an execution-focused engineer who designs, builds, and operates observability capabilities within a defined domain of the enterprise AI observability platform. Working under the strategic direction of the Senior AI Observability Architect (L11), this role translates architecture blueprints into production-grade instrumentation, telemetry pipelines, dashboards, quality gates, and safety signals across agentic AI systems. The junior architect is a hands-on engineer who codes, integrates, tests, and iterates — owning feature-level delivery within one or more specialization tracks while developing a growing understanding of the full observability platform. They are a technical practitioner first, with an emerging architect mindset.

Requirements

Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
6–8 years of experience in software engineering, platform engineering, or data engineering — with at least 2-3 years of hands-on work in observability, monitoring, or distributed systems.
Demonstrated ability to deliver production-grade software in a team environment; track record of completing complex technical features end-to-end.
Python Proficiency: Strong Python engineering skills — writing clean, testable, maintainable production code; familiarity with async patterns, type hints, and modern Python tooling (Poetry, Ruff, pytest).
Observability Fundamentals: Solid working knowledge of the three pillars of observability (metrics, logs, traces); ability to instrument services with OpenTelemetry (OTEL) SDKs; understanding of trace context propagation and semantic conventions.
Distributed Systems: Working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC APIs, and containerized deployment (Docker, Kubernetes).
Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) — including managed services, IAM basics, and cost awareness.
CI/CD & DevOps: Experience building or contributing to CI/CD pipelines; familiarity with GitOps, infrastructure-as-code concepts, and automated testing frameworks.
Data Fundamentals: Ability to query, analyze, and visualize time-series and log data using tools such as Grafana, Datadog, Splunk, Prometheus, or equivalent.
Hands-on experience with agentic AI frameworks (LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or equivalent).
Contributions to open-source observability projects or OTEL community.
Familiarity with reinforcement learning concepts, self-supervised learning, or model fine-tuning workflows.
Experience with security tooling relevant to AI (adversarial robustness libraries, LLM safety frameworks, or red-team toolkits).
Exposure to Responsible AI frameworks, fairness evaluation libraries (Arize, Fairlearn, AI Fairness 360), or explainability tools (SHAP, LIME).
Experience in a fast-paced AI platform, MLOps, or LLMOps role with production deployment responsibilities.

Responsibilities

Implement OpenTelemetry (OTEL) instrumentation within assigned agent frameworks or platforms — including custom exporters, span enrichers, semantic conventions, and context propagation hooks.
Build and maintain telemetry pipeline components (collectors, processors, exporters) that route metrics, logs, traces, and semantic signals to central observability backends.
Integrate OTEL with enterprise agentic platforms as assigned — which may include Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks — following architecture blueprints set by the L11.
Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for the assigned sub-domain, ensuring signal quality and low false-positive rates.
Participate in on-call rotations and incident response for the observability platform — contributing to RCA documentation and runbook improvement.
Write unit, integration, and end-to-end tests for all telemetry components; maintain >80% test coverage across owned services.
Instrument safety-critical signal capture within assigned pipelines — including guardrail trigger rates, policy violation events, prompt injection detections, and hallucination flags.
Support red team exercises by building observability hooks that capture adversarial test results, attack surface telemetry, and behavioral deviation signals in real time.
Implement secure trace handling for sensitive AI decision events — applying data masking, PII redaction, and audit-log retention policies as defined by the security architecture.
Assist in maintaining the Security Observability Playbook — documenting findings, updating escalation paths, and contributing to incident classification procedures.
Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) for anomalous communication patterns and flag deviations for review by the L11 architect and security team.
Implement RAI signal collectors within assigned agent workflows — capturing fairness indicators, bias detection outputs, explainability scores, and content safety classifications.
Maintain RAI telemetry pipelines and ensure data quality, completeness, and timeliness of governance signals feeding into compliance dashboards.
Contribute to audit-readiness work by ensuring all AI decision traces within the assigned domain include required governance metadata and are retained per policy.
Support gap analyses by comparing current RAI signal coverage against governance framework requirements and flagging coverage gaps to the L11.
Build and maintain quality gate components within CI/CD pipelines — using observability data to detect performance regressions, behavioral drift, and SLA breaches before they reach production.
Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — collecting eval results, tracking pass/fail trends, and alerting on regression thresholds.
Implement continuous quality monitoring for post-go-live agentic solutions — tracking agent success rate, tool-call fidelity, latency distributions, and user outcome proxies.
Conduct structured testing of new agent capabilities using standardized eval harnesses — documenting results and feeding findings into quality improvement cycles.
Develop automated quality reports and quality metric dashboards for stakeholder review, surfacing trends and anomalies in agent behavior over time.
Instrument agent memory operations (read/write latency, cache hit rates, memory drift) across episodic, semantic, and working memory backends within the assigned scope.
Add trace instrumentation to MCP server interactions — tagging tool registrations, skill invocations, context injections, and result returns with semantic OTEL attributes.
Capture harness execution telemetry for self-evolving and RL systems — logging reward signals, policy update events, environment transitions, and convergence indicators.
Monitor skill eval harness execution pipelines — detecting flaky evals, environment setup failures, and result inconsistencies that could mask real capability regressions.
Write production-grade Python for observability tooling — custom OTEL exporters, signal aggregators, anomaly detectors, and data transformation pipelines — adhering to team engineering standards.
Apply basic statistical and data science methods to telemetry data — time-series analysis, threshold tuning, distribution characterization — to improve signal quality and alerting precision.
Contribute to Python SDK and library development that simplifies OTEL onboarding for agent developers across the organization.
Participate in code reviews, apply test-driven development practices, and continuously improve the quality and maintainability of the observability codebase.
Implement telemetry for agent fleet coordination — capturing spawn/termination events, inter-agent communication traces, load distribution metrics, and fleet health indicators.
Contribute to observability instrumentation for physical AI pipelines (edge inference, sensor fusion, robotics control loops) as directed — focusing on latency, reliability, and data quality signals.
Add OTEL instrumentation to multi-modal model pipelines — tracing vision, audio, and text input processing stages and capturing cross-modal alignment quality signals.
Instrument the Agentic Marketplace and Agent Registry with usage telemetry — tracking agent invocations, capability health scores, adoption trends, and dependency relationships.
Implement protocol-level observability for A2A (Agent-to-Agent), UCP, and AP2 communication flows — capturing message latency, error rates, retry patterns, and trust boundary crossings.
Contribute to Marketplace Observability Dashboard development — building data connectors, metric calculations, and visualization components as directed by the L11.
Collaborate closely with AI platform engineers, AI Solution Engineers, SRE, and product teams to gather requirements, align on telemetry standards, and resolve integration friction.
Participate in agile ceremonies — sprint planning, stand-ups, retrospectives — contributing to estimation, dependency identification, and delivery transparency.
Stay current with emerging observability frameworks, OTEL specifications, agent communication protocols, and AI safety research — sharing learnings with the team regularly.
Contribute to internal documentation, engineering wikis, and onboarding guides for the observability platform.