Principal AI Architect/Engineer

PepsiCoPlano, TX

About The Position

The AI Platform/Observability Architect is an execution-focused engineer who designs, builds, and operates observability capabilities within a defined domain of the enterprise AI observability platform. Working under the strategic direction of the Senior AI Observability Architect (L11), this role translates architecture blueprints into production-grade instrumentation, telemetry pipelines, dashboards, quality gates, and safety signals across agentic AI systems. The junior architect is a hands-on engineer who codes, integrates, tests, and iterates — owning feature-level delivery within one or more specialization tracks while developing a growing understanding of the full observability platform. They are a technical practitioner first, with an emerging architect mindset.

Requirements

  • Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
  • 6–8 years of experience in software engineering, platform engineering, or data engineering — with at least 2-3 years of hands-on work in observability, monitoring, or distributed systems.
  • Demonstrated ability to deliver production-grade software in a team environment; track record of completing complex technical features end-to-end.
  • Python Proficiency: Strong Python engineering skills — writing clean, testable, maintainable production code; familiarity with async patterns, type hints, and modern Python tooling (Poetry, Ruff, pytest).
  • Observability Fundamentals: Solid working knowledge of the three pillars of observability (metrics, logs, traces); ability to instrument services with OpenTelemetry (OTEL) SDKs; understanding of trace context propagation and semantic conventions.
  • Distributed Systems: Working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC APIs, and containerized deployment (Docker, Kubernetes).
  • Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) — including managed services, IAM basics, and cost awareness.
  • CI/CD & DevOps: Experience building or contributing to CI/CD pipelines; familiarity with GitOps, infrastructure-as-code concepts, and automated testing frameworks.
  • Data Fundamentals: Ability to query, analyze, and visualize time-series and log data using tools such as Grafana, Datadog, Splunk, Prometheus, or equivalent.
  • Hands-on experience with agentic AI frameworks (LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or equivalent).
  • Contributions to open-source observability projects or OTEL community.
  • Familiarity with reinforcement learning concepts, self-supervised learning, or model fine-tuning workflows.
  • Experience with security tooling relevant to AI (adversarial robustness libraries, LLM safety frameworks, or red-team toolkits).
  • Exposure to Responsible AI frameworks, fairness evaluation libraries (Arize, Fairlearn, AI Fairness 360), or explainability tools (SHAP, LIME).
  • Experience in a fast-paced AI platform, MLOps, or LLMOps role with production deployment responsibilities.

Responsibilities

  • Implement OpenTelemetry (OTEL) instrumentation within assigned agent frameworks or platforms — including custom exporters, span enrichers, semantic conventions, and context propagation hooks.
  • Build and maintain telemetry pipeline components (collectors, processors, exporters) that route metrics, logs, traces, and semantic signals to central observability backends.
  • Integrate OTEL with enterprise agentic platforms as assigned — which may include Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks — following architecture blueprints set by the L11.
  • Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for the assigned sub-domain, ensuring signal quality and low false-positive rates.
  • Participate in on-call rotations and incident response for the observability platform — contributing to RCA documentation and runbook improvement.
  • Write unit, integration, and end-to-end tests for all telemetry components; maintain >80% test coverage across owned services.
  • Instrument safety-critical signal capture within assigned pipelines — including guardrail trigger rates, policy violation events, prompt injection detections, and hallucination flags.
  • Support red team exercises by building observability hooks that capture adversarial test results, attack surface telemetry, and behavioral deviation signals in real time.
  • Implement secure trace handling for sensitive AI decision events — applying data masking, PII redaction, and audit-log retention policies as defined by the security architecture.
  • Assist in maintaining the Security Observability Playbook — documenting findings, updating escalation paths, and contributing to incident classification procedures.
  • Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) for anomalous communication patterns and flag deviations for review by the L11 architect and security team.
  • Implement RAI signal collectors within assigned agent workflows — capturing fairness indicators, bias detection outputs, explainability scores, and content safety classifications.
  • Maintain RAI telemetry pipelines and ensure data quality, completeness, and timeliness of governance signals feeding into compliance dashboards.
  • Contribute to audit-readiness work by ensuring all AI decision traces within the assigned domain include required governance metadata and are retained per policy.
  • Support gap analyses by comparing current RAI signal coverage against governance framework requirements and flagging coverage gaps to the L11.
  • Build and maintain quality gate components within CI/CD pipelines — using observability data to detect performance regressions, behavioral drift, and SLA breaches before they reach production.
  • Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — collecting eval results, tracking pass/fail trends, and alerting on regression thresholds.
  • Implement continuous quality monitoring for post-go-live agentic solutions — tracking agent success rate, tool-call fidelity, latency distributions, and user outcome proxies.
  • Conduct structured testing of new agent capabilities using standardized eval harnesses — documenting results and feeding findings into quality improvement cycles.
  • Develop automated quality reports and quality metric dashboards for stakeholder review, surfacing trends and anomalies in agent behavior over time.
  • Instrument agent memory operations (read/write latency, cache hit rates, memory drift) across episodic, semantic, and working memory backends within the assigned scope.
  • Add trace instrumentation to MCP server interactions — tagging tool registrations, skill invocations, context injections, and result returns with semantic OTEL attributes.
  • Capture harness execution telemetry for self-evolving and RL systems — logging reward signals, policy update events, environment transitions, and convergence indicators.
  • Monitor skill eval harness execution pipelines — detecting flaky evals, environment setup failures, and result inconsistencies that could mask real capability regressions.
  • Write production-grade Python for observability tooling — custom OTEL exporters, signal aggregators, anomaly detectors, and data transformation pipelines — adhering to team engineering standards.
  • Apply basic statistical and data science methods to telemetry data — time-series analysis, threshold tuning, distribution characterization — to improve signal quality and alerting precision.
  • Contribute to Python SDK and library development that simplifies OTEL onboarding for agent developers across the organization.
  • Participate in code reviews, apply test-driven development practices, and continuously improve the quality and maintainability of the observability codebase.
  • Implement telemetry for agent fleet coordination — capturing spawn/termination events, inter-agent communication traces, load distribution metrics, and fleet health indicators.
  • Contribute to observability instrumentation for physical AI pipelines (edge inference, sensor fusion, robotics control loops) as directed — focusing on latency, reliability, and data quality signals.
  • Add OTEL instrumentation to multi-modal model pipelines — tracing vision, audio, and text input processing stages and capturing cross-modal alignment quality signals.
  • Instrument the Agentic Marketplace and Agent Registry with usage telemetry — tracking agent invocations, capability health scores, adoption trends, and dependency relationships.
  • Implement protocol-level observability for A2A (Agent-to-Agent), UCP, and AP2 communication flows — capturing message latency, error rates, retry patterns, and trust boundary crossings.
  • Contribute to Marketplace Observability Dashboard development — building data connectors, metric calculations, and visualization components as directed by the L11.
  • Collaborate closely with AI platform engineers, AI Solution Engineers, SRE, and product teams to gather requirements, align on telemetry standards, and resolve integration friction.
  • Participate in agile ceremonies — sprint planning, stand-ups, retrospectives — contributing to estimation, dependency identification, and delivery transparency.
  • Stay current with emerging observability frameworks, OTEL specifications, agent communication protocols, and AI safety research — sharing learnings with the team regularly.
  • Contribute to internal documentation, engineering wikis, and onboarding guides for the observability platform.

Benefits

  • Paid parental leave
  • Vacation
  • Sick leave
  • Bereavement leave
  • Medical insurance
  • Dental insurance
  • Vision insurance
  • Disability insurance
  • Health Reimbursement Accounts
  • Dependent Care Reimbursement Accounts
  • Employee Assistance Program (EAP)
  • Accident insurance
  • Group Legal insurance
  • Life insurance
  • Defined Contribution Retirement Plan
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service