(USA) Principal, Software Engineer

WalmartSunnyvale, CA
$143,000 - $286,000Onsite

About The Position

Walmart processes more transactions in a day than most companies handle in a year. When performance degrades or systems fail, the impact is immediate — measured in millions of dollars and hundreds of millions of customers. We're building the team that prevents that using agentic AI. As a Principal Engineer in Performance and Resiliency Engineering, you'll architect and lead the development of intelligent, self-healing systems: LLM-based agents that detect anomalies, reason across observability data, and trigger automated remediation — without waiting for a human in the loop. You'll operate at a scale most AI engineers never encounter: 10,500 stores, 240M weekly customers, and infrastructure that powers one of the world's largest retail ecosystems. This isn't a research role or a proof-of-concept environment. You'll own the technical strategy, set architectural direction, and ship to production — building agentic systems that directly impact Walmart's global reliability and business continuity. Building the right technology foundation for Infrastructure & Platforms is vital to success at Walmart's scale. Our team builds and maintains the foundational technologies that power the entire tech organization — data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. We ship to production weekly, run blameless postmortems, and treat chaos experiments as first-class engineering work. If you thrive in high-ownership environments where your architectural decisions have immediate, measurable impact, this is where you belong.

Requirements

  • 10+ years of experience building and operating distributed systems at scale
  • Proven, hands-on production experience with LLMs, agentic frameworks, or RAG-based systems
  • Deep background in performance engineering, chaos engineering, or SRE — with real ownership of SLOs and incident response
  • Strong programming skills in Python and/or Java; comfort working across the full ML stack
  • Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years’ experience in software engineering or related area.
  • Option 2: 7 years’ experience in software engineering or related area.

Nice To Haves

  • Familiarity with ML frameworks: PyTorch, TensorFlow, Hugging Face Transformers
  • Hands-on with cloud-native infrastructure: GCP, Azure, Kubernetes, Docker
  • MLOps experience: CI/CD for ML, drift detection, model monitoring
  • Experimentation background: A/B testing, causal inference, multi-armed bandits
  • Excellent communication skills — able to align technical and non-technical stakeholders on complex architectural decisions
  • Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area.
  • Knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly.
  • Knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.

Responsibilities

  • Set the technical direction — not just execute it. From initial architecture through production deployment, own the roadmap for Walmart's agentic AI platform for performance and resiliency.
  • Have the autonomy to make architectural tradeoffs, drive experimentation, and shape how intelligent systems operate at enterprise scale.
  • Architect production multi-agent pipelines — from RAG-based knowledge grounding to LLM-driven decision-making and autonomous remediation — operating across 10,500 stores and 240M weekly customers
  • Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt
  • Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains
  • Integrate vector databases and observability stacks to build context-aware systems that act on live signals without human intervention
  • Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self-correcting infrastructure — cutting mean time to recovery across critical systems
  • Design and run chaos experiments that expose real failure modes and change architecture decisions — not checkbox exercises
  • Define SLOs that reflect real business impact, integrate performance gates into CI/CD, and make observability (Grafana, Prometheus, ELK, Splunk) actionable across the org
  • Write and maintain runbooks that teams actually use: tested, updated after every incident, and clear enough to act on under pressure
  • Set the architectural direction for the org's agentic AI platform — from initial design through production deployment — and own the decisions that follow
  • Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday-scale traffic
  • Raise the technical floor through design reviews and mentoring that produces engineers who make better decisions independently
  • Shape the multi-year roadmap for AI-powered performance and resiliency, influencing infrastructure investment decisions across the org

Benefits

  • competitive pay
  • performance-based bonus awards
  • medical, vision and dental coverage
  • 401(k)
  • stock purchase
  • company-paid life insurance
  • PTO (including sick leave)
  • parental leave
  • family care leave
  • bereavement
  • jury duty
  • voting
  • short-term and long-term disability
  • company discounts
  • Military Leave Pay
  • adoption and surrogacy expense reimbursement
  • Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
  • annual or quarterly performance bonuses
  • Stock
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service