Principal, Data Scientist, Experimentation Sciences

Walmart•Bentonville, AR

53d•Onsite

About The Position

As a Principal Data Scientist at Walmart, you will define and execute the data science roadmap for the experimentation platform that powers trusted decision-making across Walmart’s A/B testing ecosystem. This is a hands-on technical leadership role at the intersection of experimentation science, large-scale data systems, and AI evaluation. You will own the scientific direction behind experiment reporting, dashboards, guardrails, and reusable measurement services, ensuring experiment exposure data is stitched to business and operational outcomes with rigor, scalability, and clarity. You will partner closely with engineering, product, and business teams to modernize our statistical tooling, improve self-service experimentation, and extend our measurement framework to emerging AI use cases including LLM evals, prompt evaluation, hybrid human/LLM judging, and offline-to-online quality measurement. We are looking for a self-starter who can move fluidly from strategy to hands-on prototyping, quickly validating ideas through lightweight automated workflows and proofs of concept. Our team owns and manages Walmart’s experimentation platform, enabling A/B testing across multiple channels and regions. We build and maintain the scalable infrastructure, data foundations, and measurement systems required to support high experiment volume with reliable and accurate outcomes. One of the team’s core responsibilities is generating experiment reports and dashboards that translate raw experiment data into trusted business insights. To do this, we own a broad set of ETL processes that generate, transform, and stitch experiment exposure data with business and operational metrics. We also develop and maintain the statistical processes and guardrails that underpin sound decision-making, including sample imbalance checks, metric validation, and analysis standards. As experimentation expands into AI-powered experiences, the team is evolving the platform to support LLM evals, prompt evaluation, and new approaches to measuring quality, customer impact, and business value.

Requirements

Deep expertise in experimentation, causal inference, and statistical decision-making, with a track record of shaping how organizations design, analyze, and operationalize experiments at scale.
Expert-level SQL and PySpark, strong Python skills, and hands-on experience working with high-volume, distributed data pipelines in production environments.
Experience building or materially improving experimentation platforms, measurement systems, or internal science tooling rather than only delivering one-off analyses.
Strong understanding of metric design, guardrails, data quality, and observability for experimentation systems, including sample ratio mismatch, exposure correctness, and downstream metric integrity.
Self-starter mindset, with the ability to work through ambiguity, define a roadmap, and independently drive ideas from concept to execution.
Experience in e-commerce, retail, marketplace, logistics, last-mile delivery, or other high-scale consumer platforms with complex operational feedback loops.
Working knowledge of modern AI evaluation methods, including LLM evals, prompt experimentation, model or prompt regression testing, and hybrid human-plus-automated quality frameworks.
Ability to translate ambiguous business problems into rigorous analysis plans, technical designs, and executive-ready recommendations.
Bachelor’s degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology, Operations Research, or related field and 10 years’ experience in data science, experimentation, measurement science, or related field.
OR Master’s degree in one of the above fields and 8 years’ relevant experience.
OR PhD in one of the above fields and 6 years’ relevant experience.
In all cases, candidates should demonstrate strong hands-on experience with SQL, Spark/PySpark, experimentation, and causal inference at production scale.

Nice To Haves

Experience building or scaling experimentation platforms, internal measurement tooling, or self-service analytics capabilities.
Experience supporting high-volume A/B testing in e-commerce, marketplace, or last-mile environments.
Deep knowledge of advanced experimentation methods such as CUPED/CUPAC, switchback designs, cluster randomization, interference and network effects, Bayesian or sequential testing, and observational causal inference.
Experience defining AI evaluation frameworks for conversational AI, search, recommendation, or other LLM-powered products.
Experience with Google Cloud Platform, Airflow, and modern orchestration, monitoring, and data workflow patterns.
Publications, patents, or conference contributions in experimentation, causal inference, AI evaluation, or applied machine learning.
Successful completion of one or more assessments in Python, Spark, Scala, or R.
Experience creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly.
Knowledge of accessibility best practices and joining us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.

Responsibilities

Define the multi-year data science roadmap for experimentation reporting, dashboards, and measurement services, identifying the highest-leverage investments in methodology, automation, and self-service.
Lead the design of scalable statistical frameworks for online experiments across product, business, and operational use cases, including guardrails, heterogeneity analysis, sequential decisioning, variance reduction, and quasi-experimental methods when randomized tests are not feasible.
Partner with data engineering to design robust SQL and PySpark data models, pipelines, and observability standards that improve correctness, speed, and reusability of experimentation data assets.
Establish and govern canonical experiment metrics, scorecards, and reporting standards across channels, regions, and surfaces.
Define the strategy for AI-native experimentation and evaluation, including LLM eval frameworks, prompt evaluation, golden datasets, rubric design, human-in-the-loop review, LLM-as-a-judge calibration, and ongoing regression monitoring.
Build lightweight proofs of concept and small automated workflows using tools such as Python, SQL, Airflow, and Google Cloud Platform technologies to validate ideas before broader platform investment.
Serve as the senior technical advisor to leaders across product, engineering, and business on experimental design, causal interpretation, metric tradeoffs, and measurement risk.

Benefits

Competitive pay
Performance-based bonus awards
Health benefits (medical, vision and dental coverage)
401(k)
Stock purchase
Company-paid life insurance
PTO (including sick leave)
Parental leave
Family care leave
Bereavement
Jury duty
Voting
Short-term disability
Long-term disability
Company discounts
Military Leave Pay
Adoption and surrogacy expense reimbursement
Live Better U (Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.)