Staff, Software Engineer

WalmartSunnyvale, CA
Onsite

About The Position

At Walmart Global Tech, we build highly scalable and reliable backend platforms that power the online marketplace of the world’s largest retail ecosystem. Our systems process massive volumes of real-time and batch data across Walmart marketplace. We are looking for a Staff Software Engineer with deep expertise in Java and Spring Boot, strong hands-on experience in Apache Kafka and Apache Spark, and a proven track record of building distributed systems at scale. This is an onsite role in Sunnyvale, CA, and candidates must have valid U.S. work authorization as visa sponsorship is not available. Role Overview As a Staff Engineer, you will act as a hands-on technical leader and system architect, responsible for designing and delivering large-scale backend platforms and data processing systems. You will work cross-functionally to solve complex engineering challenges, influence platform architecture, and mentor senior engineers. This role requires strong ownership, deep system thinking, and the ability to design for high throughput, low latency, and extreme reliability. Key Responsibilities Design and build highly scalable backend microservices using Java and Spring Boot. Architect and implement real-time event-driven systems using Apache Kafka. Develop and optimize large-scale batch and streaming data pipelines using Apache Spark. Drive architecture decisions around scalability, resiliency, observability, and cost efficiency. Lead system design reviews and define engineering best practices for distributed systems. Work closely with Product, Data Science, Platform, and Infrastructure teams to deliver business impact. Optimize system performance through partitioning strategies, caching, async processing, and concurrency tuning. Mentor engineers and act as a technical multiplier across multiple teams. Participate in production incident reviews and drive long-term platform reliability improvements.

Requirements

  • 12+ years of experience in backend and distributed systems engineering.
  • Must-have strong hands-on experience in Java and Spring Boot for building production-grade microservices.
  • Deep expertise in Apache Kafka: Topic design and partitioning Consumer group scaling and offset management Delivery semantics (at-least-once / exactly-once) Stream processing patterns and performance tuning
  • Strong hands-on experience with Apache Spark: Batch and Structured Streaming workloads Job optimization (shuffle tuning, memory tuning, skew handling) Working with large-scale datasets
  • Proven experience building systems operating at large scale (millions–billions of events / high TPS platforms).
  • Experience designing event-driven microservices architectures.
  • Strong understanding of distributed systems fundamentals: Fault tolerance Back-pressure Idempotency Consistency trade-offs
  • Experience with cloud-native deployments (Kubernetes, Docker, AWS/GCP/Azure).
  • Experience with NoSQL / analytical data stores such as Cassandra, BigQuery, HBase, or similar.
  • Strong production debugging and performance tuning skills.

Nice To Haves

  • Orchestration Ecosystem: Direct experience building or deeply customizing platforms like Temporal.io, Cadence, Apache Airflow, or Argo Workflows.
  • Distributed State Management & Durable Execution Deep State Knowledge: Experience managing the state of long-running processes that must survive infrastructure failures, network partitions, and deployments.
  • Event Sourcing & CQRS: Familiarity with using event-sourcing patterns to rebuild the state of a workflow by replaying history.
  • Transactions: Understanding of the Saga Pattern for managing distributed transactions and implementing compensations (rollbacks) across microservices.
  • Fault Tolerance & High Availability Idempotency Mastery: Expertise in designing systems where tasks can be retried indefinitely without side effects—a critical requirement for any orchestration engine.
  • Advanced Retry Policies: Knowledge of jitter, exponential backoff, and circuit breakers to prevent "thundering herd" problems when a downstream service fails.
  • Rate Limiting & Quotas: Experience building multi-tenant throttling mechanisms to ensure one massive workflow doesn't starve others of resources.
  • Developer Experience (DevX) & DSLs DSL Design: Experience designing Domain-Specific Languages (YAML, JSON, or Python-based) that allow users to define complex logic simply.
  • SDK Development: Ability to build client-side libraries that abstract away the complexity of the underlying orchestration engine for other developers.
  • High-Throughput Messaging & Queuing Message Brokers: Professional experience with Kafka, Pulsar, or RabbitMQ specifically used as a task distribution layer.
  • Priority Queuing: Implementing logic to handle "hot" tasks vs. background tasks efficiently.
  • Hands-on experience with existing orchestrators such as Temporal.io, Cadence, Apache Airflow, Argo Workflows, or AWS Step Functions.
  • An understanding of why these tools succeed (or fail) in specific use cases.
  • Experience in retail, supply chain, pricing, ads, or e-commerce platforms.
  • Exposure to real-time analytics, recommendation engines, or fraud detection systems.
  • Experience driving cross-team technical initiatives and platform modernization efforts.
  • Familiarity with CI/CD pipelines, observability (metrics/logging/tracing), and infrastructure as code.
  • Experience contributing to internal frameworks or platform engineering efforts.

Responsibilities

  • Design and build highly scalable backend microservices using Java and Spring Boot.
  • Architect and implement real-time event-driven systems using Apache Kafka.
  • Develop and optimize large-scale batch and streaming data pipelines using Apache Spark.
  • Drive architecture decisions around scalability, resiliency, observability, and cost efficiency.
  • Lead system design reviews and define engineering best practices for distributed systems.
  • Work closely with Product, Data Science, Platform, and Infrastructure teams to deliver business impact.
  • Optimize system performance through partitioning strategies, caching, async processing, and concurrency tuning.
  • Mentor engineers and act as a technical multiplier across multiple teams.
  • Participate in production incident reviews and drive long-term platform reliability improvements.

Benefits

  • At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet.
  • Health benefits include medical, vision and dental coverage.
  • Financial benefits include 401(k), stock purchase and company-paid life insurance.
  • Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
  • Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
  • You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes.
  • Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.
  • Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates.
  • Tuition, books, and fees are completely paid for by Walmart.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service