Building the right technology foundation for Infrastructure & platforms is vital to success at the scale of Walmart. Our team builds and maintains the foundational technologies that support the tech organization. Included in this are data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. All of these products and services are supported by scalable and powerful infrastructure, ensuring a secure and seamless employee and customer experience across stores, digital channels, and distribution centers. What you'll do... Walmart Global Tech's Site Reliability Engineering organization is built with hybrid systems and software engineers who take technical ownership for reliability, scalability, automation, and mission-critical issues related to uptime, availability and fast rate of improvement of Walmart's e-commerce, stores, and omni-channel platform. As a technical expert in this domain, you'll drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems built on modern tech stacks with intelligent capacity management and predictive performance optimization. You'll be responsible for designing and building Tier 0 high-availability, resilient agentic platforms that serve as the backbone for reliability engineering across all of Walmart's systems, stores and facilities across US and international markets while defining and implementing unified, intelligent, operationally robust technical solutions and tools for all Walmart Technology organizations across all channels and geographies. What you'll do: Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure. Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure. Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore. Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs What you'll bring: You will be responsible as a Director in Reliability Engineering and Operations team to ensure that critical parts of Walmart’s business are prepared for known events and to address any contingency. You’ll have opportunity to manage the complex challenges of micro service and scale which are unique to Walmart’s e-commerce, stores, and omni-channel platform, while using your expertise in coding, algorithms, complex triaging and analysis, and large-scale system design. You’ll excel if you have enthusiasm to dig deep and a flare for sharp technical communication, prioritization for uptime/availability and organization.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Director