BJ's Wholesale Club-posted 4 months ago
Full-time • Senior
5,001-10,000 employees

The Head of IT Operations & Service Excellence is the strategic and operational leader responsible for uptime and resiliency of systems across BJ’s digital and enterprise technology landscape (across applications, infrastructure and security) to provide world‑class experiences to our members and team members. The role sets the 'north‑star' for what 'good' looks like—defining and publishing service‑level objectives (SLOs/SLIs) and operational key results—while building the organizational muscle to deliver them consistently. Reporting to the VP of Infrastructure & Operations, this leader balances real‑time incident response with multi‑year service‑reliability vision, enabling teams to see the forest through the trees and make data‑driven trade‑offs.

  • Define and execute the multi‐year IT Service Excellence maturity roadmap aligned to business objectives, cloud migration plans, uptime and resiliency requirements.
  • Craft multi‑year resiliency and cost‑optimization roadmap aligned to company growth goals.
  • Implement IT operations best practices.
  • Collaborate with product development teams and influence them to ensure reliability and scalability are considered at the design phase.
  • Partner with Enterprise Architecture to define standards for building reliable applications that are highly available and resilient.
  • Define Service Level Objectives (SLOs), Service Level Indicators (SLIs) for all critical services.
  • Foster a high‑trust, blameless culture that rewards learning, experimentation, and excellence.
  • Own the IT Operations & Service Excellence budget; optimize OpEx through automation, self‑service, and vendor management.
  • Oversee real‑time monitoring, incident triage, and major‑incident management ensuring MTTR and communications SLAs are met.
  • Maintain a high‑performing L1 Service Desk; drive call deflection via knowledge, AI chatbots, and self‑service password reset.
  • Publish operational metrics (MTTA, MTTR, FCR, abandon rate) with actionable insights.
  • Lead the major incident management function, including defining escalation paths, coordinating cross-functional teams, and ensuring timely communication to stakeholders.
  • Oversee the entire incident lifecycle, from identification and triage to resolution and post-incident analysis, ensuring efficient and effective processes are in place.
  • Manage on-call rotations and ensure 24 by 7 coverage with major incident managers.
  • Ensure a robust playbook is developed and followed during a MIM process with clearly assigned roles, communication protocols and a well defined triaging process.
  • Chair the Change Advisory Board (CAB); uphold 99%+ change success while accelerating deployment velocity.
  • Implement risk‑based change classification; Ensure thoroughness of end to end testing, automated pre‑deployment checks, rollback processes in place and post‑implementation reviews.
  • Develop and implement SRE policies, standards, and best practices for enterprise-wide systems.
  • Lead SRE squads covering AWS, colocation data centers, network/edge, and SaaS platforms.
  • Set error budgets, reliability targets, and chaos‑engineering practices; ensure recovery time and point objectives (RTO/RPO) meet or exceed DR objectives and business expectations.
  • Work with Service managers overseeing SRE functions for Digital, Membership, Enterprise, and Club & Fuel systems and deliver integrated SRE.
  • Drive end‑to‑end service design—service maps, dependency graphs, support models—to complement observability tooling.
  • Lead the roadmap for logging, metrics, tracing, and AIOps platforms, delivering actionable insights and predictive alerting.
  • Understand the potential impact of system requirements and design choices across multiple cloud and on-premise technologies.
  • Continuously work on enhancing the reliability, stability, and performance of our key platforms, being at the forefront of promoting engineering excellence, implementing best practices, and overseeing the integration of fully automated telemetry within modern DevOps frameworks.
  • Advance problem detection and ensure service restoration processes are well defined.
  • Codify SOPs and RACI matrices across Ops, SRE, Service Desk, and engineering partners to drive clarity of ownership.
  • Lead Lean/Kaizen initiatives that reduce toil and amplify engineering productivity.
  • Track and report OKRs; course‑correct based on data.
  • Drive root‑cause analysis (RCA) and problem management; close systemic gaps and prevent recurrence of major incidents.
  • Partner with Cybersecurity and Compliance teams to meet PCI‑DSS, SOX, and data‑privacy obligations.
  • Ensure operational controls withstand internal and external audits.
  • Possess robust technical expertise and leadership qualities to lead by example with a proven track record in Site Reliability Engineering.
  • Foster a culture of psychological safety, empowerment, and continuous learning.
  • Coach and develop managers; Build, mentor, and retain organization spanning Service Desk, Command Center, SRE, Change Governance, Problem Management and Analytics.
  • Bachelor’s degree in Computer Science, Engineering, or related discipline (Master’s preferred).
  • 15+ years of progressive IT Operations leadership with 5+ years at a Director/Head level supporting large‑scale, Retail and distributed environments.
  • Proven track record of leading teams through complex system outages and scalability challenges.
  • 5+ years of proven oversight of 24×7 operations (NOC, Service Desk) and SRE or DevOps functions.
  • Proficiency in system design and architecture, particularly in a cloud environment.
  • Demonstrated success operating hybrid cloud (AWS) and on‑prem data‑center environments.
  • Expertise with ITIL v4/Service Management frameworks; ITIL certification strongly desired.
  • Experience implementing observability, AIOps, and automation platforms (e.g., ServiceNow, Ops Ramp, SolarWinds, New Relic, PagerDuty).
  • Outstanding communication skills and executive presence; able to brief C‑suite on risk and performance.
  • Retail industry experience managing store, fuel, and distribution center technologies.
  • Certifications in ServiceNow.
  • Lean Six Sigma or Continuous Improvement accreditation.
  • BJ’s pays weekly.
  • Eligible for free BJ's Inner Circle and Supplemental membership(s).
  • Generous time off programs to support busy lifestyles including Vacation, Personal, Holiday, Sick, Bereavement Leave, Jury Duty.
  • Benefit plans for your changing needs including three medical plans, Health Savings Account (HSA), two dental plans, vision plan, flexible spending.
  • 401(k) plan with company match (must be at least 18 years old).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service