Senior Infrastructure & Enterprise Engineer

Expedia Group•Seattle, WA

1d•Hybrid

About The Position

Expedia Group brands power global travel for everyone, everywhere. We design cutting-edge tech to make travel smoother and more memorable, and we create groundbreaking solutions for our partners. Our diverse, vibrant, and welcoming community is essential in driving our success. Our Technology Team partners with teams across Expedia Group to create innovative products, services, and tools to deliver high-quality experiences for travelers, partners, and our employees. A singular technology platform powered by data and machine learning provides secure, differentiated, and personalized experiences that drive loyalty and traveler satisfaction. HIPS Storage & Compute is responsible for the end-to-end engineering and lifecycle management of all storage and compute platforms within the Expedia Group Hybrid Data Centers. It encompasses physical servers, infrastructure, virtualized environments, Kubernetes (K8s)-based orchestration, and on-premises accelerated GPU pass-through machine learning (ML).

Requirements

Proven experience as an infrastructure, systems, or enterprise engineer operating production services and platforms at scale including virtualization and Kubernetes (K8s), with clear ownership for critical services or multi‑service environments.
Hands-on expertise with core infrastructure domains such as compute, storage, networking, identity, and security, including designing and implementing APIs and data models for infrastructure and enterprise integrations.
Demonstrated proficiency in scripting or programming and automation tooling (for example, infrastructure as code and CI/CD) to provision, configure, and manage enterprise infrastructure.
Experience contributing to low‑level design (LLD), system architecture, and operational runbooks for infrastructure services, including monitoring, alerting, and incident response.
Familiarity with AI-driven systems, tools, virtualization, Kubernetes (K8s), or workflows and applying AI/ML concepts to infrastructure operations or enterprise platforms in a safe and controlled manner.

Nice To Haves

Experience designing and evolving large‑scale, highly available enterprise infrastructure platforms or shared services, including defining technical roadmaps and standards across multiple domains.
Track record of leading complex infrastructure migrations, modernization efforts, or major incident remediation, using data-driven decision making to optimize reliability, performance, and cost.
Deep expertise in system design (LLD), API design, and data modeling for infrastructure and platform services, enabling secure, self‑service, and automated consumption by internal teams.
Demonstrated operational excellence in running production environments, including advanced observability practices, capacity planning, change management, and rigorous post‑incident analysis.
Role‑specific AI experience such as designing or operating AI-assisted infrastructure operations (for example, AIOps, intelligent alerting, anomaly detection, or automated remediation), safely integrating and operating AI/ML‑enabled solutions that improve infrastructure outcomes.

Responsibilities

Design, implement, and operate resilient enterprise infrastructure services and platforms, ensuring reliability, scalability, and security across on‑premise and cloud environments.
Develop and maintain low‑level system designs (LLD), APIs, and data models that enable robust integration between infrastructure components, internal services, and enterprise systems.
Automate infrastructure provisioning, configuration, monitoring, and incident response using modern scripting, IaC, and CI/CD practices to improve efficiency and reduce operational risk.
Collaborate with security, networking, platform, and application engineering teams to define standards, patterns, and guardrails that drive consistency and operational excellence across multiple domains.
Lead troubleshooting and root‑cause analysis for complex infrastructure issues, implementing durable fixes and driving continuous improvements in observability, capacity management, and performance.
Safely integrate and operate AI/ML‑enabled solutions that improve outcomes, including familiarity with AI-driven systems, tools, virtualization, Kubernetes (K8s), or workflows and applying AI/ML concepts to real world products, while ensuring they are secure, compliant, and reliable in production.