Observability Lead - Cloud SRE & Network Reliability

Lam Research•Fremont, CA

1d•$114,000 - $253,000•Hybrid

About The Position

Our team at Lam is seeking a hands-on Observability Lead with a strong Site Reliability Engineering (SRE) and multi-cloud networking foundation to join our GIS Infrastructure Platform Engineering team. You will lead engineers in delivering robust observability frameworks, SLA/SLO/SLI disciplines, DR/BCP programs, backup and restore operations, and end-to-end network reliability across Azure, AWS, and GCP. You will own the full-stack delivery of observability, reliability, and resilience capabilities across a global multi-cloud enterprise.

Requirements

A BS, MS, or PhD in Computer Science, Engineering, or a related field (or equivalent experience), with 12+ years of overall experience in Infrastructure, SRE, DevOps, or Network Engineering and 6+ years of experience leading high-performing SRE, Observability, or Platform Engineering teams.
Proven expertise in defining, enforcing, and operating SLA, SLO, and SLI frameworks, including effective error budget management.
Hands-on experience with disaster recovery (DR) and business continuity planning (BCP), including RTO/RPO planning, failover testing, and continuity documentation.
Deep expertise in backup and restore operations across multi-cloud and hybrid environments.
Strong multi-cloud networking skills across Azure (VNet, ExpressRoute, Virtual WAN), AWS (VPC, Transit Gateway, Direct Connect), and GCP (VPC, Cloud Interconnect, VPC-SC).
Experience building and operating observability platforms, including tools such as Prometheus, Grafana, Datadog, PagerDuty, ThousandEyes, Splunk, or equivalent solutions, with a focus on network telemetry and flow analysis.
Deep expertise in automation, including Ansible, Terraform, Python, and self-healing infrastructure pipelines.
Hands-on experience with infrastructure as code (IaC), CI/CD pipelines, Kubernetes (AKS, EKS, GKE), and all three major cloud platforms.
Strong programming skills in Python or Go for tooling, automation, and system integrations.
Experience leading P1, P2, and P3 incident management, including ITSM integration (ServiceNow preferred).
Exceptional communication skills, with the ability to translate complex technical concepts into clear business value for engineering, product, and executive stakeholders.

Nice To Haves

Experience with AIOps, including AI-assisted network fault detection, anomaly correlation, and auto-remediation.
Familiarity with agentic AI workflows, including LLM-based agents and RAG patterns, applied to observability and operational use cases.
Background in global WAN architectures, including MPLS and resilience strategies for multi-region enterprise environments.
Experience with compliance-driven disaster recovery and business continuity (DR/BCP) programs, including InfoSec audits, SOX, and ISO 22301 requirements.
Experience with FinOps and multi-cloud cost observability, including network egress visibility and cost optimization across Azure, AWS, and GCP.
Relevant cloud certifications, such as Azure AZ-700 or AZ-305, AWS ANS-C01 or SAP-C02, and GCP Professional Cloud Network Engineer or Architect.
Background in HPC, on-premises, or hybrid cloud environments.

Responsibilities

Lead and grow a team delivering a world-class observability platform across global, multi-cloud production environments, including Azure, AWS, and GCP.
Define and enforce SLA, SLO, and SLI frameworks across all infrastructure and network domains, driving continuous improvement through effective error budget management.
Own end-to-end multi-cloud network observability, including VNet and VPC traffic flows, Transit Gateway routing, BGP peering health, and inter-region connectivity.
Design and govern multi-cloud networking architectures, including Azure VNet, AWS VPC and Transit Gateway, GCP VPC, and hybrid connectivity solutions such as ExpressRoute, Direct Connect, and Cloud Interconnect.
Design and implement agentic AI workflows using LLM-based agents, RAG patterns, and orchestration frameworks to enable AIOps-driven fault detection and remediation.
Own disaster recovery (DR) and business continuity planning (BCP) strategy, including runbook authorship, multi-cloud failover validation, and periodic DR drills to ensure RTO and RPO commitments are met.
Lead backup and restore operations across multi-cloud and hybrid environments, incorporating automated validation and cross-cloud recovery workflows.
Build robust monitoring and alerting pipelines by integrating Prometheus, Grafana, Datadog, PagerDuty, ThousandEyes, Azure Monitor, CloudWatch, and Google Cloud Operations into a unified observability stack.
Drive automation-first practices through self-healing pipelines, remediation playbooks, and infrastructure-as-code (IaC) patterns to reduce toil and improve MTTR.
Lead P1, P2, and P3 incident response efforts, including structured post-mortems and action tracking.
Define and drive the multi-quarter roadmap for observability, reliability, networking, DR/BCP, and AI-assisted operations.
Support hiring, performance management, and career development for the team.