Observability and IaC Engineer

Lam ResearchFremont, CA
2d$114,000 - $253,000

About The Position

We are seeking a forward-thinking Observability and IaC Engineer to lead the automation and instrumentation of our next-generation cloud-native platforms. You will not just manage tools but architect an integrated "Observability-as-Code" ecosystem that supports distributed tracing, AIOps, and real-time performance analytics. You will be responsible for designing the pipelines that ingest, process, and visualize the health of our services. Your mission is to provide deep, real-time insights into system behavior, reducing the time spent troubleshooting. The IaC Engineer is responsible for the end-to-end automation of infrastructure lifecycle management. You will build the "Golden Paths" that allow developer teams to provision their own secure, compliant, and scalable resources. Your goal is to ensure that infrastructure is reproducible, resilient to disaster, and integrated into modern CI/CD pipelines.

Requirements

  • Minimum of 12+ years of related experience with a Bachelor’s degree in Computer Since Engineering or related field.
  • Expert-level knowledge of AWS and Azure, including networking topology, IAM, and serverless architectures.
  • Hands-on experience of implementing cloud-native Observability solution
  • Proficiency in Prometheus, Grafana, OpenTelemetry, ELK/Splunk, and modern platforms like Datadog, New Relic, or Dynatrace.
  • Mastery of Terraform/OpenTofu, Pulumi (for programming-based IaC).
  • Expert-level knowledge of OpenTelemetry (OTel) and W3C Trace Context.
  • Proficiency in Go, Python, or Bash to build custom automation scripts and CLI tools.
  • Experience using AI-assisted tools for code generation and infrastructure cost/performance optimization.

Nice To Haves

  • Bachelor's/Master's degree in Computer Science Engineering, or related field.
  • Certifications in Azure, AWS, DevOps, or Terraform.
  • Experience in large-scale enterprise environments.

Responsibilities

  • Design and implement robust pipelines that collect and aggregate telemetry data (logs, metrics, events, and traces) from various cloud-native sources.
  • Configure AI-driven anomaly detection to move beyond static thresholds, allowing the system to identify unusual behavior before it triggers a critical outage.
  • Collaborate with software teams to integrate auto-instrumentation libraries into the CI/CD pipeline, ensuring every new service is "observable by default."
  • Automate the deployment of dashboards, alerting rules, and SLO (Service Level Objective) tracking via IaC to ensure consistent visibility across development, staging, and production.
  • Leverage AI-driven operations (AIOps) and distributed tracing to reduce Mean Time to Resolution (MTTR) and lead root-cause analysis for complex, cross-functional system failures.
  • Monitor security event logs (e.g., flow logs, firewall logs) to identify vulnerabilities and ensure systems comply with legal regulations
  • Use tools like Terraform, OpenTofu, or Pulumi to automate the provisioning and deployment of monitoring tools, dashboards, and alerting policies.
  • Design and manage automated deployment pipelines using industry standard tools (Spacelift/ HCP Terraform)
  • Establish continuous reconciliation systems that automatically detect and correct unauthorized changes to infrastructure, maintaining the intended state without human intervention.
  • Embed security policies, encryption-at-rest requirements, and compliance scans directly into IaC templates to enforce "security-by-default".
  • Orchestrate consistent environments across AWS and Azure using cloud-agnostic tools to prevent vendor lock-in and optimize for high availability.
  • Write and automate unit and integration tests for infrastructure changes to prevent breaking production environments.

Benefits

  • At Lam, our people make amazing things possible. That’s why we invest in you throughout the phases of your life with a comprehensive set of outstanding benefits.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service