AI Services Technical Lead

Lam ResearchFremont, CA
Hybrid

About The Position

In this role, you will directly contribute to the reliability, scalability, and operational excellence of Lam’s Enterprise AI services. As a hands-on technical lead, you will modernize AI operations through observability, automation, and strong engineering discipline, helping ensure AI services are resilient, production-ready, and able to scale effectively across the company. Your work will strengthen incident response, improve service health and readiness, and drive continuous improvement in how Enterprise AI services are operated and supported.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related field.
  • Strong hands-on experience supporting cloud-based production platforms in Microsoft Azure.
  • Experience with Application Insights, Azure Monitor, Log Analytics, and Kusto Query Language (KQL) for troubleshooting, telemetry analysis, and operational monitoring.
  • Strong scripting or automation experience using Python and/or PowerShell.
  • Experience supporting CI/CD pipelines, production releases, and operational readiness practices.
  • Experience leading incident triage, root cause analysis, and direct remediation for complex production issues.
  • Strong communication skills with the ability to translate technical issues into clear updates for engineering teams, stakeholders, and leadership.

Nice To Haves

  • Experience supporting AI/ML or generative AI platforms, including services built with Azure OpenAI.
  • Experience with Azure API Management and operational support for API-based services.
  • Experience supporting containerized or distributed services, including AKS/Kubernetes.
  • Experience working with enterprise ticketing or ITSM platforms such as Jira and ServiceNow.

Responsibilities

  • Own hands-on technical operations for Enterprise AI services, ensuring platforms are reliable, maintainable, and ready for production scale.
  • Lead incident triage, technical troubleshooting, service restoration, and root cause analysis for complex production issues affecting AI platforms and services.
  • Build and enhance monitoring dashboards, alerting strategies, health checks, and operational views across Azure services using Application Insights, Azure Monitor, Log Analytics, and KQL.
  • Query logs, analyze telemetry, and identify patterns and failure modes to improve detection, response speed, and long-term reliability.
  • Improve operational automation using Python, PowerShell, and AI-driven approaches to reduce manual effort and strengthen AI Ops maturity.
  • Partner with engineering teams to review architecture, improve operability, strengthen release readiness, and drive remediation of recurring reliability and support issues.
  • Develop and maintain runbooks, support procedures, and operational standards that improve L1/L2/L3 effectiveness across internal teams and service partners.
  • Support change and release processes through readiness reviews, production validation, and post-release monitoring, using enterprise workflows and ticketing systems such as Jira and ServiceNow.
  • Ensure operational processes, controls, and artifacts are audit-ready and support enterprise compliance requirements; support BCP/DR readiness through recovery validation, runbook updates, and failover testing.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service