Platform Engineer (AI/LLM Infrastructure)

NTT DATA ServicesSanta Clara, CA
$130,000 - $170,000Onsite

About The Position

We are currently seeking a Platform Engineer (AI/LLM Infrastructure) to join our team in Santa Clara, CA. This role involves leading the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients. The engineer will act as a hands-on technical lead, contributing to development while guiding a team. The position owns end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security. Collaboration with clients and stakeholders is key to designing, presenting, and delivering robust AI infrastructure solutions.

Requirements

  • 5+ years of experience in Platform Engineering, SRE, or Infrastructure Engineering.
  • 3+ years of experience delivering and leading infrastructure for AI/LLM-based production systems.
  • 3+ years of experience with Terraform and GitOps (ArgoCD/Flux).
  • 3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines).
  • 3+ years of Experience with CI/CD and container registry management.

Responsibilities

  • Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients.
  • Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers.
  • Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
  • Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions.
  • Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC.
  • Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management.
  • Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar).
  • Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux).
  • Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps.
  • Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch.
  • Lead incident response, on-call processes, and post-mortem analysis.
  • Ensure strong security posture and lead InfoSec review processes.
  • Coordinate delivery across multiple teams and client engagements.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service