Platform Engineer (AI/LLM Infrastructure)

NTT DATA Services•Santa Clara, CA

3d•$130,000 - $170,000•Onsite

About The Position

We are currently seeking a Platform Engineer (AI/LLM Infrastructure) to join our team in Santa Clara, CA. This role involves leading the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients. The engineer will act as a hands-on technical lead, contributing to development while guiding a team. The position owns end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security. Collaboration with clients and stakeholders is key to designing, presenting, and delivering robust AI infrastructure solutions.

Requirements

5+ years of experience in Platform Engineering, SRE, or Infrastructure Engineering.
3+ years of experience delivering and leading infrastructure for AI/LLM-based production systems.
3+ years of experience with Terraform and GitOps (ArgoCD/Flux).
3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines).
3+ years of Experience with CI/CD and container registry management.

Responsibilities

Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients.
Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers.
Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions.
Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC.
Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management.
Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar).
Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux).
Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps.
Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch.
Lead incident response, on-call processes, and post-mortem analysis.
Ensure strong security posture and lead InfoSec review processes.
Coordinate delivery across multiple teams and client engagements.