Senior AI Platform Engineer

QualcommSan Diego, CA
83d$111,300 - $166,900

About The Position

We are seeking a Senior AI Platforms Engineer to design, build, and operate the infrastructure that powers large-scale AI and ML workloads, with a strong focus on LLM hosting and serving at scale. This role requires deep expertise in Kubernetes, multi-cloud environments, and observability systems, as well as experience with agentic workflow orchestration (e.g., n8n) in production environments. You will collaborate with global teams to deliver secure, reliable, and cost-efficient AI platforms.

Requirements

  • 5–7 years of experience in platform engineering, MLOps, or SRE roles.
  • Strong hands-on experience with Kubernetes (production-grade deployments, autoscaling, GPU scheduling).
  • Cloud platforms: AWS (Bedrock, SageMaker), plus GCP and/or Azure.
  • Python and scripting languages (Bash, PowerShell).
  • Linux systems administration.
  • Proven experience hosting and serving LLMs at scale in production environments.
  • Expertise in observability: Elasticsearch, Prometheus, Grafana, OpenTelemetry.
  • Familiarity with agentic workflow systems (e.g., n8n) and scaling them for enterprise use.
  • Strong understanding of networking, security, and IAM in cloud-native environments.
  • Excellent communication skills and ability to work with global teams.

Nice To Haves

  • Experience with model serving frameworks (vLLM, Triton, KServe, Ray Serve).
  • Knowledge of vector databases (Elasticsearch vector, Milvus, Pinecone) for RAG workflows.
  • Familiarity with service mesh (Istio/Linkerd), policy-as-code (OPA/Gatekeeper).
  • GPU optimization for inference workloads.
  • Certifications: AWS Solutions Architect or ML Specialty, CKA/CKAD.

Responsibilities

  • Deploy and manage large language models (LLMs) at scale using AWS Bedrock, GCP Vertex, Azure AI Foundry and Kubernetes-based solutions.
  • Optimize inference performance for throughput, latency, and cost efficiency.
  • Build and maintain Kubernetes clusters for AI workloads with GPU scheduling, autoscaling, and high availability.
  • Model and deploy auto scaling applications and APIs to existing Kubernetes clusters.
  • Implement CI/CD pipelines and Infrastructure as Code (Terraform, Helm).
  • Design and implement observability stacks for large-scale systems, including metrics, logs, and traces.
  • Manage large-scale search systems built on elasticsearch powering hybrid-search solutions.
  • Deploy and scale agentic workflow orchestration systems (e.g., n8n) for AI-driven automation.
  • Ensure reliability, security, and performance of workflow execution at scale.
  • Operate across AWS, GCP, and Azure, leveraging managed AI services and GPU infrastructure.
  • Work closely with globally distributed teams; provide documentation, mentorship, and participate in on-call rotations.

Benefits

  • $111,300.00 - $166,900.00 salary range.
  • Competitive annual discretionary bonus program.
  • Opportunity for annual RSU grants.
  • Highly competitive benefits package designed to support success at work, at home, and at play.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service