Senior AI Platform Engineer

QualcommSan Diego, CA
81d$111,300 - $166,900Hybrid

About The Position

We are seeking a Senior AI Platforms Engineer to design, build, and operate the infrastructure that powers large-scale AI and ML workloads, with a strong focus on LLM hosting and serving at scale. This role requires deep expertise in Kubernetes, multi-cloud environments, and observability systems, as well as experience with agentic workflow orchestration (e.g., n8n) in production environments. You will collaborate with global teams to deliver secure, reliable, and cost-efficient AI platforms.

Requirements

  • 5-7 years of experience in platform engineering, MLOps, or SRE roles.
  • Strong hands-on experience with Kubernetes (production-grade deployments, autoscaling, GPU scheduling).
  • Cloud platforms: AWS (Bedrock, SageMaker), plus GCP and/or Azure.
  • Python and scripting languages (Bash, PowerShell).
  • Linux systems administration.
  • Proven experience hosting and serving LLMs at scale in production environments.
  • Expertise in observability: Elasticsearch, Prometheus, Grafana, OpenTelemetry.
  • Familiarity with agentic workflow systems (e.g., n8n) and scaling them for enterprise use.
  • Strong understanding of networking, security, and IAM in cloud-native environments.
  • Excellent communication skills and ability to work with global teams.

Nice To Haves

  • Experience with model serving frameworks (vLLM, Triton, KServe, Ray Serve).
  • Knowledge of vector databases (Elasticsearch vector, Milvus, Pinecone) for RAG workflows.
  • Familiarity with service mesh (Istio/Linkerd), policy-as-code (OPA/Gatekeeper).
  • GPU optimization for inference workloads.
  • Certifications: AWS Solutions Architect or ML Specialty, CKA/CKAD.

Responsibilities

  • Deploy and manage large language models (LLMs) at scale using AWS Bedrock, GCP Vertex, Azure AI Foundry and Kubernetes-based solutions.
  • Optimize inference performance for throughput, latency, and cost efficiency.
  • Build and maintain Kubernetes clusters for AI workloads with GPU scheduling, autoscaling, and high availability.
  • Model and deploy auto scaling applications and APIs to existing Kubernetes clusters.
  • Implement CI/CD pipelines and Infrastructure as Code (Terraform, Helm).
  • Design and implement observability stacks for large-scale systems, including metrics, logs, and traces.
  • Manage large-scale search systems built on elasticsearch powering hybrid-search solutions.
  • Deploy and scale agentic workflow orchestration systems (e.g., n8n) for AI-driven automation.
  • Ensure reliability, security, and performance of workflow execution at scale.
  • Operate across AWS, GCP, and Azure, leveraging managed AI services and GPU infrastructure.
  • Work closely with globally distributed teams; provide documentation, mentorship, and participate in on-call rotations.

Benefits

  • Competitive annual discretionary bonus program.
  • Opportunity for annual RSU grants.
  • Comprehensive benefits package designed to support success at work, at home, and at play.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service