Infrastructure Engineer

Archetype AISan Mateo, CA

About The Position

This role will own the backend services and cloud infrastructure that power Archetype AI’s production platform—driving system reliability, scalability, and operational excellence as the company scales to meet growing customer and research demands. The engineer will work across the full stack of distributed systems and cloud platform concerns, from designing high-throughput services to provisioning and automating the infrastructure they run on.

Requirements

  • 7+ years of professional software engineering experience, with a focus on backend or distributed systems.
  • Deep understanding of distributed systems fundamentals—concurrency, consistency, replication, fault tolerance, networking.
  • Hands-on experience building and operating production infrastructure in cloud environments (AWS, GCP, and/or Azure), including compute, networking, and storage services.
  • Working knowledge of container orchestration (Kubernetes) and infrastructure-as-code (Terraform, Pulumi, or similar).
  • Strong debugging, instrumentation, and observability skills across distributed systems and cloud infrastructure.
  • Demonstrated ownership of complex technical problems and ability to learn and adapt quickly.

Nice To Haves

  • Proven track record of scaling systems through rapid growth and rebuilding or refactoring for new demands.
  • Experience designing and operating multi-region or multi-cloud deployments with high availability and disaster recovery.
  • Proficiency in systems programming languages (e.g., Rust, C++) and scripting environments (e.g., Python).
  • Experience with Kubernetes ecosystem tooling—Karpenter, Kueue, Helm, ArgoCD, or similar—for workload scheduling, autoscaling, and GitOps.
  • Familiarity with CI/CD systems, service mesh architectures, and secrets/config management at scale.
  • Experience with FIPS compliance, container hardening, or government cloud environments (C2S/SC2S, GovCloud).
  • Familiarity with modern ML stacks and hardware acceleration (e.g., PyTorch, CUDA)

Responsibilities

  • Architect, implement, and maintain distributed systems that support high-throughput, low-latency AI model inference and data services.
  • Design, provision, and manage cloud infrastructure (AWS, GCP, and/or Azure) including compute, networking, storage, and IAM—using infrastructure-as-code tools such as Terraform, Pulumi, or CloudFormation.
  • Build and operate Kubernetes-based platforms for deploying and scaling production workloads, including GPU-accelerated inference services
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service