Platform Engineer, Model Shaping

Together AISan Francisco, CA
$200,000 - $290,000Remote

About The Position

The Model Shaping team at Together AI works on products and research for tailoring open foundation models to downstream applications. We build services that allow machine learning developers to choose the best models for their tasks and further improve these models using domain-specific data. In addition to that, we develop new methods for more efficient model training and evaluation, drawing inspiration from a broad spectrum of ideas across machine learning, natural language processing, and ML systems. As a Platform Engineer in Model Shaping, you will work at the intersection of backend engineering and infrastructure, building the foundational layers of Together’s platform for model customization and evaluation. You will design, develop, and operate both the backend services and the underlying systems that enable us to sustainably and reliably scale production workflows launched by our users, as well as internal research experiments. You will operate in a cross-functional environment, collaborating with other engineers and researchers in the team to improve the infrastructure based on the needs of projects they work on. You will also interact with other engineering teams at Together (such as Commerce, Data Engineering, and Cloud Infrastructure) to integrate the services developed by Model Shaping with systems developed by those teams.

Requirements

  • 3+ years of experience in building infrastructure or backend components of production services
  • Extensive experience designing, operating, and troubleshooting production Linux environments and Kubernetes-based platforms
  • Strong software engineering background in Python or Go
  • Experienced with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
  • Cloud environment (e.g., AWS/GCP/Azure) administration experience, preferably with a hybrid bare-metal/cloud environment
  • Strong communication skills, be willing to document systems and processes and collaborate with peers of varying technical expertise
  • Comfortable operating across the stack, from cluster operations and infrastructure automation to backend service development

Nice To Haves

  • Developing large-scale production systems with high reliability requirements
  • Pipeline orchestration frameworks (e.g., Kubeflow, Argo Workflows, Flyte)
  • Managing GPU workloads on HPC clusters, ideally with hands-on experience in operating NVIDIA’s networking stack (e.g., NCCL, Mellanox firmware, GPUDirect RDMA)
  • Deployment of services for AI training or inference
  • Networking fundamentals, including TCP/IP, DNS, routing, load balancing, TLS, and network debugging tools
  • Maintaining or contributing to open-source projects

Responsibilities

  • Design and build Together’s systems and infrastructure for model customization, including user-facing features and internal improvements
  • Contribute to reliability improvements for the platform, participating in an on-call rotation and improving processes for incident response
  • Create and improve internal tooling for deployment, continuous integration, and observability
  • Build a job orchestration platform spanning multiple datacenters, supporting a highly heterogeneous hardware landscape
  • Partner with teams developing internal services, co-designing these services and incorporating them in systems built within Together

Benefits

  • startup equity
  • health insurance
  • flexibility in terms of remote work
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service