About The Position

Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management. Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations; advocate for best practices in security, reproducibility, and cost efficiency. Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry). Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage. Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams. Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management. Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Apply strong software engineering fundamentals in distributed systems, networking, and storage while building large-scale distributed applications on cloud platforms.
  • Build systems for AI research teams, with a solid understanding of training and evaluating large language models (LLMs).
  • Leverage hands-on experience with Kubernetes, Docker, and the Linux container ecosystem to drive platform reliability and scalability.
  • Orchestrate data and compute pipelines using tools like Airflow or Argo, manage streaming systems (Kafka/Event Hubs), and handle object storage (Azure Blob/S3-compatible).
  • Develop internal portals and CLIs for job lifecycle management, experiment tracking, and metrics visualization to support operational efficiency.
  • Manage GPU cluster operations (scheduling, isolation, utilization), high-performance computing (HPC), and experiment orchestration for machine learning training.
  • Implement container security practices and maintain CI/CD pipelines to support robust, reproducible deployments.

Responsibilities

  • Design and build core platform services for scalable training and evaluation
  • Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
  • Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts
  • Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
  • Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management
  • Enforce security and compliance policies for data access, container hardening, and supply-chain integrity
  • Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service