Deloitte-posted 15 days ago
$130,000 - $241,000/Yr
Full-time • Mid Level
Cincinnati, OH
5,001-10,000 employees
Professional, Scientific, and Technical Services

We are seeking an accomplished HPC/AI Platform Engineering Manager to lead the design, implementation, and optimization of advanced computing environments that power AI, ML, and LLM workloads. This role is ideal for a hands-on technologist with deep expertise in HPC systems, GPU-accelerated infrastructure, and large-scale AI deployments-combined with the leadership's ability to drive fast-paced, innovative initiatives. You will collaborate with engineering, research, and business teams to define infrastructure strategy, assess emerging technologies, and deliver scalable, secure, and high-performance solutions. This role is pivotal in advancing generative AI, analytics, and model training capabilities through robust architecture, automation, and software integration.

  • Design and implement HPC and AI infrastructure leveraging HPE Apollo, ProLiant, Cray, and similar enterprise-class systems.
  • Architect ultra-low-latency, high-throughput interconnect fabrics (InfiniBand NDR/800G, RoCEv2, 100-400 GbE) for large-scale GPU and HPC clusters.
  • Deploy and optimize cutting-edge NVIDIA GPU architectures (e.g. H100, H200, RTX PRO / Blackwell series, NVL based systems)
  • Develop scalable hybrid HPC and cloud architectures across Azure, AWS, GCP, and on-prem environments.
  • Establish infrastructure blueprints supporting secure, high-throughput AI workloads.
  • Build and manage AI/ML infrastructure to maximize performance and productivity of ML research teams.
  • Architect and optimize distributed training, storage, and scheduling systems for large GPU clusters.
  • Implement automation, observability, and operational frameworks to minimize manual intervention.
  • Deploy and manage GPU-accelerated Kubernetes clusters for AI and HPC workloads.
  • Integrate open-source GenAI components, including vector databases and AI/ML frameworks, for model serving and experimentation.
  • Identify and resolve performance and scalability of bottlenecks across infrastructure layers.
  • Develop and maintain automation tools and utilities in Python, Golang, and Bash.
  • Integrate HPC infrastructure with ML frameworks, container runtimes, and orchestration platforms.
  • Contribute to job scheduling, resource management, and telemetry components.
  • Build APIs and interfaces for workload submission, monitoring, and reporting across heterogeneous environments.
  • Design Kubernetes and OpenShift architectures optimized for GPU and AI workloads.
  • Implement GPU scheduling, persistent storage, and high-speed networking configurations.
  • Collaborate with DevOps/MLOps teams to build CI/CD pipelines for containerized research and production environments.
  • Oversee Linux system architectures (RHEL, Ubuntu, OpenShift) with automation via Ansible and Terraform.
  • Implement monitoring and observability (e.g Prometheus, Grafana, DCGM, and NVML)
  • Ensure system scalability, reliability, and security through proactive optimization.
  • Ensure architecture and deployments comply with organizational and regulatory standards.
  • Conduct technical workshops, architecture reviews, and presentations for both technical and executive audiences.
  • Define and drive the infrastructure roadmap in partnership with business stakeholders.
  • Mentor and lead engineering teams, translating business requirements into actionable technical deliverables.
  • Foster innovation and cross-functional collaboration to accelerate AI/ML initiatives.
  • 10+ years of experience in HPC architecture, systems engineering, or platform design with a focus on architecting and operating on-premises Kubernetes for large-scale AI/ML workloads.
  • 3+ years working hands on and with a proficiency utilizing Linux, Python, Golang, and/or Bash.
  • 2+ years leading teams and/or processes
  • 2+ years of recent experience working with GPU platforms (strong preference for NVIDIA), distributed systems, and performance optimization.
  • Ability to travel 0-10%, on average, based on the work you do and the customers you serve.
  • Must be a US Citizen.
  • Master's or Ph.D. in Computer Science, Electrical Engineering, or related discipline and work experience.
  • Demonstrated success supporting LLM training and inference workloads in both R&D and production environments.
  • Strong knowledge of high-performance networking, storage, and parallel computing frameworks.
  • Exceptional communication and leadership skills, capable of bridging technical depth with executive strategy.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service