Senior Manager, Solutions Architecture

DeloitteDetroit, MI
38d$130,000 - $241,000

About The Position

We are seeking an accomplished HPC/AI Platform Engineering Manager to lead the design, implementation, and optimization of advanced computing environments that power AI, ML, and LLM workloads. This role is ideal for a hands-on technologist with deep expertise in HPC systems, GPU-accelerated infrastructure, and large-scale AI deployments-combined with the leadership's ability to drive fast-paced, innovative initiatives. You will collaborate with engineering, research, and business teams to define infrastructure strategy, assess emerging technologies, and deliver scalable, secure, and high-performance solutions. This role is pivotal in advancing generative AI, analytics, and model training capabilities through robust architecture, automation, and software integration.

Requirements

  • 10+ years of experience in HPC architecture, systems engineering, or platform design with a focus on architecting and operating on-premises Kubernetes for large-scale AI/ML workloads.
  • 3+ years working hands on and with a proficiency utilizing Linux, Python, Golang, and/or Bash.
  • 2+ years leading teams and/or processes
  • 2+ years of recent experience working with GPU platforms (strong preference for NVIDIA), distributed systems, and performance optimization.
  • Ability to travel 0-10%, on average, based on the work you do and the customers you serve.
  • Must be a US Citizen.

Nice To Haves

  • Master's or Ph.D. in Computer Science, Electrical Engineering, or related discipline and work experience.
  • Demonstrated success supporting LLM training and inference workloads in both R&D and production environments.
  • Strong knowledge of high-performance networking, storage, and parallel computing frameworks.
  • Exceptional communication and leadership skills, capable of bridging technical depth with executive strategy.

Responsibilities

  • Design and implement HPC and AI infrastructure leveraging HPE Apollo, ProLiant, Cray, and similar enterprise-class systems.
  • Architect ultra-low-latency, high-throughput interconnect fabrics (InfiniBand NDR/800G, RoCEv2, 100-400 GbE) for large-scale GPU and HPC clusters.
  • Deploy and optimize cutting-edge NVIDIA GPU architectures (e.g. H100, H200, RTX PRO / Blackwell series, NVL based systems)
  • Develop scalable hybrid HPC and cloud architectures across Azure, AWS, GCP, and on-prem environments.
  • Establish infrastructure blueprints supporting secure, high-throughput AI workloads.
  • Build and manage AI/ML infrastructure to maximize performance and productivity of ML research teams.
  • Architect and optimize distributed training, storage, and scheduling systems for large GPU clusters.
  • Implement automation, observability, and operational frameworks to minimize manual intervention.
  • Deploy and manage GPU-accelerated Kubernetes clusters for AI and HPC workloads.
  • Integrate open-source GenAI components, including vector databases and AI/ML frameworks, for model serving and experimentation.
  • Identify and resolve performance and scalability of bottlenecks across infrastructure layers.
  • Develop and maintain automation tools and utilities in Python, Golang, and Bash.
  • Integrate HPC infrastructure with ML frameworks, container runtimes, and orchestration platforms.
  • Contribute to job scheduling, resource management, and telemetry components.
  • Build APIs and interfaces for workload submission, monitoring, and reporting across heterogeneous environments.
  • Design Kubernetes and OpenShift architectures optimized for GPU and AI workloads.
  • Implement GPU scheduling, persistent storage, and high-speed networking configurations.
  • Collaborate with DevOps/MLOps teams to build CI/CD pipelines for containerized research and production environments.
  • Oversee Linux system architectures (RHEL, Ubuntu, OpenShift) with automation via Ansible and Terraform.
  • Implement monitoring and observability (e.g Prometheus, Grafana, DCGM, and NVML)
  • Ensure system scalability, reliability, and security through proactive optimization.
  • Ensure architecture and deployments comply with organizational and regulatory standards.
  • Conduct technical workshops, architecture reviews, and presentations for both technical and executive audiences.
  • Define and drive the infrastructure roadmap in partnership with business stakeholders.
  • Mentor and lead engineering teams, translating business requirements into actionable technical deliverables.
  • Foster innovation and cross-functional collaboration to accelerate AI/ML initiatives.

Benefits

  • You may also be eligible to participate in a discretionary annual incentive program, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service