Director of Engineering - Platform Engineering

Health GPT Inc•Palo Alto, CA

49d•Onsite

About The Position

We are seeking a Director of Engineering - Platform Engineering to lead the design, implementation, and operation of HippocraticAI's cloud infrastructure, observability systems, and GPU control plane. This leader will be responsible for scaling our global compute fabric to support cutting-edge LLM workloads while maintaining exceptional reliability, security, and cost efficiency. You will build and lead a multidisciplinary engineering team spanning cloud operations, SRE, and GPU orchestration, working closely with product development, AI research, and compliance to deliver world-class infrastructure for healthcare AI.

Requirements

10+ years of engineering experience, including 5+ years leading infrastructure, SRE, or platform teams at scale.
Proven success in managing large-scale distributed systems and global cloud infrastructure.
Deep experience with high-performance computing or large-scale AI workloads.
Strong background in cloud platforms (AWS, GCP, Azure) and infrastructure-as-code (Terraform, Pulumi, etc.).
Expertise in observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.) and operational excellence.
Experience with security and compliance frameworks relevant to healthcare (HIPAA, SOC 2).
Exceptional communication skills and the ability to partner across product, AI research, and operations.

Nice To Haves

Experience designing or operating GPU control planes or schedulers (e.g., Kubernetes, Ray, Slurm, custom orchestration frameworks).
Prior work with ML infrastructure, data pipelines, or model-serving platforms.
Background in cost optimization and sustainability of GPU/compute operations.
Familiarity with edge or hybrid-cloud deployments for low-latency AI systems.

Responsibilities

Recruit and develop a world-class platform engineering team
Foster a culture of innovation, accountability, and technical excellence.
Mentor and coach engineers and managers to achieve high performance and career growth.
Build and scale a high-performing team responsible for all infrastructure operations and systems reliability.
Define and execute the long-term infrastructure roadmap for a multi-cloud, multi-region GPU and compute environment.
Drive excellence in cloud cost optimization, capacity planning, and service reliability.
Architect and manage HippocraticAI's global GPU control plane, enabling dynamic provisioning, scheduling, and monitoring of inference workloads across regions and providers.
Lead the design and automation of deployments (AWS, GCP, Azure, on-prem) using infrastructure-as-code and CI/CD best practices.
Ensure strong security posture and compliance across all environments, aligned with HIPAA, SOC 2, and other healthcare data standards.
Develop and scale comprehensive observability systems-covering telemetry, tracing, logging, and alerting-to ensure full visibility into production systems and AI workloads.
Establish SLOs, SLIs, and SLAs for all mission-critical services and infrastructure.
Implement robust incident management, root cause analysis, and continuous improvement processes.
Partner with AI and product teams to anticipate infrastructure needs and design scalable architectures for rapid experimentation and deployment.
Contribute to the design of internal developer platforms that improve productivity and standardization.
Evaluate emerging technologies (e.g., new GPU hardware, orchestration frameworks, data center partnerships) to advance our capabilities.