AI & HPC Infrastructure Engineer

FirstPrinciples

11d•Remote

About The Position

FirstPrinciples is a research organization building AI infrastructure for discovery in fundamental science, focusing on systems like Theo, the AI Physicist. They are a fast-growing, remote-first team working across Canada, the US, and the UK, united by a shared curiosity about the universe and a belief in building systems to explore it more effectively. The work involves tackling abstract problems at the intersection of creativity and rigorous thinking, requiring comfort with ambiguity and iteration. This role is crucial for building and operating the compute foundation for AI-driven scientific discovery, ensuring research and inference workloads are reliable, scalable, and fast. The role involves designing, deploying, and operating Kubernetes clusters, Linux systems, GPU infrastructure, cloud environments, HPC-style compute, deployment workflows, monitoring, and automation. The goal is to build infrastructure that supports experimentation and production-like inference across cloud, bare metal, and hybrid environments. The engineer will play a central role in shaping compute operations, including provisioning and managing clusters, improving reliability and observability, reducing operational toil, supporting researchers and engineers, and making strategic decisions about infrastructure choices (managed cloud services, self-managed Kubernetes, Slurm-style systems, or owned hardware). The ideal candidate is hands-on, systems-oriented, and comfortable in a fast-moving research environment, with strong Kubernetes and Linux fundamentals, good operational instincts, and experience with cloud and HPC/GPU infrastructure to build a robust bare metal and multi-cloud inference platform.

Requirements

Strong infrastructure builder with experience operating production, research, cloud, or high-performance compute systems
Deeply comfortable with Linux administration, including debugging networking, storage, system services, permissions, performance issues, and node-level failures
Experienced with Kubernetes in real environments, including cluster operations, deployments, networking, observability, scaling, and troubleshooting
Comfortable working with cloud infrastructure on AWS, GCP, Azure, or equivalent platforms
Familiar with infrastructure automation and configuration tools such as Terraform, Ansible, Helm, ArgoCD, GitOps workflows, or similar systems
Experienced with GPU-heavy, compute-heavy, or HPC-style workloads, especially in environments involving AI, ML, research computing, or scientific workloads
Able to work across bare metal and cloud environments, and interested in the practical tradeoffs between the two
Comfortable reasoning about resource scheduling, cluster utilization, autoscaling, storage, networking, and observability for distributed workloads
Practical and ownership-oriented; you can take ambiguous infrastructure needs and turn them into working systems
Comfortable collaborating across disciplines, especially with researchers and engineers who may not think in infrastructure terms
Able to operate independently as a senior or strong intermediate contributor, while knowing when to bring others into important technical decisions
Motivated by building foundational systems that make ambitious technical and scientific work possible

Nice To Haves

Hands-on experience with production-grade LLM inference and serving engines, such as vLLM, SGLang, or TensorRT
Experience working at an AI company, ML infrastructure team, research lab, university compute environment, HPC center, or scientific computing organization
Experience supporting model inference, model serving, distributed training, high-throughput batch workloads, or internal ML platforms
Hands-on experience with Slurm or similar HPC schedulers, including job scheduling, resource allocation, queue management, and cluster configuration
Experience operating GPU infrastructure, including NVIDIA drivers, CUDA, container runtimes, scheduling, utilization, and hardware failure modes
Experience with RDMA, InfiniBand, high-performance networking, distributed filesystems (ie. Lustre, BeeGFS), object storage, or storage systems for compute-heavy workloads
Experience with Kubernetes operators, custom controllers, CRDs, or platform tooling for AI/ML workloads
Experience with Prometheus, Grafana, Loki, OpenTelemetry, Datadog, or similar monitoring, logging, and observability tools
Experience with container registries, image optimization, CI/CD systems, deployment pipelines, and secure software delivery
Experience leading engineering operations or infrastructure efforts while remaining hands-on technically
Familiarity with security, access control, secrets management, and reliability practices in production or research environments

Responsibilities

Design, deploy, and operate Kubernetes infrastructure for AI inference, research, and engineering workloads
Set up and manage GPU and HPC-style compute environments, including scheduling, utilization, job management, and node-level troubleshooting
Work with systems such as Kubernetes, Slurm or similar schedulers, container runtimes, GPU drivers & libraries (ie; CUDA), storage systems, and observability tools
Build and manage Linux-based compute environments, including provisioning, networking, storage, monitoring, access control, and lifecycle management
Help architect bare metal, cloud, and hybrid infrastructure across AWS, GCP, Azure, or equivalent platforms
Own the reliability and operational health of infrastructure systems, including monitoring, alerting, incident response, capacity planning, and performance tuning
Improve deployment workflows, automation, configuration management, secrets management, and infrastructure-as-code practices
Partner with ML engineers, researchers, and software engineers to understand workload requirements and translate them into practical infrastructure designs
Evaluate tradeoffs between managed cloud services, self-managed Kubernetes, HPC schedulers, bare metal deployments, and multi-cloud architectures
Build tooling, documentation, runbooks, and operational practices that help the team move quickly without making infrastructure fragile or opaque
Balance speed and robustness, knowing when to prototype quickly and when to harden systems for long-term use

Benefits

The opportunity to work on foundational problems at the intersection of AI and physics
A high-trust, low-bureaucracy environment with real ownership
Remote-first work with flexibility in how you structure your day
Exposure to cutting-edge ideas across AI, scientific discovery, infrastructure, and emerging technologies
A culture that values curiosity, depth of thinking, and first-principles reasoning
The chance to shape the compute and inference infrastructure behind advanced AI systems for scientific discovery

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume