Member of Technical Staff, Infrastructure

Physical Superintelligence•Boston, MA

3d•Onsite

About The Position

Physical Superintelligence is a stealth startup with roots at Google, NVIDIA, Harvard, Meta, MIT, Oxford, Johns Hopkins, Cambridge, and the Perimeter Institute building AI systems to discover new physics at scale. We are seeking engineers to build platform infrastructure at the intersection of computational science, AI systems, and software engineering. Our mission is to discover and commercialize transformative physics breakthroughs at scale with artificial superintelligence, safely, verifiably, and for broad public benefit. The last century's golden age of physics gave us transistors, lasers, and nuclear energy. We believe artificial superintelligence will unlock the next one. We're creating the infrastructure to industrialize scientific discovery and usher in this new era. We have one product: new physics, at scale.

Requirements

Eight or more years operating cloud infrastructure in production at companies known for engineering rigor (e.g., Stripe, Cloudflare, Datadog, Snowflake, Databricks, Google, Netflix, or comparable), at multi-cloud scale.
You have written code and shipped infrastructure that paying customers, internal teams, or large user bases depend on every day.
Deep fluency with infrastructure as code (Terraform, Pulumi, or comparable), CI/CD systems, Kubernetes, and major cloud platforms (GCP and AWS at minimum).
You have built and operated multi-cloud production deployments end-to-end, from initial cloud setup through to release pipelines.
Machine learning and training-workload operations experience: GPU scheduling, distributed training infrastructure, model-serving pipelines, observability for ML systems.
You have run production training jobs and shipped served-model surfaces.
Operational excellence and on-call discipline. You have led incidents, written runbooks, reduced toil with code, and built systems that scale without bureaucracy.
You favor self-service abstractions over tickets and visibility over heroics.

Nice To Haves

Built CI/CD or release engineering pipelines from scratch at a fast-growing company.
Hands-on with model serving infrastructure such as vLLM, Triton, or comparable.
Production observability with OpenTelemetry, Prometheus, Grafana, or comparable.
Background in scientific computing, HPC, or research compute environments.

Responsibilities

Own the full infrastructure stack end-to-end, from cloud foundations through CI/CD pipelines to production deployments.
Build and operate multi-cloud infrastructure for our AI platform across GCP, AWS, and adjacent providers.
Establish the infrastructure-as-code discipline at PSI: choose the tooling, design the modules, and make every research workflow, training job, and customer-facing AI product deployable through code.
Design and run the release engineering pipeline that ships code from commit to production. Every change flows through automated tests, security scans, and progressive rollouts. Fast, safe deploys are the default; long manual release cycles are not.
Operate the production infrastructure that powers our AI platform at scale: the paid API, model training jobs for our proprietary physics LLM, agentic research workflows, and customer deployments.
Define and meet SLOs, build observability and alerting, schedule GPU and CPU capacity, lead incident response.
Be the leverage layer for the rest of engineering. Platform, product, security, and research engineers all depend on you for reliable cloud primitives, fast deploys, and visible production behavior. Write tools they use, not tickets they wait on.