ML Infrastructure Engineer

Sygaldry Technologies•San Francisco, CA

56d

About The Position

Sygaldry Technologies is building quantum-accelerated AI servers to exponentially speed up training and inference for AI. By integrating quantum and AI, we're accelerating the path to superintelligence, and addressing the problem of rising compute costs and energy bottlenecks. Sygaldry AI servers combine multiple qubit types within a single, fault-tolerant architecture to deliver the combination of cost, scale, and speed necessary for advanced AI applications. We pioneer new domains in physics, engineering, and AI, tackling the hardest challenges with a grounded, optimistic, and rigorous culture. We're looking for individuals ready to define the intersection of quantum and AI and drive its profound global impact. About the Role Our AI & Algorithms team is growing fast - research scientists, applied mathematicians, and quantum algorithm researchers developing the algorithms that will accelerate and transform AI. They need compute infrastructure that stays out of their way: GPU access that's reliable, experiments that are reproducible, and workloads that scale without requiring each researcher to become a cloud expert. You'll build and manage the compute platform this team runs on. The workloads are diverse -- quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation -- across multiple cloud providers and on-prem GPU servers. You own the full stack from cloud provider configuration to the Python APIs that researchers use to launch jobs.

Requirements

Think in systems: you see how compute, storage, networking, and cost interact
Care about developer experience: you've felt the pain of bad research infrastructure
Are pragmatic about tooling: right tool for the job, no over-engineering
Take ownership: you want to own a critical function with autonomy
Write things down: you document decisions and create runbooks

Nice To Haves

Deep AWS experience (EC2, S3, IAM, CloudFormation or Terraform)
GPU compute management (instance types, spot strategies, multi-GPU, distributed training)
Python-based ML and scientific computing tooling (PyTorch, JAX)
GCP and/or Modal experience
MLops or research computing platforms (MLflow, W&B, Kubeflow, or HPC job schedulers)
CI/CD pipeline management (GitHub Actions, containers)
Hybrid cloud / on-prem GPU cluster management
Experience supporting research teams with heterogeneous computing needs

Responsibilities

Build compute abstractions that handle the team's diverse workloads: GPU-accelerated simulation, distributed training, high-throughput CPU jobs, and interactive analysis -- across PyTorch, JAX, and scientific computing frameworks
Stand up experiment tracking and reproducibility infrastructure
Create developer tooling that makes cloud compute feel local: environment setup, job submission, monitoring, and artifact management
Scale experiments from single-GPU prototyping to multi-node production runs
Design multi-provider workload orchestration: route jobs based on cost, availability, and capability
Manage and optimize spend across cloud providers -- track credit balances, burn rates, and expiration dates
Configure hybrid local + cloud workflows as on-prem GPU infrastructure comes online
Coordinate with our infrastructure engineer on cloud administration and security
Build CI/CD pipelines for research workloads: automated testing, evaluation benchmarks, artifact management
Create data generation and preprocessing pipelines at the throughput the team's simulators demand
Set up monitoring, alerting, and cost dashboards that surface problems before researchers hit them

Benefits

Visa Sponsorship - We know what it takes to make top talent thrive here. We’re open to supporting visas whenever possible.
Compensation - We value your contribution and invest in your future with a competitive salary and meaningful equity.
Benefits - Your well-being matters. We provide company-sponsored health coverage to give you and your family peace of mind.
Connection - Whether it’s company offsite or casual crew socials, we make time to connect, recharge, and have fun together.
Time Off - We trust you to take the time you need. Unlimited PTO so you can rest, recharge, and come back ready to make an impact.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume