Platform Engineer

Axiom•San Francisco, CA

About The Position

Axiom is building the translational intelligence layer for drug discovery: AI systems that help scientists predict human toxicity earlier, more accurately, and more mechanistically than animal studies or legacy in vitro assays. Unexpected toxicity is one of the largest reasons drug programs fail. Today, drug discovery teams still rely on fragmented assays, animal studies, and expert judgment to decide which molecules are safe enough to advance. We believe this can be dramatically improved. At Axiom, we generate and curate massive multimodal datasets spanning chemical structures, primary human cell imaging, multicellular tissue systems, transcriptomics, proteomics, mass spectrometry, ADME, dose-response curves, clinical outcomes, and human exposure. To date, we have built the largest experimental-to-clinical dataset in the world and we are just getting started. We use these datasets to train models and agents that connect chemistry, biology, mechanism, and clinical risk. We are looking for an infrastructure / platform engineer to build the systems that make this work at scale. You will own the backend, distributed systems, model-serving infrastructure, deployment pipelines, customer data systems, and enterprise platform architecture behind Axiom’s AI products. This is a role for a deeply technical generalist who wants to help Axiom evolve into a world-class engineering company.

Requirements

Strong generalist software engineer with excellent taste in systems, infrastructure, and product.
Built production systems used by large enterprise customers.
Designed backend or distributed systems that process large amounts of data reliably.
Built SaaS products that store, process, and serve sensitive customer data.
Worked on ML infrastructure across data access, training, evaluation, deployment, inference, monitoring, or observability.
Understand the messy parts of getting ML into production: versioning, reproducibility, evaluation, rollout safety, monitoring, debugging, latency, cost, and reliability.
Enjoy working with enterprise customers and simplifying complex technical systems around their needs.
Built infrastructure for LLM-powered products, research workflows, retrieval systems, agents, or large-scale data processing.
Want to build distributed infrastructure for long running, compute intensive parallel reasoning workflows.
Comfortable moving across cloud infrastructure, backend systems, distributed compute, ML infrastructure, security, DevOps, and product engineering.
Want to work directly with researchers and scientists, helping them turn frontier research into usable products.
Care deeply about reliability because customers will use these systems to make consequential drug discovery decisions.
Want ownership over hard, ambiguous systems at an early-stage company.
Python, TypeScript, Go, Rust, or similar systems/backend languages.
Cloud infrastructure on AWS, GCP, or Azure.
Kubernetes, Docker, Terraform, Pulumi, CI/CD, and production DevOps.
Distributed systems, job queues, orchestration, scheduling, and large-scale compute.
Ray, Modal, Slurm, Anyscale, Spark, Dask, Daft, Airflow, Dagster, Prefect, Argo, or similar tools.
Backend APIs, data services, databases, object storage, caching, and search/retrieval systems.
Postgres, DuckDB, Snowflake, BigQuery, ClickHouse, Elasticsearch, OpenSearch, or vector databases.
ML infrastructure for model serving, inference, training pipelines, evaluation, monitoring, and deployment.
LLM systems, agents, retrieval-augmented generation, observability, and evaluation harnesses.
Enterprise software, SaaS platforms, security, access control, audit logs, and customer data isolation.
Large-scale scientific, healthcare, biotech, chemistry, biology, or clinical data systems.

Nice To Haves

Move with urgency.
Have exceptional engineering taste.
Take full ownership of the customer experience.
Care deeply about reliability and all the ways systems can fail.
Can build fast without creating chaos.
Are comfortable operating across backend, infrastructure, ML, security, and product.
Enjoy working with scientists and researchers.
Can teach others how to become better engineers.
Are practical, unpretentious, and collaborative.
Want their work to multiply the output of the entire company.
Are not satisfied with incremental improvements.
Want to build a generational company.
Have a relentless observe-orient-decide-act loop: someone who constantly identifies bottlenecks, builds the right abstractions, and makes everyone around them faster.

Responsibilities

Build the infrastructure that powers the first scientific AI systems capable of replacing animal and legacy toxicity experiments.
Create the platform that turns Axiom’s research into reliable, secure, enterprise-ready software used by the world’s leading drug discovery teams.
Own critical systems across Axiom’s backend, ML platform, customer deployment, and enterprise infrastructure.
Lead Axiom’s evolution into a world-class engineering organization focused on enterprise ML and data software.
Design and build the core infrastructure powering Axiom’s ML systems, including model evaluation, model deployment, inference, serving, monitoring, and versioning.
Architect scalable systems for storing, retrieving, processing, and serving chemical, biological, clinical, customer, and model-generated data.
Deploy large-scale reasoning agents from research environments into production systems used by customers.
Build infrastructure for running image models, LLM agents, mechanistic reasoning systems, and multimodal toxicity models at scale.
Create robust systems for customer data management, including secure ingestion, access control, audit trails, versioned deliveries, and customer-specific workspaces.
Build the backend systems behind Axiom’s product, including APIs, data services, inference services, workflow systems, and internal tooling.
Support enterprise customer deployments, including cloud, secure VPC, and potentially on-prem or customer-controlled environments.
Build evaluation and observability systems for ML models and agents, including regression testing, model comparison, trace inspection, rollout monitoring, and failure analysis.
Work with ML researchers to turn prototypes into reliable production systems.
Work with scientists to turn research workflows into durable software.
Work with product and customer teams to ensure enterprise users can trust, understand, and depend on Axiom’s systems.
Teach and empower scientists, ML researchers, and engineers to write better software and build better systems.
Help define Axiom’s engineering culture from the ground up.