MLOps / Infrastructure Engineer

10a Labs•New York, NY

3d•$130,000 - $230,000•Remote

About The Position

About 10a Labs: 10a Labs is the safety and threat-intelligence layer trusted by frontier AI labs, AI unicorns, Fortune 10 companies, and leading global technology platforms. Our adversarial red teaming, model evaluations, and intelligence collection enable engineering, safety, and security teams to stay ahead of evolving threats and deploy AI systems safely. 3–8 Years of Industry Experience | Remote | High-Impact About the Role: We’re looking for an infrastructure-focused engineer who thrives at the intersection of machine learning, systems, and product delivery. This is a hands-on role responsible for deploying, monitoring, and scaling a real-time ML-powered content moderation system used to detect and triage abuse, threats, and edge-case language. You’ll work closely with ML engineers, researchers, and clients to build infrastructure that makes high-performance models accessible and reliable in the wild.

Requirements

Has 3–8 years of experience deploying machine learning systems or high-availability backend systems.
Has shipped and maintained production infrastructure at scale, supporting ML workflows.
Has experience with GCP, AWS, or similar platforms (including managed ML services).
Is proficient in Terraform, Docker, Kubernetes, or similar infra tools.
Understands performance tradeoffs in serving models and embedding search pipelines.
Can work cross-functionally with ML, security, and product teams to deploy safely and iterate fast.
Brings a builder's mindset and bias for ownership in ambiguous environments.

Nice To Haves

Experience with vector databases or ANN systems, preferably within GCP (or AWS).
Experience serving LLMs or embedding-based models via API.
Experience with model monitoring, logging, and metrics platforms (e.g., Prometheus, Grafana, Sentry).
Familiarity with trust & safety infrastructure, abuse detection, or policy enforcement systems.

Responsibilities

Design and maintain cloud infrastructure (GCP or AWS) to support real-time model serving, data ingestion, and evaluation workflows.
Deploy and optimize APIs for low-latency access to ML models and embedding search systems.
Manage and optimize the end-to-end training data flow—from sourcing and cleaning datasets to preparing them for model consumption—ensuring accuracy, scalability, and efficiency.
Build observability tooling for production ML pipelines (monitor latency, error rates, request volumes, drift).
Automate model deployment, retraining, and evaluation pipelines (CI/CD for ML).
Work with ML engineers to package models for serving.
Help manage vector databases and semantic search infrastructure (e.g., Pinecone, FAISS, Vertex Matching Engine).
Ensure security, compliance, and uptime of infrastructure supporting safety-critical systems.