Member of Technical Staff, Infra

Pepr AI•San Francisco, CA

20d

About The Position

We are building the AI Operator for growth to replace the traditional ad agencies. We apply the quant rigor of high-frequency trading to optimize ad spend, delivering 20-50% upside in spend efficiency. We are managing spend for global category leaders like Cider and Cupshe with a clear path to managing billions of dollars. We are backed by Quiet Capital and are looking for early engineers. The Role We are looking for a Backend Engineer to build the bedrock of our autonomous system. You are the architect of reliability. While our algorithms team designs the trading strategies, you ensure the engine runs without interruption. Your role is to operationalize our ML pipelines, manage complex deployments across fragmented client environments (including private VPCs) and build the paved road that allows our team to ship high-frequency systems safely. We are looking for the adult in the room – a responsible engineer with 6-8+ years of experience who can foresee failure modes before they happen. You will be the guardian of production quality, ensuring that our system can execute millions of dollars in trades with zero downtime.

Requirements

A Senior Operator: You have 6-8+ years of experience in ML Platform, Developer Experience or Infrastructure Software Engineering roles. You have seen how systems break at scale and know how to design them to survive.
Platform Mindset: You view internal teams as your customers. You have experience in high-leverage domains like ML Platform or Developer Experience, focusing on building the paved road that maximizes engineering velocity without sacrificing reliability.
Distributed Systems Native: You are comfortable working with complexity at scale. You understand the challenges of data consistency, concurrency and latency in distributed environments.
Architect of Reliability: You prioritize system health and visibility. You build the observability and automation required to run critical workloads in production, ensuring that complex deployments remain stable even as they scale across diverse environments.

Nice To Haves

MLOps Tooling: Experience with feature stores, model registries and tools like Ray, MLflow or Kubeflow.
Client VPC / On-Prem Experience: Experience deploying software into customer-controlled environments (AWS/GCP/Azure.)
Security First: Experience with SOC2 compliance, IAM policies and securing sensitive financial data.

Responsibilities

Architect Multi-Environment Deployments: Design and manage the infrastructure to deploy our decision engine into diverse environments, including our own cloud and private client VPCs. You will solve the complex challenges of managing software lifecycles across isolated and secure instances.
Operationalize the Quant Engine: Transform experimental ML models into resilient production services. You will build the training, inference and retraining pipelines that ensure our models are always fresh and performant.
Build the Event-Driven Backbone: Architect the scheduling and event-driven triggers that power our high-frequency control loops. You will manage the message queues (e.g., Kafka, SQS) and orchestration layers (e.g., Airflow, Dagster) that coordinate data flow between our context, decision and execution engines.
Guardian of Reliability: Unexpected issues inevitably arise. You will own system health, setting up comprehensive observability (monitoring, logging, tracing) and acting as the primary troubleshooter when things break. You will convert every incident into an automated prevention mechanism.
Elevate Developer Experience: You view internal engineers as your customers. You will build the CI/CD pipelines, local development environments and tooling that allow us to ship code faster and with higher confidence.