Evaluations - Platform Engineer

Antimetal•New York, NY

19d•Onsite

About The Position

We’re looking for a Platform Engineer, Applied Evaluations to define and operationalize quality for the agentic systems that power Antimetal’s investigation and automation engine. This role is core to our product. You’ll own online and offline evaluation pipelines that operate over petabytes of infrastructure data, and shape agent platform abstractions where necessary to ensure our agents are measurable, debuggable, and reliable. You’ll partner closely with platform, product, and research, leveraging quality signals to accelerate iteration across the company. About Antimetal Antimetal is building the future of infrastructure management. We're starting by creating a platform that investigates, resolves, and prevents issues—giving engineers their time back to focus on what they do best: building great products.

Requirements

At least 3 years of experience in ML platform engineering, data engineering, or a related role, preferably at a high-growth company.
Prior experience designing evaluation systems where ground truth is noisy, high-volume, and hard to label (e.g. computer vision, deep research pipelines)
Strong system design skills: you think about how data flows through distributed systems and how decisions compound at scale.
Proven ability to write clean, scalable code and strong data modeling skills.
Demonstrated ability to bring ambiguous goals from prototype to production, using data and experimentation to drive product and architectural decisions.
Proficient in Python and Typescript, with experience using common ML libraries and data engineering tools.

Nice To Haves

Experience with SRE-best practices and modern observability (OTEL, distributed tracing)
Strong on ML fundamentals: classification/regression, clustering, dimensionality reduction, evaluation + error analysis, probabilistic ML
Experience with agent architectures: multi-step reasoning, tool use, context management

Responsibilities

Own the evaluation stack: Build online and offline eval pipelines that measure agent quality across ephemeral, voluminous MELT data, code, and unstructured docs. Set the metrics that define the experience.
Define quality at scale: Production incidents span hundreds of services–ephemeral, high-volume, and where ground truth is approximative. Design evals that capture trajectory quality, not just final outputs, and validate that your metrics predict real outcomes.
Build platform abstractions for agents: Design core agent architectures and extend internal frameworks (e.g. sub-agents, MCPs, middleware) – that lets product, platform, and research iterate with confidence and ship faster.
Productionize: Own latency, observability, and uptime.

Benefits

Pay & ownership — Competitive salary with generous equity grants.
Full coverage + retirement — Fully covered health, dental, and vision, plus retirement benefits.
Unlimited PTO — Take the time you need to recharge.
Dinner on late nights — Working late? Dinner is on us.
Fitness stipend — Monthly support for your health and wellness.
Tools of the trade — Any equipment you need to do your best work.
Commute perks — Citi Bike + train benefits.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume