Senior Data Infrastructure Engineer

Judgment LabsSan Francisco, CA
33dOnsite

About The Position

Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional observability focuses on logging exceptions and latency, our ABM surfaces behavioral anomalies such as instruction drifts and context retrieval loss in scaled production environments. Hundreds of teams building autonomous agents rely on Judgment to understand how their systems are behaving post-deployment. Instead of reactive incident triage, they cluster patterns across conversations and workflows, correlate regressions to specific interaction types, and pinpoint where reliability breaks down in their usage context. We’ve raised $30M+ across two rounds in the past five months. Our investors include Lightspeed, SV Angel, Valor Equity Partners, Nova Global, Chris Manning, Michael Ovitz, Michael Abbott, Cory Levy, Kevin Hartz, and others. The Role: We are looking for a Senior Data Infrastructure Engineer to build and scale the real-time data pipelines that power agent behavior analysis at production scale. This role is crucial for processing hundreds of thousands of traces per second, running LLM-based scoring and clustering in near-real time, and delivering the low-latency query performance that enables teams to understand agent behavior as it happens. We need someone who has built petabyte-scale data systems, knows how to squeeze performance out of OLAP databases, and can own the data infrastructure from ingestion through analytics.

Requirements

  • Experience building and tuning high-throughput Petabyte-scale data pipelines
  • Deep knowledge of data infrastructure (Apache Spark, Ray, dbt, Airflow/Dagster)
  • Experience with OLAP database engineering
  • Comfortable with cloud infrastructure and batch + streaming pipelines
  • Senior-level ownership: you will own infrastructure roadmap, architecture design, set practices, identify bottlenecks, ship fixes.

Nice To Haves

  • Experience working with LLM Inference and Serving optimization techniques such as:
  • Speculative Decoding
  • Continuous batching and dynamic batching strategies
  • KV cache optimization and management
  • Quantization techniques (INT8, INT4) for reduced memory footprint
  • Multi-GPU serving and tensor parallelism

Responsibilities

  • Design the streaming pipeline that scores and clusters 100k+ traces/s workload using LLM APIs in near-real time (Kafka + Spark/Ray).
  • Identify LLM API Serving bottleneck via looking at flamegraphs and raise RPS via smart batching/streaming, adaptive concurrency, and connection pooling.
  • Speedup Clickhouse Database query, reduce p95/p99 for queries with better schemas/partitions, projections/materialized views, and tiered storage.

Benefits

  • Full benefits
  • Equinox
  • a private chef
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service