About The Position

The Productivity and Machine Learning Evaluation team ensures the quality of AI-powered features across a suite of productivity and creative applications - including Creator Studio - used by hundreds of millions of people. This team serves as the primary evaluation function, providing critical quality signals that directly influence model development decisions and product launches. This role focuses on building and scaling automated evaluation systems and designing adversarial and stress-testing methodologies across multiple AI features. The work requires a deep understanding of how AI systems fail and how to measure quality rigorously. This is an opportunity to shape the evaluation infrastructure that determines whether AI features meet the bar for hundreds of millions of users. DESCRIPTION Day-to-day work involves designing, building, and maintaining automated evaluation systems that assess AI feature quality at scale. This includes creating adversarial test suites that probe model weaknesses and running stress tests to ensure features perform under demanding conditions. The role requires close collaboration with cross-functional partners to ensure evaluation methods are well-calibrated and integrated into development workflows. Typical deliverables include: evaluation frameworks and rubrics, quality assessment reports, adversarial test case libraries, and recommendations on model readiness.

Requirements

  • Bachelor's degree in Computer Science, Machine Learning, Statistics, or a related field
  • 4+ years of experience building or significantly extending ML evaluation systems, including designing evaluation benchmarks or quality assessment frameworks
  • Experience independently defining evaluation architecture and methodology for AI or ML systems
  • Experience designing adversarial or red-teaming test methodologies for ML models or AI-powered features
  • Experience with Python and ML frameworks (PyTorch, TensorFlow, or equivalent) in production or near-production settings
  • Track record of owning technical direction for evaluation efforts across multiple features or product areas

Nice To Haves

  • Experience evaluating user-facing AI features in consumer applications, with an understanding of how technical metrics connect to user-perceived quality
  • Familiarity with productivity software or creative tools, with the ability to assess output quality from a user workflow perspective
  • Experience ensuring alignment between automated and human evaluation methods, including inter-annotator agreement analysis and bias detection
  • Track record of designing evaluation systems that scale across multiple features or product areas without requiring bespoke solutions for each
  • Experience evaluating different types of AI systems, including API-based and custom-trained models
  • Demonstrated ability to communicate evaluation findings and readiness assessments to cross-functional partners
  • Experience leveraging automation to scale evaluation data generation and analysis
  • Graduate degree in a relevant field

Responsibilities

  • designing, building, and maintaining automated evaluation systems that assess AI feature quality at scale
  • creating adversarial test suites that probe model weaknesses
  • running stress tests to ensure features perform under demanding conditions
  • close collaboration with cross-functional partners to ensure evaluation methods are well-calibrated and integrated into development workflows
  • evaluation frameworks and rubrics
  • quality assessment reports
  • adversarial test case libraries
  • recommendations on model readiness

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service