World Model Evaluation Lead

Archetype AISan Mateo, CA
Hybrid

About The Position

Archetype AI is seeking a hands-on Evaluation Lead to build and assess model performance for physical AI. You will design and implement advanced evaluation techniques for assessing the strengths and weaknesses of real-world AI models, and build and scale evaluation frameworks to rapidly test and generate reports on model performance. Responsibilities include partnering closely with research and engineering teams to develop evaluation methodologies, analytically assessing and improving test datasets, uncovering model weaknesses or risks, and tracking competitive industry benchmarks. This is a high-impact role for someone who thrives in a fast-paced AI environment and wants to directly influence our path as we scale our AI technologies and business.

Requirements

  • Extensive expertise in evaluating AI and machine learning models, ideally in physical AI or a related AI field
  • Experience in designing, implementing, and refining evaluation metrics
  • Deep understanding of machine learning, AI, and generative models
  • Excellent python and software engineering skills
  • Experience designing and building scaleable data pipelines and evaluation tools
  • Experience collaborating closely with key stakeholders from research, engineering, and product teams
  • Strong communication and documentation skills, with a bias for creating detailed evaluation reports that help drive model performance
  • Startup-ready mindset with the ability to thrive in high-velocity, high-ambiguity environments

Nice To Haves

  • Experience evaluating real-world, real-time algorithms
  • Experience evaluating a broad range of sensor types, such as cameras, LIDAR, physical sensors, RF sensors, and beyond
  • A strong scientific approach to evaluation and understanding model performance
  • Experience in evaluating production algorithms
  • Experience building and curating data campaigns to create extensive test datasets
  • Experience managing internal teams and/or external vendors

Responsibilities

  • Design and implement rigorous evaluation methodologies and benchmarks for measuring model effectiveness, reliability, alignment, and safety
  • Lead evaluation of model performance, ranging from offline experiments to full production model testing
  • Design and oversee the pipelines, dashboards, and tools that automate model evaluation
  • Design and oversee tools for A/B model testing, regression testing, and production model performance
  • Develop and implement strategies for evaluating physical AI models that can scale across a broad range of real-world use cases, sensor types, and edge cases
  • Plan, run, and oversee evaluations, across internal teams and external customers
  • Drive edge case discovery, red-teaming, safety, privacy, and risk evaluation - feeding back knowledge to key stakeholders in research and engineering teams
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service