Senior Software Engineer — AI Evaluation & Benchmarks (Python)

G2i Inc.•Miami, FL

1d•Remote

About The Position

This role involves designing and building coding benchmarks and evaluation pipelines to test frontier AI models on real software engineering work. The goal is to create benchmarks that evaluate models on tasks like reasoning, debugging, and producing production-quality code. Responsibilities include analyzing model-generated code, constructing evaluation scenarios, and providing technical feedback on model performance. The ultimate aim is to develop benchmarks that effectively differentiate the capabilities of AI models and inform the training and improvement of future generations.

Requirements

4+ years of professional software engineering experience.
Expert Python — clean, performant, well-tested code.
Hands-on experience working in large, complex codebases.
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines.
Strong command of Git and modern development workflows.
Track record at a high-growth tech company or top-tier software organization.
Strong written English communication.
Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.

Nice To Haves

Senior or Lead-level profile with a history of technical ownership.
Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience).
Proficiency in additional languages: JavaScript, Go, C++, or others.
CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit).
Background in security engineering or significant open-source contributions.
Familiarity with AI/ML evaluation methodologies or model benchmarking.

Responsibilities

Design and build coding benchmarks and evaluation pipelines to test frontier AI models on real software engineering work.
Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code.
Build and maintain scalable data pipelines for evaluation workflows.
Analyze model-generated code for correctness, reliability, and edge-case failures.
Construct structured evaluation scenarios across large repos and multi-language environments.
Provide detailed technical feedback on model performance and failure patterns.
Contribute to evaluation frameworks that set the bar for how coding ability is measured.