Lead AI/ML Software Engineer

Raft Company Website•Boston, CO

1d•Remote

About The Position

Raft is seeking an experienced Lead AI/ML Software Engineer to join their team and shape the future of their AI Mission System, [R]AIMS. This role is for a technical builder-leader with deep experience in designing and scaling complex production systems. The ideal candidate will be able to make critical architecture decisions, simplify complexity, lead engineering efforts, and elevate the technical standards of the platform. As a senior technical leader, you will be responsible for the architecture, execution, and engineering rigor of [R]AIMS. You will be hands-on in the codebase, setting technical direction and improving engineering quality. This role involves partnering with platform leadership, product, and delivery teams to drive architectural decisions, lead technical epics, and establish engineering patterns. The position operates at the intersection of distributed systems, AI/ML platform engineering, Kubernetes-native infrastructure, and data-intensive application development, balancing rapid mission delivery with long-term platform integrity. The role requires comfort with production systems, complex integrations, debugging distributed systems, leading design reviews, and making pragmatic decisions under ambiguity.

Requirements

6+ years of hands-on experience building and shipping production software systems across the full stack (frontend, backend, infrastructure, and ML)
Deep software engineering fundamentals with demonstrated ability to design, build, and evolve complex systems that perform reliably at scale
Exceptional technical communication skills; able to lead through influence across engineering, product, and leadership stakeholders without requiring direct authority
Proven experience designing and evolving distributed systems, including service decomposition, inter-service communication patterns, fault tolerance, and observability
Strong hands-on experience with Kubernetes and cloud-native platform architecture in production environments
Experience building data-intensive or AI-enabled production systems with real operational users and real performance constraints
Demonstrated technical leadership over large, cross-functional engineering initiatives with clear ownership and accountability for outcomes
Strong system design and architecture decision-making ability, with a track record of making the right call under incomplete information
Some experience or exposure to training, fine-tuning, or deploying machine learning models in production contexts
Ability to obtain Security+ certification within the first 90 days of employment
U.S. citizenship required; ability to obtain and maintain a Top Secret/SCI clearance

Nice To Haves

Experience building AI/ML infrastructure or agentic systems, including orchestration frameworks, tool-use patterns, and LLM integration in production
Experience with streaming and event-driven architectures, particularly Kafka, Kafka Streams, or Apache Flink
Experience with platform engineering and internal developer tooling, including golden-path frameworks, shared libraries, and developer experience improvements
Experience with real-time inference or operational AI systems in latency-sensitive environments
Experience building secure, compliant systems for regulated or mission-critical environments, including familiarity with IL4/IL5/IL6 requirements or RMF processes
Prior work in defense, national security, or classified program environments
Active clearance preferred

Responsibilities

Drive architectural decisions across the [R]AIMS platform, evaluating tradeoffs across performance, scalability, security, and maintainability and building alignment across engineering and product stakeholders.
Lead major technical epics from conception through delivery, decomposing ambiguous problems into executable plans and keeping cross-functional teams moving with clarity and momentum.
Simplify and rationalize distributed system architecture as the platform scales, reducing incidental complexity and improving operational reliability without sacrificing capability.
Optimize platform performance across both edge and cloud deployment targets, identifying and resolving bottlenecks in data-intensive, latency-sensitive operational environments.
Establish strong engineering foundations and reusable technical patterns that improve developer productivity and code quality across the team.
Mentor engineers at multiple levels, conducting design reviews, providing substantive code feedback, and actively elevating technical execution across the platform.
Partner with AI/ML engineers on model integration, inference optimization, and the operational deployment of agentic workflows within [R]AIMS.
Engage directly with customers and program stakeholders at operationally demanding environments across the Department of Defense, representing Raft’s technical capabilities with credibility and clarity.