AI/ML Infrastructure Software Development Engineer

Booz Allen Hamilton•Washington, DC

18d•$86,800 - $198,000•Remote

About The Position

The Opportunity: To achieve an organization’s mission, leaders need strong team members who can create and analyze processes, communicate requirements, and develop innovative solutions throughout the execution of the mission. Whether reviewing program-wide technical architecture or providing AI/ML infrastructure expertise, our clients need someone who combines deep technical understanding of software engineering with strong architectural judgment. That is why we need you, an experienced AI/ML Software Development Engineer who can operate at a system-of-systems level to support clients in advancing AI-enabled systems within an R&D environment. As part of our team, you'll serve as an AI/ML Infrastructure Software Development Engineer to the Advanced Research Projects Agency for Health (ARPA-H), helping conceptualize, create, and execute advanced government-funded research and development programs to accelerate better health outcomes for everyone. Work with world-class scientists and engineers to support the development of high-impact solutions to society's most challenging health problems. Leverage technical expertise to provide strategic assessments of new technologies in support to senior ARPA-H decision makers. Maintain responsibility for producing and presenting findings and recommendations to a team of colleagues and clients on the feasibility and potential impact of future research programs, assisting with the management of current programs, and facilitating commercialization of successfully developed technologies. You'll serve as an AI/ML Infrastructure Software Engineer advising program leadership and supporting software engineering to support the client mission. You will support clients in ensuring that program-wide technical architecture and engineering to rigorous AI development, evaluation, and long-term impact. Your attention to detail, flexibility, communication skills, understanding of the client's mission, and problem-solving will enable the mission's success.

Requirements

7+ years of experience with software engineering, including building and operating production systems
Experience being on-call, debugging incidents, and writing postmortems
Experience in high-velocity environments where you owned and shipped complex products end-to-end
Experience with at least 2 backend languages, including Python
Experience with Microsoft Azure, including Azure Functions, API Management, Container Apps, and Azure OpenAI Service
Experience with containerization, CI/CD, and infrastructure as Code
Knowledge of modern backend frameworks, async patterns, distributed systems, APIs, data pipelines, and software design patterns
Knowledge of authentication and identity systems, such as OAuth2, OIDC, or Azure Entra ID
Ability to own production systems
Bachelor's degree in Computer Science or Software Engineering

Nice To Haves

Experience in healthcare, life sciences, or other regulated domains
Experience in security-conscious engineering, including input validation, output sanitization, audit logging, and responsible AI guardrails
Experience in startup or early-stage environments, such as 0-to-1 product building
Experience implementing A2A communication patterns and multi-agent orchestration frameworks
Experience building on top of LLMs in production, including tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management
Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning
Experience in security-conscious engineering in regulated or government environments
Ability to be a self-starter and operate within a fast-paced environment
Ability to be comfortable with ambiguity and a high sense of urgency
Master’s degree in a relevant field

Responsibilities

Own and operate all backend and infrastructure components for an AI/ML model on Azure, including compute, APIs, identity, data layers, and IaC-driven environments
Build and maintain resilient CI/CD, deployment automation, secrets management, and production‑grade fundamentals, including monitoring, alerting, logging, tracing, SLOs, and incident response
Manage cost and token economics across all LLM providers, analyzing budgets, guardrails, and optimizations for cost‑per‑query
Lead agentic and protocol infrastructure, including MCP backend implementation, tool‑calling systems, and reliable A2A communication patterns
Design and evolve LLM orchestration, multi‑model routing, and robust fallback and degradation patterns across GPT, Claude, and Gemini
Build and operate RAG and knowledge pipelines, including ingestion, indexing, embedding, semantic search, and evaluation and safety monitoring
Drive engineering excellence through coding standards, reviews, documentation, mentoring, and consistently championing user‑focused, secure, compliant system design