AI Engineer

BLEN•Washington, DC

5d•$130,000 - $150,000•Remote

About The Position

We're hiring an AI Engineer to help our federal and commercial clients ship production-grade applications powered by large language models — with a strong focus on agentic systems and MCP-based integrations. You'll spend your time building real things: agents that take actions on behalf of users, RAG pipelines that ground answers in trusted sources, and MCP servers that securely connect models to the data and tools our clients already rely on. You'll wire up model APIs, design tool interfaces, build evals, and make sure what we ship is fast, reliable, observable, and safe. This isn't a research role. You won't be training foundation models. You will be designing and shipping agentic AI systems that real users — including senior government stakeholders — depend on, and you'll have a strong voice in how we adopt generative AI responsibly across our portfolio. If you get excited about agent design, tool use, MCP, evals, and the weekly firehose of new models and frameworks — and you want that energy pointed at meaningful public-sector work — this is for you.

Requirements

5+ years of professional software engineering experience, with at least 1 year shipping LLM-based or AI-powered features to production
Hands-on experience designing or building agentic systems — tool calling, multi-step reasoning, planning loops, or agent orchestration (LangGraph, CrewAI, OpenAI Agents SDK, Claude tool use, or equivalent)
Working knowledge of the Model Context Protocol (MCP) — or demonstrated ability to pick it up quickly, plus familiarity with the broader landscape of agent/tool standards
Strong Python and experience building and deploying backend services and APIs (FastAPI, Flask, or similar)
Hands-on experience with at least one major LLM provider (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, or open-weight models via vLLM/Ollama)
Working knowledge of RAG: embeddings, vector databases (pgvector, Pinecone, Weaviate, Qdrant, or similar), and retrieval evaluation
Comfort with prompt engineering, structured outputs (JSON mode, schemas), and tool/function calling
Experience writing evals — even lightweight ones — for non-deterministic systems
Solid SQL and experience with relational and unstructured data
Familiarity with at least one cloud platform (AWS, Azure, or GCP)
Git, code review, and modern collaborative workflows
Strong written and verbal communication — you can explain AI tradeoffs to non-technical stakeholders
Must be a US Citizen or legal resident, able to work domestically
Must be able to attain a low-level security clearance
Must work from the United States

Nice To Haves

Experience authoring MCP servers for non-trivial systems (databases, internal APIs, document stores)
Experience with eval and observability platforms (Braintrust, LangSmith, Langfuse, Arize, or custom harnesses)
Multi-agent orchestration patterns and experience reasoning about agent failure modes
Fine-tuning, distillation, or LoRA experience where it actually moved the needle
Docker, Kubernetes, and CI/CD for AI workloads
TypeScript/Node for full-stack AI features
Streaming UIs (SSE, WebSockets) and token-level UX patterns
Experience with caching, prompt compression, and cost/latency optimization at scale
Background supporting federal or government clients
Awareness of NIST AI RMF, FedRAMP, or related responsible-AI frameworks

Responsibilities

Design and build agentic systems — multi-step agents that plan, call tools, retrieve context, and take action with appropriate human-in-the-loop checkpoints
Build MCP servers and clients to securely expose client data, internal tools, and APIs to LLMs in a standardized, auditable way
Ship LLM-powered applications: copilots, document intelligence, search, summarization, and workflow automation
Design and maintain RAG pipelines — chunking, embeddings, vector stores, retrieval, reranking, and grounding
Integrate model APIs (OpenAI, Anthropic, Bedrock, Azure OpenAI, open-weight models) and pick the right model for the job based on quality, latency, and cost
Develop evals and observability for agents and AI features so we know what's working in production and what's regressing
Apply prompt engineering, structured outputs, function/tool calling, and guardrails to make agent behavior predictable
Write production Python backends and APIs that expose AI capabilities to web and mobile clients
Collaborate with engineers, designers, and product folks to scope what AI should (and shouldn't) do in a given product
Help shape responsible AI practices for federal use — privacy, security, auditability, and human oversight