Software Engineer III

Pearson•Durham, NC

16d

About The Position

AgentOps is the enterprise engineering foundation for building, operating, and governing AI agents and digital workers as production-grade systems. We are enabling the shift from simple chat-based experiences to agentic systems that can reason, plan, use tools, and execute complex workflows reliably across the enterprise. Our mission is to provide the platform capabilities, reusable skills, and operational controls required to scale intelligent digital workers with strong standards for reliability, security, observability, and compliance. As a Software Engineer III on the Agent Engineering team, you will design and build core platform capabilities that power intelligent, stateful, and production-ready agents, digital workers, and reusable skills. This is a hands-on senior engineering role focused on orchestration, agent runtime patterns, resilience, memory, retrieval, and observability. You will help define reusable engineering patterns for how digital workers are built, how skills are packaged and reused, and how agentic workflows are operated across the platform. You will work closely with partner teams to translate complex business workflows into robust, governed, and scalable agentic services.

Requirements

8+ years of software engineering experience with strong proficiency in Python and backend/platform engineering.
Hands-on experience building LLM-powered systems, agents, digital workers, or workflow automation platforms in production.
Experience with frameworks such as LangGraph, CrewAI, AutoGen, LangChain, LlamaIndex, or similar.
Strong experience in APIs, distributed systems, cloud-native engineering, and production reliability.
Experience designing and integrating RAG pipelines, tool-calling systems, reusable skills, and structured output patterns.
Experience with at least one major cloud platform such as AWS, Azure, or GCP, along with Docker, Kubernetes, and CI/CD practices.
Ability to design systems with strong trade-off awareness across quality, latency, cost, resilience, and maintainability.

Nice To Haves

Experience with MCP or similar tool/context interoperability protocols.
Experience with Redis, DynamoDB, Postgres, or workflow/state stores for orchestration and persistence.
Familiarity with multi-agent systems, digital worker architectures, skill registries, and human-in-the-loop execution models.
Experience with AI observability, evaluation frameworks, and operational telemetry for LLM systems.
Understanding of secure execution patterns, sandboxing, and prompt injection mitigation.
Ability to translate emerging research and ecosystem patterns into pragmatic production solutions.

Responsibilities

Design and implement multi-agent and digital worker orchestration patterns that enable specialized agents to delegate, collaborate, and complete multi-step business goals.
Build stateful and cyclic workflows using frameworks such as LangGraph, CrewAI, AutoGen, or similar, enabling reflection, recovery, and adaptive execution beyond simple linear chains.
Develop reusable orchestration components for routing, retries, fallback logic, structured outputs, and human-in-the-loop interventions.
Define how digital workers compose and invoke reusable skills across common enterprise workflows.
Build and maintain reusable skills that encapsulate business actions, domain logic, tool usage, and workflow steps in a standardized way.
Define contracts and standards for how skills are exposed, discovered, versioned, and consumed by agents and digital workers.
Contribute to standards for MCP, tool calling, and agent interaction contracts across the platform.
Integrate enterprise APIs, services, and data systems into reusable skills with strong attention to safety, governance, and maintainability.
Design systems for long-running, resumable workflows for agents and digital workers, including checkpointing, persistence, context restoration, and lifecycle management.
Implement resilience patterns for non-deterministic AI systems, including timeout handling, intelligent retries, degraded execution modes, and escalation paths.
Improve runtime reliability, scalability, and cost efficiency of agent and digital worker workloads in production.
Partner with infrastructure and platform teams to harden execution across cloud-native environments.
Build and optimize retrieval-augmented generation pipelines using vector databases, hybrid retrieval, re-ranking, and grounding strategies.
Design memory patterns that improve continuity and contextual relevance across agent and digital worker sessions, including short-term, episodic, and semantic memory approaches.
Integrate enterprise knowledge sources and structured systems securely into workflows and skills.
Evaluate and improve answer quality, retrieval performance, and contextual fidelity.
Build automated evaluation frameworks to measure workflow quality, skill execution quality, tool-use accuracy, groundedness, safety, and task success.
Instrument deep tracing and operational observability using tools such as Langfuse, LangSmith, Arize Phoenix, OpenTelemetry, or similar.
Define and monitor engineering KPIs such as latency, cost per run, fallback rates, workflow completion success, skill reliability, and production health.
Contribute to guardrails for safe execution, prompt injection resistance, and policy-compliant agent behavior.
Drive reusable engineering standards, shared libraries, and reference patterns for agent development, digital workers, and skills across the platform.
Mentor other engineers through design reviews, code reviews, and implementation guidance.
Partner with product, architecture, and domain teams to shape scalable solutions for enterprise use cases.
Stay current on the evolving agentic AI ecosystem and evaluate new frameworks, techniques, and runtime patterns pragmatically for enterprise adoption.