About The Position

Moveworks is building the runtime infrastructure that powers its AI agents. These systems are responsible for orchestrating, executing, and delivering agent responses to millions of enterprise users in real-time. This role focuses on distributed systems engineering within the agentic AI domain, not machine learning. The AI agents are designed to plan, execute multi-step workflows, invoke tools, await human input, and resume operations while ensuring correctness, observability, and low latency. The systems responsible for these capabilities are what the engineer will build and own.

Requirements

  • Deep experience in at least 3 of the following areas: Distributed systems (consistency models, idempotency, exactly-once delivery, distributed locking/leasing), Concurrent/async programming (Python asyncio, Go goroutines, structured concurrency, cancellation handling), Event-driven architectures (message queues like SQS, Kafka, pub/sub, backpressure, delivery guarantees), Database systems for infrastructure (DynamoDB with conditional writes and transactions, Redis with connection pooling and pub/sub), Observability (OpenTelemetry, distributed tracing, span context propagation, Prometheus metrics), gRPC/protobuf (streaming RPCs, service interface design, error handling patterns).
  • 5+ years of experience building production backend/infrastructure systems.
  • Strong proficiency in Python or Go (ideally both).
  • Experience designing and operating systems that handle real traffic at scale.
  • Comfort with ambiguity, as these are novel problems without textbook solutions.

Responsibilities

  • Build and own the agent orchestration engine, a state machine managing long-running agent sessions, coordinating planning, execution, and user interaction across multiple LLM calls and tool invocations.
  • Develop distributed session management features, including lease-based ownership using DynamoDB conditional writes, heartbeat protocols, and crash recovery via checkpointing.
  • Implement an event-driven message pipeline using SQS FIFO queues for ordered delivery, Kafka consumers for event processing, and real-time streaming via gRPC and Socket.IO.
  • Utilize structured concurrency with Python asyncio TaskGroups for managing multiple concurrent tasks per session (message polling, lease heartbeats, output publishing, orchestrator execution) with fail-fast semantics and graceful cancellation.
  • Develop and enhance observability infrastructure, including OpenTelemetry instrumentation, distributed trace context propagation across async boundaries, and custom span lifecycle management for long-running sessions.
  • Build and maintain caching and state layers using Redis and DynamoDB KV stores with per-org/per-bot scoping, batch read optimization, and hot-reload configuration.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service