About The Position

We're building the runtime infrastructure that powers Moveworks' AI agents — the systems that orchestrate, execute, and deliver agent responses to millions of enterprise users in real time. This is a distributed systems engineering role at the heart of the agentic AI wave. Our AI agents can plan, execute multi-step workflows, call tools, wait on human input, and resume — all while maintaining correctness, observability, and low latency. The systems that make this possible are what you'll build and own. What you get to do in this role: Agent orchestration engine — A state machine that manages long-running agent sessions, coordinating planning, execution, and user interaction across multiple LLM calls and tool invocations Distributed session management — Lease-based ownership using DynamoDB conditional writes, heartbeat protocols, and crash recovery via checkpointing Event-driven message pipeline — SQS FIFO queues for ordered delivery, Kafka consumers for event processing, and real-time streaming via gRPC and Socket.IO Structured concurrency — Python asyncio TaskGroups running multiple concurrent tasks per session (message polling, lease heartbeats, output publishing, orchestrator execution) with fail-fast semantics and graceful cancellation Observability infrastructure — OpenTelemetry instrumentation, distributed trace context propagation across async boundaries, custom span lifecycle management for sessions that span minutes Caching and state layers — Redis, DynamoDB KV stores with per-org/per-bot scoping, batch read optimization, and hot-reload configuration

Requirements

  • Deep experience in at least 3 of the following areas: Distributed systems (consistency models, idempotency, exactly-once delivery, distributed locking/leasing), Concurrent/async programming (Python asyncio, Go goroutines, structured concurrency, cancellation handling), Event-driven architectures (message queues (SQS, Kafka), pub/sub, backpressure, delivery guarantees), Database systems for infrastructure (DynamoDB (conditional writes, transactions), Redis (connection pooling, pub/sub)), Observability (OpenTelemetry, distributed tracing, span context propagation, Prometheus metrics), gRPC/protobuf (streaming RPCs, service interface design, error handling patterns).
  • 7+ years building production backend/infrastructure systems.
  • Strong in Python or Go (ideally both).
  • Experience designing and operating systems that handle real traffic at scale.
  • Comfort with ambiguity — these are novel problems without textbook solutions.

Responsibilities

  • Build and own the runtime infrastructure that powers Moveworks' AI agents.
  • Develop the agent orchestration engine, a state machine managing long-running agent sessions.
  • Implement distributed session management using lease-based ownership, DynamoDB conditional writes, heartbeat protocols, and crash recovery.
  • Create an event-driven message pipeline using SQS FIFO queues and Kafka consumers.
  • Implement structured concurrency using Python asyncio TaskGroups for concurrent tasks.
  • Develop observability infrastructure with OpenTelemetry, distributed tracing, and custom span lifecycle management.
  • Build caching and state layers using Redis and DynamoDB KV stores.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service