Infrastructure Software Engineer

Normal Computing Corporation•New York City, NY

About The Position

Normal Computing is seeking an Infrastructure Software Engineer to build the production systems behind their AI products. This role focuses on infrastructure-shaped software such as orchestration services, execution runtimes, internal APIs, persistence layers, observability, and developer experience. The engineer will help define the runtime layer for AI products where agents execute long-running work, coordinate across distributed environments, interact with code and tools, and require reliability for customer workflows. This position is situated between product engineering, AI engineering, and platform engineering, with a focus on application-level infrastructure for AI workflows, rather than managing general company infrastructure like Terraform or CI/CD. The systems built will be used by product, AI, research, and platform teams, emphasizing developer experience through understandable APIs, debuggable failure modes, and easy-to-use abstractions. This is a cross-functional role for individuals who thrive in ambiguity, value clean abstractions, and want to influence how a frontier AI company operates its production systems. Strong engineering judgment and ownership are prioritized over rigid specialization. Daily tasks may include designing runtime architectures, building orchestration layers for autonomous workflows, improving workload scheduling and isolation, or creating system abstractions for productionizing AI prototypes.

Requirements

4+ years of experience in infrastructure software, backend infrastructure, production infrastructure, platform engineering, distributed systems, or related areas.
Strong software engineering fundamentals, including backend programming, APIs, data modeling, concurrency, debugging, and testing.
Experience building or operating production services where reliability, observability, and maintainability matter.
Practical experience with Docker and Kubernetes, including debugging containerized workloads, deployments, networking, resource limits, and lifecycle issues.
Comfort working with persistence systems such as Postgres, Redis/Valkey, object storage, or similar production data stores.
Experience building orchestration systems, job schedulers, workflow engines, sandboxes, developer platforms, or distributed execution systems.
Experience designing internal APIs and developer-facing abstractions that other engineers can use confidently.
Strong systems thinking: you can reason about state machines, failure modes, retries, queues, leases, scheduling, and long-running workflows.
Pragmatism in fast-moving environments: you know when to improve an abstraction, when to delete one, and when to ship the simple version.
Ownership mindset: you care whether the systems you build work in production and are usable by other engineers.
Clear communication and good technical judgment across product, AI, and infrastructure boundaries.

Nice To Haves

Deep Kubernetes experience, such as controllers/operators, networking, storage, scheduling, autoscaling, or resource isolation.
Experience with AI agent infrastructure, ML infrastructure, model orchestration, or LLM-based product systems.
Background in production infrastructure, reliability engineering, or infrastructure software at meaningful scale.
Experience in high-growth startups or engineering teams where ownership boundaries are still being defined.
Experience with Chips, EDA or Device Verification

Responsibilities

Build and maintain production software infrastructure for Normal’s AI products, especially orchestration, execution, and runtime systems.
Design internal backend services and APIs used by product engineers, AI engineers, execution services, and other internal systems.
Improve the operational maturity of rapidly evolving systems through better state management, failure handling, metrics, tracing, and debugging tools.
Work with Kubernetes-backed execution environments, including container lifecycle, scheduling behavior, autoscaling, resource isolation, and runtime reliability.
Build developer-facing tools and abstractions that make it easier for other engineers to use and extend the systems you own.
Turn promising prototypes into durable production systems by designing clear abstractions, hardening critical paths, and creating operational patterns that scale with the product.
Collaborate closely with product, AI, research, and platform engineers to define the right interfaces between product features, AI workloads, and production infrastructure.
Lead design discussions for core runtime and orchestration systems, including API boundaries, state management, execution models, and operational tradeoffs.