About The Position

As a Staff Infrastructure Performance & Engineer, you will own and evolve the performance, reliability, and scalability of Nash’s core infrastructure. You’ll work directly with the Engineering Leadership team, platform, and product engineering teams to design and operate low-latency, business-critical systems that power real-time logistics for some of the largest retailers in the world. This is a senior, high-impact role focused on elastic capacity, high availability, cloud-native architectures, Postgres performance, and enterprise-grade CI/CD and multi-region deployments. You will set technical direction, define best practices, and deploy systems for the largest retailers in the world powering their business critical workflows.

Requirements

  • 6+ years of experience building and operating high-scale, production infrastructure for business-critical systems.
  • Deep expertise in AWS, including networking, compute, storage, and managed services.
  • Hands-on experience running production workloads on ECS/Fargate at scale.
  • Strong background in Postgres, including performance tuning, replication, high availability, and operational excellence.
  • Proven experience designing and operating multi-region architectures with strict uptime and reliability requirements.
  • Strong understanding of CI/CD for enterprise deployments, including rollout strategies, environment isolation, and rollback safety.
  • Experience building low-latency systems where milliseconds matter.
  • Excellent debugging and systems-level problem-solving skills.
  • Ability to operate autonomously and lead technical initiatives in a fast-paced startup environment.

Responsibilities

  • Own infrastructure performance and reliability across Nash’s production systems, with a focus on low latency, high throughput, and predictable behavior under load.
  • Design, build, and optimize AWS-based infrastructure, leveraging managed services with a strong emphasis on ECS/Fargate.
  • Lead Postgres performance engineering, including query optimization, indexing strategies, connection management, replication, cluster design, and failover.
  • Architect and operate multi-region, highly availability systems with strong resiliency, disaster recovery, and failover guarantees.
  • Design and evolve enterprise-grade CI/CD pipelines that support safe, repeatable, and fast deployments across environments and regions.
  • Drive observability standards (metrics, logs, tracing, SLOs) and use data to proactively identify and eliminate performance bottlenecks.
  • Partner with application engineers to influence system design decisions that impact scalability, latency, and reliability.
  • Lead incident response and postmortems, focusing on root cause analysis, systemic fixes, and long-term resilience.
  • Set infrastructure and performance best practices and mentor engineers across the organization.

Benefits

  • Early-stage, well-funded startup – directly impact the company and grow your career!
  • Quarterly broader team on-sites to bond with teammates
  • Competitive compensation and opportunity for equity
  • Flexible paid time off
  • Health, dental, and vision insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service