About The Position

This role focuses on designing, scaling, and securing infrastructure to meet business needs. The engineer will be responsible for fault-tolerant architecture, performance testing, capacity planning, and building automation, monitoring, and alerting systems. They will also design and test disaster recovery solutions, ensure scalability and maintainability through microservices and other architectural patterns, enhance the CI/CD pipeline for safe releases, and verify system performance and correctness. The position involves participating in peer reviews, testing, and an on-call rotation.

Requirements

  • Experience developing, managing and troubleshooting highly available distributed systems, including operational experience with Kubernetes in a production environment
  • Extensive expertise with at least one public cloud provider (AWS, GCP, Azure)
  • Exceptional verbal, written, and interpersonal communication skills
  • Interest in and understanding of best-in-class security practices, and automation and testing methods
  • Familiarity with configuration and maintenance of common infrastructure components such as Redis, Elasticsearch, and Hadoop
  • Deep understanding of customer needs and passion for customer success
  • BS or MS degree in Computer Science or equivalent experience

Nice To Haves

  • Advanced knowledge of managing and optimizing Postgresql server configuration
  • 3+ years of experience in software development
  • Experience with managing service meshes (e.g. Istio)
  • Defining and monitoring Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) to ensure that systems meet reliability and performance targets
  • Monitoring Tools like New Relic, Prometheus, Grafana and/or Datadog
  • OpenTelemetry knowledge for distributed tracing and metrics collection and experience on using it in production environments
  • Managing Python and Golang applications in production
  • Microservices architectures
  • DevOps tooling such as Docker, Terraform, ArgoCD, ArgoWorkflows, CircleCI, Github Actions, New Relic, PagerDuty, etc
  • AWS/Cloud services such as EKS, EC2, S3, Lambda, Route 53, CloudFront, Cloudflare, IAM, etc.

Responsibilities

  • Design, scale, and secure infrastructure to stay ahead of business needs through fault-tolerant architecture design, performance testing, profiling, and tuning, and capacity planning
  • Design, build, deploy, and maintain automation, monitoring, and alerting systems, as well as design, implement, and test disaster recovery solutions
  • Ensure scalability and maintainability through microservices adoption, decoupling of concerns and data model, queuing of jobs and application layering
  • Enhance and maintain our CI/CD pipeline for smooth and safe production releases via automated testing and verification
  • Verify and ensure performance and correctness of systems in response time and throughput
  • Participate in peer reviews and testing and contribute to automated test suites and in design reviews for new features, products, and systems
  • Participate in an on-call rotation

Benefits

  • Remote-first program
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service