About The Position

Lob is seeking a Senior Platform Engineer to enhance the reliability, observability, performance, and cost-efficiency of its platform infrastructure. This role emphasizes observability engineering and infrastructure optimization within AWS environments. The ideal candidate will possess extensive experience with Datadog, OpenTelemetry, and HashiCorp Nomad, with a proven ability to build highly visible, scalable, and operationally efficient systems while actively managing infrastructure costs. The engineer will collaborate with other engineering teams to improve telemetry, monitoring, performance testing, platform reliability, and cloud infrastructure efficiency in a dynamic, distributed setting, incorporating modern AI-driven tools and operational workflows as needed.

Requirements

  • 7+ years of experience in platform engineering, infrastructure engineering, or site reliability engineering
  • Strong hands-on experience with HashiCorp Nomad
  • Deep expertise with Datadog
  • Strong experience implementing and operating observability platforms using OpenTelemetry and modern monitoring tooling
  • Experience with Grafana or similar visualization and observability platforms
  • Strong understanding of distributed tracing, metrics, logging, and monitoring best practices
  • Experience building dashboards, alerts, telemetry pipelines, and operational visibility tooling
  • Strong experience identifying and implementing AWS cost optimization strategies in production environments
  • Strong knowledge of S3 optimization, lifecycle management, and storage cost reduction
  • Experience building and running performance/load testing environments
  • Strong troubleshooting and performance analysis skills across distributed systems
  • Strong experience operating infrastructure in AWS environments
  • Strong experience with Terraform and infrastructure-as-code practices
  • Experience balancing platform reliability, observability, and infrastructure cost efficiency at scale
  • Experience working with distributed and event-driven architectures using technologies such as Redis, SQS, or Temporal
  • Experience managing and tuning Elasticsearch or OpenSearch clusters
  • Experience working in fast-paced engineering environments
  • Strong communication and collaboration skills

Nice To Haves

  • Exposure to PostgreSQL RDS to Aurora migrations
  • Experience with Kubernetes
  • Experience with CI/CD systems and deployment automation
  • Experience with Go, Python, or TypeScript

Responsibilities

  • Lead observability initiatives across infrastructure and applications
  • Design and maintain monitoring, telemetry, dashboards, tracing, and alerting systems
  • Build actionable visibility into platform health, reliability, and performance
  • Improve incident detection, troubleshooting, and operational response capabilities
  • Define observability standards and best practices across engineering teams
  • Drive infrastructure cost optimization initiatives across AWS services and platform environments
  • Analyze infrastructure utilization and recommend performance and cost efficiency improvements
  • Maintain and improve infrastructure-as-code standards and workflows
  • Design, build, and maintain scalable performance testing environments and tooling
  • Execute and analyze load/performance testing initiatives
  • Support and improve Nomad-based orchestration environments
  • Troubleshoot complex production and infrastructure issues across distributed systems
  • Collaborate closely with engineering teams to improve scalability, reliability, operational visibility, and infrastructure efficiency
  • Create and maintain operational documentation and platform best practices

Benefits

  • base salary + additional RSUs
  • remote working opportunities
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service