Senior Platform Engineer

Lob

1d•Remote

About The Position

Lob is seeking a Senior Platform Engineer to enhance the reliability, observability, performance, and cost-efficiency of its platform infrastructure. This role emphasizes observability engineering and infrastructure optimization within AWS environments. The ideal candidate will possess extensive experience with Datadog, OpenTelemetry, and HashiCorp Nomad, with a proven ability to build highly visible, scalable, and operationally efficient systems while actively managing infrastructure costs. The engineer will collaborate with other engineering teams to improve telemetry, monitoring, performance testing, platform reliability, and cloud infrastructure efficiency in a dynamic, distributed setting, incorporating modern AI-driven tools and operational workflows as needed.

Requirements

7+ years of experience in platform engineering, infrastructure engineering, or site reliability engineering
Strong hands-on experience with HashiCorp Nomad
Deep expertise with Datadog
Strong experience implementing and operating observability platforms using OpenTelemetry and modern monitoring tooling
Experience with Grafana or similar visualization and observability platforms
Strong understanding of distributed tracing, metrics, logging, and monitoring best practices
Experience building dashboards, alerts, telemetry pipelines, and operational visibility tooling
Strong experience identifying and implementing AWS cost optimization strategies in production environments
Strong knowledge of S3 optimization, lifecycle management, and storage cost reduction
Experience building and running performance/load testing environments
Strong troubleshooting and performance analysis skills across distributed systems
Strong experience operating infrastructure in AWS environments
Strong experience with Terraform and infrastructure-as-code practices
Experience balancing platform reliability, observability, and infrastructure cost efficiency at scale
Experience working with distributed and event-driven architectures using technologies such as Redis, SQS, or Temporal
Experience managing and tuning Elasticsearch or OpenSearch clusters
Experience working in fast-paced engineering environments
Strong communication and collaboration skills

Nice To Haves

Exposure to PostgreSQL RDS to Aurora migrations
Experience with Kubernetes
Experience with CI/CD systems and deployment automation
Experience with Go, Python, or TypeScript

Responsibilities

Lead observability initiatives across infrastructure and applications
Design and maintain monitoring, telemetry, dashboards, tracing, and alerting systems
Build actionable visibility into platform health, reliability, and performance
Improve incident detection, troubleshooting, and operational response capabilities
Define observability standards and best practices across engineering teams
Drive infrastructure cost optimization initiatives across AWS services and platform environments
Analyze infrastructure utilization and recommend performance and cost efficiency improvements
Maintain and improve infrastructure-as-code standards and workflows
Design, build, and maintain scalable performance testing environments and tooling
Execute and analyze load/performance testing initiatives
Support and improve Nomad-based orchestration environments
Troubleshoot complex production and infrastructure issues across distributed systems
Collaborate closely with engineering teams to improve scalability, reliability, operational visibility, and infrastructure efficiency
Create and maintain operational documentation and platform best practices