Senior Site Reliability Engineer

Coalition, Inc.

2d•$135,900 - $215,000•Remote

About The Position

We are looking for a Senior Site Reliability Engineer to join our Platform SRE team. In this role, you will build and operate the infrastructure, tools, and "paved roads" that empower our developers to deliver scalable, secure, and reliable software with speed and confidence. You'll work across the entire stack—from infrastructure automation and observability to developer enablement and system reliability. You will be a key collaborator with software engineering and security teams, helping to evolve our Infrastructure as Code (IaC), enhance CI/CD pipelines, and scale our internal developer platform. We value pragmatism and engineering excellence, primarily using Python, Go, and AWS to reduce toil and build self-service capabilities.

Requirements

6+ years of experience in SRE, DevOps, Cloud Engineering, or Software Development roles
Hands-on experience operating production environments in AWS
Proficiency in Go or Python, with experience building production-grade automation, tooling or libraries
Strong experience with Terraform
Experience with container orchestration platforms like ECS or Kubernetes
Familiarity with CI/CD tools such as GitHub Actions
Experience designing and implementing re-usable platform components based on team requirements
Solid understanding of observability practices including system metrics, distributed tracing, and SLOs
Exposure to failure-based testing approaches and automated recovery strategies
Strong leadership and communication skills, both written and verbal
Experience evangelizing reliability best practices

Nice To Haves

Experience with microservices architectures
Exposure to Kafka or other event streaming systems
Experience building internal developer platforms or self-service infrastructure
Familiarity with systems security, compliance requirements, or hardening practices

Responsibilities

Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform, driving architectural decisions that improve long-term maintainability and reliability.
System Reliability: Lead efforts to improve platform resilience through failure-based testing, automated recovery strategies, and proactive capacity planning.
Developer Enablement: Own the design and delivery of reusable platform components and self-service tools that streamline the developer experience and reduce cross-team toil.
Observability: Define and evolve observability standards across the platform, including system metrics, distributed tracing, and SLO frameworks.
Project Ownership: Own projects end to end—from initial scoping and effort estimation through detailed planning, execution, and successful rollout.
Mentorship & Standards: Mentor engineers across the team, uphold high infrastructure quality, and actively shape the best practices and standards used by the organization.
Collaboration: Engage in technical design discussions, providing guidance and feedback while adapting strategies based on team input and evolving requirements.