Site Reliability Engineer II

Coalition, Inc.

46d•$115,000 - $159,750•Remote

About The Position

We are looking for a Site Reliability Engineer to join our Platform SRE team. In this role, you will build and operate the infrastructure, tools, and "paved roads" that empower our developers to deliver scalable, secure, and reliable software with speed and confidence. You’ll work across the entire stack—from infrastructure automation and observability to developer enablement and system reliability. You will be a key collaborator with software engineering and security teams, helping to evolve our Infrastructure as Code (IaC), enhance CI/CD pipelines, and scale our internal developer platform. We value pragmatism and engineering excellence, primarily using Python, Go, and AWS to reduce toil and build self-service capabilities.

Requirements

Experience: 4+ years in SRE, DevOps, Cloud Engineering, or Software Development roles.
Cloud Proficiency: Hands-on experience operating and scaling production environments within AWS.
Infrastructure as Code: Strong expertise with Terraform for managing complex cloud infrastructure.
Programming: Proficiency in Go or Python, with experience building production-grade automation, tooling, or libraries.
Containers & Orchestration: Experience with ECS or Kubernetes.
CI/CD: Familiarity with modern deployment tools, specifically GitHub Actions.
Communication: Strong written and verbal skills with a knack for evangelizing reliability best practices across the organization.

Nice To Haves

Experience troubleshooting complex distributed systems in a high-traffic production environment.
Exposure to event streaming systems such as Kafka or Kinesis.
Experience contributing to Internal Developer Platforms (IDP) or automating self-service infrastructure workflows.
Familiarity with systems security, compliance requirements, or infrastructure hardening.

Responsibilities

Infrastructure Automation: Design, build, and scale production environments using AWS and Terraform.
System Reliability: Improve the resilience and operability of our platform through failure-based testing and automated recovery strategies.
Developer Enablement: Design and implement reusable platform components and self-service tools to streamline the developer experience.
Observability: Implement and maintain robust observability practices, including system metrics, distributed tracing, and SLO management.
Mentorship & Standards: Guide junior engineers, uphold high infrastructure quality, and contribute to the team’s evolving best practices.
Collaboration: Participate in technical design discussions, sharing feedback and adapting strategies based on team input and evolving requirements.