SRE (Terminal)

MLabs•New York, NY

23h•Onsite

About The Position

Our client is a high-growth software development organization and a key contributor to one of the largest and fastest-growing decentralized crypto social networks globally. The platform has achieved massive scale, generating significant revenue and global attention since its inception. To support this rapid expansion and ensure the continuous uptime of its high-stakes, high-throughput environment, our client is seeking a battle-tested Site Reliability Engineering (SRE) Expert. This individual will be handed ambiguous, critical infrastructure challenges and will be trusted to navigate them end-to-end—scoping solutions, making sound architectural trade-offs, and executing with precision.

Requirements

Core SRE & Infrastructure Focus: Deep expertise in infrastructure-as-code (Terraform/OpenTofu), network topology, high-availability architecture, and system internals.
Proven Track Record: Experience building foundational infrastructure (ideally from 0→1) and running high-availability environments where reliability is treated with financial-system levels of seriousness.
Cloud-Native Fluency: Advanced proficiency with modern cloud providers (AWS, GCP) and container orchestration platforms (Kubernetes).
Pragmatic Problem Solver: Strong capacity to operate independently in high-stakes environments, deciding when to gather consensus versus when to execute autonomously.

Nice To Haves

Experience with infrastructure security hardening, IAM architecture, or compliance mapping (e.g., SOC2, ISO).
Hands-on experience managing and scaling high-throughput, low-latency data backbones and event streaming systems (Kafka, Redpanda, PostgreSQL).
A working understanding of Web3/crypto infrastructure patterns and comfort operating within them.

Responsibilities

Design, scale, and maintain highly available, multi-region, or active-active cloud infrastructure patterns.
Lead critical incident response efforts, participate in real on-call rotations, and drive comprehensive, blameless post-mortems to continuously harden the system.
Write clean, production-grade automation code (Python, Go, or similar) for infrastructure tooling, operators, and seamless systems integration.
Exercise sharp judgment regarding system risks, balancing rapid deployment velocity with robust infrastructure safety and stability.
Raise the engineering and operational bar across the organization through the implementation of rigorous standards, modern tooling, and technical mentorship.
Deep expertise in infrastructure-as-code (Terraform/OpenTofu), network topology, high-availability architecture, and system internals.
Experience building foundational infrastructure (ideally from 0→1) and running high-availability environments where reliability is treated with financial-system levels of seriousness.
Advanced proficiency with modern cloud providers (AWS, GCP) and container orchestration platforms (Kubernetes).
Strong capacity to operate independently in high-stakes environments, deciding when to gather consensus versus when to execute autonomously.