Cloud Engineer

Relay Payments•Atlanta, GA

11h•Hybrid

About The Position

Relay Payments is seeking a pragmatic, curious Cloud Engineer to join our technology team. This role is designed for a software engineer who enjoys understanding how systems actually work in production and wants to grow into reliability and infrastructure ownership over time. You’ll work closely with senior engineers to help operate and improve our platform. This includes a mix of hands-on operational work (responding to alerts, following runbooks, maintaining systems) and project-based work to improve reliability, automation, and developer experience. You’ll be expected to learn quickly, take ownership, and steadily expand the scope of what you can handle independently. You’ll be part of a small, high-leverage team focused on platform and reliability, working alongside experienced engineers who will mentor and support your growth. You will initially have read-only access to most production systems and will ramp into deeper ownership as you build familiarity and trust. Success in this role means becoming someone who can independently handle the majority of platform and system-related requests from internal teams, while knowing when to escalate and ask for help.

Requirements

3+ years of software engineering experience working on a team building and operating cloud-based applications with real users.
Experience working across different parts of a system (frontend, backend, APIs, infrastructure, or data systems).
Comfortable reading and working in multiple languages or technologies, even if not an expert in any one of them.
Experience working within a team using standard software development practices (code reviews, testing, deployments, handling production issues).
Demonstrated curiosity about how systems work in production and a desire to learn infrastructure, reliability, and operational best practices.
Strong problem-solving skills and willingness to dig into unfamiliar systems to understand and resolve issues.
Ability to follow structured processes (runbooks, incident response) while also identifying opportunities to improve them.
Comfortable working in environments where some work is repetitive or operational in nature, with a focus on doing it well and improving it over time.
Good communication skills and ability to collaborate with engineers across different teams.

Responsibilities

Respond to production alerts as part of an on-call rotation (1 week on, 3 weeks off), using established runbooks and escalating when appropriate.
Follow and improve operational playbooks for common production issues and system maintenance tasks.
Support patching, upgrades, and routine system maintenance to keep our platform secure and stable.
Work with logging, monitoring, and alerting systems (OpenSearch, CloudWatch, Sentry, JSM, Slack) to investigate and resolve issues.
Contribute to infrastructure and tooling improvements using a variety of technologies (e.g., Terraform, Go, Python), with guidance from senior engineers.
Assist internal engineering teams with platform-related requests, debugging production issues, and improving system observability.
Participate in incident response and postmortems to understand failure modes and improve system reliability.
Continuously learn how our systems are built, deployed, and operated, and take on increasing ownership over time.