Senior Infra Engineer: Observability

Railway•San Francisco, CA

2d•Remote

About The Position

Our core mission at Railway is to make software engineers higher leverage. We believe that people should be given powerful tools so that they can spend less time setting up to do, and more time doing. Many infrastructure platforms simply focus on how you deploy your singular application, and now how these applications function in concert. Questions like “How do you build systems for zero downtime deployment”, “How do you do service-to-service communications”, etc are usually left up to the engineers to define. At Railway, our goal is to be an all encompassing solution to all these problems. As such, we take special care as we define our networking infrastructure. Note: Networking falls under the platform engineering umbrella. If you’re specialized, we’d love to chat! That said, we’d also like it noted you’re probably going to do a lot of non-networking + platform things. For this role, you will build ingestion pipelines to consume 1M+ RPS streams of logs, metrics, and other telemetry. You will also build scalable, fault tolerant alerting engines for notifying users, in real-time, of threshold breaches. Craft rich backend observability APIs, working with product to build amazing experiences for instantly grokking their application. Provide APIs to access realtime log/metrics streams to be consumed by the Dashboard and Product Teams. Build Golang/Rust GRPC services from scratch capable of supporting tens of thousands of users, and the million+ to come. Define infrastructure that can be torn down, failed over, and reconstituted from scratch using principle of immutable infrastructure using Terraform and Ansible. Write Engineering Requirement Documents to take something from idea, to defined tasks, to implementation, to monitoring it’s success. Interface with our TypeScript and GraphQL edge to expose your microservice APIs for both internal and potentially external consumption. This is a high impact, high agency role with direct effect on company culture, trajectory, and outcome.

Requirements

A strong understanding of distributed systems. You enjoy building fault tolerant, resilient, and scalable services
Interests in VictoriaMetrics, ClickHouse, and other systems for building observability stacks from the ground up
A solid intuition about how long your solutions will last. All systems age. In startups, we can hope for 2-3 orders of magnitude, or 12-18mo.
The tact to implement your solution, creator monitors for it’s error boundaries, and document any requirements for when you’re not around
A great sense of direction and prioritization when it comes to dealing with the ambiguity of an early stage startup
A sense of grit to dive into a problem, implement a solution, scale that solution, and replace it when needed
A great set of communication skills for getting your point across, solution implemented, and beyond

Nice To Haves

Networking falls under the platform engineering umbrella. If you’re specialized, we’d love to chat! That said, we’d also like it noted you’re probably going to do a lot of non-networking + platform things

Responsibilities

Build ingestion pipelines to consume 1M+ RPS streams of logs, metrics, and other telemetry
Build scalable, fault tolerant alerting engines for notifying users, in real-time, of threshold breaches
Craft rich backend observability APIs, working with product to build amazing experiences for instantly grokking their application
Provide APIs to access realtime log/metrics streams to be consumed by the Dashboard and Product Teams
Build Golang/Rust GRPC services from scratch capable of supporting tens of thousands of users, and the million+ to come
Define infrastructure that can be torn down, failed over, and reconstituted from scratch using principle of immutable infrastructure using Terraform and Ansible
Write Engineering Requirement Documents to take something from idea, to defined tasks, to implementation, to monitoring it’s success
Interface with our TypeScript and GraphQL edge to expose your microservice APIs for both internal and potentially external consumption

Benefits

Great salary
full health benefits including dependents
strong equity grants
equipment stipend
Autonomy: We have very few meetings. Just a Monday and a Friday to go over the Company Board. We think your time is sacred, whether it's at work, or outside of work.
Ownership: We're a company with a high ownership, high autonomy culture. We hope that you'll come in, help us, and over the course of many years do the best work of your life. When we bring you onboard, we expect you to change the company.
Novel problems/solutions: We're a startup that's well funded, with cool problems, which lets us implement novel solutions! We abhor “busywork” and think, whether it's community, engineering, operations, etc there's always opportunity for creative and high leverage solutions.
Growth: We want you to grow with us, but we know that talent is loaned, so when you figure out what area you want to grow in next, whether it's at Railway or outside, we'll make sure you land there.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume