Staff Site Reliability Engineer, Cloud

Kentik

25d•$165,000 - $200,000•Remote

About The Position

Kentik is the network intelligence platform for modern infrastructure teams. Unlike traditional monitoring and observability tools, we demystify complex network operations, enabling organizations to deliver applications and innovation at scale. Built by network experts to make critical insight accessible to every engineer, Kentik is the real-time source of truth that understands every network in context — from data center to cloud to the internet. This single platform unifies and correlates cloud, device, flow, synthetic data to turn telemetry into action. Market leaders like Akamai, Booking.com, Dropbox, and Zoom rely on Kentik to run, manage, and optimize their networks. Our platform ingests trillions of records and serves hundreds of thousands of queries for our users each day. You will gain experience building a production quality, high performance server-and-client SaaS application that handles uniquely high volumes of data. We have built a team of world-class engineers, network experts, and technology thought leaders in a remote-friendly culture from day one. While prior experience in a remote environment is not required, we highly value strong collaboration and communication skills, as well as a high level of independence and autonomy. Kentik is looking for a Staff level Site Reliability Engineer (Cloud) to join our Product Engineering team to help build and maintain our Synthetics and Cloud product lines. These products have multiple applications deployed in various cloud providers all over the world. We manage these cloud applications using observability tooling, automated build processes, and adherence to configuration as code best practices. We’re looking for an experienced engineer who will work with engineering teams across the company to help grow our hardware and software infrastructure. We operate a well-organized, well-instrumented platform, and offer enormous opportunities for employee growth.

Requirements

8+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
Expertise in public cloud environments such as AWS, GCP, Azure, or OCI.
Strong command of containerization and orchestration using Docker and Kubernetes.
Solid programming and automation skills using Bash, Python, or Go.
Proficiency with Infrastructure as Code (IaC) and configuration management platforms such as Terraform, Ansible, and Puppet.
Proficiency in Linux administration and command-line tools (e.g., SSH, grep, awk).
Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
A passion for documenting code, processes, and infrastructure in runbooks and wikis
Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships

Nice To Haves

Familiarity with Kubernetes automation tools, specifically managing complex deployments with Helm and Helmfile.
Knowledge of scaling Kubernetes workloads and compute infrastructure
Experience optimizing CI/CD build and deploy pipelines using GitHub Actions and Jenkins.
Exposure to PagerDuty Integrations
Knowledge of SRE, DevOps and GitOps practices and principles

Responsibilities

Make sure our real-time, scalable, infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware, across multiple locations as well as all major cloud vendors
Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth
Deep-diving into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes
Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective
Assist with expanding our cloud deployments across the major cloud providers
Contribute code, code reviews and tools or patches to all kinds of existing code
Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure
Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team

Benefits

100% of premiums are paid by company for health, vision and dental coverage for you and your dependents
Additionally, an annual Health Reimbursement Account (HRA) of $3,000 for an individual or $4,500 for a family
Paid family & medical leave
Open PTO, a quarterly Wellness Day, and a minimum of 10 paid holidays
401(k) retirement account
Home office reimbursement
Stock options

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume